2023年经济学人 人工智能公司的数据争夺战(下)(在线收听

 

The upshot has been a flurry of dealmaking as AI companies race to secure data sources.

结果是随着人工智能公司竞相获取数据来源,一系列交易被达成。

In July OpenAI inked a deal with Associated Press, a news agency, to access its archive of stories.

今年7月,OpenAI与新闻机构美联社签署了一项协议,目的是使用其新闻报道。

It has also recently expanded an agreement with Shutterstock, a provider of stock photography, with which Meta has a deal, too.

OpenAI最近还扩大了与Shutterstock(一家版权图片提供商)的协议,Meta也与其做了交易。

On August 8th it was reported that Google was in discussions with Universal Music, a record label, to license artists’ voices to feed a songwriting AI tool.

8月8日,有报道称,谷歌正在与唱片公司环球音乐洽谈,希望授权把歌手的声音输入给一个编写歌曲的AI工具。

Rumours swirl about AI labs approaching the BBC, Britain’s public broadcaster.

关于各家AI实验室与英国公共广播公司BBC接洽的谣言也沸沸扬扬。

Another supposed target is JSTORE, a digital library of academic journals.

另一个假定的目标是JSTOR,一个收纳学术期刊的数字图书馆。

Holders of information are taking advantage of their greater bargaining power.

信息持有者正在利用他们更大的议价能力。

Reddit, a discussion forum, and Stack Overflow, a question-and-answer site popular with coders, have increased the cost of access to their data.

论坛Reddit和深受程序员欢迎的问答网站Stack Overflow提高了访问其数据的成本。

Both websites are particularly valuable because users “upvote” preferred answers, helping models know which are most relevant.

这两个网站都特别有价值,因为用户会投票把更好的回答“顶上去”,从而帮助模型了解哪些回答最有价值。

Twitter (now known as X), a social-media site, has put in place measures to limit the ability of bots to scrape the site and now charges anyone who wishes to access its data.

社交媒体网站推特(现已更名为X)已经采取措施,限制机器人盗取其网站数据的能力,并向任何想要访问其数据的人收费。

Elon Musk, its mercurial owner, is planning to build his own AI business using the data.

推特的老板--捉摸不定的埃隆·马斯克--正计划利用这些数据建立自己的人工智能业务。

As a consequence, model-builders are working hard to improve the quality of the inputs they already have.

因此,模型建造者正在努力提高现有数据的质量。

Many AI labs employ armies of data annotators to perform tasks such as labelling images and rating answers.

许多AI实验室雇佣了大量的数据注释员,来执行诸如给图像标记和给答案评分的任务。

Some of that work is complex; an advert for one such job seeks applicants with a master’s degree or doctorate in life sciences.

其中一些工作很复杂,有一条这类工作的招聘广告希望应聘人有生命科学硕士或博士学位。

But much of it is mundane, and is being outsourced to places such as Kenya where labour is cheap.

但大多数工作很单调,并被外包到肯尼亚等劳动力廉价的地方。

AI firms are also gathering data through users’ interactions with their tools.

人工智能公司也在通过用户与其工具的互动来收集数据。

Many of these have a feedback mechanism, where users indicate which outputs are useful.

其中许多都有反馈机制,用户可以指出哪些输出是有用的。

Firefly’s text-to-image generator allows users to pick from one of four options.

萤火虫的文本转图像生成器允许用户从四个选项中进行选择。

Bard, Google’s chatbot, proposes three answers.

谷歌的聊天机器人巴德会给出三个答案。

Users can give ChatGPT a thumbs-up or thumbs-down to its responses.

用户可以对ChatGPT的回复点击“喜欢”或“不喜欢”。

That information can be fed back as an input into the underlying model, forming what Douwe Kiela, co-founder of Contextual AI, a startup, calls the “data flywheel”.

这些信息可以再反馈回底层模型,形成初创公司Context AI的联合创始人杜威·基拉所说的“数据飞轮”。

A stronger signal still of the quality of a chatbot’s answers is whether users copy the text and paste it elsewhere, he adds.

他补充说,表明聊天机器人的回答质量高的一个更有力的信号是,用户会把文本复制并粘贴到其他地方。

That information helped Google rapidly improve its translation tool.

这些信息帮助谷歌迅速改进了其翻译工具。

There is, however, one source of data that remains largely untapped: the information that exists within the walls of the tech firms’ corporate customers.

然而,有一个数据来源在很大程度上仍未被开发:科技公司企业客户的内部信息。

Many businesses possess, often unwittingly, vast amounts of useful data, from call-centre transcripts to customer spending records.

许多企业拥有大量有用的数据,从客服中心的文字记录到客户的消费记录,这些数据往往都是在无意中掌握的。

Such information is especially valuable because it can be used to fine-tune models for specific business purposes, such as helping call-centre workers answer queries or analysts spot ways to boost sales.

这类信息特别有价值,因为可以用来微调模型而达到特定的商业目的,比如帮助客服中心的工作人员回答问题,或者帮助分析师找到提高销量的方法。

Yet making use of that rich resource is not always straightforward.

然而,这些丰富的资源并不总是可以直接利用。

Roy Singh of Bain, a consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove most useful for training AI tools.

贝恩咨询公司的罗伊·辛格指出,过去大多数公司几乎没有注意到那些海量但非结构化的数据集,这些数据集对训练AI工具是最有用的。

Often these are spread across various systems, buried in company servers rather than in the cloud.

这些数据通常分布在不同的系统中,深藏在公司的服务器里,而不是在云端。

Unlocking that information would help companies customise AI tools to serve their needs better.

解锁这些信息将帮助公司按需要创造AI工具,以更好地满足他们的需求。

Amazon and Microsoft, two tech giants, now offer tools to help companies improve management of their unstructured datasets, as does Google.

亚马逊和微软这两家科技巨头现在提供工具,帮助公司改善对非结构化数据集的管理,谷歌也有类似行动。

Christian Kleinerman of Snowflake, a database firm, says that business is booming as clients look to “tear down data silos”.

来自数据库公司雪花的克里斯蒂安·克莱纳曼表示,随着客户想要“拆除储存数据的筒仓”,数据业务正在蓬勃发展。

Startups are piling in.

初创企业正蜂拥而至。

In April Weaviate, an AI-focused database business, raised $50m at a valuation of $200m.

今年4月,专注于人工智能的数据库企业Weaviate以2亿美元的估值筹集了5000万美元。

Barely a week later PineCone, a rival, raised $100m at a $750m valuation.

仅仅一周后,其竞争对手PineCone就以7.5亿美元的估值筹集了1亿美元。

Earlier this month Neon, another database startup, raised an additional $46m in funding.

本月初,另一家数据库初创公司Neon又筹集了4600万美元的资金。

The scramble for data is only just getting started.

数据争夺战才刚刚开始。

  原文地址:http://www.tingroom.com/lesson/jjxrhj/2023jjxr/565525.html