2023年经济学人 人工智能公司的数据争夺战(上)(在线收听) |
Business 商业版块 Digging for digits 挖掘数据 A scramble for data is underway among AI companies. 各大人工智能公司正在抢夺数据。 Not so long ago analysts were openly wondering whether artificial intelligence (AI) would be the death of Adobe, a maker of software for creative types. 不久前,分析人士还在公开猜测,人工智能是否会导致创意软件制造商Adobe的灭亡。 New tools like DALL-E 2 and Midjourney, which conjure up pictures from text, seemed set to render Adobe’s image-editing offerings redundant. 像DALL-E 2和Midjourney这样的新工具能够根据文本描述而生成图像,似乎让Adobe的图像编辑功能变得多余。 As recently as April, Seeking Alpha, a financial-news site, published an article headlined “Is AI the Adobe killer?” 就在4月份,财经新闻网站“寻找阿尔法”发表了一篇题为“AI会杀死Adobe吗?”的文章。 Far from it. 完全不会。 Adobe has used its database of hundreds of millions of stock photos to build its own suite of AI tools, dubbed Firefly. Adobe已经利用其容纳数亿张版权照片的数据库而建立了自己的人工智能工具套件,名为“萤火虫”。 Since its release in March the software has been used to create over 1bn images, says Dana Rao, a company executive. Adobe的高管达纳·拉奥表示,自3月份发布以来,萤火虫软件已被用于创建逾10亿张图片。 By avoiding mining the internet for images, as rivals did, Adobe has skirted the deepening dispute over copyright that now dogs the industry. 通过避免像竞争对手那样在互联网上搜罗图像,Adobe规避了目前困扰着该行业的日益深化的版权纠纷问题。 The firm’s share price has risen by 36% since Firefly was launched. 自萤火虫发布以来,Adobe的股价已经上涨了36%。 Adobe’s triumph over the doomsters illustrates a wider point about the contest for dominance in the fast-developing market for AI tools. Adobe对末日预言者的胜利说明了一个更广泛的问题,这个问题关乎如何在快速发展的AI工具市场争夺主导地位。 The supersize models powering the latest wave of so-called “generative” AI rely on oodles of data. 为最新一波“生成式”人工智能提供动力的超大模型依赖于海量数据。 Having already helped themselves to much of the internet, often without permission, AI firms are now seeking out new data sources to sustain the feeding frenzy. 人工智能公司已经自助取用了互联网上的大量数据(通常未经许可),现在正在寻找新的数据来源,以继续给模型疯狂投喂。 Meanwhile, companies with vast troves of the stuff are weighing up how best to profit from it. 与此同时,拥有大量数据的公司正在权衡如何最好地从中获利。 A data land grab is under way. 一场数据土地掠夺正在进行。 The two essential ingredients for an AI model are datasets, on which the system is trained, and processing power, through which the model detects relationships within and among those datasets. 人工智能模型的两个基本要素是数据集和处理能力,系统用数据集进行训练,模型通过处理发现数据集内部和不同数据集之间的关系。 Those two ingredients are, to an extent, substitutes: a model can be improved either by ingesting more data or adding more processing power. 在某种程度上,这两个要素可互相替代:模型可以通过摄入更多数据而改进,也可以通过加强处理能力而改进。 The latter, however, is becoming difficult owing to a shortage of specialist AI chips, leading model-builders to be doubly focused on seeking out data. 然而,由于专业人工智能芯片的短缺,加强处理能力正变得越来越困难,这导致模型建造者们加倍专注于寻找数据。 Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026, reckons Epoch AI, a research outfit. 研究团队“纪元AI”估计,由于对数据的需求增长非常迅速,因此可用于训练的高质量文本储备可能在2026年耗尽。 The latest AI models from Google and Meta, two tech giants, are likely trained on over 1trn words. 谷歌和Meta这两家科技巨头的最新AI模型的训练数据可能会超过1万亿个单词。 By comparison, the sum total of English words on Wikipedia, an online encyclopedia, is about 4bn. 相比之下,在线百科全书维基百科的英文单词总数约为40亿个。 It is not only the size of datasets that counts. The better the data, the better the model. 重要的不仅仅是数据集的大小。数据越好,模型就越好。 Text-based models are ideally trained on long-form, well-written, factually accurate writing, notes Russell Kaplan of Scale AI, a data startup. 数据初创公司Scale AI的拉塞尔·卡普兰指出,对于基于文本的模型,最理想的训练数据是篇幅长、文笔好、符合事实的文字。 Models fed this information are more likely to produce similarly high-quality output. 被输入这种信息的模型更有可能生成同样高质量的产出。 Likewise, AI chatbots give better answers when asked to explain their working step by step, increasing demand for sources like textbooks. 同样,当AI聊天机器人被要求一步一步地解释原理时,它们会给出更好的答案,这就增加了对教科书等资源的需求。 Specialised information sets are also prized, as they allow models to be “fine-tuned” for more niche applications. 专业的信息集也很受重视,因为这些信息集可以让模型进行“微调”,以适应更小众领域的应用。 Microsoft’s purchase of GitHub, a repository for software code, for $7.5bn in 2018 helped it develop a code-writing AI tool. 2018年,微软斥资75亿美元收购了软件代码库GitHub,这使微软开发出了一款编写代码的AI工具。 As demand for data grows, accessing it is getting trickier, with content creators now demanding compensation for material that has been ingested into AI models. 随着对数据需求的增长,访问数据变得越来越棘手,内容创建者现在要求对输入给AI模型的材料收取报酬。 A number of copyright-infringement cases have already been brought against model-builders in America. 在美国,已经有多起针对模型建造者的侵犯版权案件。 A group of authors, including Sarah Silverman, a comedian, are suing OpenAI, maker of ChatGPT, an AI chatbot, and Meta. 包括喜剧演员莎拉·西尔弗曼在内的一群作家正在起诉OpenAI(AI聊天机器人ChatGPT的制造商)和Meta。 A group of artists are similarly suing Stability AI, which builds text-to-image tools, and Midjourney. 一群艺术家也在起诉Stability AI(开发文本转图像的工具)和Midjourney。 |
原文地址:http://www.tingroom.com/lesson/jjxrhj/2023jjxr/565524.html |