-
(单词翻译:双击或拖选)
A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.
Training data includes writing and information publicly available on the Internet. AI companies use the internet to "train" AI systems to create human-sounding writing. This "training" is what developers use to create large language models. Currently, many technology companies are developing large language models this way.
The nonprofit research group Epoch1 AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.
The team's latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.
A 'gold rush'
Researcher Tamay Besiroglu is one of the paper's writers. He compared the current situation to a "gold rush" in which limited resources are depleted2. He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.
As a result, technology companies like the maker3 of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer.
Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content.
A 'bottleneck4' in development?
Besiroglu described the issue as a "bottleneck" that can prevent companies from making improvements to their AI models, a process called "scaling up."
"...Scaling up models has been probably the most important way of expanding their capabilities5 and improving the quality of their output."
The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said "high-quality language data" would be exhausted6 by 2026. Since then, AI researchers have developed new methods that make better use of data and that "overtrain" models on the same data many times. But there are limits to such methods.
While the amount of written information that is fed into AI systems has been growing, so has computing7 power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.
But whether a "bottleneck" in development is a concern remains8 the subject of debate.
Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized9 tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as "model collapse10."
Permission and quality
Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions11 on how AI companies use its articles, which are written by volunteers.
But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called "systematic12 theft on a mass scale." They said ChatGPT was using their materials, which are protected by copyright laws, without permission.
AI developers are concerned about the quality of what they train their systems on. Epoch AI's study noted13 that paying millions of humans to write for AI models "is unlikely to be an economical way" to improve performance.
The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with "generating lots of synthetic14 data" for training. He said both humans and machines produce high- and low-quality data.
Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.
"There'd be something very strange if the best way to train a model was to just generate...synthetic data and feed that back in," Altman said. "Somehow that seems inefficient15."
Words in This Story
exhaust -v. to completely use up a resource
depleted -adj. when a resource is almost used up
trajectory16 -n. the direction that something is taking or is predicted to take
synthetic -adj. created by a process that is not natural
scale -n. the level of size of a thing
generate -v. to create something through a process
1 epoch | |
n.(新)时代;历元 | |
参考例句: |
|
|
2 depleted | |
adj. 枯竭的, 废弃的 动词deplete的过去式和过去分词 | |
参考例句: |
|
|
3 maker | |
n.制造者,制造商 | |
参考例句: |
|
|
4 bottleneck | |
n.瓶颈口,交通易阻的狭口;妨生产流程的一环 | |
参考例句: |
|
|
5 capabilities | |
n.能力( capability的名词复数 );可能;容量;[复数]潜在能力 | |
参考例句: |
|
|
6 exhausted | |
adj.极其疲惫的,精疲力尽的 | |
参考例句: |
|
|
7 computing | |
n.计算 | |
参考例句: |
|
|
8 remains | |
n.剩余物,残留物;遗体,遗迹 | |
参考例句: |
|
|
9 specialized | |
adj.专门的,专业化的 | |
参考例句: |
|
|
10 collapse | |
vi.累倒;昏倒;倒塌;塌陷 | |
参考例句: |
|
|
11 restrictions | |
约束( restriction的名词复数 ); 管制; 制约因素; 带限制性的条件(或规则) | |
参考例句: |
|
|
12 systematic | |
adj.有系统的,有计划的,有方法的 | |
参考例句: |
|
|
13 noted | |
adj.著名的,知名的 | |
参考例句: |
|
|
14 synthetic | |
adj.合成的,人工的;综合的;n.人工制品 | |
参考例句: |
|
|
15 inefficient | |
adj.效率低的,无效的 | |
参考例句: |
|
|
16 trajectory | |
n.弹道,轨道 | |
参考例句: |
|
|