To print this article, all you need is to be registered or login on Mondaq.com.
We can all agree that training data is essential to any machine
learning model, but what happens when that data runs out? According
to a recent article in The New Scientist, the
high-quality language data used to train models such as ChatGPT
could run out as soon as 2026.
High-quality language data includes books and scientific papers
but is slow and costly to generate. Lower-quality data includes
posts on blogs, forums and social media and is plentiful, but
machine learning models based on lower-quality data may struggle to
make the paradigm-shifting developments seen in machine leaning
models recently. Not only is this data shortage likely to slow
development, it could also see the cost of training data
But all is not lost. While these predictions are based on
human-created data, synthetic data can also be generated leading to
a potentially infinite source. The effectiveness of synthetic data
for training machine learning models must be evaluated, but it
certainly provides new opportunities for training. Also, more
efficient learning algorithms are being developed all the time
which can enable models to extract more knowledge from existing
data sets, learn from smaller data sets and even transfer learning
from one task to another.
I look forward to reading about innovations in these areas over
the coming years and I am sure we will continue to see huge leaps
in AI development into 2026 and beyond.
The content of this article is intended to provide a general
guide to the subject matter. Specialist advice should be sought
about your specific circumstances.
POPULAR ARTICLES ON: Technology from UK