November 1 to 15 (#19 of 2024)

2024-11-19

The wall

For my generation, The Wall was both a double album by Pink Floyd that hypnotized us at the end of the 1970s and a film by Alan Parker that blew our minds in the early 1980s. I remember seeing it in the cinema and leaving stunned by those delirious images of marching hammers and children turned into automatons by an alienating educational system. It was the era of Reagan and Thatcher, and the wall stood for authoritarianism, oppression, and control. We lived with the fear that someone might press the nuclear button at any moment. The wall represented all of that, and it had to come down.

During this last fortnight, though, people have been talking about a different wall: the scaling wall for language models. The concept is explained very well in this week’s episode of Monos estocásticos and in Antonio Ortiz’s article at Error500. For the detailed version, go there.

Error500

La hipótesis del escalado de la inteligencia artificial hasta llegar a la AGI

Es el concepto que más dinero ha movilizado en el mundo los últimos dos años…

a year ago · 20 likes · 3 comments · Antonio Ortiz

Today I only want to offer a quick sketch, with a few links and with my personal view.

On November 9, The Information published the article OpenAI Shifts Strategy as Rate of ‘GPT’ AI Improvements Slows. I have not been able to read it because it sits behind a paywall and I have not found a free version. A few days later Reuters published another piece that included remarks from Ilya Sutskever along the lines of: we need to try new things, and scaling alone is not enough. And in between there appeared a paper, Scaling Laws for Precision, plus a thread on X, also pointing to problems in model scaling. Everything sounded negative.

On top of that, we are now a year and a half past the release of GPT-4, and still no larger model has appeared. No GPT-5, no Claude 3 Ultra, no Gemini 2 Ultra. The next step in scaling, a model with more than 10T parameters, is taking a long time to arrive.

All this has started to cast doubt on the big hypothesis that has been driving the industry in recent years. Will the giant data center plans of the tech companies turn out to be useless? Will NVIDIA collapse? Will the bubble burst?

Fortunately, things calmed down at the end of the fortnight, when Altman gave us some cheer, saying that all this is just rumor and invention, and that there is no wall.

Can we believe Altman? Here is my personal take. During these two weeks I listened to two very interesting interviews. The first was Dwarkesh Patel interviewing Gwern Branwen,¹ one of the earliest people to formulate the scaling hypothesis.

Although the hypothesis had already been laid out in a January 2020 OpenAI paper, Scaling Laws for Neural Language Models, and much earlier, in 2015, Andrej Karpathy had anticipated the future in his post The Unreasonable Effectiveness of Recurrent Neural Networks, it was Gwern’s post that went viral and brought the idea to a broader public.

The other interview of the fortnight was François Chollet on the Machine Learning Street Talk podcast.

It is a very technical interview, full of interesting details. I am studying it carefully and will comment on it in a future article.

Both interviews talk about how to explain the behavior of LLMs. These neural networks learn a vast number of programs, functions that predict the next token, and then construct new functions by exploring the huge space of possible combinations and keeping the best ones.

Although Chollet has often said that LLMs cannot become AGIs, and hence his ARC competition, his criticism is based on their inability to deal with novelty and on the inefficiency of gradient descent as a way of recombining the model’s structure from just a few examples. Contrary to what many people think he means, Chollet does not say that LLMs cannot generalize. In fact, he explicitly says in the interview that LLMs do build models from training data, and that these models are functions defining curves that let LLMs interpolate. But, and this is my reading, those curves may live in a very abstract space: literary style, sentiment classification, and so on.

Gwern talks about something very similar, although he uses the language of Turing machines. It is essentially the same idea. When we talk about Turing machines, we are talking about algorithms. LLMs learn algorithms that let them predict the next token in a sequence. As Karpathy said, neural networks are unreasonably effective at that, or as Sutskever put it, models just want to learn.

So the version of the scaling thesis I currently have in my head would be something like this:

LLMs build a vast number of functions that they use to predict the next token.
The larger LLMs become, and the more data they are trained on, and the longer they train, the higher the level of abstraction of those functions, and the better they can generalize. Small models can capture syntactic regularities, while larger ones capture semantic ones, such as “the sea is blue”, “a table can have objects on top of it”, or “a car drives on a road”.
I do believe people in the industry when they say the current models can still be scaled for another couple of generations. They all have commercial interests, of course, but I do not see any decisive reason why scaling must stop here. I do not think, for example, that we have hit a wall in training data. Data can be generated artificially, or by experts writing exercise books. And we still have not explored the use of full video sequences at something closer to 25 fps, rather than the 1 fps snapshots being used now. That will require vastly more compute, but it is not obviously impossible.

We will see. As Antonio Ortiz says in his article, the good thing is that we will not have to wait long to find out. Next year should be the year when the next big model appears, whether GPT-5, Gemini 2, or Grok 3. Soon we will know whether scaling still works.

Like Mulder, I want to believe. Gorbachev arrived. Reagan won the Cold War and another wall fell in 1989. But now, forty years later, we are more or less where we were in the 1980s, maybe worse.

Everyone of my generation also saw the film in which the supercomputer WOPR nearly triggered the final nuclear war.² The computer had a backdoor through which one could reach its true personality. Its real name was Joshua, and in the end it manages to generalize correctly and align itself with human values:

Strange game. The only winning move is not to play.

Stephen Falken had programmed that computer and named it Joshua after his dead son. The motives of today’s Falkens are more prosaic. But I would like to believe the outcome will be the same. That Altman, Amodei, Sutskever, Karpathy, Chollet, Murati, and the rest of the San Francisco residents will lead us to the techno-utopia of GPT-2030, full of machines of loving grace.

See you next time.

Gwern Branwen is a pseudonym. He is an anonymous figure who has spent years building Gwern.net, a huge hypertext in which he annotates his ideas. He not only writes the content, he is also the author of the software that runs it, available openly on GitHub. The interview is extraordinary, not just because of its content but because it is the first public appearance of a brilliant and enigmatic character. Even then it is only partly public, since the video image is computer-generated and the voice is not Gwern’s own. He explains in the interview that he has been deaf since childhood and is reluctant to appear with his real voice. It looks as if the interview may become a turning point in his life, and that he may stop living in a modest house on $12,000 a year and instead move to San Francisco. ↩︎
Some of us wanted to be Matthew Broderick, bought a Spectrum, and got hooked forever on computing and programming. ↩︎