CoT : A Serial Computation on a Parallel Computer

26th September 2024

OpenAI recently released a preview of their new line of reasoning chatbots, called o1-preview. This shocked and amazed with it's abilities at science, coding and maths.

However I am more interested in understanding the how and why it works rather than talking about benchmarks.

A Serial Computation on a Parallel Computer

We know o1-preview gets it's abilities from chain-of-thought reasoning (CoT).

In-order to understand why o1-preview is so much smarter than older LLMs, let's explore what chain-of-thought is.

Artificial Neural Networks (ANNs), like the human brain, are parallel computers:

The human brain having 80billion Neurons
a SOTA AI cluster runs on-top of thousands of GPUs each containing thousands of processor cores

A chain-of-thought is when you run a serial computation on-top of that underlying parallel computer. The parallel components periodically sync up for long-enough to create a "thought" which is then somehow chained on-to the next. Hence we have brain waves.

Now language has some basic reasoning baked into it, in words like "because" and "therefore" which denote causal relationships between things.

So you can't teach a neural network language without teaching it some basic reasoning. Hence even early versions of ChatGPT often had surprising abilities because it inherited reasoning skills from language itself.

At a higher level, story-telling expresses the causal relationships between events. These are sometimes complex; a seemingly innocuous event early in the story may have an important implication later on. This is a higher level of reasoning beyond that intrinsic to language.

So I think early LLMs had the reasoning of language but lacked the higher level wisdom that comes from story-telling or lived experience.

What the New Reasoning Models Are

The exact details of how OpenAI's models work are not public, but we do know they are training them on synthetic data and training them to have a good chain-of-thought rather than just to give a good answer.

My understanding is that they are using two AI models working together to produce synthetic training data, before using that data to train the next model.

You have a creative model which, given a problem, hallucinates millions of chains-of-thought exploring the problem. These chains-of-thought are in text form obviously.

Then they have a verifier model which:

filters the chains-of-thought down to those that produce the correct answer
for the remaining, checks the reasoning steps 1-by-1 to make sure that correct reasoning was used to get to the answer

Those chains-of-thought left are the ones which produced the right answer and had correct reasoning steps. These are the training data for your new model.

So essentially they are training models how to think rather than just how to get the right answer.

Wisdom

This means these reasoning models not only have the reasoning capabilities intrinsic to language, they also have a form of "reasoning wisdom" which comes from having explored millions of different ways to solve thousands of different problems and finding the best ones.

Hence them being great at problem solving.

I wonder when they will figure out how to synthesise data containing broader life wisdom like what humans might get from reading Shakespeare or having a mentor. I suppose you would have to run a full simulation of life and get the AI to explore millions of different life-journeys and find the best ones, and use those as your new training data.

If we want these AIs to be safe (not kill us all or whatever) then exactly what "life wisdom" we train into them will be very important.

Sleep

I'm not a neuroscientist but I suspect that this whole process of exploring different chains-of-thought to mine wisdom is partly what dreams are for.

So these new techniques of generating synthetic training data are essentially a primitive prototype of human sleep.

Unlocking a New Type of Computation

The manufacturing methods we have make it much easier to expand processing power horizontally than vertically. I.e. it is much easier to make 10 processors running at 2Ghz than 1 processor running at 20Ghz.

However, as any programmer knows, parallel programming is a pain. Most programming languages are not designed for parallel programming, and even with those that are, you are limited in what algorithms you can use.

Even at the level of maths, for some problems there is simply no known algorithm that can solve it in parallel.

Now that we have a way of doing serial work on parallel computers it dramatically raises the limits on how much work computers can do. It's the removal of a major ceiling on computer performance.

And AI isn't limited by maths in the same way a computer¹ is because it can find patterns which do not have a tidy mathematical formulation.

And finally as I mentioned in my last post I think we don't have enough brain-power to program all the computers. We end up using off-the-shelf software packages for everything which are not well adapted to the specific problem.

At the moment computers have "non-scaling utility" which means that having more of them doesn't necessarily provide commensurate benefit. For example if I have two laptops it doesn't give me any extra abilities than if I have one laptop. Because both of them are limited by the range of software available I can run on them.

However AI will allow computers to have continuous utility where having more computers translates to more useful work done. The more computers and robots you will have the better. Both because they can work together and because they are adaptable to any number of unique problems.

Smart Turing Machine

If I may be so bold I would propose a new term of "Smart Turing Machine"

A Turing Machine does very simple logical operations over and over in sequence to build arbritrarily complex behaviours.

These new reasoning AIs are essentially "Smart Turing Machines". They do smart things over and over in sequence. And they can learn.

At the moment you can't train a SOTA LLM on consumer hardware. And even the commercial LLMs like ChatGPT have to be trained once in a big expensive training run so do not really learn on the go². But once we have enough compute that consumers can buy the hardware to train and operate a full reasoning AI, intelligence can permeate everything everywhere.

Footnotes:

I frequently use the term "computer" to refer to a traditional computer programmed using traditional methods. I.e. a turing machine. As opposed to using thousands of computers to build an artificial neural network which is how AI works.

beyond what they can fit in their context window