> The value of dynamic evaluation

Dynamic evaluation, also known as test-time training, appears to be highly beneficial for the current models. We can draw a parallel between this process and human learning. Consider evolution as a pre-training phase, similar to how language models are pre-trained on vast amounts of internet data. Humans then fine-tune their “model” through emergent experiences and latent tasks, guided by short and long-term heuristics embedded in our genetic makeup. Language models should follow a similar path, using language-based heuristics distilled from average human experience. This approach could lead to diverse models existing in parallel, each gaining unique experiences. Inference in diverse environments is where the models have the opportunity to diverge and acquire different representations. We want them to wander down deep paths of thought and learning as individuals, then communicating their findings. The current models are far too similar. These models could contribute to a globally shared heuristic or value function, improving the overall efficiency of experience. While in-context learning performs a form of light dynamic evaluation, it’s fundamentally limited by compute constraints and cannot incorporate new priors at the parameter level.

I previously believed that current model architectures were very dissimilar to biological brains, hindered by their constant compute input/output regime through a fixed number of layers. I no longer believe this. It’s widely accepted that humans employ both System 1 (fast, intuitive) and System 2 (slow, deliberate) thinking, with our current AI models more closely resembling System 1. However, it’s possible that System 2 thinking is not drastically different from a looped construction of System 1 computation and values. We might be able to induce a form of reasoning and longer-term System 2 thinking into models through raw data generated and modulated during active inference and subsequent dynamic evaluation. This could be further enhanced by a value function that becomes stronger and more efficient over time. Essentially, we would be unfolding the internal human experience of System 2 thinking into a longer cycle of alternating experimentation and learning. This approach is unlikely to cause model collapse, especially when grounded by the results of the model’s inferentially generated data and actions in a simulated or real environment.

In my early experiments with Francois Chollet’s ARC-AGI benchmark, I attempted to implement a light form of active inference. The process begins by training a small language model on a large dataset of synthesized programs designed to solve grids using a domain-specific language (DSL) provided by Michael Hodel. Next, we start inferring potential programs to solve grids in the evaluation set. We retain the better-performing programs based on a grounded heuristic, which is simply the absolute distance between the inferred sample grids generated by the program, and the actual grids, while withholding the test grid. These inferred grids then become training data for the subsequent iteration of the model, allowing it to learn from the consequences of its incorrect but valid program generations. Over many iterations, this approach enables the model to triple its original number of solved evaluation grids. To somewhat ground the distribution, we also incorporate some of the original training data alongside the inferred grids during the training process. A very similar and more successful approach, employing additional techniques, is implemented in the CodeIt algorithm by Butt, Natasha, et al.

As a result of these approaches, we may see further hardware decoupling of inference and training chips to ensure optimal efficiency in this progressive improvement cycle. While prompt engineering and similar techniques have their place, I believe parameter-level tuning via experience is necessary for exceptional emergent abilities. During System 2 thinking in biological brains, synapses are likely tuned on short time scales to allow for memory and guidance during the reasoning and search process. We need to allow for this level of tuning in language models through the rolled-out cycle outlined above.

> What might be holding us back?

The most significant challenge in achieving substantial goals using the process outlined above is the issue of catastrophic forgetting, which impedes continual learning. This may be the primary advantage biological brains have over language models. While the sparsity induced by self-attention in transformers has potentially mitigated this issue to some extent, it remains a persistent problem. Researchers have proposed various approaches to address catastrophic forgetting, such as using weight masks found and stored via Hopfield networks. However, the ideal solution would involve the model itself performing adequately at the parameter level, without the need to swap modules in and out.

> How neuromodulation may induce compositionality

Though we seem to have implemented the substantial pattern recognition and learning capacity found in STDP, the lower level base learning rule in biological brains, we haven’t yet fully incorporated neuromodulation. Even in non-spiking artificial networks, this may be key to inducing the compositionality present in biological brains. A potential process implementing neuromodulation could involve cyclical activation of candidate circuits in a semi-chaotic manner. This would be a process similar to active inference, where the same model is inferring data and changing slightly as it processes the new data representing itself as the effect in the environment. Different neuronal circuits would activate with the weights being temporarily strengthened or labelled with a decaying term. This is akin to a form of search in which the assembly of circuits that eventually produces the reward in the latter search cycles, is sparsely reinforced by the subsequent deployment of neuromodulators like dopamine. Dynamic evaluation in language models could potentially implement a weaker form of this process through large amounts of inference in which the consequences of the model’s outputs are part of the next training cycle, constrained by some ground truths.

> Contraction and expansion

Recent work by Jeff Hawkins and others, supported by decades of neuroscience research, strongly suggests that active inference via movement allows the brain to build and discover assemblies. These assemblies may be the building blocks used for thinking. A recent paper by Chi Zhang et al. demonstrated that a strategy called minimax entropy learning could match human few-shot performance in an IQ-like test. This aligns with the idea that the brain uses STDP for System 1 thinking, while System 2 thinking involves a search process for assemblies of the most relevant circuits. In essence, System 1 and STDP narrows the pattern recognition space, embedding fine patterns, while System 2 aided by neuromodulators expands and searches the brain’s learned patterns similar to minimax entropy learning. I think this is where the brain takes advantage of its inherent physical noise to sample “candidate” neuronal assemblies in the aforementioned search-like process. It’s possible that existing models, through extensive pretraining, do build some of the spatial assemblies or priors required for thinking via passive inference.

> Evidence in mental health conditions

Mental health conditions provide compelling circumstantial evidence for the critical role of neuromodulation in cognitive processes. Take schizophrenia, for instance, which is associated with dopaminergic abnormality or often hyperactivity (Brisch, Ralf et al.). This overactivity may lead to a continuous search for and construction of expansive neuronal assemblies. As a result, individuals with schizophrenia often perceive long range patterns or connections that aren’t actually present or don’t contribute to effective learning. Autism, particularly in severe cases, offers another perspective. Social interactions, including verbal communication, typically require the coordination of large, complex neuronal assemblies. In autism, particularly in cases of savant syndrome, there is a lack of dopaminergic activity (Mandic-Maravic, Vanja et al.) and what would appear to me to be a tendency to form neuronal assemblies prematurely and with excessive rigidity - metaphorically speaking, “too close to the metal.” This can result in remarkable data processing and memory capabilities, but difficulties with higher-order cognitive functions. It would appear that this regime is most similar to our current models. ADHD is known to involve serious issues with long term planning, and upregulating dopamine appears to alleviate this.

These examples underscore a crucial point: the brain must maintain a delicate balance between forming stable, low level pattern based connections while remaining flexible enough to adapt and learn through creating higher order assemblies. It needs the capacity to forget irrelevant information and, through active inference, to determine what should be forgotten. This process of active inference allows an individual’s brain to begin to understand its own computational abilities and the consequences of its actions. We can view a number of core issues with the current models through this neuromodulatory frame. Furthermore, this framework helps explain the significant role of dopamine in sensorimotor control. By modulating the formation and strength of neuronal assemblies, dopamine may play a crucial part in fine-tuning our cognitive ability to interact with the environment through action.

Dopamine is also known to often deploy in lieu of, or just prior to, expecting a reward. This could provide evidence that its job is to reinforce recently used assemblies and pathologies of metalearning and internal cognitive exploration that led the agent to that reward, rather than binding patterns and assemblies directly to the upcoming reward itself. Maybe evolution has allowed us to feel pleasure when dopamine is deployed because we’re actually in the active process of learning and building the general cognitive framework of our brains, which is even more important than learning to directly receive certain rewards. Dopamine means “let’s get ready to learn”.

> Expansion and compression cycle

It is likely that the models will adopt a two-phase cycle, expansion and compression. The model first expands its knowledge, inferring a large amount of thought chains ideally in an atomic or somewhat discrete manner. Experimenting and receiving feedback in grounded or non grounded environments or scenarios. Following this, the model enters the compression phase, where it focuses on refining its reasoning processes. The model attempts to compress the moderately or completely successful chains of thought into shorter, more efficient versions by reducing the chain length incrementally – effectively from length n to n-1. By motivating the model to slightly compress the chain at each iteration, we encourage it to internalise some of the reasoning steps. This leads to more efficient and accurate performance in subsequent expansion phases, as the model can draw upon its internalised knowledge to make quicker and more accurate inferences. One of my favourite papers showing a (very narrow) form of this is (Deng et al.) where they force a GPT-2 model to internalise chain-of-thought by stepwise removing intermediate CoT steps. Eventually the model can multiply 20-digit numbers in one pass, which is pretty profound. They question whether this can occur in more general settings. I think it can. Is a reward model necessary? Probably, but maybe not.

> Have the models found a form of structural generalization already?

Recent research into the phenomenon of “grokking” in neural networks has shed light on an intriguing aspect of learning dynamics. Grokking refers to a sudden improvement in a model’s ability to generalize effectively, occurring long after the model appears to have overfit on the training set (Lee et al.). This phenomenon might be interpreted as a form of extensive passive inference. In the process of grokking, it seems that the relentless pressure to minimize the loss function eventually enables the model to traverse local minima and the loss landscape to find a simpler manifold. This simpler manifold allows for a more fundamental understanding of the training distribution. In essence, the immense pressure to descend the loss landscape may drive the model to discover more efficient and generalizable representations. Mixture of Experts (MoE) models also represent a small form of sparsity, allowing the system to selectively activate subsets of networks while ignoring others. This approach bears some resemblance to the selective activation of neuronal assemblies in biological brains.