Daily Paper Reading (9/n)
Topic: How do LLMs actually use depth?
Link: http://arxiv.org/abs/2602.05970

arxiv.org arxiv.org Inverse Depth Scaling From Most Layers Being SimilarNeural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks.

We know that deeper models generally have larger capacity, but what is the mechanism behind it?

This paper first summarizes three assembly assumptions in existing works:

Compositional Assembly: Depth builds qualitatively new abstractions layer by layer. Simpler inputs may stop changing early, while harder inputs keep evolving deeper. If this were true, shallow and deep models should share similar early representations, with deeper models adding new higher level stages.

Procedural Assembly: Depth acts like a finer discretization of a smooth underlying transformation. As depth grows, each layer makes a smaller step, and neighboring layer updates should be strongly correlated because they follow the same smooth dynamics.

Ensemble Averaging: Many layers perform similar transformations, and depth mainly helps by averaging away errors across these similar layers. This predicts evenly spread updates, with per layer update size shrinking roughly like \(1/L\), and a weak correlation structure rather than coherent smooth dynamics.

They first rule out compositional assembly using hidden state geometry in real pretrained LLMs. Only a small fraction of tokens show early stopping. The first and last layers change hidden states a lot, but most middle layers make similarly small incremental updates.

Then, to distinguish procedural assembly from ensemble averaging, they:

  • look at correlations between neighboring layer updates. In LLMs, these correlations are generally weak, which is already inconsistent with smooth procedural dynamics.
  • further build a separate controlled toy teacher student setup.
NOTE
This part is not done on a pretrained LLM. In the toy setup, the teacher is a randomly initialized and fixed deep residual network, and the student is trained to match the teacher’s output distribution.
  • Case 1: With tied teacher weights, the target transformation becomes smooth across depth, so increasing student depth behaves like a finer discretization of the same underlying dynamics. This is the regime where procedural assembly can arise.
  • Case 2: With independent teacher weights, the target transformation becomes non smooth across depth, so the student can no longer realize a clean smooth procedure and instead falls into ensemble averaging.

Given that the independent teacher regime is better aligned with the empirical signatures of real LLMs, and that in the toy model inverse depth scaling remains robust across temperatures under independent teacher weights, ensemble averaging becomes the most plausible explanation for how real LLMs use depth.

Personal Note

Summary

  1. Many middle layers perform similar transformations. Loop it! This is exactly the core observation behind the title, “Inverse Depth Scaling From Most Layers Being Similar”.

  2. The observation that the first and last layers play roles distinct from the middle layers is consistent with the Prelude/Coda design in “Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach”.

  1. Some logics: They observe an inverse depth scaling law. Since both Procedural Assembly and Ensemble Averaging can produce this law in some regimes, they run further controlled experiments and argue that real LLMs, especially their middle layers, are better matched by Ensemble Averaging.