The Story of Information Theory

Some background to how we got to where we are today. For this viewer, it started out relatively comprehensible, but gradually and then suddenly became more or less impenetrable. Your mileage may vary.

Couple takeaways: one is how much of technological advancement comes in intermediate steps built on what came before. And for current LLMs and such, when the discussion says “well it’s just a probablistic projection of the word that comes next”…this is no way a simple projection. The mathematics involved are hairy AFAIK..and further the “black box” of LLMs “does work” that we are challenged to have any certainty about how it does it.

I did notice the words “weight” and “weighted” which are often mentioned as part of, say, an Open Source version of LLMs. As in “the weights are included”…or “the weights are NOT included”. Below, explanations to clarify, as provided by ChatGPT 03.

The Everyday Autocomplete Analogy
When you type “How are” into your phone, the keyboard often guesses the next word will be “you.” It is not reading your mind; it has seen countless text snippets where “How are you” appears. A large–language model (LLM) works on the very same principle but at a far bigger scale. Instead of checking a few million text messages, it has digested hundreds of billions of sentences from books, web pages, and dialogue. Every time you prompt it, the model asks: “Given every word so far, which next token (word-piece) is most probable?” It then picks according to the probabilities it has learned.
Why This “Next-Word Game” Becomes Powerful
• Combinatorial explosion: English has roughly 50,000 common tokens. After only five words, there are 50,000⁵ possible sequences—far more than grains of sand on Earth.
• Context window: Modern LLMs can look at thousands of preceding tokens at once, so their “guess” is conditioned on a rich history, not just the last few words.
• Temperature knob: At generation time you can ask the model to choose the very top candidate (deterministic) or pick randomly according to the learned probability curve (creative). These small controls turn a blunt autocomplete trick into storytelling, code writing, or step-by-step reasoning.
Information Theory View (Plain Language)
Claude Shannon defined information as reduction of uncertainty. An LLM’s job is to reduce uncertainty about the next token as much as possible.
• Entropy: If the model narrows the next token from 50,000 possibilities down to ten plausible ones, it has greatly reduced entropy.
• Conditional entropy: The longer the context, the tighter the prediction—similar to guessing the last word of “Romeo and Juliet are tragic ____.” Your mental entropy is near zero: “lovers” is almost guaranteed.
Hard-to-parse twist: Because the model is probabilistic, low-probability continuations are never impossible—only unlikely. That is why you can coax creative, odd, or erroneous outputs by adjusting sampling settings.
What “Weights” Really Are
A neural network is a giant nested equation. “Weights” are the numeric knobs inside that equation. Each weight tells the model how strongly to pass a signal from one neuron to the next.
• Scale: GPT-3 has 175 billion weights; GPT-4o has even more. They are just floating-point numbers like 0.137 or –2.84.
• Training: During training, the system shows the network a sentence with the last token blanked out, asks it to predict, measures the error, and nudges the relevant weights to reduce that error. Do this trillions of times and the weight values come to encode an enormous statistical map of language.
• After training: The weights are fixed; inference is simply matrix arithmetic that applies these stored numbers to new text.
Toy Example: A Baby Bigram Model
Imagine a mini-corpus of three sentences:
- “I like apples.”
- “I like bananas.”
- “You like bananas.”
  The model counts how often each word follows each other word:
  • after “I” the word “like” appears 2 times → P(like|I)=1.0
  • after “like” the word “apples” appears 1, “bananas” appears 2 → P(apples|like)=0.33, P(bananas|like)=0.67
  Those conditional probabilities are the “weights” of this toy network. Scale this counting trick up to 10¹² tokens and many network layers and you have an LLM.
Why The Weights Feel “Magical”
• Distributed knowledge: No single weight stores the fact “Paris is in France.” Instead that fact is smeared across thousands of weights.
• Non-linear mixing: Each layer combines earlier activations in complex ways, letting the model capture syntax, semantics, even rudimentary reasoning.
• Gradient descent: Training is like hiking down a foggy mountain by always stepping toward the steepest downward slope. Millions of microscopic steps eventually find low-error valleys where the weights jointly encode useful patterns.
Limitations You Can Observe
• Over-reliance on surface statistics can lead to confident but wrong answers (“hallucinations”).
• If the training data missed a niche fact, the probabilities never learned it, so the model may invent.
• Long chains of logic are hard because each step must survive probabilistic sampling noise.
Practical Implication for PSA or Any User
• Prompt engineering = steering probabilities. Clear, specific prompts reduce uncertainty and push the model toward your desired region of the probability space.
• Fine-tuning = re-weighting on your data. By showing the model PSA’s 2,500 posts, you slightly adjust the relevant weights so next-token probabilities better reflect your mission’s vocabulary and viewpoints.
• Safety and alignment = shaping low-probability tails. Policies and reward-models try to dampen undesirable continuations without crippling useful creativity.

In short, an LLM is “just” an enormous next-word guessing engine, but the guesswork operates over astronomically large possibility spaces with trillions of finely tuned weights, guided by principles straight out of information theory. That scale and tuning turn a simple statistical trick into what feels, at the surface, like fluent intelligence.

The Story of Information Theory

Submit a Comment Cancel reply

Archives