# How LLMs Work — Walkthrough

## Step 1: Tokenization — Breaking Text into Pieces

Before an LLM can process your text, it splits it into tokens — subword units that the model understands.

**Task:** Go to the OpenAI Tokenizer (https://platform.openai.com/tokenizer) or think manually: how would the sentence "I'm learning about tokenization" get split into tokens? Try splitting it yourself, then check.

**Question:** Why do you think LLMs use subword tokens instead of whole words or individual characters? What trade-off does tokenization make?

**Checkpoint:** The learner should understand that subword tokenization balances vocabulary size with coverage — common words get one token, rare words are split into pieces, and no word is ever "unknown."

---

## Step 2: Embeddings — From Tokens to Numbers

<!-- hint:card type="concept" title="Embeddings" -->

Tokens are discrete symbols. The model needs continuous numbers to compute with. Each token gets mapped to a high-dimensional vector called an embedding.

**Task:** Consider the words "king", "queen", "man", "woman". If these are represented as vectors, what relationships would you expect between them? Sketch out (on paper or mentally) how you'd arrange them in 2D space.

**Question:** The famous word2vec result is: king - man + woman ≈ queen. What does this tell you about what embeddings capture? Is it just spelling, or something deeper?

**Checkpoint:** The learner should understand that embeddings encode semantic meaning — words with similar meanings have similar vectors, and relationships between words are preserved as directions in vector space.

---

## Step 3: Attention — The Core Mechanism

<!-- hint:card type="concept" title="Attention" -->
<!-- hint:diagram mermaid-type="flowchart" topic="transformer architecture" -->

The transformer architecture's key innovation is the attention mechanism. It lets each token "look at" every other token to understand context.

**Task:** Consider the sentence: "The bank by the river was steep." Now consider: "The bank approved my loan." The word "bank" appears in both. How would a model figure out which meaning is intended?

**Question:** Why is attention described as letting tokens "attend to" each other? What would happen if a model could only see tokens in order (left to right) without attention?

**Checkpoint:** The learner should understand that attention allows each token to weigh the relevance of all other tokens, resolving ambiguity based on context. Without attention, the model would struggle with long-range dependencies.

---

## Step 4: Context Windows — Memory Limits

Every LLM has a maximum context window — the total number of tokens it can process at once (input + output). Claude's context window is 200K tokens; GPT-4 ranges from 8K to 128K.

**Task:** Estimate: roughly how many pages of text fit in a 100K-token context window? (Hint: a typical page is ~300 words, and English averages ~1.3 tokens per word.)

**Question:** If an LLM can only "see" what's in its context window, what happens when you ask about something mentioned 150K tokens ago in a long conversation? Does the model remember it the same way you would?

**Checkpoint:** The learner should understand that LLMs have no persistent memory beyond the context window, that information early in a very long context may get less attention ("lost in the middle" phenomenon), and that ~100K tokens is roughly 250 pages.

---

## Step 5: Next-Token Prediction and Temperature

LLMs generate text one token at a time by predicting the most probable next token given all previous tokens. The temperature parameter controls how "creative" the selection is.

**Task:** Imagine the model is completing: "The capital of France is ___". At temperature 0, it always picks the highest-probability token. At temperature 1, it samples from the distribution. What output would you expect at each temperature?

**Question:** If temperature 0 always picks the most likely token, why would you ever want higher temperature? What tasks benefit from predictability vs. creativity?

**Checkpoint:** The learner should understand that temperature 0 gives deterministic, focused output (good for factual Q&A), higher temperatures introduce variety (good for creative writing), and very high temperatures produce incoherent text.

---

## Step 6: Why LLMs Hallucinate

LLMs sometimes generate confident-sounding but incorrect information. This isn't a bug — it's a consequence of how they work.

**Task:** Ask an LLM a very specific factual question about something obscure (e.g., "What was the population of a small town in 1987?"). Compare the answer to a verified source. Does the model hedge or state it confidently?

**Question:** LLMs are fundamentally pattern-completion machines. Given that they predict the next most likely token, why would they generate false information with high confidence instead of saying "I don't know"?

**Checkpoint:** The learner should understand that LLMs don't have a "knowledge" store they query — they complete patterns from training data. "I don't know" is rarely the most probable continuation, so the model defaults to plausible-sounding completions. RLHF helps but doesn't eliminate the problem.

---

## Step 7: Choosing the Right Model

<!-- hint:buttons type="single" prompt="Which task needs a large model?" options="Classify tickets,Technical architecture,Short translation" -->
<!-- hint:celebrate -->

Different tasks call for different models. Larger models aren't always better — they're slower and more expensive.

**Task:** You have three tasks: (1) classify customer support tickets into 5 categories, (2) write a detailed technical architecture document, (3) translate short phrases to Spanish. For each, would you choose a small/fast model or a large/powerful one? Why?

**Question:** What factors should you consider when choosing a model besides raw capability? Think about latency, cost, context window, and task complexity.

**Checkpoint:** The learner should understand the trade-offs: small models for simple/high-volume tasks (classification, extraction), large models for complex reasoning and generation, and that cost and latency matter as much as capability in production systems.
