# How LLMs Work — Tokens, Context Windows, and Model Behavior

<!-- hint:slides topic="How LLMs work: tokenization, transformer architecture, attention mechanism, context windows, temperature, and model behavior" slides="6" -->

## What Is an LLM?

A **Large Language Model (LLM)** is a statistical model trained on vast amounts of text. It learns patterns in language — grammar, facts, reasoning, style — and uses those patterns to predict the next token. It doesn't "know" things in the human sense; it completes patterns based on what it was trained on.

## Tokenization

Text is broken into **tokens** — subword units (words, parts of words, punctuation). Most models use **BPE (Byte Pair Encoding)** or similar: common words are single tokens; rare words split into smaller pieces.

- "Hello" → 1 token
- "Claude" → 1–2 tokens
- "unbelievable" → 2–3 tokens

Token count affects cost and context limits. Roughly: 1 token ≈ 4 characters in English, 1–2 in code.

## The Transformer Pipeline

```mermaid
flowchart LR
    A[Input Text] --> B[Tokenize]
    B --> C[Embed]
    C --> D[Attention Layers]
    D --> E[Generate Next Token]
    E --> F[Detokenize]
    F --> G[Output Text]
```

| Stage | What Happens |
|-------|--------------|
| **Tokenize** | Split input into tokens |
| **Embed** | Map tokens to vectors (numerical representations) |
| **Attention** | Model weights which tokens matter for the next prediction |
| **Generate** | Predict next token (probabilistically) |
| **Detokenize** | Convert tokens back to text |

## Attention (Simplified)

The **attention mechanism** lets the model look at all previous tokens and decide which ones are most relevant for predicting the next. "The cat sat on the mat" — when predicting "mat", the model attends strongly to "the" and "sat".

## Context Windows

The **context window** is the maximum number of tokens the model can "see" at once — prompt + response. If your prompt + response exceeds it, older tokens are dropped or truncated.

- Small: 4K–8K tokens
- Medium: 32K–128K tokens
- Large: 200K+ tokens

What fits: depends on model. 100K tokens ≈ 75K words ≈ a short book.

## Next-Token Prediction

At each step, the model predicts the **next token** from the distribution of possibilities. It doesn't plan ahead; it generates one token at a time. The choice is sampled (with temperature) from the predicted probabilities.

## Temperature and Sampling

- **Temperature 0** — Always pick the highest-probability token (deterministic).
- **Temperature 0.5** — Slight randomness; mostly likely tokens.
- **Temperature 1.0** — Sample proportionally to probability.
- **Temperature 2.0** — More random; creative but less coherent.

## Training vs Inference

| Phase | Purpose |
|-------|---------|
| **Training** | Learn patterns from data; update weights; expensive, one-time (per model) |
| **Inference** | Generate output; weights fixed; cost per token |

## Fine-Tuning vs Prompting

- **Prompting** — No model changes. You steer with instructions and examples. Fast, flexible.
- **Fine-tuning** — Retrain on labeled data. Better for domain-specific behavior, but requires data and compute.

## Why LLMs Hallucinate

LLMs complete patterns — they don't verify facts. Hallucination occurs when:

- The model confidently continues a plausible pattern that isn't true
- Training data had errors or conflicting information
- The prompt invites fabrication (e.g., "What did person X say in 2024?" when X said nothing)
- Temperature is high, increasing randomness

Mitigation: grounding (RAG), verification, lower temperature for factual tasks.

## Emergent Abilities and Scaling Laws

Larger models show **emergent abilities** — capabilities that appear suddenly at scale (e.g., few-shot learning, chain-of-thought). **Scaling laws** describe how performance improves with more data, parameters, and compute.

## Model Families

| Family | Examples | Notes |
|--------|----------|-------|
| GPT | GPT-4, GPT-4o | OpenAI; strong generalist |
| Claude | Claude 3, Claude 3.5 | Anthropic; long context, safety |
| Llama | Llama 3, Llama 3.1 | Meta; open weights |
| Gemini | Gemini Pro, Ultra | Google; multimodal |

## Safety and Alignment

**RLHF (Reinforcement Learning from Human Feedback)** — Train a reward model from human preferences, then optimize the policy toward that reward. Reduces harmful outputs and improves helpfulness.

**Alignment** — Making models behave according to human values: harmless, honest, helpful.

---

## Key Takeaways

1. **Tokens** — Subword units; ~4 chars/token in English; drive cost and context limits
2. **Transformers** — Tokenize → Embed → Attention → Generate → Detokenize
3. **Context window** — Max tokens the model can use; older tokens dropped when exceeded
4. **Hallucination** — Confident pattern completion without knowledge; use RAG, verification
5. **Temperature** — Low = consistent; high = creative