Guiding, not praying: a marionette metaphor for actively steering AI

AI · LLM · Prompt Engineering

How AI Works: Guide It, Don't Pray to It — A Complete Map for Steering AI

You use ChatGPT and Claude every day — but what actually happens when AI reads your words, thinks through an answer, and writes a reply? Starting from first principles, this article connects Context, Attention, KV Cache, generation, Temperature, and Hallucination into one causal chain — so you don't just use AI, you know how to steer it.

~12,000 wordsAILLMPrompt Engineering

You probably use ChatGPT, Claude, or Copilot most days. You write prompts, tweak your phrasing, and re-ask when the model gets it wrong. But have you ever wondered what actually happens the moment you hit send?

Is it looking up answers in some giant database? How does it know which words in your sentence matter? Why does the same question get different answers twice? And why does it sometimes state completely wrong information with total confidence?

What this article answers

This piece goes deeper than "10 ChatGPT tips." The real question is: how does AI actually work, and how does understanding that help you steer it better?

We'll walk through three phases: how AI reads, how it thinks, and how it writes. Each concept ends with practical takeaways. By the end, your mental model of AI should be in a different league.

Here's how the journey breaks down:

Phase 1 — How AI reads: starting with tokens and what the world looks like to a model; then Context Window (a metaphor for this request's context, not persistent memory); Attention (how the model decides what matters); and KV Cache (the optimization that makes inference fast). These four explain how AI receives and understands your input.

Phase 2 — How AI thinks / writes: Autoregressive Generation reveals the truth behind one-token-at-a-time output; Temperature & Sampling are the knobs for output style; and Hallucination explains why AI can sound so sure while being wrong. These three explain how AI produces a response.

Phase 3 — How you steer it: System Prompt vs User Prompt turns everything above into tools you actually use; then a customer-support bot example shows how all of it runs together.

None of these ideas stands alone — they form a linked causal chain. Once you see the chain, you hold the reins for steering AI.

Before you read

This article assumes basic software development background (you can read TypeScript, you know what cache and RAM are). No AI or machine learning expertise required.

Phase 1 · How AI reads

1. Token: the smallest unit AI sees

Before anything else about how AI works, one idea has to land: AI processes tokens, not the words you see on screen.

When you type please review this code, the model doesn't read character by character. It first chops the text into chunks called tokens — units somewhere between letters and whole words — then maps each token to a vector of numbers. Those numbers are what the model actually computes on.

"please review this code"
   ↓ tokenization
["please", " review", " this", " code"]
   ↓ token IDs
[7847, 12043, 5521, 9982]
   ↓ embeddings
[[0.23, -0.81, ...], [0.11, 0.42, ...], ...]

what the model actually computes on

Why not plain text? Computers only speak numbers at the bottom. An AI model is, at core, a huge math function: numbers in, numbers out. Tokens are the critical middle step that turns human language into something the model can run on.

That immediately gives us a few useful ideas:

  • All of AI's "capacity" is counted in tokens, not characters. Context window limits, API billing — everything is token-based.
  • Chinese and English tokenize differently. The same meaning can cost a different number of tokens in different languages.
  • How text gets split into tokens directly affects what AI can do. The classic example: "how many r's in strawberry?" Models often get it wrong because they see straw + berry as token blocks — not s-t-r-a-w-b-e-r-r-y spelled out letter by letter.

Tokens also matter in practice because they are the billing unit. APIs charge separately for input and output tokens — not per request or per character. When you design a high-frequency AI feature, "how many tokens does this prompt use?" is literally "what does this feature cost per month?"

// In practice, use the model provider's tokenizer or API usage response.
// Different models use different tokenizers, so character-based formulas drift fast.
const usage = await client.messages.countTokens({
  model: "claude-opus-4-8",
  system: systemPrompt,
  messages: [
    {
      role: "user",
      content: `${documentContent}\n\nQuestion: ${userQuestion}`,
    },
  ],
});

console.log(`This input is about ${usage.input_tokens} tokens`);

Once "token = billing unit" clicks, KV Cache and context management stop being abstract details and become cost decisions.

We won't go deep on BPE tokenization, how embedding vectors carry meaning, or how models tokenize images, audio, or gene sequences here. I cover all of that in another article:

Want to go deeper on tokens?

For how tokens get split, why languages differ so much in token efficiency, and how AI turns gene sequences and weather data into tokens, see Tokens: How AI Turns the World into Numbers.

For this article, one idea is enough to carry forward:

AI's basic unit is the token, and to the model a token is a string of numbers. Everything that follows — memory, attention, cache, generation — is built on tokens.

2. Context Window: AI's working memory

With tokens in place, the first big concept is Context Window. "Working memory" here is a metaphor: it means what the model can see in this API call, not durable storage on disk.

The context window is the cap on how many tokens the model can see at once in a single conversation turn. Your question, the model's reply, uploaded files, and the system prompt behind the scenes — all share one pool.

Like RAM — cleared every time

The right analogy

The closest match is RAM: valid only for the current run, gone when the process ends. Context window is not disk-style persistent storage.

While your program runs, RAM only holds what's being processed right now. Close the program, RAM clears. Every API call to a model works the same way — it only sees what you put in the context window this time. It has no memory across sessions and does not look up what it read during training.

The LLM itself has no memory

Common misconception

The LLM has zero memory on its own. When ChatGPT or Claude Code feels like it remembers last week's chat or your project settings, that feeling comes from how the application manages context: the app assembles past messages, uploaded files, system prompts, and more, and stuffs them back into the context window on every call. That's why it feels continuous.

Strip away the app layer and call the model API directly — it only sees what you send this time. What you said last round has nothing to do with this one. Starting a new chat "forgets" everything for the same reason: it's a fresh context window.

Context Window (all tokens the model can see in this API call)

System Prompt · fixed settings
Conversation history · accumulated
Uploaded files · specific to this call
Your question this turn · latest input
─────────────────────────
Remaining space · how many tokens left?

At the API level this is clearest: the model has no hidden memory. If you want it to "remember" something, you must send the full history on every call:

// The model has no memory. Send the full conversation history on every call.
interface Message {
  role: "system" | "user" | "assistant";
  content: string;
}

async function chat(history: Message[], newMessage: string) {
  const messages: Message[] = [
    ...history, // ← full past conversation, resent every time
    { role: "user", content: newMessage },
  ];

  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    messages,
  });

  return response;
}

// Omit history and the model has no idea what you talked about before

What happens when context fills up?

When the budget runs out, something has to give — a problem teams often run into in production. Three common strategies:

Truncation: the blunt approach — drop whatever exceeds the limit, usually the oldest messages, keeping the newest. The cost: the model loses everything from early in the thread.

Summarization: run the old conversation through the model to compress it into a summary, then replace the full history with that summary. You keep the broad strokes; detail disappears.

Sliding window: always keep only the most recent N tokens. In long chats, the model gradually "forgets" what was said too far back.

// Simplified truncation: when over budget, drop from the oldest messages
function truncateToFit(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number,
): Message[] {
  let total = messages.reduce((sum, m) => sum + countTokens(m.content), 0);

  // Keep system prompt and latest message; drop second-oldest onward
  while (total > maxTokens && messages.length > 2) {
    const removed = messages.splice(1, 1)[0]; // remove oldest non-system message
    total -= countTokens(removed.content);
  }

  return messages;
}

What this means for steering AI

Steering takeaway

A larger context window enables bigger tasks. Models today often support 200K+ tokens, so you can fit a whole contract, an entire codebase, or a long chat history in one shot.

But "can fit" isn't the same as "should dump everything in." Longer context makes Attention harder to focus and costs more. Pouring irrelevant noise into context dilutes what the model cares about. Counter-intuitive but true: good context management is about selective tradeoffs — stuffing the window full often backfires. A curated 5K-token context with only relevant material often beats a noisy 100K-token dump.

You assemble context on every call. Want the model to "remember" something? Put it in this call's context. That's steering rule number one: your context defines what AI can see.

In real apps, context management is deliberate engineering. A common pattern is layered context:

// In real apps, context is assembled — not dumped in randomly
interface ContextLayers {
  systemPrompt: string;      // fixed role definition (cache-friendly)
  retrievedDocs: string[];   // RAG chunks for this turn
  conversationHistory: Message[]; // history (length-managed)
  currentQuery: string;      // user's question this turn
}

function assembleContext(layers: ContextLayers, budget: number): Message[] {
  const messages: Message[] = [];

  // In practice systemPrompt usually goes in the API `system` field (see §8);
  // here we focus on assembling `messages`.
  // 1. Fixed system prompt — highest priority, always kept
  // 2. Retrieved docs — only the most relevant chunks, within budget
  const docs = layers.retrievedDocs.slice(0, 3).join("\n\n");

  // 3. Conversation history — keep recent turns until budget is tight
  const history = trimHistoryToBudget(layers.conversationHistory, budget);

  messages.push(...history);
  messages.push({
    role: "user",
    content: `Reference material:\n${docs}\n\nQuestion: ${layers.currentQuery}`,
  });

  return messages;
}

The point of that code is the idea: in mature AI apps, nobody blindly throws everything at the model. Every call's context is filtered and assembled on purpose — because context quality drives output quality and cost.

3. Attention: how AI decides what matters

Now you know a context window might hold tens of thousands of tokens. The question: does the model really weigh every token equally when it processes them?

No. That's why Attention exists — the core of modern AI (the Transformer architecture) and the root answer to why not everything in context gets equal treatment.

A classic example

Take this sentence:

"The animal didn't cross the street because it was too tired."

When the model hits it, it has to figure out what it refers to. animal or street?

Attention scores how related it is to every other token in context. animal scores highest — in training data, "tired" correlates with living things, not streets.

it → animal   ████████████   high relevance
it → tired    ███████        medium (semantic)
it → street   ██             low
it → cross    █              very low

The model judges relevance with math, not gut feel. That computation is Attention.

The mechanism: Q, K, V

We'll go slightly deeper, but through familiar data-structure intuition. Attention works like a "fuzzy-query HashMap."

In a normal HashMap, the key has to match exactly. Attention is different: every key gets queried, but with different match strength, then all values are mixed with those weights.

Each token plays three roles in Attention:

Q (Query)  → "What am I looking for?"     current token's query
K (Key)    → "What can I offer?"          each token's outward "label"
V (Value)  → "What do I actually carry?"  each token's real contribution

The flow looks like this:

1. Take the current token's Q
2. Dot product Q with every token's K → relevance scores
3. Softmax the scores → weights that sum to 1
4. Weighted average of every token's V
5. Result = this token's vector after "seeing" the full context

Collapse those five steps and you get scaled dot-product attention from the Transformer paper:

Attention(Q, K, V) = softmax(QKT / √dk) V

Broken down:

  • QKT: dot product of each Query with all Keys → relevance score matrix (step 2)
  • √dk: scaling factor; dk is Key vector dimension — without it, high-dimensional dot products explode and Softmax goes extreme
  • softmax(⋯): scores → weights that sum to 1 (step 3)
  • multiply by V: weighted mix of all Values (steps 4–5)

Here's simplified TypeScript for the skeleton (scaling by √dk omitted, but logic matches; in practice this is heavily optimized matrix math):

// Simplified single-head attention to illustrate the flow
type Vector = number[];

function dotProduct(a: Vector, b: Vector): number {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

function softmax(scores: number[]): number[] {
  const max = Math.max(...scores);
  const exps = scores.map((s) => Math.exp(s - max)); // subtract max to avoid overflow
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map((e) => e / sum);
}

function attention(
  query: Vector, // current token's Q
  keys: Vector[], // all tokens' K
  values: Vector[], // all tokens' V
): Vector {
  // Step 2: dot Q with each K → relevance scores
  const scores = keys.map((k) => dotProduct(query, k));

  // Step 3: Softmax → weights that sum to 1
  const weights = softmax(scores);

  // Step 4: weighted average of all V
  const dim = values[0].length;
  const output: Vector = new Array(dim).fill(0);

  values.forEach((v, i) => {
    for (let d = 0; d < dim; d++) {
      output[d] += weights[i] * v[d];
    }
  });

  return output; // Step 5: vector after "seeing" full context
}

The key variable is weights: the model's judgment of how important each token in context is. High-weight tokens move the result; low-weight ones barely matter.

Multi-Head Attention: many angles at once

In practice, Attention doesn't run once — it runs in parallel groups (often 8 to 32), each called a Head. Each Head learns a different attention pattern:

Head 1 → syntax (subject ↔ verb)
Head 2 → semantics (synonyms, near-meaning)
Head 3 → reference (it / they / this → what?)
Head 4 → position (distance in the sequence)

All Head outputs merge into one token vector that encodes syntax, semantics, reference, and tone at once. That's how a model can hold all of those in one pass.

A philosophical note on "understanding"

Worth pausing here. When AI "understands" that it means animal, it isn't grasping what animals or tiredness are. It's finding a statistical pattern in training data: the token "tired" co-occurs with animate nouns far more often than with streets.

Where capability ends

In other words, Attention captures statistical association, not true semantic understanding. That distinction matters for AI's limits: where training patterns are clear, it looks like it gets it; where patterns are sparse or contradictory, cracks show. When we get to Hallucination, you'll see how that leads to confident wrong answers.

Holding Attention on the "statistics" side keeps a healthy stance: don't underestimate it (the patterns are rich enough for huge tasks), don't mythologize it either (no independent fact-checking).

Positional encoding and Lost in the Middle

Self-attention is order-blind in pure math: feed embeddings without position info and "the cat sat on the mat" looks the same as "on the mat sat the cat." Modern LLMs add positional encoding to each token — its index in the sequence, from 0 at the start to the end. A common implementation is RoPE (Rotary Positional Embedding). The model uses this for word order, reference, and knowing where it is in a generation task.

In practice, apps usually stack context in time order: system prompt first, conversation history above that, latest user message at the tail. When generating the next token, the model looks backward through the sequence, so the latest question sits near where processing is happening.

Easy misconception: positional encoding marks position in the sequence, not "how recent in wall-clock time." Attention weights don't simply decay the farther you are from the current turn. Research (Liu et al., 2023) found a U-shaped distribution:

Attention weights (simplified)

high │ ████                              ████
     │ ████                              ████
 low │      ████  ████  ████  ████  ████
     └──────────────────────────────────────
       start         middle              end
       (System       (easily ignored)    (latest question)
        Prompt)
Lost in the Middle

The system prompt at the start has low position numbers but often gets high Attention; middle chunks (old chat or document sections) get ignored most; the tail gets high weight because it's next to the generation point. That's Lost in the Middle.

Several factors stack — no single mechanism explains it all:

Positional encoding lets the model sense order and relative distance, with positional attention bias baked in.

Training data structure bias is another layer. Pre-training text often puts summaries, intros, and conclusions at the start and end; instruction fine-tuning puts directives at the front. Models may reinforce "start and end matter more" as a statistical habit. Liu et al. (2023) discuss this as one factor among several, not the sole cause.

Context lengths seen in training, causal masks, and Softmax competition in long contexts also play a role. Together they systematically undervalue middle content in long threads.

What this means for steering AI

Attention explains several things you've probably noticed:

Steering takeaway

Put important info in the right place (Lost in the Middle). The U-shaped Attention curve means: in a long document, key instructions at the start or end work much better than buried in the middle. If critical evidence must sit mid-doc, pull highlights into a summary, number sections, or ask the model to check each section — so it doesn't stitch the whole thing from head and tail alone.

Longer context, harder focus. Ten tokens cross-referencing each other is easy; 100K is astronomical compute plus more noise. That's why an overstuffed context can hurt detail.

Repeating yourself actually works. Say a constraint once in the system prompt and again in the user message — two token clusters in context, higher combined weight, better odds of compliance. Straightforward: Attention naturally boosts information that appears twice.

4. KV Cache: the key to faster inference

With Q, K, V from Attention in mind, KV Cache follows naturally — it grows directly out of Attention computation.

The problem: recompute every time?

Recall: for each token, Attention pairs its K and V with every other token's K and V in context.

Picture this: your system prompt is 1,000 tokens. Every new user message — does the model recompute K and V for all 1,000?

Without KV Cache, yes, every single time. And those 1,000 tokens haven't changed — that's a lot of duplicate work.

Core idea

KV Cache stores computed K and V. Next time the same prefix appears, read from cache instead of recomputing.

Without KV Cache:
Every call
  → System Prompt (1000 tokens) recompute K, V
  → Conversation history recompute K, V
  → New message compute K, V
  → Generate answer

With KV Cache:
First call
  → System Prompt (1000 tokens) compute K, V → store in cache

Second call (System Prompt unchanged)
  → System Prompt → read cache ✓ (no recompute!)
  → New message compute K, V
  → Generate answer

Same idea as Redis or a CDN: compute once, store, reuse. Here what's cached is K and V vectors from Attention.

Cache hit vs cache miss

Whether KV Cache hits depends on whether the prefix changed:

// ✅ Cache hit: stable system prompt; different user messages still hit cache
const systemPrompt = "You are a professional support agent. Be friendly and concise.";

await chat(systemPrompt, "Where is my order?"); // system prompt KV computed once
await chat(systemPrompt, "How do I request a refund?"); // system prompt read from cache ✓

// ❌ Cache miss: one word changed in system prompt → full cache invalidation
const v1 = "You are a professional support agent. Be friendly and concise.";
const v2 = "You are a professional support helper. Be friendly and concise."; // "agent" → "helper"

await chat(v1, "Where is my order?"); // compute once
await chat(v2, "Where is my order?"); // prefix diverges → cache miss → full recompute ✗
Cost trap

Change one character and the whole cache can invalidate.Prefix matching is that strict: compare token by token from the start; the moment something diverges, everything after that is a miss.

The right context order

With prefix matching in mind, you see why vendors keep saying "put stable content at the front of context":

Context order (top to bottom; prefix matched from the top)

System Prompt · stable · front, easiest to cache
Few-shot examples · stable · same
Uploaded documents · call-specific, often large
Conversation history · grows over time
User message this turn · changes every time · last

Put volatile content (user data) first and stable system prompt last, and every user change nukes the cache behind it. Wrong order, cache is useless.

Anthropic Prompt Caching

Claude exposes Prompt Caching so you can mark what to cache explicitly:

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longStableSystemPrompt, // long, stable system prompt
      cache_control: { type: "ephemeral" }, // ← mark for caching
    },
  ],
  messages: [{ role: "user", content: userMessage }],
});

Cached tokens cost less and arrive faster than a full recompute. For high-frequency apps, the gap is huge.

What this means for steering AI

Steering takeaway

Keep the system prompt stable. Don't tweak wording every call — even an extra space breaks prefix cache and raises latency and cost.

Large fixed docs go up front. Manuals, codebases, specs — stuff you'll reference all thread but that doesn't change — belong early in context with cache markers. Later Q&A skips recomputing their KV.

Cache expires. Claude's Prompt Cache defaults to ~5 minutes of idle time. High-traffic apps hit often; low-traffic apps mostly miss — factor that into architecture and cost estimates.

This is not a micro-optimization

Why does this matter so much for cost? Imagine a support bot: system prompt plus product manual = 10,000 tokens, 5,000 calls per day. Without KV Cache, those 10,000 tokens recompute and rebill every time — 50 million token-equivalents of waste daily. With KV Cache, stable content computes once per cache window; cached portions often bill at a fraction of full price. KV Cache often decides whether an AI product is profitable — far beyond "micro-optimization."

When you wonder why some AI features are cheap and others expensive, cache hit rate is often the answer. Stable prefix, fixed content up front, no dynamic system prompt edits — small engineering habits that add up to real money.

Phase 2 · How AI thinks / writes

5. Autoregressive Generation: how AI produces answers

The first four sections were about how AI reads. From here, how it writes — and we need to break a very common wrong mental model first.

Breaking the "database lookup" illusion

People unfamiliar with AI often imagine it works like this:

Your question → AI "looks up" a giant knowledge base → finds answer → returns it
Wrong model

That's not how it works. AI isn't search or a database query. What it actually does:

Your question → predict "the most likely next token" → append it
        → predict again → append → repeat until done

That's Autoregressive Generation: use what's already been generated to predict the next token, one after another.

The flow step by step

Suppose you ask: The capital of Taiwan is

Step 1: Context = "The capital of Taiwan is"
        → predict next token distribution
        → candidates: " Taipei" 88%, " Ta" 8%, other 4%
        → pick " Taipei"

Step 2: Context = "The capital of Taiwan is Taipei"  ← note: output becomes input
        → predict next token
        → candidates: "." 70%, "," 15%, other 15%
        → pick "."

Step 3: Context = "The capital of Taiwan is Taipei."
        → predict next token
        → candidates: <EOS> 92%, other 8%
        → pick <EOS> → stop generation

The crucial bit each step: the token just produced immediately becomes input for the next step. That's the "auto-regressive" part — output feeds back as input.

TypeScript expressing the loop:

// Autoregressive generation is fundamentally a loop
async function generate(prompt: string, maxTokens: number): Promise<string> {
  let context = prompt;

  for (let i = 0; i < maxTokens; i++) {
    // Each step: predict next-token distribution from current context
    const distribution = await predictNextToken(context);

    // Pick one token from the distribution (§6 covers "how to pick")
    const nextToken = sample(distribution);

    if (nextToken === END_OF_SEQUENCE) break;

    // Key: append output back into context for the next round
    context += nextToken;
  }

  return context;
}

Why this mechanism matters

AI has no global plan. It doesn't draft the whole answer in its head first. When writing word one, it doesn't know what word ten will be. That's why answers sometimes pivot mid-stream or contradict themselves — one step forward, no rewind.

Errors compound. Once a token is chosen, it enters context and shapes every prediction after. One wrong step, and everything downstream builds on the mistake. Source of many "spiraling" answers.

Length isn't pre-decided. The model doesn't pick "200 words" then write; each step it weighs the end-of-sequence token, and when that probability is high enough, it stops.

Parallels for developers

If you've used a streaming API and watched text appear character by character, that's not a frontend typewriter effect — that's the real generation process. Each character is a full model forward pass.

// The "typing" you see in a stream is autoregressive generation itself
const stream = await client.messages.stream({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  messages: [{ role: "user", content: "What is the capital of Taiwan?" }],
});

for await (const event of stream) {
  if (event.type === "content_block_delta") {
    process.stdout.write(event.delta.text); // each token is its own forward pass
  }
}

That also explains a few practical things:

  • Longer answers take more time: every extra token is another full inference step.
  • Output tokens usually cost more than input: input computes KV once (and may cache); each output token runs the full generation loop.
  • Asking for brevity actually works: you shift the odds toward the stop token, so it finishes sooner.

Why "think step by step" helps

Why chain-of-thought works

With autoregression in mind, Chain-of-Thought makes sense: asking the model to think step by step means it writes intermediate steps first; those steps enter context and scaffold later predictions. Demanding the final answer in one shot forces a single high-stakes guess. Spreading reasoning across steps usually improves accuracy. Many models' "reasoning mode" is essentially autoregressively generating a long reasoning trace before the final answer.

// ❌ Ask for the answer directly: model must get it right in one shot
await chat("", "A store sold 23 units on day one. Day two was 3× day one. How many total?");

// ✅ Ask for step-by-step reasoning: intermediate steps scaffold later tokens
await chat(
  "",
  "A store sold 23 units on day one. Day two was 3× day one. How many total? Walk through each step, then give the answer.",
);

6. Temperature & Sampling: knobs for output control

We left off at sample(distribution). Time to open it up: after the model has a probability distribution for the next token, how does it actually pick one?

That pick is the sampling strategy, and Temperature is the main knob shaping it.

What Temperature is

Each generation step produces a distribution over candidate tokens:

Next token candidates:
" Paris"  → 80%
" Lyon"   → 15%
" Mars"   →  3%
" Rome"   →  1%
other     →  1%

Temperature controls the shape of that distribution— sharper or flatter.

Temperature = 0 (approaching 0): sharpest
" Paris" → 99.9%   " Lyon" → 0.1%   other → ~0%
→ almost always picks the top token; stable, predictable

Temperature = 1.0 (default): original distribution
" Paris" → 80%   " Lyon" → 15%   " Mars" → 3%
→ sample by original probabilities; some variation

Temperature = 2.0 (high): flattest
" Paris" → 40%   " Lyon" → 30%   " Mars" → 20%   " Rome" → 10%
→ probabilities level out; low-probability tokens get a real chance; more random

Handy analogy: Temperature is like a "fairness" dial on a die.

  • Low temperature → loaded die, almost always rolls 6.
  • High temperature → fairer die, any face can show up.

In math, Temperature scales each logit before Softmax:

function applyTemperature(logits: number[], temperature: number): number[] {
  if (temperature === 0) {
    // Temperature 0: greedy — fully deterministic
    const maxIndex = logits.indexOf(Math.max(...logits));
    return logits.map((_, i) => (i === maxIndex ? 1 : 0));
  }

  // Divide by temperature: lower → sharper; higher → flatter
  const scaled = logits.map((l) => l / temperature);
  return softmax(scaled);
}

Common phenomena explained

Why does the same prompt give different answers? With Temperature > 0, the model samples randomly — not table lookup. Different rolls, different outputs.

Why low temperature for code, high for copywriting?

Code generation: Temperature = 0
  → you want the correct answer, not a creative one
  → the highest-probability token is usually syntactically right

Creative writing: Temperature = 0.9 ~ 1.2
  → you want surprising phrasing
  → let low-probability tokens through for variety

Two more parameters: Top-P and Top-K

Besides Temperature, API docs mention two other sampling parameters.

Top-P (nucleus sampling): doesn't reshape the distribution directly — it samples only from candidates whose cumulative probability reaches P%.

Top-P = 0.9, candidate list:
" Paris" → 80%   ← cumulative 80%
" Lyon"  → 15%   ← cumulative 95% > 90%, include it and stop
" Mars"  →  3%   ← excluded
" Rome"  →  1%   ← excluded

→ sample only between " Paris" and " Lyon"

In other words, the token that pushes cumulative probability over the threshold still stays in the sampling pool; then the model redistributes relative probability inside that pool. Effect: filter out vanishingly unlikely tokens that cause garbage or nonsense.

Top-K: simpler — sample only from the top K tokens by probability.

Top-K = 2:
→ sample only from " Paris" (80%) and " Lyon" (15%); everything else excluded

In practice, Temperature alone is usually enough; leave Top-P and Top-K at defaults.

A practical reference table

Task type              Temperature   Top-P    Notes
─────────────────────────────────────────────────────────
Code generation         0 ~ 0.2      0.9      stable, correct
Extraction / classify   0            -        fully deterministic
Q&A / summarization     0.3 ~ 0.7    0.9      accuracy + fluency
General chat            0.7 ~ 1.0    0.95     natural variation
Creative writing        1.0 ~ 1.2    0.95     encourage surprise
Brainstorming           1.2+         1.0      maximum diversity

An important correction

Some people think high Temperature = smarter or more creative, low = dumber or safer.

Common misconception

That's wrong.

Temperature doesn't change capability — only how the model picks from the distribution. A weak model at high Temperature produces messier garbage, not genius. Capability comes from training; Temperature is an output-style knob.

7. Hallucination: why AI says the wrong thing

With generation and sampling in place, we can tackle the question everyone wrestles with: why does AI sound so sure while talking nonsense?

Redefining the term

Redefining hallucination

"Hallucination" suggests the model is seeing things that aren't there. More precisely: it outputs something that sounds right but isn't — and doesn't know it's wrong.

After sections 5 and 6, Hallucination isn't a mysterious bug — it's a predictable structural outcome.

Why hallucination is inevitable

Back to the core of autoregressive generation:

What generation optimizes for

The model predicts a distribution for the next token, then samples. That process optimizes for plausible-sounding sequences; verifying factual correctness is a completely different job.

Wrong mental model:
AI → query knowledge base → find fact → output

Right mental model:
AI → given context, predict "the most likely next token" → output

After "In 1952 Einstein proposed…" what token is most likely? Training patterns suggest something that sounds like physics, whether that thing actually exists or not. There's no separate fact-checker — only "what word usually comes next."

Where hallucination shows up most

Obscure facts. Less training text on a topic means a fuzzier distribution and more wrong tokens. Top-ten city populations: usually fine. History of a small town: often shaky.

Exact numbers. Numbers are a weak spot — "1994" and "1942" look similar as tokens but mean very different things. "Roughly which decade" often works; "exact year" often doesn't.

Late in long context. More noise, harder Attention. A fact from turn 1 is weaker by turn 30.

Tasks where you want invention. "Write a fictional company history" uses the same mechanism as hallucination — you just authorized it. Creative writing shines here because the job is to produce plausible fiction.

Two types of hallucination

Intrinsic Hallucination
  → output directly contradicts context you provided
  → you gave a doc saying A; AI answers B

Extrinsic Hallucination
  → output cannot be verified from your context
  → AI "fills in" information you never provided
  → more common, harder to spot

Extrinsic hallucination is the dangerous one: output often doesn't contradict your source material — the model adds unverifiable claims mixed with correct content, hard to spot at a glance.

Steering hallucination with this mental model

Steering takeaway

Give context; don't rely on memory. AI predicting from text you provide beats predicting from training residue. Instead of "when did Einstein win the Nobel?", paste the source and ask "according to this, when did Einstein win the Nobel?" That's the core idea behind RAG (Retrieval-Augmented Generation) — retrieve first, then answer from what you retrieved.

Ask the model to flag uncertainty. Add to the system prompt: "if you're not sure, say so — don't guess." Won't eliminate hallucination, but when the distribution is flat, it's more likely to say "I'm not sure" than serve a plausible wrong answer.

Verify high-stakes facts. Legal text, medical info, exact dates, citations — don't trust raw output. Look up originals. AI fits best as "draft for you to verify," with humans signing off.

Constrain output shape. Instead of open-ended generation, offer choices: "which is correct? A, B, or C?" Narrowing the token space narrows hallucination space.

// ❌ Rely on "memory": prone to hallucination
const bad = await chat(
  "You are an assistant",
  "What is our company's return policy window in days?",
);

// ✅ Provide context: put facts in the prompt
const good = await chat(
  "You are an assistant. Answer only from the user's provided material. If it's not there, say \"not mentioned in the material.\"",
  `Return policy document:\n${policyDocument}\n\nQuestion: How many days for returns?`,
);

A mindset shift

Mindset shift

Hallucination won't be "fixed." It's structural to this generation mechanism, not a bug. Better models reduce frequency, but as long as the core job is predicting token distributions, wrong content remains possible.

The point is using AI where it fits: reasoning, synthesis, drafting, analysis — and adding human verification where it doesn't (precise facts, numbers, citations).

Phase 3 · How you steer it

8. System Prompt vs User Prompt: your basic steering tools

This is the hands-on chapter. Every mechanism from the first seven sections maps to something you do here.

Structure first

A full Claude API request looks like this:

const response = await client.messages.create({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  system: "You are a professional technical writing assistant...", // ← System Prompt
  messages: [
    { role: "user", content: "Draft API documentation for me" }, // ← User Prompt
    { role: "assistant", content: "Sure — please share the API spec..." },
    { role: "user", content: "Here is the spec: ..." }, // ← User Prompt
  ],
});

System and user prompts share one context window, but their roles, stability, and layers of influence differ.

System Prompt: who the AI is

The system prompt is the "settings layer" before the conversation starts. It does this:

Define role     → "You are a professional legal assistant"
Set constraints → "Only answer contract-related questions"
Specify format  → "Use bullet points; stay under 200 words"
Set tone        → "Formal tone; no emoji"
Inject knowledge → "Here is our product documentation: ..."

From what we've learned, the system prompt has a few properties:

It always sits at the front of context. Attention's U-shape gives start and end high weight; system prompt at the front gets primacy bias and fits KV Cache prefix needs. Its instructions usually run with high priority.

It's ideal for KV Cache. Stable, identical every call — perfect prefix cache. A well-designed system prompt saves a lot of repeated compute.

Common misconception

It sets default behavior, not absolute law. Many people miss this. The system prompt says how the model should normally act, but a clear opposite in the user prompt can override it. Hard limits need deliberate reinforcement in design.

User Prompt: the task each turn

The user prompt is what the human types each round:

Assign task      → "Draft an apology email to a customer"
Provide material → "Here is the contract text: ..."
Give feedback    → "This version is too formal — make it conversational"
Add constraints  → "Also keep it under 100 words"

Its properties:

It's dynamic. Different every turn — not a KV Cache target.

It can override the system prompt. Flexibility and risk. If users shouldn't change model behavior, critical constraints need to be explicit and firm in the system prompt.

It sits toward the end, but Attention weights the tail highly too. Your latest instruction already packs punch — usually no need to repeat it.

How they interact

System Prompt              User Prompt
────────────────────────────────────────────────
"Reply in English"    +    "what is AI?"
  → AI explains AI in English

"Only legal questions" +   "write me a poem"
  → AI should refuse (not guaranteed — depends how firm the system prompt is)

"Keep tone formal"    +    "explain this casually"
  → user prompt overrides system; AI usually goes casual
    (when they conflict, the nearer, clearer instruction tends to win)

Using everything from the first seven sections

A solid system prompt should reflect every mechanism you've learned:

const systemPrompt = `
You are the technical writing assistant for Lumen Tech.

# Hard constraints (up front — high Attention weight at the start)
- Answer only from material the user provides. Do not add outside information.
- If unsure, say "I'm not sure" — do not guess.

# Output format (structure narrows the token space)
1. Conclusion (one sentence)
2. Explanation (bullets, max three)
3. Code example if needed

# Long threads (respect Context Window limits)
If the conversation grows, prioritize the latest task; older detail can drop.
`.trim();

// Stable system prompt → good KV Cache candidate → lower cost
Mechanism behind each line
  • Constraints up front → Attention weight on the start
  • Flag uncertainty → counter Hallucination
  • Format limits → tame Temperature drift
  • Length hints → respect finite Context Window
  • Stable system prompt → KV Cache hits

A useful framing

Think of the system prompt as the job description when you hire someone; the user prompt is the daily assignment.

A clear job description tells them how to handle ambiguity; a vague one leaves them guessing.

Job description (System Prompt) covers:
  → scope of this role
  → how to handle uncertainty
  → output format and tone standards
  → what to decide alone vs. when to say "I don't know"

Work assignment (User Prompt) covers:
  → the specific task this turn
  → materials or constraints for this turn only
  → feedback on the last output

9. In practice: a support bot tying it all together

Theory done — let's walk through a concrete example where all eight concepts show up in a real app.

Suppose you're building a company support bot. It answers product questions in a professional, friendly tone, and must never invent policies the company doesn't have. We'll trace each decision to the mechanism behind it.

Step 1: Design the system prompt

const systemPrompt = `
You are the support assistant for Lumen Tech.

# Core constraints
- Answer only from "reference material." If it's not there, say "I'll connect you with a specialist."
- Never guess or invent specs, prices, or return policies.
- If unsure, say so — don't sound plausible without evidence.

# Tone
- Professional, friendly, concise.
- No emoji.

# Answer format
- Answer the question first, then add necessary detail.
- Use numbered steps when describing procedures.
`.trim();
What's at work here?
  • Core constraints at the top → U-shaped primacy bias and KV Cache so the hardest rules actually run.
  • "Don't guess; say when unsure" → fights Hallucination, especially extrinsic (made-up policies).
  • Fixed system prompt → KV Cache hit conditions.

Step 2: Augment with RAG

The bot can't rely on model "memory" for company-specific policy — that stuff isn't in training data; asking blind triggers hallucination. Retrieve relevant docs and put them in context:

async function answerCustomerQuestion(question: string) {
  // 1. Retrieve relevant chunks from the knowledge base
  const relevantDocs = await vectorSearch(question, { limit: 3 });

  // 2. Assemble retrieved material into context
  const userContent = `
Reference material:
${relevantDocs.map((d) => d.content).join("\n---\n")}

Customer question: ${question}
  `.trim();

  // 3. Call the model
  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    temperature: 0.3, // low: support needs stable, predictable answers
    system: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" },
      },
    ],
    messages: [{ role: "user", content: userContent }],
  });

  return response;
}
What's at work here?
  • Retrieved chunks fill the Context Window so the model answers from what's in front of it, not fuzzy training residue — the strongest hallucination reducer.
  • temperature: 0.3: support needs stable, predictable answers, not creativity. Low Temperature keeps similar questions consistent.
  • cache_control: stable system prompt hits KV Cache — big cost and latency win at high volume.
  • limit: 3 takes the three most relevant chunks, echoing "more context isn't always better" so Attention isn't diluted by noise. This example avoids the name topK on purpose, so it doesn't get confused with §6 Top-K sampling.

Step 3: Manage multi-turn context

Customers may ask several questions in a row. Each turn you send history back (the model has no memory), but context can't grow forever:

async function handleConversation(
  history: Message[],
  newQuestion: string,
): Promise<Message[]> {
  // Cap history length — avoid blowing context budget or losing Attention focus
  const trimmedHistory = history.slice(-6); // keep last 3 Q&A pairs

  const relevantDocs = await vectorSearch(newQuestion, { limit: 3 });

  const messages: Message[] = [
    ...trimmedHistory,
    {
      role: "user",
      content: `Reference material:\n${formatDocs(relevantDocs)}\n\nQuestion: ${newQuestion}`,
    },
  ];

  const response = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    temperature: 0.3,
    system: [{ type: "text", text: systemPrompt, cache_control: { type: "ephemeral" } }],
    messages,
  });

  return [...messages, { role: "assistant", content: extractText(response) }];
}
What's at work here?
  • history.slice(-6): active Context Window management — avoid blowing the budget; also ties to hallucination in long threads (accuracy drops late); trimming old turns can sharpen focus on the current question.
  • Full history resent each time because Autoregressive Generation is stateless — the model won't remember the last turn on its own.
  • The relevantDocs retrieved on each turn are dynamic content, so putting them in the latest User Prompt makes sense. What is truly suited for longer-lived caching is the stable System Prompt, few-shot examples, or a large fixed document this conversation will repeatedly reference. In RAG design, separate "fixed cacheable knowledge" from "per-turn retrieval results" so you don't overestimate KV Cache hit rate.

The full chain in one answer

When a customer asks "my headphones broke — can I return them?" here's what runs:

  • Token: question tokenized into vectors.
  • RAG + Context Window: system retrieves return policy docs into context.
  • KV Cache: stable system prompt cache hit, no recompute.
  • Attention: model scores how "return," "broke," etc. link to policy sections.
  • Autoregressive Generation: answer generated token by token.
  • Temperature: low setting keeps output steady.
  • Hallucination guardrails: answer grounded in retrieved docs plus system "don't guess" rule — no invented return policy.
  • System / User Prompt split: system defines role and hard rules; user brings this question and retrieved material.

Eight concepts, one answer. That's why mechanism matters: every design choice has a reason, not just trial and error.

10. Putting it together: the full map

Eight concepts down — you now have a full map. Let's string them into one causal chain and see how they link.

Token
  → AI's basic unit. All computation starts here; capacity and billing are token-based.

Context Window
  → AI's working memory. Finite like RAM; cleared every call.

Attention
  → How AI decides what matters. Each token gets a different weight via Q, K, V.
     This is where Lost in the Middle comes from.

KV Cache
  → Optimization built from Attention's K and V. Stable prefixes cache;
     that's why system prompts should be stable and up front.

Autoregressive Generation
  → How AI produces answers. One token at a time, no global plan;
     errors compound — source of many odd behaviors.

Temperature & Sampling
  → Knobs for how to pick from the distribution. Controls determinism,
     not intelligence. Low = stable; high = varied.

Hallucination
  → Structural outcome of generation, not a bug. AI optimizes for plausible;
     factual correctness needs extra tooling. Use it wisely; don't trust blindly.

System Prompt vs User Prompt
  → Turns all of the above into levers you actually pull.
     System defines the role (stable, front); user assigns the task (dynamic, can override).
Why this map matters

When AI behavior confuses you, you can drop to the mechanism layer — understand why it did that, then know what to change.

Why does a new chat feel like amnesia? Empty context window.

Why do mid-document instructions get ignored? Attention's Lost in the Middle.

Why did one character change make the API slow and expensive? KV Cache miss.

Why different answers to the same question? Temperature sampling.

Why confident and wrong? It predicts plausible, not verified correct.

You now have structural answers to all of those.

Steering AI comes down to understanding how the machine runs, then working with its mechanics. Memorizing prompt tricks is a shortcut; once you know the substrate, those tricks derive from mechanism — no incantations required.

Tips are surface; mechanism is root

Look back and every practical tip in this article came from somewhere specific. Put key info at start or end → Attention's Lost in the Middle; keep system prompt stable → KV Cache prefix matching; give the model source text instead of trusting "memory" → generation optimizes for plausible, not verified; ask for step-by-step reasoning → autoregression makes intermediate steps scaffolding for what follows. Tips are surface; mechanism is root. Hold the root and you can derive what to do for new tools, new models, and weird new behaviors — without googling "best prompt for XXX" every time.

That's the real gap between using AI and steering it.