Context Engineering#

Context engineering is the discipline of deliberately designing, structuring, and managing the total information environment provided to an LLM — system prompts, retrieved documents, tool outputs, conversation history, and structured metadata — to maximize the quality and reliability of its outputs.

While prompt engineering focuses on crafting the instruction text, context engineering treats the entire context window as an engineered artifact with a budget, architecture, and lifecycle.

Learning Objectives#

Understand the distinction between prompt engineering and context engineering
Map the anatomy of an LLM context window and its components
Apply token budgeting strategies for context allocation
Distinguish write context from read context
Implement dynamic context assembly with RAG and tool outputs
Leverage context caching for cost optimization
Recognize common context anti-patterns and their fixes

1. From Prompt Engineering to Context Engineering#

Prompt engineering treats the user instruction as the primary lever for model behavior. Context engineering expands the scope to everything the model sees:

Prompt Engineering          Context Engineering
─────────────────          ────────────────────
"Write a summary"    →     System prompt + persona
                           + retrieved documents
                           + tool call results
                           + conversation history
                           + output format schema
                           + few-shot examples

The shift matters because production LLM systems rarely depend on a single instruction. A customer-support agent’s quality depends on the retrieved ticket history, the customer profile, the policy documents, and the tool outputs — not just “Answer the customer’s question.”

2. Anatomy of a Context Window#

Every LLM call assembles a context window from distinct components, each serving a different role:

        graph TB
    CW["Context Window (e.g. 128k tokens)"]
    CW --> SP["System Prompt<br/>Identity, rules, output format"]
    CW --> FS["Few-Shot Examples<br/>Desired input→output pairs"]
    CW --> RAG["Retrieved Context<br/>Documents from vector search"]
    CW --> TO["Tool Outputs<br/>API responses, DB results"]
    CW --> CH["Conversation History<br/>Prior turns in the session"]
    CW --> UI["User Instruction<br/>The current request"]

Component Roles#

Component	Purpose	Typical Size
System prompt	Define persona, rules, output format	500–2,000 tokens
Few-shot examples	Demonstrate desired behavior	500–3,000 tokens
Retrieved context	Ground answers in source material	2,000–20,000 tokens
Tool outputs	Inject real-time data	500–5,000 tokens
Conversation history	Maintain multi-turn coherence	1,000–50,000 tokens
User instruction	The current task	50–500 tokens

3. Context Budgeting#

A context budget allocates the available token window across components, just like a financial budget allocates funds across departments. Exceeding the budget degrades quality; wasting it leaves the model under-informed.

Priority-Based Allocation#

from tiktoken import encoding_for_model

enc = encoding_for_model("gpt-4o")

CONTEXT_LIMIT = 128_000
RESPONSE_RESERVE = 4_096  # tokens reserved for output

BUDGET = {
    "system_prompt": 1_500,
    "few_shot": 2_000,
    "retrieved_context": 20_000,
    "tool_outputs": 5_000,
    "conversation_history": CONTEXT_LIMIT - RESPONSE_RESERVE - 28_500,
    # remaining tokens go to history
}


def count_tokens(text: str) -> int:
    return len(enc.encode(text))


def truncate_to_budget(text: str, max_tokens: int) -> str:
    """Truncate text to fit within token budget."""
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])

Truncation Strategies#

When a component exceeds its budget, you need a truncation strategy:

Recency: Keep the most recent items (conversation history)
Relevance: Keep the highest-scored items (retrieved documents)
Summarization: Compress older content into summaries
Sliding window: Drop the oldest items beyond a threshold

def manage_conversation_history(
    messages: list[dict],
    max_tokens: int,
    summarizer=None,
) -> list[dict]:
    """Keep recent messages within budget, summarize overflow."""
    total = sum(count_tokens(m["content"]) for m in messages)

    if total <= max_tokens:
        return messages

    # Always keep the first message (system context) and last N turns
    keep_recent = 6  # last 3 exchanges
    old_messages = messages[:-keep_recent]
    recent_messages = messages[-keep_recent:]

    if summarizer:
        summary = summarizer(old_messages)
        return [{"role": "system", "content": f"Earlier conversation summary: {summary}"}] + recent_messages

    # Fallback: drop oldest messages until within budget
    while old_messages and total > max_tokens:
        dropped = old_messages.pop(0)
        total -= count_tokens(dropped["content"])

    return old_messages + recent_messages

4. Write Context vs. Read Context#

Andrej Karpathy’s framing distinguishes two modes of context:

Write context: Information you author and control — system prompts, few-shot examples, output schemas. These are static and optimized offline.
Read context: Information assembled at runtime — retrieved documents, tool outputs, user messages. These are dynamic and require runtime engineering.

        graph LR
    subgraph Write["Write Context (authored offline)"]
        SP2["System prompt"]
        FS2["Few-shot examples"]
        OS["Output schema"]
    end
    subgraph Read["Read Context (assembled at runtime)"]
        RAG2["RAG results"]
        TO2["Tool outputs"]
        CH2["Chat history"]
        UI2["User message"]
    end
    Write --> MW["Context Window"]
    Read --> MW

Optimization Strategies by Type#

Write context (optimize once, reuse many times):

A/B test system prompt variants across evaluation datasets
Curate few-shot examples that cover edge cases
Version-control prompts alongside application code

Read context (optimize per-request):

Rank retrieved documents by relevance before injection
Summarize verbose tool outputs
Compress conversation history as it grows

5. Dynamic Context Assembly#

In production, context is assembled from multiple sources at request time. The assembly pipeline determines what the model sees:

from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document

SYSTEM_TEMPLATE = """You are a technical support agent for Acme Cloud.

## Rules
- Only answer based on the provided context
- If unsure, say "I'll escalate this to a specialist"
- Always cite the source document

## Customer Profile
{customer_profile}

## Retrieved Documentation
{retrieved_docs}

## Recent Ticket History
{ticket_history}
"""


def assemble_context(
    user_query: str,
    customer_id: str,
    retriever,
    ticket_db,
    customer_db,
) -> list:
    """Assemble the full context for a support query."""
    # 1. Retrieve relevant documentation
    docs = retriever.invoke(user_query)
    retrieved_text = "\n\n".join(
        f"[{doc.metadata['source']}]\n{doc.page_content}" for doc in docs[:5]
    )

    # 2. Fetch customer profile (tool output)
    profile = customer_db.get_profile(customer_id)
    profile_text = f"Name: {profile['name']}, Plan: {profile['plan']}, Region: {profile['region']}"

    # 3. Get recent ticket history
    tickets = ticket_db.get_recent(customer_id, limit=5)
    history_text = "\n".join(f"- [{t['date']}] {t['subject']}: {t['status']}" for t in tickets)

    # 4. Assemble into prompt
    prompt = ChatPromptTemplate.from_messages([
        ("system", SYSTEM_TEMPLATE),
        ("human", "{query}"),
    ])

    return prompt.invoke({
        "customer_profile": profile_text,
        "retrieved_docs": retrieved_text,
        "ticket_history": history_text,
        "query": user_query,
    })

Source Attribution#

Always tag retrieved content with its source so the model can cite it:

def format_retrieved_docs(docs: list[Document]) -> str:
    """Format documents with source attribution."""
    formatted = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"[Source {i}: {source}]\n{doc.page_content}")
    return "\n\n---\n\n".join(formatted)

6. Context Caching and Optimization#

LLM providers offer caching mechanisms that reduce cost and latency when the context prefix is reused across requests:

Provider	Feature	How It Works
Anthropic	Prompt caching	Cache static prefix; pay reduced rate for cached tokens
OpenAI	Automatic prefix caching	Automatically caches matching prefixes (>1024 tokens)
Google	Context caching	Explicit cache creation with TTL

Anthropic Cache Example#

from anthropic import Anthropic

client = Anthropic()

# The system prompt and few-shot examples are cached
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a technical documentation assistant...",  # long system prompt
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "How do I configure SSL?"}],
)

Design for Cacheability#

Structure your context so that the stable prefix (system prompt, few-shot examples, reference documentation) comes first, and dynamic content (user query, tool outputs) comes last:

        graph TB
    subgraph CACHED["Cached — Stable across requests"]
        A["System prompt"]
        B["Few-shot examples"]
        C["Reference docs"]
    end
    subgraph DYNAMIC["Dynamic — Changes per request"]
        D["Retrieved context"]
        E["User query"]
    end
    CACHED --> DYNAMIC

    style CACHED fill:#1a3a1a,stroke:#4caf50,color:#e0e0e0
    style DYNAMIC fill:#3a1a1a,stroke:#f44336,color:#e0e0e0

7. Grounding and Hallucination Reduction#

Context engineering is your primary defense against hallucination. The model can only be faithful to context it actually receives:

Grounding Techniques#

Explicit sourcing instructions: “Only answer based on the provided documents. If the answer is not in the documents, say so.”
Source tagging: Label each document chunk so the model can cite [Source 3]
Relevance filtering: Remove low-scoring retrieval results before injection — irrelevant context increases hallucination
Fact-check prompting: Ask the model to verify its claims against the provided context in a second pass

The Lost-in-the-Middle Problem#

Research shows that LLMs pay more attention to information at the beginning and end of the context, with reduced attention to the middle. Mitigations:

Place the most relevant documents first and last
Keep the total number of retrieved documents small (3–5 instead of 10–20)
Use reciprocal rank fusion to re-rank before injection

8. Patterns and Anti-Patterns#

Patterns (Do)#

Pattern	Description
Budget-first design	Define token budgets per component before building
Layered context	Stable prefix → semi-stable retrieval → dynamic query
Progressive disclosure	Start with summaries, fetch details on demand
Context validation	Log and monitor what the model actually sees

Anti-Patterns (Avoid)#

Anti-Pattern	Problem
Context stuffing	Dumping everything into the window degrades quality
Blind retrieval	Injecting all results without relevance filtering
History hoarding	Never truncating conversation history until token limit
Prompt-only thinking	Optimizing the instruction while ignoring retrieved context quality

9. Putting It All Together#

A production context engineering pipeline:

        graph LR
    UQ["User Query"] --> RP["Route & Plan"]
    RP --> R["Retrieve"]
    RP --> T["Call Tools"]
    R --> F["Filter & Rank"]
    T --> S["Summarize Outputs"]
    F --> A["Assemble Context"]
    S --> A
    A --> B["Check Budget"]
    B -->|Over budget| TR["Truncate"]
    TR --> A
    B -->|Within budget| LLM["Send to LLM"]
    LLM --> V["Validate Output"]

Prompt Caching: The Highest-ROI Optimization NEW#

Prompt caching is the single most impactful cost optimization for LLM applications in 2026. When the prefix of a prompt matches a previously cached version, the provider serves it at a steep discount.

How Prompt Caching Works#

        graph LR
    P["Prompt"] --> SP["Static Prefix<br/>(system prompt, few-shot examples,<br/>retrieved docs)"]
    P --> DP["Dynamic Suffix<br/>(user query, conversation turn)"]
    SP -->|"Cache Write (first call)"| CACHE[("Provider Cache<br/>TTL: 5 min")]
    CACHE -->|"Cache Read (subsequent calls)"| LLM["LLM"]
    DP --> LLM

Provider Pricing#

Provider	Cache Write Cost	Cache Read Cost	Effective Discount	Min Tokens	TTL
Anthropic	1.25x base price	0.1x base price	90% on reads	1,024	5 min
OpenAI	1x (auto, no extra)	0.5x base price	50% on reads	Auto	~5-10 min
Google Gemini	Free (explicit caching)	0.25x base price	75% on reads	32,768	Configurable

Design Rules for Cache-Friendly Prompts#

Static content FIRST: System prompt → few-shot examples → stable retrieved docs → dynamic user query
Stable prefix: Don’t change the system prompt between calls unless necessary
Batch similar requests: Group requests that share the same context prefix
Monitor cache hit rate: Track cache_creation_input_tokens vs cache_read_input_tokens in Anthropic responses

# Anthropic prompt caching example
from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert audit assistant...",  # Long system prompt
            "cache_control": {"type": "ephemeral"}  # Cache this block
        }
    ],
    messages=[{"role": "user", "content": "What are the key risks?"}],
)
# Check cache performance in response.usage:
# cache_creation_input_tokens, cache_read_input_tokens

Typical production savings: 45-80% cost reduction when cache is properly configured.

Practice#

Take an existing RAG pipeline and add explicit token budgeting. Measure how output quality changes as you vary the retrieved context budget from 2,000 to 20,000 tokens.
Implement a conversation history manager that summarizes old messages when the history exceeds its token budget.
Restructure a prompt to maximize cache hit rate by moving stable content to the prefix.