Skip to main content
context windowLLM memorytoken limitlarge language modelLLM context

What Is a Context Window in AI?

A context window is the maximum amount of text — measured in tokens — that a large language model can process in a single request. Everything the model uses to generate its response, including the system prompt, the conversation history, any retrieved documents, and the user's question, must fit inside this window. When the limit is reached, older content is dropped, truncated, or summarized.

The context window is one of the most consequential design parameters of any LLM-based application. It determines how much background a model can hold while it reasons, how many documents a RAG system can retrieve at once, and how many turns a conversation can run before the model loses track. In 2026, frontier model windows have grown from the original 2,048 tokens of GPT-3 to 10 million tokens for Meta's Llama 4 Scout — a 5,000x increase in five years.

TL;DR

A context window is the maximum number of tokens an LLM can read and reason over at once — covering system prompts, retrieved documents, chat history, and the question itself. Modern models range from 128K (the working baseline) to 10M tokens (Llama 4 Scout). Bigger windows are not free: they raise cost, slow inference, and suffer from the "lost in the middle" problem where models miss facts buried mid-context. Most production systems still use RAG to feed only the most relevant content into a smaller, focused window.

What Is a Context Window?

Every LLM is a fixed-size machine: it accepts a defined number of tokens as input and produces tokens as output, one at a time. The context window is the cap on input + output combined (or sometimes split, depending on the provider). The window represents the model's working memory — but unlike human memory, it is reset to zero at the start of each request. Anything the model "remembers" from earlier in a session is only remembered because the application replayed it back into the prompt.

Context windows differ from the broader concept of model memory. A context window is short-term and per-request. Persistent memory, by contrast, is something the application layer adds — by storing past conversations, summarizing them, and re-injecting relevant pieces into each new prompt. The context window is the bottleneck through which every piece of that memory must pass.

How Tokens Work

LLMs do not see text — they see tokens. A token is a chunk of text produced by a tokenizer, the algorithm a model uses to break input into manageable pieces. Tokens roughly correspond to common substrings: short words ("the", "and") are usually one token; longer or rare words split into multiple tokens. Punctuation, whitespace, and Unicode characters all consume tokens too.

A useful rule of thumb for English text:

  • 1 token ≈ 4 characters or about 0.75 words
  • 1,000 tokens ≈ 750 English words or roughly 1.5 pages of plain prose
  • 128,000 tokens ≈ 96,000 words or a 300-page book
  • 1,000,000 tokens ≈ 750,000 words or roughly the full text of War and Peace
  • 10,000,000 tokens — Llama 4 Scout's window — covers approximately 7.5 million words, the equivalent of a small library

Non-English languages, code, and structured data tokenize differently. Czech, German, and other morphologically rich languages typically use 1.5–2x more tokens for the same meaning. Source code is often more efficient (1 token can represent a full symbol like function), but heavy whitespace or unusual characters can blow up the count. Always check token usage with the provider's tokenizer before designing prompts that rely on tight limits.

Context Window Sizes Across Models (2026)

As of May 2026, frontier context windows fall into four tiers. The numbers below reflect the maximum advertised window — actual useful context is usually smaller (see Lost in the Middle).

Context Window Sizes — Major LLMs in 2026 CONTEXT WINDOW SIZES — MAJOR LLMS, MAY 2026 Llama 4 Scout Gemini 3 Pro Claude Opus 4.6 Gemini 3.1 Pro Claude Sonnet 4.6 Llama 4 Maverick GPT-5.4 (extended) GPT-5.4 (standard) GPT-4 Turbo GPT-3 (2020) 10M 2M 1M 1M 1M 1M 1M (2x cost) 272K 128K 2K (baseline) Meta Google Anthropic OpenAI Sources: vendor docs, May 2026
Click to enlarge

Tier 1 — 10M tokens

Meta Llama 4 Scout currently leads the field at 10 million tokens, the largest commercially available context window. At this size, the model can hold an entire mid-sized codebase or a year of email history in a single request. Real-world utility at the upper end is debated — see the cost and "lost in the middle" sections below.

Tier 2 — 1–2M tokens

The 1-million-token tier is the new frontier mainstream. Google Gemini 3 Pro goes up to 2M tokens. Claude Opus 4.6, Claude Sonnet 4.6 (after the March 13, 2026 expansion), Gemini 3.1 Pro, Gemini 3 Flash, and Llama 4 Maverick all support 1M-token windows. OpenAI GPT-5.4 exposes 1M tokens through the API and Codex with a 2x pricing surcharge above 272K.

Tier 3 — 128K–272K tokens

This is the working baseline for production systems. GPT-5.4 standard is 272K. GPT-4 Turbo, GPT-4o, Llama 3.1 405B, and Mistral Large 2 all sit at 128K. Most enterprise applications, RAG pipelines, and agent frameworks are built around this tier — not because larger windows are unavailable, but because they are not always worth the cost or latency.

Tier 4 — under 32K tokens

Smaller and older models still fill specialized roles: classifiers, embedding models, fine-tuned task models, and edge-deployed local LLMs. Their 4K–32K windows are deliberate, optimizing for speed and inference cost over breadth.

Why Context Window Matters

Context window directly bounds what an LLM application can do in a single shot. The same model with a 8K window versus a 1M window is, functionally, two very different products:

  • Document processing — A 128K window can hold a 300-page contract; an 8K window cannot. Whole-document analysis, cross-document comparison, and summarization at scale are gated by window size.
  • Codebase reasoning — Large windows let coding assistants understand multi-file dependencies. A 1M-token window holds a meaningful slice of a real production codebase.
  • Conversation depth — Multi-turn chatbots replay history into the prompt. A small window forces aggressive truncation; a large one preserves nuance and earlier instructions.
  • Agent reliability — Tool-using AI agents accumulate observations from each step. Bigger windows let agents handle longer tasks before they hit the limit and lose state.
  • RAG retrieval depth — A larger window allows retrieving more passages per question, increasing recall — though as discussed below, this is not a free win.

Context Window vs RAG vs Fine-Tuning

A common misconception is that larger context windows make retrieval-augmented generation (RAG) obsolete. In practice, the three approaches solve different problems and are usually combined.

  • Context window is short-term, per-request memory. It is filled fresh every time and disappears after the response.
  • RAG retrieves only the most relevant chunks of a much larger corpus and injects them into the window. It scales to terabytes of documents that no window — even 10M tokens — could hold.
  • Fine-tuning bakes knowledge or behavior into the model weights themselves. The model learns patterns it then applies without needing the source content in the prompt.

Even with a 10M-token window, RAG remains the dominant production pattern for one simple reason: cost and latency scale with input size. Stuffing every query with millions of tokens is wasteful when only a few thousand are relevant. RAG keeps the prompt focused; the window provides the room to fit the retrieved snippets, the system prompt, and the question together.

Big windows do not replace governed retrieval. A 1M-token window can hold thousands of documents, but stuffing all of them per query costs more, runs slower, and degrades accuracy. RAG with a smaller, governed window almost always outperforms naive context-stuffing — especially when the underlying data has lineage, ownership, and freshness metadata to filter on.

The "Lost in the Middle" Problem

A model that advertises 1M tokens does not necessarily perform well at 1M tokens. Research published since 2023 — and consistently confirmed across model generations — shows that LLMs preferentially attend to information at the beginning and end of their context, with notable accuracy degradation for facts placed in the middle. The phenomenon is widely known as "lost in the middle."

Reported degradation patterns vary by model, but the broad finding holds: most current models show 10–25% accuracy drops for needle-in-a-haystack questions when the relevant passage sits in the middle 50% of the window, compared to placement at the start or end. Counterintuitively, models with the largest windows often degrade more across their full range — there is simply more "middle" to get lost in.

Practical implications:

  • Place critical instructions and key facts at the start or end of the prompt, not buried in the middle.
  • Validate any application that relies on long-context recall with synthetic needle-in-a-haystack tests across the actual window length you intend to use.
  • Treat advertised window size as a maximum, not a working capacity — most production systems use only 30–60% of the advertised window in practice.

Cost and Performance Trade-offs

Context length is one of the dominant cost levers in LLM operations. Three considerations matter:

Cost

API providers charge per input and output token. A query that uses 500K input tokens costs roughly four times as much as one that uses 128K, regardless of how much of that context the model actually uses to generate the answer. Some providers, including OpenAI for GPT-5.4 above 272K, charge a multiplier on top of base rates for extended windows.

Latency

Time to first token grows with input size. Long contexts can add several seconds — sometimes tens of seconds — of latency before the model starts streaming output. For interactive UIs, this is often the binding constraint, not cost.

Output budget

Some providers count input and output tokens against the same overall window. A long input can leave too little room for a long answer. Always check whether the provider exposes an explicit max_output_tokens parameter and whether output is counted separately.

Best Practices for Working With Context Windows

Across production systems, a small set of patterns consistently produce better results than naive context-stuffing.

  • Retrieve, don't dump — use RAG to filter content down to what is actually relevant. Even with a 10M window, a focused 20K prompt usually beats a 1M prompt for accuracy and cost.
  • Place anchors at the edges — put the system prompt, critical instructions, and the user question near the start or end of the context. Reserve the middle for retrieved supporting content.
  • Compress aggressively — summarize older conversation turns rather than replaying them verbatim. Tools like recursive summarization reduce 50K-token chat history to a 2K state file with negligible quality loss.
  • Chunk with overlap — when splitting long documents for retrieval, use overlapping chunks so important spans are not split across boundaries.
  • Govern what enters the window — every token in context costs money and influences output. Filter retrieved content by ownership, freshness, classification, and lineage before injecting it. This is where active metadata earns its keep.
  • Test at your real working size — never trust advertised limits. Run needle-in-a-haystack evals at the prompt size your application actually uses.
  • Monitor token usage in production — instrument every call. Token consumption is the surest leading indicator of cost overruns and quality regressions.

The context window is a powerful resource, but it is not a substitute for grounded, governed access to enterprise data. The most reliable AI systems combine modest windows with strong retrieval over active metadata, data lineage, and a semantic layer — feeding the model exactly the context it needs, no more.

See it in action

Dawiso AI Context Layer

Bigger context windows do not solve enterprise grounding. Give your AI agents governed, queryable metadata — not a 10M-token dump.

Next step

Trusted data starts here.

Pick one problem. We map the data first, fix what's broken, then help your team trust every number.

Take the product tour
© Dawiso s.r.o. All rights reserved