RAG vs Long Context: Approaches to Supplying External Knowledge to LLMs
Large language models (LLMs) have a fundamental limitation. They are trained on historical data and remain fixed at the point of their training cutoff. As a result, they have no awareness of events that occurred after training, and they do not have access to private or proprietary information such as internal documentation, company knowledge bases, or private code repositories.
If we want an LLM to reason about that information, we must provide it as context during inference. This is commonly referred to as context injection. The core architectural question is how to supply the right information to the model at the right time.
Two primary approaches have emerged to address this problem: Retrieval-Augmented Generation (RAG) and long context prompting. Each approach reflects a different philosophy about how external knowledge should be incorporated into model reasoning.
Retrieval-Augmented Generation
Retrieval-Augmented Generation is the approach most commonly used in production systems today. It relies on preprocessing documents and retrieving relevant pieces of information at query time.
In a typical RAG pipeline, source documents such as PDFs, documentation, code files, or manuals are first divided into smaller segments. This process is commonly referred to as chunking. The chunks are then passed through an embedding model, which converts each piece of text into a vector representation that captures its semantic meaning.
These vectors are stored in a vector database.
When a user submits a query, the system performs a semantic similarity search against the stored vectors. The retrieval step returns the most relevant document chunks. These snippets are then inserted into the LLM prompt along with the user's query. The model generates its response using both the query and the retrieved context.
This approach works well in practice and allows LLMs to operate over large document collections. However, it introduces a dependency on the retrieval process itself. If the retrieval step fails to identify the correct information, the model never sees the relevant content.
Long Context Prompting
An alternative approach is long context prompting. Instead of building a retrieval pipeline, this method places the entire relevant dataset directly into the model's context window and allows the model's attention mechanism to identify useful information.
This approach removes several components from the system architecture. There is no need for embedding models, vector databases, or retrieval pipelines. The system simply gathers the relevant documents and includes them in the prompt.
Historically, this approach was not practical because early LLMs supported very small context windows, often around four thousand tokens. That capacity is insufficient for anything beyond a few pages of text.
Recent models support significantly larger context windows, sometimes exceeding one million tokens. One million tokens corresponds to roughly seven hundred thousand words. In principle, a large collection of documents can now be placed directly in the prompt.
This increase in context capacity raises an architectural question. If we can place all relevant information directly in the prompt, is the additional infrastructure required for RAG still necessary?
Advantages of Long Context
Long context prompting has several advantages, particularly in environments where the dataset is relatively small and well defined.
Simpler Architecture
A production RAG system contains multiple components. These include chunking logic, embedding models, vector storage, retrieval mechanisms, and often reranking systems to improve search results. The system must also maintain synchronization between the vector database and the underlying source documents.
Long context prompting removes most of this infrastructure. The system simply gathers the required documents and passes them to the model. This reduces operational complexity and eliminates several potential points of failure.
Removal of Retrieval Errors
RAG systems depend on semantic search to identify relevant information. Semantic search operates on vector representations of text and attempts to locate content that is semantically similar to the query.
This process is inherently probabilistic. In some cases the retrieval step fails to return the relevant document, even though the correct information exists in the dataset. When this occurs, the model produces an answer without having seen the relevant material.
Long context prompting eliminates this failure mode because the model receives the entire dataset.
Global Document Reasoning
Some tasks require reasoning across entire documents rather than isolated snippets.
For example, consider two documents: a product requirements specification and a set of release notes. A user might ask which security requirements were omitted from the final release.
A RAG system will likely retrieve sections related to security requirements and sections related to the release notes. However, the retrieval step cannot return the conceptual difference between the two documents.
To answer the question correctly, the model must examine both documents in their entirety and compare them. Long context prompting allows this type of global reasoning because the full documents are available to the model.
Limitations of Long Context
Despite its simplicity, long context prompting introduces several practical challenges.
Computational Cost
Large prompts are expensive to process. Consider a five hundred page manual. Converting that document to tokens may produce approximately two hundred and fifty thousand tokens.
If the document is included in every prompt, the model must process the entire document for each user query. This creates a significant computational cost.
In contrast, a RAG system processes the document once during indexing. At query time, only a small number of relevant chunks are retrieved and passed to the model.
Prompt caching can reduce costs for static documents, but this optimization is less effective when documents change frequently.
Difficulty Locating Specific Information
There is a common assumption that if information is present in the context window, the model will be able to use it effectively. In practice, extremely large context windows can dilute the model's attention.
If a question refers to a specific paragraph buried inside a very large document collection, the model may fail to locate it reliably. In some cases it may generate an answer based on surrounding text rather than the exact source.
RAG systems reduce this problem by filtering the dataset before it reaches the model. The retrieval process removes irrelevant content and presents the model with only a small set of relevant passages.
Dataset Size Limitations
Even context windows containing millions of tokens are small relative to typical enterprise datasets.
Corporate data lakes often contain terabytes or petabytes of information. It is not feasible to include all of that data in a single prompt.
In these environments, a retrieval layer remains necessary to filter large datasets into a manageable subset that can be processed by the model.
Choosing Between RAG and Long Context
The choice between RAG and long context depends primarily on the characteristics of the dataset and the type of reasoning required.
Long context prompting works well when the dataset is bounded and when tasks require reasoning across entire documents. Examples include analyzing legal agreements, reviewing technical specifications, or summarizing books.
RAG remains the more practical solution when working with large knowledge bases or enterprise data systems. It reduces computational cost and allows systems to operate over datasets that are far larger than any available context window.
In practice, many modern architectures combine both approaches. Retrieval is used to narrow a large dataset to a manageable subset, and long context windows are then used to provide the model with enough information to perform deeper reasoning.
Conclusion
The rapid expansion of context window sizes has changed the design space for LLM systems. Problems that previously required complex retrieval pipelines can now sometimes be addressed with direct prompting.
However, retrieval remains essential when dealing with large or continuously growing datasets. RAG and long context prompting should not be viewed as competing solutions but as complementary techniques.
The key architectural decision is determining when a bounded dataset can be provided directly to the model and when a retrieval layer is required to manage scale.