What is Retrieval Augmented Generation (RAG)?

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI framework that enhances the accuracy and relevance of responses from large language models (LLMs) by integrating external information sources into the generation process. This approach allows LLMs to access up-to-date and domain-specific data, addressing limitations associated with static training datasets.

How does RAG work?

RAG operates through a multi-step process:

Indexing: External data sources—such as documents, databases, or websites—are processed and transformed into vector representations (embeddings) that capture semantic meaning. These embeddings are stored in a vector database for efficient retrieval.
Retrieval: When a user poses a query, the system searches the vector database to identify and retrieve the most relevant documents or data segments that pertain to the query.
Augmentation: The retrieved information is combined with the original user query to form an augmented prompt. This enriched prompt provides the LLM with additional context necessary for generating a more accurate response.
Generation: The LLM processes the augmented prompt to produce a response that is informed by both its pre-existing knowledge and the newly retrieved information.

This methodology enables LLMs to generate responses that are grounded in current and specific data, thereby reducing the likelihood of inaccuracies or "hallucinations."

What's really different with RAG?

The key technical innovation of Retrieval-Augmented Generation (RAG) lies in its combination of information retrieval with large language model generation to produce more accurate and contextually relevant outputs.

Retrieval: A vector search engine is used to fetch relevant information from an external database or document collection based on the user query. The retrieved information acts as additional context.

Generation: The retrieved information is combined with the query and passed to the language model, enhancing the relevance and factuality of its response.

RAG relies on vector embeddings to represent text semantically, enabling efficient and accurate retrieval of information. Vector embeddings capture the meaning of text, allowing the system to find relevant information even if the query wording differs from the stored text.

Unlike traditional language models, which rely on static knowledge encoded during training, RAG dynamically incorporates external knowledge at inference time. This flexibility allows the model to Remain up-to-date without retraining, plus specialise in domain-specific tasks by accessing relevant datasets.

Enough, show me the code!

Install the required Python libraries:

pip install faiss-cpu sentence-transformers transformers

Then, this is a simple example of RAG:

import faiss
from sentence_transformers import SentenceTransformer
from transformers import pipeline

# 1. Initialize Embedding Model and FAISS Index
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # For embedding generation
vector_dimension = embedding_model.get_sentence_embedding_dimension()
index = faiss.IndexFlatL2(vector_dimension)  # FAISS index for L2 similarity

# 2. Sample Dataset (Replace this with your own documents)
documents = [
    "The Eiffel Tower is in Paris.",
    "The Great Wall of China is visible from space.",
    "Python is a popular programming language.",
    "SpaceX was founded by Elon Musk."
]
document_embeddings = embedding_model.encode(documents)
index.add(document_embeddings)  # Add document embeddings to FAISS

# 3. RAG Workflow
def retrieve_and_generate(query, k=2):
    # Encode the query into an embedding
    query_embedding = embedding_model.encode([query])

    # Retrieve top-k similar documents
    distances, indices = index.search(query_embedding, k)
    retrieved_docs = [documents[i] for i in indices[0]]

    # Combine retrieved documents for context
    context = " ".join(retrieved_docs)
    augmented_query = f"Context: {context} Query: {query}"

    # Use a generative model to answer
    generator = pipeline('text-generation', model='gpt2')
    response = generator(augmented_query, max_length=50, num_return_sequences=1)

    return response[0]['generated_text']

# 4. Example Usage
query = "Who founded SpaceX?"
output = retrieve_and_generate(query)
print(output)

What's happening in the code?

Indexing: The text corpus (documents) is embedded using SentenceTransformer, a model optimized for creating dense vector representations. The embeddings are stored in a FAISS index for efficient similarity search.
Query Processing: The input query is embedded and compared with the stored embeddings to find the most relevant documents.
Augmented Prompting: The retrieved documents are concatenated into a contextual input (augmented query) and fed to a text generation model like GPT.
Text Generation: A generative model (e.g., GPT-4.5 or Llama 4) generates a response based on the augmented query.

What are embeddings?

Embeddings are numerical representations of data, typically in the form of vectors, that encode the semantic or contextual meaning of the data into a fixed-dimensional space.

Data points with similar meanings or characteristics are mapped to vectors that are close to each other in the embedding space.

Regardless of the input size, embeddings are represented by a fixed number of dimensions (e.g., a vector of length 512).

In other terms, natural language text is converted to ("embedded") these things called vectors that you can think of land on a 'map'. Text that is close in meaning will be close in the map.

In the code above, the SentenceTransformer model is used to convert the text into embeddings.

They are a cornerstone of many modern machine learning and natural language processing (NLP) applications, enabling efficient and meaningful computations on complex data.

Different types of embeddings

Word Embeddings represent words or tokens as vectors in a way that captures semantic relationships. Example: In the embedding space, the vector for "king" might be close to "queen" and maintain a similar relationship as "man" to "woman."

Sentence and Document Embeddings represent entire sentences, paragraphs, or documents.

In the same way, there are also image embeddings, audio embeddings, and video embeddings. Combining modalities lead to embeddings that represent data from multiple modalities (e.g., text and images) in a unified space.

Graph embeddings are also a thing, key to use cases like social networks, knowledge graphs, and recommendation systems.

Embeddings are typically generated by neural networks as intermediate outputs from trained layers. For example, a word embedding layer in a transformer model.

There are many embedding models out there, but here are some of the most popular ones:

Stella: High-performance, open-source embedding model available in 400M and 1.5B parameter sizes.
NV-Embed-v2 (NVIDIA): Generalist embedding model fine-tuned on Mistral 7B, suitable for various tasks including retrieval and classification.
Gemini Embedding (Google DeepMind): Multilingual embedding model leveraging Gemini's capabilities, excelling in language understanding tasks.
OpenAI's text-embedding-3-large: Advanced embedding model with up to 3072 dimensions, offering strong performance across various benchmarks.

And then there is generation

The generation component is where the language model synthesizes a response by combining the user's query with relevant information retrieved from external sources.

This process enhances the model's ability to produce accurate and contextually relevant answers, especially for queries requiring up-to-date or domain-specific knowledge.

The retrieved documents are combined with the original user query to form an augmented prompt. This prompt provides the language model with additional context, grounding its response in the retrieved information.

Improving retrieval

Before generation, the retrieved documents can be re-ranked to prioritize the most relevant information, ensuring that the model focuses on the most pertinent context. As we discussed in our search and LLM post, ranking is a crucial step in improving the quality of the response. You can also use cross-encoders or specialized models to re-rank the top retrieved documents by relevance.

Negattive examples help: During retriever training, include "hard negatives" (e.g., highly similar but incorrect passages) to make the model better at discerning relevance.

Selecting the most relevant portions of the retrieved documents helps in reducing noise and focusing the model's attention on critical information. This is called chunking - splitting long documents into smaller, coherent chunks (e.g., 300-500 tokens) to ensure better retrieval granularity.

You can also adjust the language model to better handle augmented prompts to enhance its ability to integrate retrieved information effectively.

Enhancing generation

Limit the input context to the top-N most relevant chunks to avoid overwhelming the generation model. Use sliding windows to optimize for token limits if dealing with long queries.

Preprocess retrieved documents into a structured format (e.g., bullet points, summaries) before feeding them to the model.

Prompt engineering

Provide the model with clear instructions on how to utilize retrieved documents.

Prompt with explicit instructions

Context: [RETRIEVED DOCUMENTS] Question: [USER QUERY] Answer:

You can also assign a role to the model, framing the model as a domain expert to improve the specificity of the response.

Prompt with role assignment

You are a financial expert. Use the provided context to answer accurately.

Context: [RETRIEVED DOCUMENTS]

Question: [USER QUERY]

Answer:

Asking the model to think step by step can improve the quality of the response. This is called chain of thought prompting.

Prompt with chain of thought

Context: [RETRIEVED DOCUMENTS]

Question: [USER QUERY]

Task: Think step by step, responding in the following format:

[Reasoning]
[Answer]

Finally, you can use guardrails to prevent hallucination, so that the model doesn't generate an answer that is not based on the context.

Prompt with guardrails to prevent hallucination

You are a financial expert. Use the provided context to answer accurately.

Only answer questions based on the provided context. If unsure, respond with "I don't know based on the context provided."

Context: [RETRIEVED DOCUMENTS]

Question: [USER QUERY]

Answer:

Conclusion

Retrieval Augmented Generation represents a significant advancement in making language models more reliable, accurate, and practical for real-world applications.

By combining the power of information retrieval with generative AI, RAG addresses many of the limitations of traditional language models, such as:

Providing up-to-date information without model retraining
Reducing hallucinations by grounding responses in verified sources
Enabling domain-specific expertise through custom knowledge bases
Offering transparency and traceability of information sources

As the field continues to evolve, we can expect to see more sophisticated RAG implementations with improved retrieval mechanisms, better context processing, and more efficient embedding techniques.

The future of RAG likely includes multimodal capabilities, handling not just text but also images, audio, and other data types, making it an increasingly valuable tool in the AI landscape.

Whether you're building a customer support system, a research assistant, or a domain-specific AI application, RAG provides a robust framework for creating more reliable and contextually aware AI systems.

By understanding and implementing RAG effectively, developers can create AI applications that not only generate responses but do so with accuracy, relevance, and verifiable sources.

What is Retrieval Augmented Generation (RAG)?

Contents

What is RAG?

How does RAG work?

What's really different with RAG?

Enough, show me the code!

What's happening in the code?

What are embeddings?

Different types of embeddings

And then there is generation

Improving retrieval

Enhancing generation

Prompt engineering

Conclusion

Ready to ship human level AI?