Back to Tutorials
IntermediateTutorial 12

RAG Fundamentals: Retrieval-Augmented Generation Basics

NeuronDB Team
2/24/2025
35 min read

RAG Fundamentals Overview

Retrieval-Augmented Generation combines retrieval and generation. It retrieves relevant documents. It uses them as context for generation. It improves answer quality and reduces hallucinations. It enables knowledge-grounded responses.

RAG has three main components. Retrieval finds relevant documents. Augmentation adds context to prompts. Generation produces answers using context.

RAG Architecture
Figure: RAG Architecture

The diagram shows RAG flow. Query triggers retrieval. Retrieved documents provide context. Generator produces answer using context.

RAG Architecture Components

RAG architecture includes document store, retriever, and generator. Document store stores knowledge base. Retriever finds relevant documents. Generator produces answers.

Document store can be vector database, traditional database, or hybrid. Retriever uses semantic search or keyword search. Generator uses language models.

# RAG Architecture
from sentence_transformers import SentenceTransformer
from transformers import pipeline
class RAGSystem:
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.generator = pipeline('text-generation', model='gpt2')
self.documents = []
self.embeddings = None
def add_documents(self, documents):
self.documents = documents
self.embeddings = self.embedder.encode(documents)
def retrieve(self, query, top_k=3):
query_emb = self.embedder.encode([query])
similarities = cosine_similarity(query_emb, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
return [self.documents[i] for i in top_indices]
def generate(self, query, context):
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
answer = self.generator(prompt, max_length=100)[0]['generated_text']
return answer
def query(self, query):
context = " ".join(self.retrieve(query))
return self.generate(query, context)
# Example
rag = RAGSystem()
rag.add_documents(["Machine learning is...", "Deep learning uses..."])
answer = rag.query("What is machine learning?")
print("Answer: " + str(answer))

RAG architecture enables knowledge-grounded generation. It improves answer quality. It reduces hallucinations.

Detailed RAG Implementation Patterns

Basic RAG uses simple retrieval and generation. Query triggers semantic search. Top-k documents retrieved. Context built from documents. Prompt includes context and query. Generator produces answer.

Advanced RAG adds reranking and filtering. Initial retrieval gets more candidates. Reranking improves order. Filtering removes irrelevant documents. Final context uses best documents.

Iterative RAG refines retrieval through feedback. Initial answer generated. Answer analyzed for gaps. Additional retrieval fills gaps. Process repeats until complete.

# Detailed RAG Implementation
class AdvancedRAGSystem:
def __init__(self):
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
self.generator = pipeline('text-generation', model='gpt2')
self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
self.documents = []
self.embeddings = None
def basic_rag(self, query, top_k=3):
"""Basic RAG implementation"""
# Retrieve
query_emb = self.embedder.encode([query])
similarities = cosine_similarity(query_emb, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
retrieved = [self.documents[i] for i in top_indices]
# Build context
context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(retrieved)])
# Generate
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
answer = self.generator(prompt, max_length=200)[0]['generated_text']
return answer.split("Answer:")[-1].strip()
def advanced_rag(self, query, retrieve_k=20, rerank_k=5):
"""Advanced RAG with reranking"""
# Initial retrieval
query_emb = self.embedder.encode([query])
similarities = cosine_similarity(query_emb, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:retrieve_k]
candidates = [self.documents[i] for i in top_indices]
# Rerank
pairs = [[query, doc] for doc in candidates]
rerank_scores = self.reranker.predict(pairs)
rerank_indices = np.argsort(rerank_scores)[::-1][:rerank_k]
final_docs = [candidates[i] for i in rerank_indices]
# Build context
context = "\n\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(final_docs)])
# Generate
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
answer = self.generator(prompt, max_length=200)[0]['generated_text']
return answer.split("Answer:")[-1].strip()
def iterative_rag(self, query, max_iterations=3):
"""Iterative RAG with feedback"""
context = ""
answer = ""
for iteration in range(max_iterations):
# Retrieve based on query and current answer
search_query = f"{query} {answer}" if answer else query
query_emb = self.embedder.encode([search_query])
similarities = cosine_similarity(query_emb, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:5]
retrieved = [self.documents[i] for i in top_indices]
# Update context
new_context = "\n\n".join([f"Doc: {doc}" for doc in retrieved])
context = context + "\n\n" + new_context if context else new_context
# Generate answer
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
answer = self.generator(prompt, max_length=200)[0]['generated_text']
answer = answer.split("Answer:")[-1].strip()
# Check if answer is complete (simplified)
if len(answer) > 50: # Placeholder check
break
return answer
# Example
rag = AdvancedRAGSystem()
rag.documents = ["ML is AI subset", "Neural networks have layers", "Deep learning uses many layers"]
rag.embeddings = rag.embedder.encode(rag.documents)
basic_answer = rag.basic_rag("What is machine learning?")
print("Basic RAG answer: " + str(basic_answer))
advanced_answer = rag.advanced_rag("What is machine learning?", retrieve_k=10, rerank_k=3)
print("Advanced RAG answer: " + str(advanced_answer))

RAG Quality Evaluation

Evaluate RAG systems using multiple metrics. Answer quality measures correctness. Answer relevance measures topic alignment. Answer completeness measures information coverage. Context utilization measures retrieved document usage.

Answer quality uses human evaluation or automated metrics. BLEU scores measure n-gram overlap. ROUGE scores measure summary quality. BERTScore measures semantic similarity. Human evaluation provides gold standard.

# RAG Quality Evaluation
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import numpy as np
class RAGEvaluator:
def __init__(self):
self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
def evaluate_answer_quality(self, generated_answer, reference_answer):
"""Evaluate answer quality using multiple metrics"""
metrics = {}
# ROUGE scores
rouge_scores = self.rouge_scorer.score(reference_answer, generated_answer)
metrics['rouge1'] = rouge_scores['rouge1'].fmeasure
metrics['rouge2'] = rouge_scores['rouge2'].fmeasure
metrics['rougeL'] = rouge_scores['rougeL'].fmeasure
# BERTScore
P, R, F1 = bert_score([generated_answer], [reference_answer], lang='en', verbose=False)
metrics['bertscore_precision'] = P.item()
metrics['bertscore_recall'] = R.item()
metrics['bertscore_f1'] = F1.item()
# Answer length
metrics['answer_length'] = len(generated_answer.split())
metrics['reference_length'] = len(reference_answer.split())
return metrics
def evaluate_context_utilization(self, retrieved_docs, generated_answer):
"""Measure how well answer uses retrieved context"""
answer_words = set(generated_answer.lower().split())
doc_words_sets = [set(doc.lower().split()) for doc in retrieved_docs]
all_doc_words = set().union(*doc_words_sets)
# Overlap ratio
overlap = answer_words & all_doc_words
utilization = len(overlap) / len(answer_words) if len(answer_words) > 0 else 0
return {
'context_utilization': utilization,
'overlap_count': len(overlap),
'answer_word_count': len(answer_words),
'context_word_count': len(all_doc_words)
}
def evaluate_retrieval_quality(self, retrieved_docs, relevant_docs):
"""Evaluate retrieval performance"""
retrieved_set = set(retrieved_docs)
relevant_set = set(relevant_docs)
precision = len(retrieved_set & relevant_set) / len(retrieved_set) if len(retrieved_set) > 0 else 0
recall = len(retrieved_set & relevant_set) / len(relevant_set) if len(relevant_set) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return {
'precision': precision,
'recall': recall,
'f1': f1
}
# Example
evaluator = RAGEvaluator()
generated = "Machine learning is a subset of artificial intelligence that enables computers to learn."
reference = "Machine learning is a method of data analysis that automates analytical model building."
quality_metrics = evaluator.evaluate_answer_quality(generated, reference)
print("Answer quality metrics:")
for metric, value in quality_metrics.items():
print(f"{metric}: {value:.4f}")
RAG Components
Figure: RAG Components

The diagram shows RAG components. Document store provides knowledge. Retriever finds relevant content. Generator produces answers.

Document Processing Pipeline

Document processing prepares documents for RAG. It includes ingestion, chunking, embedding, and indexing. Each step affects retrieval quality.

Processing pipeline ingests documents from sources. It chunks documents appropriately. It generates embeddings. It indexes for fast retrieval.

# Document Processing Pipeline
def process_documents(documents):
# Chunk documents
chunks = []
for doc in documents:
chunks.extend(chunk_document(doc, chunk_size=500))
# Generate embeddings
embeddings = embedder.encode(chunks)
# Index
index = create_index(embeddings)
return chunks, embeddings, index
# Example
documents = ["Document 1 content...", "Document 2 content..."]
chunks, embeddings, index = process_documents(documents)
print("Processed " + str(len(chunks)) + " chunks")

Document processing affects RAG quality. Good processing improves retrieval. It enables accurate generation.

Document Processing
Figure: Document Processing

The diagram shows processing pipeline. Documents are chunked. Chunks are embedded. Embeddings are indexed.

Retrieval Strategies

Retrieval strategies find relevant documents. They use semantic search, keyword search, or hybrid approaches. Each has different strengths.

Semantic retrieval uses embeddings. It finds semantically similar documents. Keyword retrieval uses term matching. It finds documents with matching terms. Hybrid combines both.

# Retrieval Strategies
def semantic_retrieve(query, embeddings, index, top_k=5):
query_emb = embedder.encode([query])
results = index.search(query_emb, top_k)
return results
def keyword_retrieve(query, documents, top_k=5):
# Use TF-IDF or BM25
scores = compute_keyword_scores(query, documents)
top_indices = np.argsort(scores)[::-1][:top_k]
return [documents[i] for i in top_indices]
def hybrid_retrieve(query, documents, embeddings, index, alpha=0.5, top_k=5):
semantic_results = semantic_retrieve(query, embeddings, index, top_k*2)
keyword_results = keyword_retrieve(query, documents, top_k*2)
# Combine and rerank
combined = combine_results(semantic_results, keyword_results, alpha)
return combined[:top_k]

Retrieval strategies affect RAG performance. Semantic retrieval works for meaning-based queries. Keyword retrieval works for exact matches. Hybrid works for diverse queries.

Retrieval Strategies
Figure: Retrieval Strategies

The diagram shows retrieval strategies. Semantic uses embeddings. Keyword uses term matching. Hybrid combines both methods.

Context Building

Context building prepares retrieved documents for generation. It combines multiple documents. It formats context appropriately. It ensures context fits prompt limits.

Context building includes document selection, formatting, and truncation. Selection chooses most relevant documents. Formatting structures context. Truncation fits length limits.

# Context Building
def build_context(retrieved_docs, max_length=1000):
context_parts = []
current_length = 0
for doc in retrieved_docs:
doc_text = format_document(doc)
if current_length + len(doc_text) <= max_length:
context_parts.append(doc_text)
current_length += len(doc_text)
else:
break
context = "\n\n".join(context_parts)
return context
# Example
retrieved = ["Doc 1", "Doc 2", "Doc 3"]
context = build_context(retrieved, max_length=500)
print("Context: " + str(context))

Context building affects generation quality. Good context improves answers. It provides relevant information.

Prompt Construction

Prompt construction creates effective prompts for generation. It includes context, query, and instructions. It structures information clearly.

Prompts typically include context section, query section, and instruction section. Context provides background. Query specifies task. Instructions guide generation.

# Prompt Construction
def construct_prompt(query, context, instruction=""):
prompt = f"""Context:
{context}
Question: {query}
{instruction}
Answer:"""
return prompt
# Example
query = "What is machine learning?"
context = "Machine learning is a subset of AI..."
prompt = construct_prompt(query, context, "Answer based on the context provided.")
print("Prompt: " + str(prompt))

Prompt construction affects generation quality. Clear prompts produce better answers. Good structure improves understanding.

Generation Integration

Generation integration uses language models to produce answers. It takes prompts with context. It generates coherent responses. It ensures answers use provided context.

Generation uses decoder models like GPT. It generates text autoregressively. It conditions on context. It produces relevant answers.

# Generation Integration
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
def generate_answer(prompt, max_length=200):
output = generator(prompt, max_length=max_length, num_return_sequences=1)
answer = output[0]['generated_text'].split("Answer:")[-1].strip()
return answer
# Example
prompt = "Context: ML is AI subset.\n\nQuestion: What is ML?\n\nAnswer:"
answer = generate_answer(prompt)
print("Answer: " + str(answer))

Generation integration produces final answers. It uses retrieved context. It ensures relevance.

Summary

RAG combines retrieval and generation. Architecture includes document store, retriever, and generator. Document processing prepares documents. Retrieval strategies find relevant content. Context building prepares information. Prompt construction creates effective prompts. Generation integration produces answers. RAG enables knowledge-grounded generation.

References

Related Tutorials