RAG Fundamentals Overview
Retrieval-Augmented Generation combines retrieval and generation. It retrieves relevant documents. It uses them as context for generation. It improves answer quality and reduces hallucinations. It enables knowledge-grounded responses.
RAG has three main components. Retrieval finds relevant documents. Augmentation adds context to prompts. Generation produces answers using context.
Figure : RAG Architecture
The diagram shows RAG flow. Query triggers retrieval. Retrieved documents provide context. Generator produces answer using context.
RAG Architecture Components
RAG architecture includes document store, retriever, and generator. Document store stores knowledge base. Retriever finds relevant documents. Generator produces answers.
Document store can be vector database, traditional database, or hybrid. Retriever uses semantic search or keyword search. Generator uses language models.
from sentence_transformers import SentenceTransformer
from transformers import pipeline
class RAGSystem :
def __init__ ( self ) :
self . embedder = SentenceTransformer ( 'all-MiniLM-L6-v2' )
self . generator = pipeline ( 'text-generation' , model = 'gpt2' )
self . documents = [ ]
self . embeddings = None
def add_documents ( self , documents ) :
self . documents = documents
self . embeddings = self . embedder . encode ( documents )
def retrieve ( self , query , top_k = 3 ) :
query_emb = self . embedder . encode ( [ query ] )
similarities = cosine_similarity ( query_emb , self . embeddings ) [ 0 ]
top_indices = np . argsort ( similarities ) [ : : - 1 ] [ : top_k ]
return [ self . documents [ i ] for i in top_indices ]
def generate ( self , query , context ) :
prompt = f"Context: { context } \n\nQuestion: { query } \n\nAnswer:"
answer = self . generator ( prompt , max_length = 100 ) [ 0 ] [ 'generated_text' ]
return answer
def query ( self , query ) :
context = " " . join ( self . retrieve ( query ) )
return self . generate ( query , context )
rag = RAGSystem ( )
rag . add_documents ( [ "Machine learning is..." , "Deep learning uses..." ] )
answer = rag . query ( "What is machine learning?" )
print ( "Answer: " + str ( answer ) )
RAG architecture enables knowledge-grounded generation. It improves answer quality. It reduces hallucinations.
Detailed RAG Implementation Patterns
Basic RAG uses simple retrieval and generation. Query triggers semantic search. Top-k documents retrieved. Context built from documents. Prompt includes context and query. Generator produces answer.
Advanced RAG adds reranking and filtering. Initial retrieval gets more candidates. Reranking improves order. Filtering removes irrelevant documents. Final context uses best documents.
Iterative RAG refines retrieval through feedback. Initial answer generated. Answer analyzed for gaps. Additional retrieval fills gaps. Process repeats until complete.
class AdvancedRAGSystem :
def __init__ ( self ) :
self . embedder = SentenceTransformer ( 'all-MiniLM-L6-v2' )
self . generator = pipeline ( 'text-generation' , model = 'gpt2' )
self . reranker = CrossEncoder ( 'cross-encoder/ms-marco-MiniLM-L-6-v2' )
self . documents = [ ]
self . embeddings = None
def basic_rag ( self , query , top_k = 3 ) :
"""Basic RAG implementation"""
query_emb = self . embedder . encode ( [ query ] )
similarities = cosine_similarity ( query_emb , self . embeddings ) [ 0 ]
top_indices = np . argsort ( similarities ) [ : : - 1 ] [ : top_k ]
retrieved = [ self . documents [ i ] for i in top_indices ]
context = "\n\n" . join ( [ f"Document { i + 1 } : { doc } " for i , doc in enumerate ( retrieved ) ] )
prompt = f"Context: { context } \n\nQuestion: { query } \n\nAnswer:"
answer = self . generator ( prompt , max_length = 200 ) [ 0 ] [ 'generated_text' ]
return answer . split ( "Answer:" ) [ - 1 ] . strip ( )
def advanced_rag ( self , query , retrieve_k = 20 , rerank_k = 5 ) :
"""Advanced RAG with reranking"""
query_emb = self . embedder . encode ( [ query ] )
similarities = cosine_similarity ( query_emb , self . embeddings ) [ 0 ]
top_indices = np . argsort ( similarities ) [ : : - 1 ] [ : retrieve_k ]
candidates = [ self . documents [ i ] for i in top_indices ]
pairs = [ [ query , doc ] for doc in candidates ]
rerank_scores = self . reranker . predict ( pairs )
rerank_indices = np . argsort ( rerank_scores ) [ : : - 1 ] [ : rerank_k ]
final_docs = [ candidates [ i ] for i in rerank_indices ]
context = "\n\n" . join ( [ f"Document { i + 1 } : { doc } " for i , doc in enumerate ( final_docs ) ] )
prompt = f"Context: { context } \n\nQuestion: { query } \n\nAnswer:"
answer = self . generator ( prompt , max_length = 200 ) [ 0 ] [ 'generated_text' ]
return answer . split ( "Answer:" ) [ - 1 ] . strip ( )
def iterative_rag ( self , query , max_iterations = 3 ) :
"""Iterative RAG with feedback"""
context = ""
answer = ""
for iteration in range ( max_iterations ) :
search_query = f" { query } { answer } " if answer else query
query_emb = self . embedder . encode ( [ search_query ] )
similarities = cosine_similarity ( query_emb , self . embeddings ) [ 0 ]
top_indices = np . argsort ( similarities ) [ : : - 1 ] [ : 5 ]
retrieved = [ self . documents [ i ] for i in top_indices ]
new_context = "\n\n" . join ( [ f"Doc: { doc } " for doc in retrieved ] )
context = context + "\n\n" + new_context if context else new_context
prompt = f"Context: { context } \n\nQuestion: { query } \n\nAnswer:"
answer = self . generator ( prompt , max_length = 200 ) [ 0 ] [ 'generated_text' ]
answer = answer . split ( "Answer:" ) [ - 1 ] . strip ( )
if len ( answer ) > 50 :
break
return answer
rag = AdvancedRAGSystem ( )
rag . documents = [ "ML is AI subset" , "Neural networks have layers" , "Deep learning uses many layers" ]
rag . embeddings = rag . embedder . encode ( rag . documents )
basic_answer = rag . basic_rag ( "What is machine learning?" )
print ( "Basic RAG answer: " + str ( basic_answer ) )
advanced_answer = rag . advanced_rag ( "What is machine learning?" , retrieve_k = 10 , rerank_k = 3 )
print ( "Advanced RAG answer: " + str ( advanced_answer ) )
RAG Quality Evaluation
Evaluate RAG systems using multiple metrics. Answer quality measures correctness. Answer relevance measures topic alignment. Answer completeness measures information coverage. Context utilization measures retrieved document usage.
Answer quality uses human evaluation or automated metrics. BLEU scores measure n-gram overlap. ROUGE scores measure summary quality. BERTScore measures semantic similarity. Human evaluation provides gold standard.
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import numpy as np
class RAGEvaluator :
def __init__ ( self ) :
self . rouge_scorer = rouge_scorer . RougeScorer ( [ 'rouge1' , 'rouge2' , 'rougeL' ] , use_stemmer = True )
def evaluate_answer_quality ( self , generated_answer , reference_answer ) :
"""Evaluate answer quality using multiple metrics"""
metrics = { }
rouge_scores = self . rouge_scorer . score ( reference_answer , generated_answer )
metrics [ 'rouge1' ] = rouge_scores [ 'rouge1' ] . fmeasure
metrics [ 'rouge2' ] = rouge_scores [ 'rouge2' ] . fmeasure
metrics [ 'rougeL' ] = rouge_scores [ 'rougeL' ] . fmeasure
P , R , F1 = bert_score ( [ generated_answer ] , [ reference_answer ] , lang = 'en' , verbose = False )
metrics [ 'bertscore_precision' ] = P . item ( )
metrics [ 'bertscore_recall' ] = R . item ( )
metrics [ 'bertscore_f1' ] = F1 . item ( )
metrics [ 'answer_length' ] = len ( generated_answer . split ( ) )
metrics [ 'reference_length' ] = len ( reference_answer . split ( ) )
return metrics
def evaluate_context_utilization ( self , retrieved_docs , generated_answer ) :
"""Measure how well answer uses retrieved context"""
answer_words = set ( generated_answer . lower ( ) . split ( ) )
doc_words_sets = [ set ( doc . lower ( ) . split ( ) ) for doc in retrieved_docs ]
all_doc_words = set ( ) . union ( * doc_words_sets )
overlap = answer_words & all_doc_words
utilization = len ( overlap ) / len ( answer_words ) if len ( answer_words ) > 0 else 0
return {
'context_utilization' : utilization ,
'overlap_count' : len ( overlap ) ,
'answer_word_count' : len ( answer_words ) ,
'context_word_count' : len ( all_doc_words )
}
def evaluate_retrieval_quality ( self , retrieved_docs , relevant_docs ) :
"""Evaluate retrieval performance"""
retrieved_set = set ( retrieved_docs )
relevant_set = set ( relevant_docs )
precision = len ( retrieved_set & relevant_set ) / len ( retrieved_set ) if len ( retrieved_set ) > 0 else 0
recall = len ( retrieved_set & relevant_set ) / len ( relevant_set ) if len ( relevant_set ) > 0 else 0
f1 = 2 * precision * recall / ( precision + recall ) if ( precision + recall ) > 0 else 0
return {
'precision' : precision ,
'recall' : recall ,
'f1' : f1
}
evaluator = RAGEvaluator ( )
generated = "Machine learning is a subset of artificial intelligence that enables computers to learn."
reference = "Machine learning is a method of data analysis that automates analytical model building."
quality_metrics = evaluator . evaluate_answer_quality ( generated , reference )
print ( "Answer quality metrics:" )
for metric , value in quality_metrics . items ( ) :
print ( f" { metric } : { value : .4f } " )
Figure : RAG Components
The diagram shows RAG components. Document store provides knowledge. Retriever finds relevant content. Generator produces answers.
Document Processing Pipeline
Document processing prepares documents for RAG. It includes ingestion, chunking, embedding, and indexing. Each step affects retrieval quality.
Processing pipeline ingests documents from sources. It chunks documents appropriately. It generates embeddings. It indexes for fast retrieval.
def process_documents ( documents ) :
chunks = [ ]
for doc in documents :
chunks . extend ( chunk_document ( doc , chunk_size = 500 ) )
embeddings = embedder . encode ( chunks )
index = create_index ( embeddings )
return chunks , embeddings , index
documents = [ "Document 1 content..." , "Document 2 content..." ]
chunks , embeddings , index = process_documents ( documents )
print ( "Processed " + str ( len ( chunks ) ) + " chunks" )
Document processing affects RAG quality. Good processing improves retrieval. It enables accurate generation.
Figure : Document Processing
The diagram shows processing pipeline. Documents are chunked. Chunks are embedded. Embeddings are indexed.
Retrieval Strategies
Retrieval strategies find relevant documents. They use semantic search, keyword search, or hybrid approaches. Each has different strengths.
Semantic retrieval uses embeddings. It finds semantically similar documents. Keyword retrieval uses term matching. It finds documents with matching terms. Hybrid combines both.
def semantic_retrieve ( query , embeddings , index , top_k = 5 ) :
query_emb = embedder . encode ( [ query ] )
results = index . search ( query_emb , top_k )
return results
def keyword_retrieve ( query , documents , top_k = 5 ) :
scores = compute_keyword_scores ( query , documents )
top_indices = np . argsort ( scores ) [ : : - 1 ] [ : top_k ]
return [ documents [ i ] for i in top_indices ]
def hybrid_retrieve ( query , documents , embeddings , index , alpha = 0.5 , top_k = 5 ) :
semantic_results = semantic_retrieve ( query , embeddings , index , top_k * 2 )
keyword_results = keyword_retrieve ( query , documents , top_k * 2 )
combined = combine_results ( semantic_results , keyword_results , alpha )
return combined [ : top_k ]
Retrieval strategies affect RAG performance. Semantic retrieval works for meaning-based queries. Keyword retrieval works for exact matches. Hybrid works for diverse queries.
Figure : Retrieval Strategies
The diagram shows retrieval strategies. Semantic uses embeddings. Keyword uses term matching. Hybrid combines both methods.
Context Building
Context building prepares retrieved documents for generation. It combines multiple documents. It formats context appropriately. It ensures context fits prompt limits.
Context building includes document selection, formatting, and truncation. Selection chooses most relevant documents. Formatting structures context. Truncation fits length limits.
def build_context ( retrieved_docs , max_length = 1000 ) :
context_parts = [ ]
current_length = 0
for doc in retrieved_docs :
doc_text = format_document ( doc )
if current_length + len ( doc_text ) <= max_length :
context_parts . append ( doc_text )
current_length += len ( doc_text )
else :
break
context = "\n\n" . join ( context_parts )
return context
retrieved = [ "Doc 1" , "Doc 2" , "Doc 3" ]
context = build_context ( retrieved , max_length = 500 )
print ( "Context: " + str ( context ) )
Context building affects generation quality. Good context improves answers. It provides relevant information.
Prompt Construction
Prompt construction creates effective prompts for generation. It includes context, query, and instructions. It structures information clearly.
Prompts typically include context section, query section, and instruction section. Context provides background. Query specifies task. Instructions guide generation.
def construct_prompt ( query , context , instruction = "" ) :
prompt = f"""Context:
{ context }
Question: { query }
{ instruction }
Answer:"""
return prompt
query = "What is machine learning?"
context = "Machine learning is a subset of AI..."
prompt = construct_prompt ( query , context , "Answer based on the context provided." )
print ( "Prompt: " + str ( prompt ) )
Prompt construction affects generation quality. Clear prompts produce better answers. Good structure improves understanding.
Generation Integration
Generation integration uses language models to produce answers. It takes prompts with context. It generates coherent responses. It ensures answers use provided context.
Generation uses decoder models like GPT. It generates text autoregressively. It conditions on context. It produces relevant answers.
from transformers import pipeline
generator = pipeline ( 'text-generation' , model = 'gpt2' )
def generate_answer ( prompt , max_length = 200 ) :
output = generator ( prompt , max_length = max_length , num_return_sequences = 1 )
answer = output [ 0 ] [ 'generated_text' ] . split ( "Answer:" ) [ - 1 ] . strip ( )
return answer
prompt = "Context: ML is AI subset.\n\nQuestion: What is ML?\n\nAnswer:"
answer = generate_answer ( prompt )
print ( "Answer: " + str ( answer ) )
Generation integration produces final answers. It uses retrieved context. It ensures relevance.
Summary
RAG combines retrieval and generation. Architecture includes document store, retriever, and generator. Document processing prepares documents. Retrieval strategies find relevant content. Context building prepares information. Prompt construction creates effective prompts. Generation integration produces answers. RAG enables knowledge-grounded generation.
References
Related Tutorials