Semantic search finds documents by meaning, not just keywords. It uses embeddings to capture semantics. It understands query intent. It finds relevant documents regardless of exact word matches. It works better than keyword search for many tasks.
Semantic search converts queries and documents to embeddings. It compares embeddings using similarity. It ranks documents by semantic relevance. It returns most relevant results.
The diagram shows semantic search flow. Query converts to embedding. Documents convert to embeddings. Similarity search finds relevant documents.
Document Chunking Strategies
Document chunking splits documents into searchable pieces. Chunks must balance context and granularity. Too large chunks lose precision. Too small chunks lose context.
Chunking strategies include fixed-size, sentence-based, and semantic chunking. Fixed-size uses character or token limits. Sentence-based splits at sentence boundaries. Semantic chunking groups related content.
SET embedding = neurondb.embed(chunk,'sentence-transformers/all-MiniLM-L6-v2');
Chunking affects search quality. Good chunking preserves context. It enables precise retrieval.
Detailed Chunking Strategies
Fixed-size chunking uses character or token limits. It is simple to implement. It works for uniform documents. It may split sentences or concepts. Overlap helps preserve context across boundaries.
Sentence-based chunking splits at sentence boundaries. It preserves sentence integrity. It works well for natural language. It may create variable-sized chunks. It requires sentence segmentation.
Semantic chunking groups related content. It uses embeddings to find boundaries. It creates coherent chunks. It requires more computation. It produces better quality chunks.
# Detailed Chunking Implementation
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
long_text ="Machine learning is a subset of AI. It enables computers to learn. Deep learning uses neural networks. Neural networks have multiple layers. Each layer processes information differently."
Choose chunk size based on document type. Short documents need smaller chunks. Long documents can use larger chunks. Typical sizes are 200-500 tokens. Test different sizes for your use case.
Overlap helps preserve context. Typical overlap is 10-20% of chunk size. It ensures important information isn't split. It improves retrieval quality. It increases storage requirements.
Consider document structure. Respect paragraph boundaries when possible. Preserve section headers. Maintain list formatting. These improve chunk quality.
The diagram shows chunking strategies. Fixed-size creates uniform chunks. Sentence-based preserves sentence boundaries. Semantic chunking groups related content.
Query Processing
Query processing prepares queries for search. It converts queries to embeddings. It handles query expansion. It normalizes queries. It improves search quality.
Query processing includes normalization, expansion, and embedding. Normalization standardizes text. Expansion adds related terms. Embedding converts to vectors.
# Query Processing
from sentence_transformers import SentenceTransformer
The diagram shows embedding generation process. Text preprocessed and tokenized. Encoder processes tokens. Embedding vector generated. Normalized for similarity search.
Ranking Algorithms
Ranking algorithms order search results. They use similarity scores. They combine multiple signals. They improve result relevance.
Ranking methods include similarity-based, learning-to-rank, and hybrid approaches. Similarity-based uses embedding similarity. Learning-to-rank uses machine learning. Hybrid combines multiple signals.
# Ranking Algorithms
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Ranking improves result quality. It orders by relevance. It combines multiple signals.
Detailed Ranking Algorithms
Similarity-based ranking uses embedding similarity directly. It is simple and fast. It works well when embeddings are good. It may miss important signals.
Learning-to-rank uses machine learning models. Features include query-document similarity, document length, position, and query characteristics. Models learn optimal feature weights. They improve ranking quality. They require training data.
BM25 is probabilistic ranking function. It combines term frequency and inverse document frequency. It works well for keyword search. It can be combined with semantic scores.
# Detailed Ranking Implementation
import numpy as np
from sklearn.ensemble import RandomForestRegressor
The diagram shows ranking process. Similarity scores computed. Results ordered by score. Top results returned.
Keyword vs Semantic Search
Keyword search matches exact words. It is fast and simple. It misses synonyms and related concepts. Semantic search matches meaning. It finds related content. It works better for many queries.
Keyword search uses inverted indexes. It matches query terms. It ranks by term frequency. Semantic search uses embeddings. It matches meaning. It ranks by semantic similarity.
# Keyword vs Semantic Search
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
Hybrid search combines keyword and semantic search. It uses both exact matches and meaning. It improves result quality. It handles diverse query types.
Hybrid methods include score fusion, reranking, and multi-stage retrieval. Score fusion combines similarity scores. Reranking uses semantic scores to rerank keyword results. Multi-stage uses keyword for recall, semantic for precision.