Machine Learning & Embeddings
ML Capabilities
NeuronDB provides comprehensive machine learning capabilities with 19 fully implemented ML algorithms (part of 52 total ML features) in pure C, all supporting GPU acceleration and SIMD optimization. The extension offers in-database ML inference, embedding generation, and complete model lifecycle management.
19 Fully Implemented ML Algorithms
NeuronDB includes a complete suite of machine learning algorithms organized by category:
Clustering (5 algorithms):
- K-Means - Lloyd's algorithm with k-means++ initialization, GPU-accelerated
- Mini-batch K-Means - Efficient variant for large datasets and streaming data
- DBSCAN - Density-based clustering for arbitrary shapes and noise detection
- Gaussian Mixture Model (GMM) - EM algorithm for probabilistic clustering
- Hierarchical Clustering - Agglomerative clustering with multiple linkage criteria
Dimensionality Reduction (2 algorithms):
- PCA - Principal Component Analysis with SIMD-optimized covariance computation
- PCA Whitening - Advanced PCA with decorrelation and normalization
Quantization (2 algorithms):
- Product Quantization (PQ) - Vector compression for 8-16x memory reduction
- Optimized Product Quantization (OPQ) - Rotation-optimized PQ for better accuracy
Outlier Detection (3 methods):
- Z-Score - Statistical outlier detection using standard deviations
- Modified Z-Score - Robust variant using median and MAD (Median Absolute Deviation)
- IQR - Non-parametric interquartile range method for non-normal distributions
Reranking (3 algorithms):
- MMR (Maximal Marginal Relevance) - Diversity-focused reranking
- Ensemble Reranking - Weighted and Borda count methods
- Learning-to-Rank (LTR) - Machine learning-based ranking
Quality Metrics (6 metrics):
- Recall@K, Precision@K, F1@K - Search quality metrics for retrieval evaluation
- MRR (Mean Reciprocal Rank) - Ranking quality measurement
- Davies-Bouldin Index - Clustering quality measurement (lower is better)
- Silhouette Score - Clustering cohesion and separation analysis
Drift Detection (3 methods):
- Centroid Drift - Track changes in cluster centroids over time
- Distribution Divergence - Statistical distribution comparison
- Temporal Drift Monitoring - Time-series based drift detection
Analytics (4 features):
- Topic Discovery - Automatic topic modeling and analysis
- Similarity Histograms - Analyze similarity distributions in vector space
- KNN Graph Building - Construct k-nearest neighbor graphs for analysis
- Embedding Quality Assessment - Evaluate embedding quality and coherence
Search (2 algorithms):
- Hybrid Lexical-Semantic Fusion - Combine vector and text search signals
- Reciprocal Rank Fusion (RRF) - Multiple ranking fusion algorithm
Traditional ML Algorithms:
Additionally supports Random Forest, XGBoost, LightGBM, CatBoost, Linear/Logistic Regression, Ridge, Lasso, SVM, KNN, Naive Bayes, Decision Trees, Neural Networks, and Deep Learning models for classification, regression, and advanced ML workflows.
In-Database ML Inference
Run ML models directly inside PostgreSQL with zero data movement, batch inference for high throughput, real-time predictions with low latency, and automatic GPU acceleration when available.
Embedding Generation
Generate embeddings from text, images, and more. Supports OpenAI, Cohere, HuggingFace models, custom model deployment, automatic batching and caching, and multi-modal embeddings (text, image, audio).
Model Management
Deploy and manage ML models efficiently with model versioning and rollback, A/B testing support, resource quota management, and performance monitoring.
AutoML & Feature Store
Automated hyperparameter tuning, model selection, and feature management with versioning for production ML workflows.
Supported Models
Text Embeddings
- text-embedding-ada-002 (OpenAI) - 1536 dimensions - General text similarity
- text-embedding-3-small (OpenAI) - 1536 dimensions - Efficient embeddings
- text-embedding-3-large (OpenAI) - 3072 dimensions - High quality embeddings
- embed-english-v3.0 (Cohere) - 1024 dimensions - English text
- embed-multilingual-v3.0 (Cohere) - 1024 dimensions - Multilingual text
Sentence Transformers
- all-MiniLM-L6-v2 (HuggingFace) - 384 dimensions - Fast, lightweight
- all-mpnet-base-v2 (HuggingFace) - 768 dimensions - High quality
- paraphrase-multilingual-MiniLM (HuggingFace) - 384 dimensions - 50+ languages
Multimodal
- CLIP-ViT-B-32 (OpenAI) - 512 dimensions - Image + text
- CLIP-ViT-L-14 (OpenAI) - 768 dimensions - High quality image search
ML Functions
embed_text()
Generate text embeddings with automatic caching.
Signature: embed_text(text TEXT, model TEXT DEFAULT 'all-MiniLM-L6-v2') RETURNS vector
Example
SELECT embed_text('Machine learning with PostgreSQL');embed_text_batch()
Generate embeddings for multiple texts efficiently.
Signature: embed_text_batch(texts TEXT[], model TEXT DEFAULT 'all-MiniLM-L6-v2') RETURNS vector[]
Example
SELECT embed_text_batch(ARRAY['text1', 'text2'], 'all-MiniLM-L6-v2');train_random_forest_classifier()
Train Random Forest classifier with GPU support.
Signature: train_random_forest_classifier(table_name TEXT, features_col TEXT, label_col TEXT, n_trees INT, max_depth INT)
Example
SELECT train_random_forest_classifier('training_data', 'features', 'label', 100, 10);cluster_kmeans()
K-means clustering with GPU acceleration.
Signature: cluster_kmeans(table_name TEXT, vector_column TEXT, k INTEGER, max_iter INTEGER DEFAULT 100)
Example
SELECT cluster_kmeans('documents', 'embedding', 5, 100);Next Steps
- ONNX Inference - Deploy ONNX models
- Embeddings - Generate embeddings
- Clustering - ML clustering algorithms