Machine Learning & Embeddings | NeuronDB Documentation

ML Capabilities

NeuronDB provides comprehensive machine learning capabilities with 19 fully implemented ML algorithms (part of 52 total ML features) in pure C, all supporting GPU acceleration and SIMD optimization. The extension offers in-database ML inference, embedding generation, and complete model lifecycle management.

19 Fully Implemented ML Algorithms

NeuronDB includes a complete suite of machine learning algorithms organized by category:

Clustering (5 algorithms):

K-Means - Lloyd's algorithm with k-means++ initialization, GPU-accelerated
Mini-batch K-Means - Efficient variant for large datasets and streaming data
DBSCAN - Density-based clustering for arbitrary shapes and noise detection
Gaussian Mixture Model (GMM) - EM algorithm for probabilistic clustering
Hierarchical Clustering - Agglomerative clustering with multiple linkage criteria

Dimensionality Reduction (2 algorithms):

PCA - Principal Component Analysis with SIMD-optimized covariance computation
PCA Whitening - Advanced PCA with decorrelation and normalization

Quantization (2 algorithms):

Product Quantization (PQ) - Vector compression for 8-16x memory reduction
Optimized Product Quantization (OPQ) - Rotation-optimized PQ for better accuracy

Outlier Detection (3 methods):

Z-Score - Statistical outlier detection using standard deviations
Modified Z-Score - Robust variant using median and MAD (Median Absolute Deviation)
IQR - Non-parametric interquartile range method for non-normal distributions

Reranking (3 algorithms):

MMR (Maximal Marginal Relevance) - Diversity-focused reranking
Ensemble Reranking - Weighted and Borda count methods
Learning-to-Rank (LTR) - Machine learning-based ranking

Quality Metrics (6 metrics):

Recall@K, Precision@K, F1@K - Search quality metrics for retrieval evaluation
MRR (Mean Reciprocal Rank) - Ranking quality measurement
Davies-Bouldin Index - Clustering quality measurement (lower is better)
Silhouette Score - Clustering cohesion and separation analysis

Drift Detection (3 methods):

Centroid Drift - Track changes in cluster centroids over time
Distribution Divergence - Statistical distribution comparison
Temporal Drift Monitoring - Time-series based drift detection

Analytics (4 features):

Topic Discovery - Automatic topic modeling and analysis
Similarity Histograms - Analyze similarity distributions in vector space
KNN Graph Building - Construct k-nearest neighbor graphs for analysis
Embedding Quality Assessment - Evaluate embedding quality and coherence

Search (2 algorithms):

Hybrid Lexical-Semantic Fusion - Combine vector and text search signals
Reciprocal Rank Fusion (RRF) - Multiple ranking fusion algorithm

Traditional ML Algorithms:

Additionally supports Random Forest, XGBoost, LightGBM, CatBoost, Linear/Logistic Regression, Ridge, Lasso, SVM, KNN, Naive Bayes, Decision Trees, Neural Networks, and Deep Learning models for classification, regression, and advanced ML workflows.

In-Database ML Inference

Run ML models directly inside PostgreSQL with zero data movement, batch inference for high throughput, real-time predictions with low latency, and automatic GPU acceleration when available.

Embedding Generation

Generate embeddings from text, images, and more. Supports OpenAI, Cohere, HuggingFace models, custom model deployment, automatic batching and caching, and multi-modal embeddings (text, image, audio).

Model Management

Deploy and manage ML models efficiently with model versioning and rollback, A/B testing support, resource quota management, and performance monitoring.

AutoML & Feature Store

Automated hyperparameter tuning, model selection, and feature management with versioning for production ML workflows.

Supported Models

Text Embeddings

text-embedding-ada-002 (OpenAI) - 1536 dimensions - General text similarity
text-embedding-3-small (OpenAI) - 1536 dimensions - Efficient embeddings
text-embedding-3-large (OpenAI) - 3072 dimensions - High quality embeddings
embed-english-v3.0 (Cohere) - 1024 dimensions - English text
embed-multilingual-v3.0 (Cohere) - 1024 dimensions - Multilingual text

Sentence Transformers

all-MiniLM-L6-v2 (HuggingFace) - 384 dimensions - Fast, lightweight
all-mpnet-base-v2 (HuggingFace) - 768 dimensions - High quality
paraphrase-multilingual-MiniLM (HuggingFace) - 384 dimensions - 50+ languages

Multimodal

CLIP-ViT-B-32 (OpenAI) - 512 dimensions - Image + text
CLIP-ViT-L-14 (OpenAI) - 768 dimensions - High quality image search

ML Functions

embed_text()

Generate text embeddings with automatic caching.

Signature: embed_text(text TEXT, model TEXT DEFAULT 'all-MiniLM-L6-v2') RETURNS vector

Example

SELECT embed_text('Machine learning with PostgreSQL');

embed_text_batch()

Generate embeddings for multiple texts efficiently.

Signature: embed_text_batch(texts TEXT[], model TEXT DEFAULT 'all-MiniLM-L6-v2') RETURNS vector[]

Example

SELECT embed_text_batch(ARRAY['text1', 'text2'], 'all-MiniLM-L6-v2');

train_random_forest_classifier()

Train Random Forest classifier with GPU support.

Signature: train_random_forest_classifier(table_name TEXT, features_col TEXT, label_col TEXT, n_trees INT, max_depth INT)

Example

SELECT train_random_forest_classifier('training_data', 'features', 'label', 100, 10);

cluster_kmeans()

K-means clustering with GPU acceleration.

Signature: cluster_kmeans(table_name TEXT, vector_column TEXT, k INTEGER, max_iter INTEGER DEFAULT 100)

Example

SELECT cluster_kmeans('documents', 'embedding', 5, 100);

Next Steps

ONNX Inference - Deploy ONNX models
Embeddings - Generate embeddings
Clustering - ML clustering algorithms