DocumentationNeuronDB Documentation

Machine Learning & Embeddings

ML Capabilities

NeuronDB provides comprehensive machine learning capabilities with 19 fully implemented ML algorithms (part of 52 total ML features) in pure C, all supporting GPU acceleration and SIMD optimization. The extension offers in-database ML inference, embedding generation, and complete model lifecycle management.

19 Fully Implemented ML Algorithms

NeuronDB includes a complete suite of machine learning algorithms organized by category:

Clustering (5 algorithms):

  • K-Means - Lloyd's algorithm with k-means++ initialization, GPU-accelerated
  • Mini-batch K-Means - Efficient variant for large datasets and streaming data
  • DBSCAN - Density-based clustering for arbitrary shapes and noise detection
  • Gaussian Mixture Model (GMM) - EM algorithm for probabilistic clustering
  • Hierarchical Clustering - Agglomerative clustering with multiple linkage criteria

Dimensionality Reduction (2 algorithms):

  • PCA - Principal Component Analysis with SIMD-optimized covariance computation
  • PCA Whitening - Advanced PCA with decorrelation and normalization

Quantization (2 algorithms):

  • Product Quantization (PQ) - Vector compression for 8-16x memory reduction
  • Optimized Product Quantization (OPQ) - Rotation-optimized PQ for better accuracy

Outlier Detection (3 methods):

  • Z-Score - Statistical outlier detection using standard deviations
  • Modified Z-Score - Robust variant using median and MAD (Median Absolute Deviation)
  • IQR - Non-parametric interquartile range method for non-normal distributions

Reranking (3 algorithms):

  • MMR (Maximal Marginal Relevance) - Diversity-focused reranking
  • Ensemble Reranking - Weighted and Borda count methods
  • Learning-to-Rank (LTR) - Machine learning-based ranking

Quality Metrics (6 metrics):

  • Recall@K, Precision@K, F1@K - Search quality metrics for retrieval evaluation
  • MRR (Mean Reciprocal Rank) - Ranking quality measurement
  • Davies-Bouldin Index - Clustering quality measurement (lower is better)
  • Silhouette Score - Clustering cohesion and separation analysis

Drift Detection (3 methods):

  • Centroid Drift - Track changes in cluster centroids over time
  • Distribution Divergence - Statistical distribution comparison
  • Temporal Drift Monitoring - Time-series based drift detection

Analytics (4 features):

  • Topic Discovery - Automatic topic modeling and analysis
  • Similarity Histograms - Analyze similarity distributions in vector space
  • KNN Graph Building - Construct k-nearest neighbor graphs for analysis
  • Embedding Quality Assessment - Evaluate embedding quality and coherence

Search (2 algorithms):

  • Hybrid Lexical-Semantic Fusion - Combine vector and text search signals
  • Reciprocal Rank Fusion (RRF) - Multiple ranking fusion algorithm

Traditional ML Algorithms:

Additionally supports Random Forest, XGBoost, LightGBM, CatBoost, Linear/Logistic Regression, Ridge, Lasso, SVM, KNN, Naive Bayes, Decision Trees, Neural Networks, and Deep Learning models for classification, regression, and advanced ML workflows.

In-Database ML Inference

Run ML models directly inside PostgreSQL with zero data movement, batch inference for high throughput, real-time predictions with low latency, and automatic GPU acceleration when available.

Embedding Generation

Generate embeddings from text, images, and more. Supports OpenAI, Cohere, HuggingFace models, custom model deployment, automatic batching and caching, and multi-modal embeddings (text, image, audio).

Model Management

Deploy and manage ML models efficiently with model versioning and rollback, A/B testing support, resource quota management, and performance monitoring.

AutoML & Feature Store

Automated hyperparameter tuning, model selection, and feature management with versioning for production ML workflows.

Supported Models

Text Embeddings

  • text-embedding-ada-002 (OpenAI) - 1536 dimensions - General text similarity
  • text-embedding-3-small (OpenAI) - 1536 dimensions - Efficient embeddings
  • text-embedding-3-large (OpenAI) - 3072 dimensions - High quality embeddings
  • embed-english-v3.0 (Cohere) - 1024 dimensions - English text
  • embed-multilingual-v3.0 (Cohere) - 1024 dimensions - Multilingual text

Sentence Transformers

  • all-MiniLM-L6-v2 (HuggingFace) - 384 dimensions - Fast, lightweight
  • all-mpnet-base-v2 (HuggingFace) - 768 dimensions - High quality
  • paraphrase-multilingual-MiniLM (HuggingFace) - 384 dimensions - 50+ languages

Multimodal

  • CLIP-ViT-B-32 (OpenAI) - 512 dimensions - Image + text
  • CLIP-ViT-L-14 (OpenAI) - 768 dimensions - High quality image search

ML Functions

embed_text()

Generate text embeddings with automatic caching.

Signature: embed_text(text TEXT, model TEXT DEFAULT 'all-MiniLM-L6-v2') RETURNS vector

Example

SELECT embed_text('Machine learning with PostgreSQL');

embed_text_batch()

Generate embeddings for multiple texts efficiently.

Signature: embed_text_batch(texts TEXT[], model TEXT DEFAULT 'all-MiniLM-L6-v2') RETURNS vector[]

Example

SELECT embed_text_batch(ARRAY['text1', 'text2'], 'all-MiniLM-L6-v2');

train_random_forest_classifier()

Train Random Forest classifier with GPU support.

Signature: train_random_forest_classifier(table_name TEXT, features_col TEXT, label_col TEXT, n_trees INT, max_depth INT)

Example

SELECT train_random_forest_classifier('training_data', 'features', 'label', 100, 10);

cluster_kmeans()

K-means clustering with GPU acceleration.

Signature: cluster_kmeans(table_name TEXT, vector_column TEXT, k INTEGER, max_iter INTEGER DEFAULT 100)

Example

SELECT cluster_kmeans('documents', 'embedding', 5, 100);

Next Steps