ML Analytics Suite
Clustering Algorithms
NeuronDB provides 5 clustering algorithms with GPU acceleration:
K-Means Clustering
Lloyd's K-Means with k-means++ initialization for finding customer segments, topic clusters, and data grouping. GPU-accelerated for large datasets.
Mini-batch K-Means
Efficient variant of K-Means using mini-batches for faster convergence on very large datasets. Ideal for streaming data and incremental updates.
DBSCAN
K-Means clustering
-- K-Means clustering
SELECT cluster_kmeans(
'train_data', -- table with vectors
'features', -- vector column
7, -- K clusters
50 -- max iterations
);
-- Project-based training and versioning
SELECT neurondb_train_kmeans_project(
'fraud_kmeans', -- project name
'train_data',
'features',
7,
50
) AS model_id;
-- List models for a project
SELECT version, algorithm, parameters, is_deployed
FROM neurondb_list_project_models('fraud_kmeans')
ORDER BY version;- Time Complexity: O(n·k·i·d)
- Initialization: k-means++
- Project Models: Versioned training runs
DBSCAN
Density-based clustering for arbitrary shapes. Automatically detects noise while grouping dense regions.
DBSCAN clustering
SELECT *
FROM cluster_dbscan(
relation => 'train_data',
column_name => 'features',
eps => 0.35,
min_samples => 12,
distance => 'cosine'
);- No need to specify cluster count. DBSCAN finds density-based groupings
- Handles noise and outliers automatically
- Works well with non-spherical clusters
Gaussian Mixture Model (GMM)
Expectation-Maximization (EM) algorithm for probabilistic clustering. Handles overlapping clusters and provides probability assignments.
Hierarchical Clustering
Agglomerative hierarchical clustering that builds a tree of clusters. Supports multiple linkage criteria (single, complete, average, Ward).
Outlier Detection
NeuronDB provides multiple outlier detection methods:
Z-Score Outlier Detection
Statistical method using standard deviations. Identifies points that deviate significantly from the mean.
Z-score outliers
SELECT *
FROM detect_outliers_zscore(
(SELECT embedding FROM documents),
2.5 -- threshold
);Modified Z-Score
Robust variant using median and median absolute deviation (MAD). More resistant to outliers in the training data.
IQR (Interquartile Range)
Non-parametric method using quartiles. Effective for non-normal distributions.
Dimensionality Reduction
Reduce vector dimensions while preserving important information:
PCA (Principal Component Analysis)
Linear dimensionality reduction that finds principal components maximizing variance. SIMD-optimized for fast covariance computation.
PCA
SELECT *
FROM reduce_pca(
(SELECT embedding FROM documents),
50 -- target dimensions
);PCA Whitening
Advanced PCA variant that decorrelates and normalizes the data. Produces features with unit variance and zero correlation.
Quality Metrics
Assess clustering and search quality with comprehensive metrics:
- Recall@K, Precision@K, F1@K: Search quality metrics
- MRR (Mean Reciprocal Rank): Ranking quality
- Davies-Bouldin Index: Clustering quality (lower is better)
- Silhouette Score: Clustering cohesion and separation
Drift Detection
Monitor data distribution changes over time:
- Centroid Drift: Track changes in cluster centroids
- Distribution Divergence: Measure statistical differences
- Temporal Drift Monitoring: Time-series based drift detection
Topic Discovery
Automatic topic modeling and analysis for text data. Identifies latent topics in document collections.
Similarity Histograms
Analyze similarity distributions in your vector space for quality assessment and optimization.
KNN Graph Building
Construct k-nearest neighbor graphs for graph-based analysis and community detection.
Next Steps
- ML Functions - Complete ML API reference
- Clustering Guide - Detailed clustering documentation
- Performance - Optimize ML operations