DocumentationNeuronDB Documentation

ML Analytics Suite

Clustering Algorithms

NeuronDB provides 5 clustering algorithms with GPU acceleration:

K-Means Clustering

Lloyd's K-Means with k-means++ initialization for finding customer segments, topic clusters, and data grouping. GPU-accelerated for large datasets.

Mini-batch K-Means

Efficient variant of K-Means using mini-batches for faster convergence on very large datasets. Ideal for streaming data and incremental updates.

DBSCAN

K-Means clustering

-- K-Means clustering
SELECT cluster_kmeans(
  'train_data',   -- table with vectors
  'features',     -- vector column
  7,              -- K clusters
  50              -- max iterations
);

-- Project-based training and versioning
SELECT neurondb_train_kmeans_project(
  'fraud_kmeans',   -- project name
  'train_data',
  'features',
  7,
  50
) AS model_id;

-- List models for a project
SELECT version, algorithm, parameters, is_deployed
FROM neurondb_list_project_models('fraud_kmeans')
ORDER BY version;
  • Time Complexity: O(n·k·i·d)
  • Initialization: k-means++
  • Project Models: Versioned training runs

DBSCAN

Density-based clustering for arbitrary shapes. Automatically detects noise while grouping dense regions.

DBSCAN clustering

SELECT *
FROM cluster_dbscan(
  relation      => 'train_data',
  column_name   => 'features',
  eps           => 0.35,
  min_samples   => 12,
  distance      => 'cosine'
);
  • No need to specify cluster count. DBSCAN finds density-based groupings
  • Handles noise and outliers automatically
  • Works well with non-spherical clusters

Gaussian Mixture Model (GMM)

Expectation-Maximization (EM) algorithm for probabilistic clustering. Handles overlapping clusters and provides probability assignments.

Hierarchical Clustering

Agglomerative hierarchical clustering that builds a tree of clusters. Supports multiple linkage criteria (single, complete, average, Ward).

Outlier Detection

NeuronDB provides multiple outlier detection methods:

Z-Score Outlier Detection

Statistical method using standard deviations. Identifies points that deviate significantly from the mean.

Z-score outliers

SELECT *
FROM detect_outliers_zscore(
  (SELECT embedding FROM documents),
  2.5  -- threshold
);

Modified Z-Score

Robust variant using median and median absolute deviation (MAD). More resistant to outliers in the training data.

IQR (Interquartile Range)

Non-parametric method using quartiles. Effective for non-normal distributions.

Dimensionality Reduction

Reduce vector dimensions while preserving important information:

PCA (Principal Component Analysis)

Linear dimensionality reduction that finds principal components maximizing variance. SIMD-optimized for fast covariance computation.

PCA

SELECT *
FROM reduce_pca(
  (SELECT embedding FROM documents),
  50  -- target dimensions
);

PCA Whitening

Advanced PCA variant that decorrelates and normalizes the data. Produces features with unit variance and zero correlation.

Quality Metrics

Assess clustering and search quality with comprehensive metrics:

  • Recall@K, Precision@K, F1@K: Search quality metrics
  • MRR (Mean Reciprocal Rank): Ranking quality
  • Davies-Bouldin Index: Clustering quality (lower is better)
  • Silhouette Score: Clustering cohesion and separation

Drift Detection

Monitor data distribution changes over time:

  • Centroid Drift: Track changes in cluster centroids
  • Distribution Divergence: Measure statistical differences
  • Temporal Drift Monitoring: Time-series based drift detection

Topic Discovery

Automatic topic modeling and analysis for text data. Identifies latent topics in document collections.

Similarity Histograms

Analyze similarity distributions in your vector space for quality assessment and optimization.

KNN Graph Building

Construct k-nearest neighbor graphs for graph-based analysis and community detection.

Next Steps