DocumentationNeuronDB Cloud-Native
Documentation Branch: You are viewing documentation for the main branch (3.0.0-devel). Select a branch to view its documentation:

Observability Stack

Overview

The NeuronDB observability stack provides comprehensive monitoring, visualization, and distributed tracing for the entire ecosystem. The stack includes:

  • Prometheus - Metrics collection, alerting, and querying
  • Grafana - Pre-configured dashboards and visualization
  • Jaeger - Distributed tracing for request flows
  • Alertmanager - Alert routing and notification management

Key Features

  • Complete Coverage: All modules and variants monitored (NeuronDB, NeuronAgent, NeuronMCP, NeuronDesktop)
  • Detailed Metrics: Module-specific metrics with proper labeling
  • Comprehensive Alerts: 40+ alert rules for all critical failure modes
  • Performance Optimization: Recording rules for common queries
  • Production Ready: Alertmanager integration with notification routing
  • Pre-configured: Grafana dashboards and Prometheus rules included

Prometheus

Prometheus collects metrics from all NeuronDB ecosystem components and provides a query language (PromQL) for monitoring and alerting.

Configuration Files

The Prometheus configuration is located in prometheus/ directory:

  • prometheus.yml - Main Prometheus configuration
  • alerts.yml - Alert rules (organized by module)
  • recording_rules.yml - Pre-computed metrics for performance
  • alertmanager.yml - Alertmanager configuration
  • postgres_exporter.yml - PostgreSQL exporter custom queries
  • service_discovery.yml - Service discovery reference

Quick Start

Start Prometheus with Docker Compose

# Start Prometheus
docker compose -f docker-compose.observability.yml up -d prometheus

# Access Prometheus UI
# http://localhost:9090

# Check targets
# http://localhost:9090/targets

Metrics Endpoints

All services expose Prometheus-compatible metrics:

  • NeuronDB: Via PostgreSQL exporter at :9187/metrics
  • NeuronAgent: :8080/metrics
  • NeuronDesktop API: :8081/metrics
  • Infrastructure: Node exporter (:9100/metrics), cAdvisor (:8080/metrics)

📋 Complete Prometheus Documentation: See prometheus/README.md for detailed configuration, metrics reference, and alert rules.

Grafana

Grafana provides pre-configured dashboards for visualizing NeuronDB ecosystem metrics, performance data, and health status.

Quick Start

Start Grafana with Docker Compose

# Start Grafana
docker compose -f docker-compose.observability.yml up -d grafana

# Access Grafana UI
# http://localhost:3001
# Default credentials: admin/admin

# Grafana will automatically provision:
# - Prometheus datasource
# - Pre-configured dashboards

Pre-configured Dashboards

Grafana includes dashboards for:

  • NeuronDB: Database health, query performance, index health, cache metrics
  • NeuronAgent: Service availability, error rates, latency, execution metrics
  • NeuronDesktop: API availability, error rates, connection metrics
  • NeuronMCP: Service availability, tool execution, connection pool
  • Infrastructure: System resources, container health, network metrics

Dashboard Provisioning

Grafana dashboards are automatically provisioned from grafana/provisioning/dashboards/ directory. The Prometheus datasource is configured in grafana/provisioning/datasources/prometheus.yml.

Custom Dashboards

Create custom dashboards in Grafana UI or add JSON files to grafana/dashboards/ directory.

Jaeger

Jaeger provides distributed tracing for request flows across all NeuronDB ecosystem components.

Quick Start

Start Jaeger with Docker Compose

# Start Jaeger
docker compose -f docker-compose.observability.yml up -d jaeger

# Access Jaeger UI
# http://localhost:16686

# Jaeger endpoints:
# - UI: :16686
# - OTLP gRPC: :4317
# - OTLP HTTP: :4318

Features

  • Distributed Tracing: Track requests across all services
  • Service Map: Visualize service dependencies
  • Trace Analysis: Identify bottlenecks and slow operations
  • Performance Insights: Understand request latency breakdown

Docker Compose Setup

Use the docker-compose.observability.yml file to run the complete observability stack:

Start observability stack

# Start all observability services
docker compose -f docker-compose.observability.yml up -d

# Check status
docker compose -f docker-compose.observability.yml ps

# View logs
docker compose -f docker-compose.observability.yml logs -f

# Stop services
docker compose -f docker-compose.observability.yml down

Access URLs

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3001 (admin/admin)
  • Jaeger: http://localhost:16686
  • Alertmanager: http://localhost:9093 (if enabled)

Kubernetes Setup

The Helm chart includes the complete observability stack. Enable it in your values file:

Enable observability in Helm values

# values.yaml
monitoring:
  enabled: true
  prometheus:
    enabled: true
    retention: "30d"
    persistence:
      enabled: true
      size: "20Gi"
  grafana:
    enabled: true
    adminPassword: "change-me"  # Change in production!
    persistence:
      enabled: true
      size: "10Gi"
  jaeger:
    enabled: true

Access Services in Kubernetes

Port-forward to observability services

# Grafana
kubectl port-forward svc/neurondb-grafana 3001:3000 -n neurondb
# Access at: http://localhost:3001

# Prometheus
kubectl port-forward svc/neurondb-prometheus 9090:9090 -n neurondb
# Access at: http://localhost:9090

# Jaeger
kubectl port-forward svc/neurondb-jaeger 16686:16686 -n neurondb
# Access at: http://localhost:16686

Service Discovery

Kubernetes deployments use ServiceMonitors for automatic service discovery. Prometheus automatically discovers and scrapes all NeuronDB ecosystem services.

Metrics Reference

Key metrics exposed by each component:

NeuronDB Metrics

  • neurondb_queries_total - Total number of queries (by query_type, index_type)
  • neurondb_query_duration_seconds - Query duration histogram (by query_type)
  • neurondb_index_size_bytes - Index size in bytes (by index_name, index_type)
  • neurondb_vector_count - Number of vectors (by table_name)
  • neurondb_cache_hits_total - Cache hits (by cache_type)
  • neurondb_cache_misses_total - Cache misses (by cache_type)
  • neurondb_worker_status - Worker status (by worker_id, status)
  • neurondb_errors_total - Total errors (by error_type)

NeuronAgent Metrics

  • neurondb_agent_http_requests_total - Total HTTP requests (by method, endpoint, status)
  • neurondb_agent_http_request_duration_seconds - HTTP request duration (by method, endpoint)
  • neurondb_agent_executions_total - Agent executions (by agent_id, status)
  • neurondb_agent_execution_duration_seconds - Execution duration (by agent_id)
  • neurondb_agent_llm_calls_total - LLM API calls (by model, status)
  • neurondb_agent_llm_tokens_total - LLM tokens (by model, type)
  • neurondb_agent_memory_chunks_stored_total - Memory chunks stored (by agent_id)
  • neurondb_agent_tool_executions_total - Tool executions (by tool_name, status)
  • neurondb_agent_database_connections_active - Active DB connections

NeuronDesktop Metrics

  • neurondesktop_api_requests_total - Total API requests (by endpoint, method)
  • neurondesktop_api_errors_total - API errors (by endpoint, error_type)
  • neurondesktop_api_request_duration_seconds - Request duration (by endpoint)
  • neurondesktop_active_connections - Active connections
  • neurondesktop_active_mcp_connections - Active MCP connections
  • neurondesktop_active_neurondb_connections - Active NeuronDB connections
  • neurondesktop_active_agent_connections - Active agent connections

📋 Complete Metrics Reference: See Prometheus README for all available metrics with descriptions and labels.

Alert Rules

Prometheus includes 40+ alert rules organized by module, covering all critical failure modes:

NeuronDB Alerts

  • NeuronDBServiceDown (Critical) - Service down > 1m
  • NeuronDBConnectionFailure (Critical) - >5 failures in 5m
  • NeuronDBHighQueryLatency (Warning) - P95 > 1s for 5m
  • NeuronDBIndexHealthDegraded (Warning) - Health < 80% for 5m
  • NeuronDBCacheHitRateLow (Warning) - Hit rate < 70% for 5m
  • NeuronDBConnectionPoolExhausted (Critical) - Utilization > 90% for 5m

NeuronAgent Alerts

  • NeuronAgentServiceDown (Critical) - Service down > 1m
  • NeuronAgentHighErrorRate (Critical) - Error rate > 5% for 5m
  • NeuronAgentHighLatency (Warning) - P95 > 1s for 5m
  • NeuronAgentExecutionFailure (Critical) - >10 failures in 5m
  • NeuronAgentDatabaseConnectionIssue (Warning) - >5 errors in 5m

Infrastructure Alerts

  • HighCPUUsage (Warning) - CPU > 80% for 5m
  • HighMemoryUsage (Warning) - Memory > 85% for 5m
  • HighDiskUsage (Warning) - Disk > 85% for 5m
  • PrometheusTargetDown (Critical) - Target down > 2m

📋 Complete Alert Rules: See alerts.yml for all alert rules with conditions and descriptions.

Additional Resources