DocumentationNeuronDB Production
Documentation Branch: You are viewing documentation for the main branch (3.0.0-devel). Select a branch to view its documentation:
High Availability Architecture
Overview
This document describes the high availability (HA) architecture for NeuronDB ecosystem in production.
Key Features:
- ✅ Load balancing with Nginx or HAProxy
- ✅ PostgreSQL HA with Patroni for automatic failover
- ✅ Connection pooling with PgBouncer
- ✅ Horizontal scaling for stateless services
- ✅ Automatic failover and recovery
- ✅ Disaster recovery with backups and WAL archiving
Architecture Diagram
┌─────────────┐
│ Load │
│ Balancer │
│ (Nginx) │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Desktop │ │ Desktop │ │ Desktop │
│ API 1 │ │ API 2 │ │ API 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Agent 1 │ │ Agent 2 │ │ Agent 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌───────────▼───────────┐
│ PostgreSQL Primary │
│ (with Patroni) │
└───────────┬───────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Replica 1│ │Replica 2│ │Replica 3│
└─────────┘ └─────────┘ └─────────┘Components
1. Load Balancer
Nginx or HAProxy for:
- Request distribution
- SSL termination
- Health checks
- Session affinity (if needed)
2. Application Layer
Stateless Services (can scale horizontally):
- NeuronDesktop API (2+ replicas)
- NeuronAgent (2+ replicas)
- NeuronDesktop Frontend (2+ replicas)
Stateful Services:
- NeuronMCP (1 replica, can be scaled if stateless)
3. Database Layer
PostgreSQL HA using Patroni:
- Primary node (read/write)
- 2+ replica nodes (read-only)
- Automatic failover
- Connection pooling (PgBouncer)
Setup
Step 1: PostgreSQL HA with Patroni
# docker-compose.ha.yml
services:
postgres-primary:
image: postgres:17
environment:
PATRONI_SCOPE: neurondb
PATRONI_NAME: postgres-primary
volumes:
- patroni-config:/etc/patroni
- postgres-data:/var/lib/postgresql/data
postgres-replica-1:
image: postgres:17
environment:
PATRONI_SCOPE: neurondb
PATRONI_NAME: postgres-replica-1
depends_on:
- postgres-primary
patroni:
image: patroni/patroni:latest
environment:
PATRONI_SCOPE: neurondb
PATRONI_RESTAPI_LISTEN: 0.0.0.0:8008Step 2: Connection Pooling
pgbouncer:
image: pgbouncer/pgbouncer:latest
environment:
DATABASES_HOST: postgres-primary
DATABASES_PORT: 5432
DATABASES_USER: neurondb
DATABASES_PASSWORD: ${POSTGRES_PASSWORD}
POOL_MODE: transaction
MAX_CLIENT_CONN: 1000
DEFAULT_POOL_SIZE: 25Step 3: Load Balancer
# nginx.conf
upstream neurondesktop_api {
least_conn;
server neurondesk-api-1:8081;
server neurondesk-api-2:8081;
server neurondesk-api-3:8081;
}
upstream neuronagent {
least_conn;
server neuronagent-1:8080;
server neuronagent-2:8080;
server neuronagent-3:8080;
}
server {
listen 80;
server_name api.neurondb.example.com;
location / {
proxy_pass http://neurondesktop_api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}Failover Scenarios
Database Primary Failure
- Patroni detects primary failure
- Elects new primary from replicas
- Updates DNS/VIP to point to new primary
- Applications reconnect automatically
Application Node Failure
- Load balancer detects health check failure
- Removes node from pool
- Traffic routed to healthy nodes
- Auto-scaling can replace failed node
Monitoring
Health Checks
- Application:
/healthendpoint - Database:
pg_isready - Load balancer: TCP/HTTP checks
Metrics
- Request rate per node
- Error rate per node
- Database connection pool usage
- Replication lag
Disaster Recovery
Backup Strategy
- Daily full backups
- Continuous WAL archiving
- Off-site backup storage (S3)
Recovery Time Objectives (RTO)
- Database failover: < 30 seconds
- Application recovery: < 5 minutes
- Full disaster recovery: < 1 hour
Recovery Point Objectives (RPO)
- Database: < 5 minutes (WAL archiving)
- Application: Near-zero (stateless)
Scaling
Horizontal Scaling
- Add more application replicas
- Add more database replicas
- Use read replicas for queries
Vertical Scaling
- Increase database resources
- Increase application resources
- Optimize queries and indexes
Best Practices
- Use connection pooling: PgBouncer for database connections
- Monitor replication lag: Keep lag < 1 second
- Test failover regularly: Monthly failover drills
- Use health checks: All services should have health endpoints
- Implement circuit breakers: Prevent cascade failures
- Use idempotent operations: Handle retries gracefully