High Availability Setup
Complete guide for setting up high availability for the SBM CRM Platform.
High Availability Architecture
Multi-AZ Deployment
Zone A Zone B
┌─────────────┐ ┌─────────────┐
│ ECS-01 │ │ ECS-02 │
│ API Server │ │ API Server │
└──────┬──────┘ └──────┬──────┘
│ │
└──────────┬───────────────────┘
│
┌────────▼────────┐
│ ELB (Multi-AZ) │
└────────┬─────────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ RDS Primary │ │ RDS Standby │
│ (Zone A) │◄─────►│ (Zone B) │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ DCS Redis │
│ (HA Mode) │
└─────────────┘
Application Layer HA
Load Balancer Configuration
ELB Configuration:
Type: Application Load Balancer
Multi-AZ: Enabled
Health Check:
Protocol: HTTP
Path: /health
Interval: 30s
Timeout: 5s
Healthy Threshold: 2
Unhealthy Threshold: 3
Backend Servers:
- Zone A: sbmcrm-api-01
- Zone B: sbmcrm-api-02
Session Persistence: Cookie-based
Sticky Sessions: Enabled
Auto-Scaling Configuration
Auto-Scaling Group:
Min Instances: 2
Max Instances: 10
Desired Capacity: 2
Scaling Policies:
Scale Up:
Metric: CPU Utilization
Threshold: > 70%
Duration: 5 minutes
Action: Add 1 instance
Scale Down:
Metric: CPU Utilization
Threshold: < 30%
Duration: 10 minutes
Action: Remove 1 instance
Health Check:
Type: ELB
Grace Period: 300s
Health Check Endpoint
// Health check implementation
app.get('/health', async (req, res) => {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {
database: await checkDatabase(),
redis: await checkRedis(),
externalApis: await checkExternalApis()
}
};
const isHealthy = Object.values(health.checks)
.every(check => check.status === 'ok');
res.status(isHealthy ? 200 : 503).json(health);
});
Database HA
RDS High Availability
RDS Configuration:
High Availability: Enabled
Replication Mode: Async
Primary Instance:
Zone: ap-southeast-1a
Instance Class: rds.pg.c2.xlarge
Standby Instance:
Zone: ap-southeast-1b
Instance Class: rds.pg.c2.xlarge
Automatic Failover: Enabled
Failover Time: < 30 seconds
Backup:
Automated: Daily at 02:00
Retention: 30 days
Point-in-Time Recovery: Enabled
Read Replicas
Read Replicas:
- Zone A: Read Replica 1
- Zone B: Read Replica 2
Use Cases:
- Analytics queries
- Reporting
- Read-heavy workloads
Connection String:
Primary: postgres-primary.internal
Replicas: postgres-replica.internal
Connection Pooling
// Use connection pooling with read replicas
const pool = new Pool({
host: process.env.DB_PRIMARY_HOST,
port: 5432,
database: 'sbmcrm_production',
user: 'sbmcrm',
password: process.env.DB_PASSWORD,
max: 20,
min: 5,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
});
// Read replica pool for read operations
const readPool = new Pool({
host: process.env.DB_REPLICA_HOST,
// ... same config
});
Cache HA
Redis High Availability
DCS Redis Configuration:
Mode: Master-Standby
Replication: Enabled
Master:
Zone: ap-southeast-1a
Memory: 8GB
Standby:
Zone: ap-southeast-1b
Memory: 8GB
Automatic Failover: Enabled
Persistence:
RDB: Enabled (every 1 hour)
AOF: Enabled
Redis Sentinel (Alternative)
Redis Sentinel:
Sentinels:
- sentinel-01 (Zone A)
- sentinel-02 (Zone B)
- sentinel-03 (Zone C)
Quorum: 2
Down After Milliseconds: 5000
Failover Timeout: 10000
Storage HA
OBS Multi-Region Replication
OBS Configuration:
Primary Region: ap-southeast-1
Replication:
Enabled: Yes
Target Region: ap-southeast-2
Replication Rule: All objects
Versioning: Enabled
Lifecycle:
- Transition to Archive: After 90 days
- Delete: After 365 days
Disaster Recovery
Backup Strategy
Backups:
Database:
Frequency: Daily at 02:00
Retention: 30 days
Cross-Region: Enabled
Files:
Frequency: Daily at 03:00
Retention: 90 days
Storage: OBS (Archive)
Configuration:
Frequency: On change
Storage: Git repository
Recovery Procedures
Database Failover
# Automatic failover (RDS)
# Standby automatically promotes to primary
# Application reconnects automatically
# Manual failover
huaweicloud rds failover \
--instance-id sbmcrm-postgresql \
--node-id standby-node-id
Application Failover
# Remove unhealthy instance from ELB
# ELB automatically routes traffic to healthy instances
# Auto-scaling launches replacement instance
Monitoring & Alerting
Key Metrics
Metrics to Monitor:
Application:
- Request rate
- Error rate
- Response time (P50, P95, P99)
- Active connections
Infrastructure:
- CPU utilization
- Memory usage
- Disk I/O
- Network throughput
Database:
- Connection count
- Query performance
- Replication lag
- Backup status
Cache:
- Hit rate
- Memory usage
- Connection count
- Replication status
Alert Rules
Critical Alerts:
- Database failover detected
- All application servers down
- High error rate (> 5%)
- Database replication lag > 10s
Warning Alerts:
- CPU > 70%
- Memory > 80%
- Disk usage > 85%
- Response time > 1s (P95)
Testing HA
Failover Testing
# Test database failover
1. Stop primary database instance
2. Verify automatic failover to standby
3. Verify application reconnection
4. Check data consistency
# Test application failover
1. Stop one application server
2. Verify traffic routes to remaining server
3. Verify auto-scaling launches replacement
4. Check application health
Load Testing
# Use load testing tools
# Verify system handles peak load
# Verify auto-scaling works correctly
# Check response times under load
Best Practices
- Multi-AZ Deployment: Always deploy across multiple zones
- Health Checks: Implement comprehensive health checks
- Auto-Scaling: Configure auto-scaling for dynamic load
- Monitoring: Monitor all critical components
- Testing: Regularly test failover procedures
- Documentation: Document all HA configurations
- Backup: Maintain regular backups with cross-region replication
Cost Considerations
HA Costs
- Multi-AZ: ~2x single-AZ cost
- Read Replicas: Additional database costs
- Auto-Scaling: Pay for actual usage
- Monitoring: Minimal additional cost
Cost Optimization
- Use reserved instances for base capacity
- Scale down during off-peak hours
- Use spot instances for non-critical workloads
- Optimize database instance sizes
Next Steps
- Review Cloud Architecture for setup details
- Use Terraform for infrastructure provisioning
- Configure Networking for network setup