Skip to main content

High Availability Setup

Complete guide for setting up high availability for the SBM CRM Platform.

High Availability Architecture

Multi-AZ Deployment

Zone A                          Zone B
┌─────────────┐ ┌─────────────┐
│ ECS-01 │ │ ECS-02 │
│ API Server │ │ API Server │
└──────┬──────┘ └──────┬──────┘
│ │
└──────────┬───────────────────┘

┌────────▼────────┐
│ ELB (Multi-AZ) │
└────────┬─────────┘

┌──────────┴──────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ RDS Primary │ │ RDS Standby │
│ (Zone A) │◄─────►│ (Zone B) │
└─────────────┘ └─────────────┘


┌─────────────┐
│ DCS Redis │
│ (HA Mode) │
└─────────────┘

Application Layer HA

Load Balancer Configuration

ELB Configuration:
Type: Application Load Balancer
Multi-AZ: Enabled
Health Check:
Protocol: HTTP
Path: /health
Interval: 30s
Timeout: 5s
Healthy Threshold: 2
Unhealthy Threshold: 3

Backend Servers:
- Zone A: sbmcrm-api-01
- Zone B: sbmcrm-api-02

Session Persistence: Cookie-based
Sticky Sessions: Enabled

Auto-Scaling Configuration

Auto-Scaling Group:
Min Instances: 2
Max Instances: 10
Desired Capacity: 2

Scaling Policies:
Scale Up:
Metric: CPU Utilization
Threshold: > 70%
Duration: 5 minutes
Action: Add 1 instance

Scale Down:
Metric: CPU Utilization
Threshold: < 30%
Duration: 10 minutes
Action: Remove 1 instance

Health Check:
Type: ELB
Grace Period: 300s

Health Check Endpoint

// Health check implementation
app.get('/health', async (req, res) => {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {
database: await checkDatabase(),
redis: await checkRedis(),
externalApis: await checkExternalApis()
}
};

const isHealthy = Object.values(health.checks)
.every(check => check.status === 'ok');

res.status(isHealthy ? 200 : 503).json(health);
});

Database HA

RDS High Availability

RDS Configuration:
High Availability: Enabled
Replication Mode: Async

Primary Instance:
Zone: ap-southeast-1a
Instance Class: rds.pg.c2.xlarge

Standby Instance:
Zone: ap-southeast-1b
Instance Class: rds.pg.c2.xlarge
Automatic Failover: Enabled
Failover Time: < 30 seconds

Backup:
Automated: Daily at 02:00
Retention: 30 days
Point-in-Time Recovery: Enabled

Read Replicas

Read Replicas:
- Zone A: Read Replica 1
- Zone B: Read Replica 2

Use Cases:
- Analytics queries
- Reporting
- Read-heavy workloads

Connection String:
Primary: postgres-primary.internal
Replicas: postgres-replica.internal

Connection Pooling

// Use connection pooling with read replicas
const pool = new Pool({
host: process.env.DB_PRIMARY_HOST,
port: 5432,
database: 'sbmcrm_production',
user: 'sbmcrm',
password: process.env.DB_PASSWORD,
max: 20,
min: 5,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
});

// Read replica pool for read operations
const readPool = new Pool({
host: process.env.DB_REPLICA_HOST,
// ... same config
});

Cache HA

Redis High Availability

DCS Redis Configuration:
Mode: Master-Standby
Replication: Enabled

Master:
Zone: ap-southeast-1a
Memory: 8GB

Standby:
Zone: ap-southeast-1b
Memory: 8GB
Automatic Failover: Enabled

Persistence:
RDB: Enabled (every 1 hour)
AOF: Enabled

Redis Sentinel (Alternative)

Redis Sentinel:
Sentinels:
- sentinel-01 (Zone A)
- sentinel-02 (Zone B)
- sentinel-03 (Zone C)

Quorum: 2
Down After Milliseconds: 5000
Failover Timeout: 10000

Storage HA

OBS Multi-Region Replication

OBS Configuration:
Primary Region: ap-southeast-1
Replication:
Enabled: Yes
Target Region: ap-southeast-2
Replication Rule: All objects

Versioning: Enabled
Lifecycle:
- Transition to Archive: After 90 days
- Delete: After 365 days

Disaster Recovery

Backup Strategy

Backups:
Database:
Frequency: Daily at 02:00
Retention: 30 days
Cross-Region: Enabled

Files:
Frequency: Daily at 03:00
Retention: 90 days
Storage: OBS (Archive)

Configuration:
Frequency: On change
Storage: Git repository

Recovery Procedures

Database Failover

# Automatic failover (RDS)
# Standby automatically promotes to primary
# Application reconnects automatically

# Manual failover
huaweicloud rds failover \
--instance-id sbmcrm-postgresql \
--node-id standby-node-id

Application Failover

# Remove unhealthy instance from ELB
# ELB automatically routes traffic to healthy instances
# Auto-scaling launches replacement instance

Monitoring & Alerting

Key Metrics

Metrics to Monitor:
Application:
- Request rate
- Error rate
- Response time (P50, P95, P99)
- Active connections

Infrastructure:
- CPU utilization
- Memory usage
- Disk I/O
- Network throughput

Database:
- Connection count
- Query performance
- Replication lag
- Backup status

Cache:
- Hit rate
- Memory usage
- Connection count
- Replication status

Alert Rules

Critical Alerts:
- Database failover detected
- All application servers down
- High error rate (> 5%)
- Database replication lag > 10s

Warning Alerts:
- CPU > 70%
- Memory > 80%
- Disk usage > 85%
- Response time > 1s (P95)

Testing HA

Failover Testing

# Test database failover
1. Stop primary database instance
2. Verify automatic failover to standby
3. Verify application reconnection
4. Check data consistency

# Test application failover
1. Stop one application server
2. Verify traffic routes to remaining server
3. Verify auto-scaling launches replacement
4. Check application health

Load Testing

# Use load testing tools
# Verify system handles peak load
# Verify auto-scaling works correctly
# Check response times under load

Best Practices

  1. Multi-AZ Deployment: Always deploy across multiple zones
  2. Health Checks: Implement comprehensive health checks
  3. Auto-Scaling: Configure auto-scaling for dynamic load
  4. Monitoring: Monitor all critical components
  5. Testing: Regularly test failover procedures
  6. Documentation: Document all HA configurations
  7. Backup: Maintain regular backups with cross-region replication

Cost Considerations

HA Costs

  • Multi-AZ: ~2x single-AZ cost
  • Read Replicas: Additional database costs
  • Auto-Scaling: Pay for actual usage
  • Monitoring: Minimal additional cost

Cost Optimization

  • Use reserved instances for base capacity
  • Scale down during off-peak hours
  • Use spot instances for non-critical workloads
  • Optimize database instance sizes

Next Steps

  1. Review Cloud Architecture for setup details
  2. Use Terraform for infrastructure provisioning
  3. Configure Networking for network setup