When systems go down, the costs compound fast. Gartner estimates that the average cost of IT downtime is $5,600 per minute—that's over $300,000 per hour. For mission-critical government systems processing benefits, healthcare claims, or financial transactions, the impact extends beyond dollars to public trust and safety.
High availability isn't about preventing all failures. It's about designing systems that continue operating despite failures.
Understanding Availability Metrics
Availability is measured in "nines"—the percentage of time a system is operational.
The Nines of Availability
| Availability | Annual Downtime | Monthly Downtime | Common Use |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | Internal tools |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | Business applications |
| 99.95% | 4.38 hours | 21.9 minutes | E-commerce |
| 99.99% (four nines) | 52.6 minutes | 4.4 minutes | Financial systems |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | Critical infrastructure |
Each additional nine requires exponentially more engineering effort and cost. The goal isn't maximum availability—it's appropriate availability for your use case.
Calculating Required Availability
Work backwards from business impact:
Required Availability = 1 - (Acceptable Downtime ÷ Total Time)
Example: A payment processing system can tolerate 4 hours of downtime per year.
Required Availability = 1 - (4 hours ÷ 8,760 hours/year)
= 1 - 0.000457
= 99.954%
This requires engineering for four nines (99.95%+).
The Anatomy of High Availability
High availability emerges from three principles applied at every layer:
High Availability Principles
1. Redundancy
No single component should be a single point of failure (SPOF). Redundancy strategies vary by component type:
| Component | Redundancy Strategy |
|---|---|
| Compute | Multiple instances across availability zones |
| Storage | Replication (synchronous or asynchronous) |
| Network | Multiple paths, load balancers, DNS failover |
| Database | Primary-replica, multi-master, or distributed |
| Power | UPS, generators, multiple utility feeds |
2. Fault Detection
You can't recover from failures you don't detect. Implement monitoring at multiple levels:
// Health check hierarchy
interface HealthCheck {
// Component is running
liveness: () => Promise<boolean>;
// Component can serve traffic
readiness: () => Promise<boolean>;
// Component is performing well
performance: () => Promise<HealthMetrics>;
}
// Example: Database health check
const databaseHealth: HealthCheck = {
liveness: async () => {
try {
await db.query('SELECT 1');
return true;
} catch {
return false;
}
},
readiness: async () => {
const replicationLag = await db.getReplicationLag();
return replicationLag < 1000; // Less than 1 second lag
},
performance: async () => {
const metrics = await db.getMetrics();
return {
queryLatencyP99: metrics.latency.p99,
connectionsUsed: metrics.connections.active,
connectionsMax: metrics.connections.max,
healthy: metrics.latency.p99 < 100 && metrics.connections.active < metrics.connections.max * 0.8
};
}
};
3. Automated Recovery
Human response times are measured in minutes. Automated recovery happens in seconds.
// Automated failover controller
class FailoverController {
private primaryHealthy = true;
private lastFailoverTime = 0;
private readonly minFailoverInterval = 60000; // Prevent flapping
async monitor(): Promise<void> {
while (true) {
const health = await this.checkPrimaryHealth();
if (!health.healthy && this.primaryHealthy) {
await this.initiateFailover(health.reason);
} else if (health.healthy && !this.primaryHealthy) {
await this.considerFailback();
}
await sleep(5000); // Check every 5 seconds
}
}
private async initiateFailover(reason: string): Promise<void> {
const now = Date.now();
if (now - this.lastFailoverTime < this.minFailoverInterval) {
console.warn('Failover suppressed: too recent');
return;
}
console.log(`Initiating failover: ${reason}`);
// 1. Promote replica to primary
await this.promoteReplica();
// 2. Update load balancer
await this.updateTrafficRouting();
// 3. Notify operations team
await this.sendAlert('FAILOVER_COMPLETED', { reason });
this.primaryHealthy = false;
this.lastFailoverTime = now;
}
}
Architecture Patterns for High Availability
Active-Passive (Warm Standby)
The simplest HA pattern: one active system with a standby ready to take over.
┌─────────────┐ ┌─────────────┐
│ Active │ │ Passive │
│ Primary │────▶│ Standby │
│ │sync │ │
└─────────────┘ └─────────────┘
│
▼
[Traffic]
Pros:
- Simple to implement and understand
- Cost-effective (standby resources can be smaller)
Cons:
- Failover takes time (30 seconds to minutes)
- Standby resources underutilized
- Synchronization lag can cause data loss
Best for: Databases, legacy applications that can't run multiple instances
Active-Active (Load Balanced)
Multiple instances serve traffic simultaneously. If one fails, others absorb the load.
┌─────────────────┐
│ Load Balancer │
└────────┬────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ Node 1│ │ Node 2│ │ Node 3│
└───────┘ └───────┘ └───────┘
Pros:
- No failover delay—traffic automatically routes to healthy nodes
- Better resource utilization
- Can scale horizontally
Cons:
- Requires stateless design or shared state management
- More complex deployment and configuration
- Load balancer becomes potential SPOF
Best for: Web applications, APIs, stateless services
Multi-Region Active-Active
For maximum availability, deploy across geographic regions.
Multi-Region Active-Active Architecture
// Multi-region request routing
interface RegionConfig {
region: string;
endpoint: string;
weight: number;
healthy: boolean;
}
class GlobalLoadBalancer {
private regions: RegionConfig[] = [
{ region: 'us-east-1', endpoint: 'https://east.api.example.com', weight: 50, healthy: true },
{ region: 'us-west-2', endpoint: 'https://west.api.example.com', weight: 50, healthy: true },
];
async routeRequest(request: Request): Promise<Response> {
const healthyRegions = this.regions.filter(r => r.healthy);
if (healthyRegions.length === 0) {
throw new Error('All regions unhealthy');
}
// Route to nearest healthy region (simplified)
const clientRegion = this.detectClientRegion(request);
const targetRegion = this.findNearestHealthy(clientRegion, healthyRegions);
return await fetch(targetRegion.endpoint + request.path, {
method: request.method,
headers: request.headers,
body: request.body
});
}
}
Challenges:
- Data consistency across regions
- Network latency between regions
- Significantly higher cost
Solutions:
- Eventually consistent data models
- Conflict resolution strategies (last-write-wins, CRDTs)
- Careful selection of what data needs global consistency
Database High Availability
Databases are often the hardest component to make highly available because they hold state.
Replication Strategies
Synchronous Replication
- Primary waits for replica acknowledgment before committing
- Zero data loss on failover
- Higher latency (must wait for slowest replica)
Asynchronous Replication
- Primary commits immediately, replica catches up
- Potential data loss on failover (replication lag)
- Lower latency, better performance
Semi-Synchronous
- Wait for at least one replica (not all)
- Balance between durability and performance
PostgreSQL HA Example
# Patroni configuration for PostgreSQL HA
scope: production-cluster
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: pg-node-1:8008
etcd:
hosts:
- etcd-1:2379
- etcd-2:2379
- etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB max lag for promotion
synchronous_mode: true
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
shared_buffers: 4GB
synchronous_commit: remote_apply
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
Read Replicas for Scale
Separate read and write traffic to scale reads independently:
// Read/write splitting
class DatabaseRouter {
private primary: DatabaseConnection;
private replicas: DatabaseConnection[];
private replicaIndex = 0;
async query(sql: string, params: unknown[]): Promise<QueryResult> {
if (this.isWriteQuery(sql)) {
return await this.primary.query(sql, params);
}
// Round-robin across replicas
const replica = this.replicas[this.replicaIndex];
this.replicaIndex = (this.replicaIndex + 1) % this.replicas.length;
return await replica.query(sql, params);
}
private isWriteQuery(sql: string): boolean {
const normalized = sql.trim().toUpperCase();
return normalized.startsWith('INSERT') ||
normalized.startsWith('UPDATE') ||
normalized.startsWith('DELETE') ||
normalized.startsWith('CREATE') ||
normalized.startsWith('ALTER') ||
normalized.startsWith('DROP');
}
}
Graceful Degradation
When failures occur, degrade gracefully rather than failing completely.
Circuit Breaker Pattern
Prevent cascade failures by stopping calls to failing services:
enum CircuitState {
CLOSED, // Normal operation
OPEN, // Failing, reject requests immediately
HALF_OPEN // Testing if service recovered
}
class CircuitBreaker {
private state = CircuitState.CLOSED;
private failures = 0;
private lastFailureTime = 0;
private readonly failureThreshold = 5;
private readonly recoveryTimeout = 30000;
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === CircuitState.OPEN) {
if (Date.now() - this.lastFailureTime > this.recoveryTimeout) {
this.state = CircuitState.HALF_OPEN;
} else {
throw new CircuitOpenError('Service unavailable');
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess(): void {
this.failures = 0;
this.state = CircuitState.CLOSED;
}
private onFailure(): void {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.failureThreshold) {
this.state = CircuitState.OPEN;
}
}
}
Fallback Strategies
When a service fails, provide degraded but functional alternatives:
| Service | Primary | Fallback |
|---|---|---|
| User profile | Database | Cached version |
| Recommendations | ML model | Popular items |
| Payment processing | Primary processor | Backup processor |
| Search | Elasticsearch | Database LIKE query |
// Fallback chain
async function getUserProfile(userId: string): Promise<UserProfile> {
// Try primary source
try {
return await userService.getProfile(userId);
} catch (error) {
console.warn('Primary failed, trying cache', error);
}
// Try cache
try {
const cached = await cache.get(`user:${userId}`);
if (cached) {
return { ...cached, stale: true };
}
} catch (error) {
console.warn('Cache failed, using defaults', error);
}
// Return minimal profile
return {
id: userId,
name: 'Unknown User',
stale: true,
degraded: true
};
}
Chaos Engineering
Don't wait for production failures to test your resilience. Deliberately inject failures to find weaknesses.
Principles of Chaos Engineering
- Start with a hypothesis - "If X fails, the system will continue operating with Y degradation"
- Minimize blast radius - Start small, in staging, with quick rollback
- Run in production - Staging never perfectly mirrors production
- Automate - Regular chaos experiments catch regressions
Common Chaos Experiments
| Experiment | What It Tests |
|---|---|
| Kill random pods | Container orchestration recovery |
| Network latency injection | Timeout handling |
| CPU/memory stress | Resource exhaustion handling |
| Clock skew | Time-dependent logic |
| DNS failure | Service discovery resilience |
| Disk fill | Storage exhaustion handling |
Example: Chaos Monkey Script
#!/bin/bash
# Simple chaos experiment: randomly kill one pod from a deployment
NAMESPACE="production"
DEPLOYMENT="api-server"
# Get random pod
POD=$(kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | shuf -n 1)
echo "Terminating pod: $POD"
kubectl delete pod $POD -n $NAMESPACE
# Monitor recovery
echo "Monitoring recovery..."
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --timeout=60s
if [ $? -eq 0 ]; then
echo "SUCCESS: Deployment recovered"
else
echo "FAILURE: Deployment did not recover in time"
exit 1
fi
Monitoring and Alerting
You can't maintain high availability without visibility into system health.
The Four Golden Signals
Google's SRE team recommends monitoring these four metrics:
The Four Golden Signals
- Latency - Time to serve requests
- Traffic - Demand on the system
- Errors - Rate of failed requests
- Saturation - How "full" the system is
SLIs, SLOs, and SLAs
- SLI (Service Level Indicator): Metric that measures service quality
- SLO (Service Level Objective): Target value for an SLI
- SLA (Service Level Agreement): Contract with consequences for missing SLOs
# Example SLO definition
service: payment-api
slos:
- name: availability
sli: successful_requests / total_requests
target: 99.95%
window: 30d
- name: latency
sli: requests_under_200ms / total_requests
target: 95%
window: 30d
- name: error_rate
sli: 1 - (error_requests / total_requests)
target: 99.9%
window: 30d
Alert Fatigue Prevention
Too many alerts is as bad as too few. Design alerts that are actionable:
| Alert Type | Trigger | Action |
|---|---|---|
| Page (wake someone up) | SLO breach imminent | Immediate investigation |
| Ticket (next business day) | Degradation detected | Scheduled investigation |
| Log (informational) | Anomaly detected | Review in context |
Key Takeaways
-
Define availability targets based on business impact - Not all systems need five nines
-
Eliminate single points of failure - Redundancy at every layer
-
Automate detection and recovery - Humans are too slow for high availability
-
Design for graceful degradation - Partial functionality beats complete failure
-
Test your resilience - Chaos engineering finds weaknesses before production does
-
Monitor the right signals - Latency, traffic, errors, saturation
Building Resilient Systems
High availability isn't a feature you add at the end—it's an architectural principle that shapes every decision.
PEW Consulting has experience building mission-critical systems that achieve 99.97%+ uptime while processing millions of monthly transactions. We've applied these patterns to government systems, healthcare platforms, and high-volume e-commerce.
Schedule a consultation to discuss your availability requirements.
Sources
- Gartner: Cost of IT Downtime
- Google SRE Book: Service Level Objectives
- AWS Well-Architected Framework: Reliability
- Netflix Chaos Engineering
Related reading: The $100 Billion Problem: Why Federal Agencies Still Run on COBOL
