High-Availability Architecture: Engineering 99.97% Uptime

When systems go down, the costs compound fast. Gartner estimates that the average cost of IT downtime is $5,600 per minute—that's over $300,000 per hour. For mission-critical government systems processing benefits, healthcare claims, or financial transactions, the impact extends beyond dollars to public trust and safety.

High availability isn't about preventing all failures. It's about designing systems that continue operating despite failures.

Understanding Availability Metrics

Availability is measured in "nines"—the percentage of time a system is operational.

The Nines of Availability

99% (2 nines)3.65 days/year

99.9% (3 nines)8.76 hours/year

99.99% (4 nines)52.6 min/year

99.999% (5 nines)5.26 min/year

Each additional nine requires exponentially more engineering effort and cost.

Availability	Annual Downtime	Monthly Downtime	Common Use
99% (two nines)	3.65 days	7.3 hours	Internal tools
99.9% (three nines)	8.76 hours	43.8 minutes	Business applications
99.95%	4.38 hours	21.9 minutes	E-commerce
99.99% (four nines)	52.6 minutes	4.4 minutes	Financial systems
99.999% (five nines)	5.26 minutes	26.3 seconds	Critical infrastructure

Each additional nine requires exponentially more engineering effort and cost. The goal isn't maximum availability—it's appropriate availability for your use case.

Calculating Required Availability

Work backwards from business impact:

Required Availability = 1 - (Acceptable Downtime ÷ Total Time)

Example: A payment processing system can tolerate 4 hours of downtime per year.

Required Availability = 1 - (4 hours ÷ 8,760 hours/year)
                     = 1 - 0.000457
                     = 99.954%

This requires engineering for four nines (99.95%+).

The Anatomy of High Availability

High availability emerges from three principles applied at every layer:

High Availability Principles

RedundancyNo single point of failure

Fault DetectionKnow when things break

Auto RecoverySeconds, not minutes

Apply these three principles at every layer of your architecture.

1. Redundancy

No single component should be a single point of failure (SPOF). Redundancy strategies vary by component type:

Component	Redundancy Strategy
Compute	Multiple instances across availability zones
Storage	Replication (synchronous or asynchronous)
Network	Multiple paths, load balancers, DNS failover
Database	Primary-replica, multi-master, or distributed
Power	UPS, generators, multiple utility feeds

2. Fault Detection

You can't recover from failures you don't detect. Implement monitoring at multiple levels:

// Health check hierarchy
interface HealthCheck {
  // Component is running
  liveness: () => Promise<boolean>;

  // Component can serve traffic
  readiness: () => Promise<boolean>;

  // Component is performing well
  performance: () => Promise<HealthMetrics>;
}

// Example: Database health check
const databaseHealth: HealthCheck = {
  liveness: async () => {
    try {
      await db.query('SELECT 1');
      return true;
    } catch {
      return false;
    }
  },

  readiness: async () => {
    const replicationLag = await db.getReplicationLag();
    return replicationLag < 1000; // Less than 1 second lag
  },

  performance: async () => {
    const metrics = await db.getMetrics();
    return {
      queryLatencyP99: metrics.latency.p99,
      connectionsUsed: metrics.connections.active,
      connectionsMax: metrics.connections.max,
      healthy: metrics.latency.p99 < 100 && metrics.connections.active < metrics.connections.max * 0.8
    };
  }
};

3. Automated Recovery

Human response times are measured in minutes. Automated recovery happens in seconds.

// Automated failover controller
class FailoverController {
  private primaryHealthy = true;
  private lastFailoverTime = 0;
  private readonly minFailoverInterval = 60000; // Prevent flapping

  async monitor(): Promise<void> {
    while (true) {
      const health = await this.checkPrimaryHealth();

      if (!health.healthy && this.primaryHealthy) {
        await this.initiateFailover(health.reason);
      } else if (health.healthy && !this.primaryHealthy) {
        await this.considerFailback();
      }

      await sleep(5000); // Check every 5 seconds
    }
  }

  private async initiateFailover(reason: string): Promise<void> {
    const now = Date.now();
    if (now - this.lastFailoverTime < this.minFailoverInterval) {
      console.warn('Failover suppressed: too recent');
      return;
    }

    console.log(`Initiating failover: ${reason}`);

    // 1. Promote replica to primary
    await this.promoteReplica();

    // 2. Update load balancer
    await this.updateTrafficRouting();

    // 3. Notify operations team
    await this.sendAlert('FAILOVER_COMPLETED', { reason });

    this.primaryHealthy = false;
    this.lastFailoverTime = now;
  }
}

Architecture Patterns for High Availability

Active-Passive (Warm Standby)

The simplest HA pattern: one active system with a standby ready to take over.

┌─────────────┐     ┌─────────────┐
│   Active    │     │   Passive   │
│   Primary   │────▶│   Standby   │
│             │sync │             │
└─────────────┘     └─────────────┘
       │
       ▼
   [Traffic]

Pros:

Simple to implement and understand
Cost-effective (standby resources can be smaller)

Cons:

Failover takes time (30 seconds to minutes)
Standby resources underutilized
Synchronization lag can cause data loss

Best for: Databases, legacy applications that can't run multiple instances

Active-Active (Load Balanced)

Multiple instances serve traffic simultaneously. If one fails, others absorb the load.

       ┌─────────────────┐
       │  Load Balancer  │
       └────────┬────────┘
                │
    ┌───────────┼───────────┐
    ▼           ▼           ▼
┌───────┐  ┌───────┐  ┌───────┐
│ Node 1│  │ Node 2│  │ Node 3│
└───────┘  └───────┘  └───────┘

Pros:

No failover delay—traffic automatically routes to healthy nodes
Better resource utilization
Can scale horizontally

Cons:

Requires stateless design or shared state management
More complex deployment and configuration
Load balancer becomes potential SPOF

Best for: Web applications, APIs, stateless services

Multi-Region Active-Active

For maximum availability, deploy across geographic regions.

Multi-Region Active-Active Architecture

Global Load BalancerRoutes traffic

Region: US-EastPrimary region

Region: US-WestSecondary region

Replicated DataSynced across regions

Geographic distribution provides resilience against regional outages.

// Multi-region request routing
interface RegionConfig {
  region: string;
  endpoint: string;
  weight: number;
  healthy: boolean;
}

class GlobalLoadBalancer {
  private regions: RegionConfig[] = [
    { region: 'us-east-1', endpoint: 'https://east.api.example.com', weight: 50, healthy: true },
    { region: 'us-west-2', endpoint: 'https://west.api.example.com', weight: 50, healthy: true },
  ];

  async routeRequest(request: Request): Promise<Response> {
    const healthyRegions = this.regions.filter(r => r.healthy);

    if (healthyRegions.length === 0) {
      throw new Error('All regions unhealthy');
    }

    // Route to nearest healthy region (simplified)
    const clientRegion = this.detectClientRegion(request);
    const targetRegion = this.findNearestHealthy(clientRegion, healthyRegions);

    return await fetch(targetRegion.endpoint + request.path, {
      method: request.method,
      headers: request.headers,
      body: request.body
    });
  }
}

Challenges:

Data consistency across regions
Network latency between regions
Significantly higher cost

Solutions:

Eventually consistent data models
Conflict resolution strategies (last-write-wins, CRDTs)
Careful selection of what data needs global consistency

Database High Availability

Databases are often the hardest component to make highly available because they hold state.

Replication Strategies

Synchronous Replication

Primary waits for replica acknowledgment before committing
Zero data loss on failover
Higher latency (must wait for slowest replica)

Asynchronous Replication

Primary commits immediately, replica catches up
Potential data loss on failover (replication lag)
Lower latency, better performance

Semi-Synchronous

Wait for at least one replica (not all)
Balance between durability and performance

PostgreSQL HA Example

# Patroni configuration for PostgreSQL HA
scope: production-cluster
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: pg-node-1:8008

etcd:
  hosts:
    - etcd-1:2379
    - etcd-2:2379
    - etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB max lag for promotion
    synchronous_mode: true
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 200
        shared_buffers: 4GB
        synchronous_commit: remote_apply
        wal_level: replica
        max_wal_senders: 5
        max_replication_slots: 5

Read Replicas for Scale

Separate read and write traffic to scale reads independently:

// Read/write splitting
class DatabaseRouter {
  private primary: DatabaseConnection;
  private replicas: DatabaseConnection[];
  private replicaIndex = 0;

  async query(sql: string, params: unknown[]): Promise<QueryResult> {
    if (this.isWriteQuery(sql)) {
      return await this.primary.query(sql, params);
    }

    // Round-robin across replicas
    const replica = this.replicas[this.replicaIndex];
    this.replicaIndex = (this.replicaIndex + 1) % this.replicas.length;

    return await replica.query(sql, params);
  }

  private isWriteQuery(sql: string): boolean {
    const normalized = sql.trim().toUpperCase();
    return normalized.startsWith('INSERT') ||
           normalized.startsWith('UPDATE') ||
           normalized.startsWith('DELETE') ||
           normalized.startsWith('CREATE') ||
           normalized.startsWith('ALTER') ||
           normalized.startsWith('DROP');
  }
}

Graceful Degradation

When failures occur, degrade gracefully rather than failing completely.

Circuit Breaker Pattern

Prevent cascade failures by stopping calls to failing services:

enum CircuitState {
  CLOSED,   // Normal operation
  OPEN,     // Failing, reject requests immediately
  HALF_OPEN // Testing if service recovered
}

class CircuitBreaker {
  private state = CircuitState.CLOSED;
  private failures = 0;
  private lastFailureTime = 0;

  private readonly failureThreshold = 5;
  private readonly recoveryTimeout = 30000;

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime > this.recoveryTimeout) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        throw new CircuitOpenError('Service unavailable');
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    this.failures = 0;
    this.state = CircuitState.CLOSED;
  }

  private onFailure(): void {
    this.failures++;
    this.lastFailureTime = Date.now();

    if (this.failures >= this.failureThreshold) {
      this.state = CircuitState.OPEN;
    }
  }
}

Fallback Strategies

When a service fails, provide degraded but functional alternatives:

Service	Primary	Fallback
User profile	Database	Cached version
Recommendations	ML model	Popular items
Payment processing	Primary processor	Backup processor
Search	Elasticsearch	Database LIKE query

// Fallback chain
async function getUserProfile(userId: string): Promise<UserProfile> {
  // Try primary source
  try {
    return await userService.getProfile(userId);
  } catch (error) {
    console.warn('Primary failed, trying cache', error);
  }

  // Try cache
  try {
    const cached = await cache.get(`user:${userId}`);
    if (cached) {
      return { ...cached, stale: true };
    }
  } catch (error) {
    console.warn('Cache failed, using defaults', error);
  }

  // Return minimal profile
  return {
    id: userId,
    name: 'Unknown User',
    stale: true,
    degraded: true
  };
}

Chaos Engineering

Don't wait for production failures to test your resilience. Deliberately inject failures to find weaknesses.

Principles of Chaos Engineering

Start with a hypothesis - "If X fails, the system will continue operating with Y degradation"
Minimize blast radius - Start small, in staging, with quick rollback
Run in production - Staging never perfectly mirrors production
Automate - Regular chaos experiments catch regressions

Common Chaos Experiments

Experiment	What It Tests
Kill random pods	Container orchestration recovery
Network latency injection	Timeout handling
CPU/memory stress	Resource exhaustion handling
Clock skew	Time-dependent logic
DNS failure	Service discovery resilience
Disk fill	Storage exhaustion handling

Example: Chaos Monkey Script

#!/bin/bash
# Simple chaos experiment: randomly kill one pod from a deployment

NAMESPACE="production"
DEPLOYMENT="api-server"

# Get random pod
POD=$(kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | shuf -n 1)

echo "Terminating pod: $POD"
kubectl delete pod $POD -n $NAMESPACE

# Monitor recovery
echo "Monitoring recovery..."
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE --timeout=60s

if [ $? -eq 0 ]; then
  echo "SUCCESS: Deployment recovered"
else
  echo "FAILURE: Deployment did not recover in time"
  exit 1
fi

Monitoring and Alerting

You can't maintain high availability without visibility into system health.

The Four Golden Signals

Google's SRE team recommends monitoring these four metrics:

The Four Golden Signals

LatencyResponse time

TrafficRequest volume

ErrorsFailure rate

SaturationResource usage

Google SRE recommends monitoring these four metrics for every service.

Latency - Time to serve requests
Traffic - Demand on the system
Errors - Rate of failed requests
Saturation - How "full" the system is

SLIs, SLOs, and SLAs

SLI (Service Level Indicator): Metric that measures service quality
SLO (Service Level Objective): Target value for an SLI
SLA (Service Level Agreement): Contract with consequences for missing SLOs

# Example SLO definition
service: payment-api
slos:
  - name: availability
    sli: successful_requests / total_requests
    target: 99.95%
    window: 30d

  - name: latency
    sli: requests_under_200ms / total_requests
    target: 95%
    window: 30d

  - name: error_rate
    sli: 1 - (error_requests / total_requests)
    target: 99.9%
    window: 30d

Alert Fatigue Prevention

Too many alerts is as bad as too few. Design alerts that are actionable:

Alert Type	Trigger	Action
Page (wake someone up)	SLO breach imminent	Immediate investigation
Ticket (next business day)	Degradation detected	Scheduled investigation
Log (informational)	Anomaly detected	Review in context

Key Takeaways

Define availability targets based on business impact - Not all systems need five nines
Eliminate single points of failure - Redundancy at every layer
Automate detection and recovery - Humans are too slow for high availability
Design for graceful degradation - Partial functionality beats complete failure
Test your resilience - Chaos engineering finds weaknesses before production does
Monitor the right signals - Latency, traffic, errors, saturation

Building Resilient Systems

High availability isn't a feature you add at the end—it's an architectural principle that shapes every decision.

PEW Consulting has experience building mission-critical systems that achieve 99.97%+ uptime while processing millions of monthly transactions. We've applied these patterns to government systems, healthcare platforms, and high-volume e-commerce.

Schedule a consultation to discuss your availability requirements.

High-Availability Architecture: Engineering 99.97% Uptime

Understanding Availability Metrics

The Nines of Availability

Calculating Required Availability

The Anatomy of High Availability

High Availability Principles

1. Redundancy

2. Fault Detection

3. Automated Recovery

Architecture Patterns for High Availability

Active-Passive (Warm Standby)

Active-Active (Load Balanced)

Multi-Region Active-Active

Multi-Region Active-Active Architecture

Database High Availability

Replication Strategies

PostgreSQL HA Example

Read Replicas for Scale

Graceful Degradation

Circuit Breaker Pattern

Fallback Strategies

Chaos Engineering

Principles of Chaos Engineering

Common Chaos Experiments

Example: Chaos Monkey Script

Monitoring and Alerting

The Four Golden Signals

The Four Golden Signals

SLIs, SLOs, and SLAs

Alert Fatigue Prevention

Key Takeaways

Building Resilient Systems

Sources

PEW Consulting Team

Related Articles

CMMC 2.0 Compliance: A Technical Implementation Guide for Defense Contractors

Ready to Transform Your Agency?