Availability
Understanding availability in distributed systems and how to design highly available systems.
What is Availability?
Availability measures the proportion of time a system is operational and accessible. It's typically expressed as a percentage, often called "nines" of availability.
The "Nines" of Availability
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three nines) | 8.77 hours | 43.83 minutes | 10.08 minutes |
| 99.99% (four nines) | 52.60 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.30 seconds | 6.05 seconds |
Note: Five nines (99.999%) is considered "high availability" and is the target for critical systems like banking, healthcare, and emergency services.
Calculating Availability
Single Component
Availability = MTBF / (MTBF + MTTR)Where:
- MTBF: Mean Time Between Failures
- MTTR: Mean Time To Recovery
Systems in Sequence
When components are in series (all must work):
Total Availability = A₁ × A₂ × A₃ × ... × AₙExample: Three services at 99.9% each:
0.999 × 0.999 × 0.999 = 99.7%Systems in Parallel (Redundancy)
When components are redundant (any one can work):
Total Availability = 1 - (1 - A)ⁿExample: Two servers at 99% each:
1 - (1 - 0.99)² = 1 - 0.0001 = 99.99%Strategies for High Availability
1. Redundancy
- Active-Passive: Standby takes over on failure
- Active-Active: Multiple instances handle requests simultaneously
- N+1 Redundancy: One extra component beyond minimum required
2. Replication
- Database replication: Multiple copies of data
- Multi-region deployment: Survive regional outages
- Cross-zone redundancy: Survive datacenter failures
3. Load Balancing
- Distribute traffic across healthy instances
- Automatic health checks and failover
- Geographic load balancing for global availability
4. Fault Isolation
- Bulkhead pattern: Isolate failures to prevent cascade
- Circuit breaker: Stop calling failing services
- Graceful degradation: Provide reduced functionality instead of complete failure
Availability Patterns
Failover
Active-Passive (Cold Standby):
- Primary handles all traffic
- Secondary activated on primary failure
- Simple but causes brief downtime during switch
Active-Active (Hot Standby):
- Multiple instances handle traffic
- No downtime on single failure
- More complex to implement
Health Checks
Systems should continuously monitor component health:
- Liveness probes: Is the service running?
- Readiness probes: Can it handle requests?
- Deep health checks: Are dependencies healthy?
Availability vs Other Properties
| Trade-off | Higher Availability | Lower Availability |
|---|---|---|
| Consistency | Eventually consistent | Strongly consistent |
| Cost | Higher (more redundancy) | Lower |
| Complexity | Higher (more components) | Lower |
| Latency | Sometimes higher | Sometimes lower |
Real-World Examples
High Availability Systems
| System | Target | Approach |
|---|---|---|
| AWS S3 | 99.999999999% (11 nines) durability | Massive replication |
| Google Search | 99.99%+ | Global distribution |
| Banking Systems | 99.99% | Active-active with failover |
Downtime Incidents
- 2017 AWS S3 outage: Took down thousands of websites for 4 hours
- 2021 Facebook outage: 6-hour global outage due to BGP misconfiguration
- 2020 Google outage: 45-minute outage affecting Gmail, YouTube, etc.
Designing for Availability
Questions to Ask
- What's the acceptable downtime for this system?
- What's the cost of downtime (revenue, reputation)?
- What failure scenarios must we handle?
- What's the budget for redundancy?
Best Practices
- Define SLAs/SLOs: Set clear availability targets
- Plan for failure: Assume components will fail
- Test failover: Regularly test backup systems
- Monitor everything: Detect issues before users do
- Automate recovery: Reduce MTTR with automation
Interview Tips
- Know the nines: Memorize the availability percentages and their downtime
- Calculate compound availability: Understand how components in series/parallel affect total availability
- Discuss trade-offs: Availability often trades off with consistency (CAP theorem)
- Mention real examples: Reference AWS, Google, etc. availability strategies
- Consider cost: Higher availability = higher cost
Summary
Availability is crucial for user trust and business continuity. Achieving high availability requires:
- Redundancy at every layer
- Automated failover mechanisms
- Continuous monitoring and alerting
- Regular testing of failure scenarios
- Clear SLAs and recovery procedures