Failover Strategies for Multi-Provider LLM Architectures
Provider availability incidents are inevitable. OpenAI, Anthropic, and Google have all experienced degraded performance or complete outages that affected production applications depending on their APIs. This post covers the failover strategies that keep LLM applications running through these incidents.
The Availability Baseline
Major LLM providers publish uptime in the range of 99.5-99.9%. That sounds excellent until you calculate the implications: 99.5% uptime means approximately 44 hours of downtime per year. For customer-facing applications, that is an unacceptable number if there is no mitigation strategy.
Passive vs. Active Health Monitoring
Passive monitoring tracks error rates from live production traffic. It is free (you are already making these requests) and provides the most accurate picture of current availability for your specific request patterns. Active health monitoring sends dedicated health check requests on a schedule. It detects degradation before real user traffic is affected but adds cost and complexity.
Failover Trigger Thresholds
The most effective failover configurations use a combination of error rate, error type, and latency degradation as triggers. Rate limit errors warrant a different response than model errors. Transient timeout spikes may not warrant immediate failover, while persistent 5xx errors should trigger immediate circuit opening.
Recovery and Traffic Shift-Back
Failover is straightforward. Recovery is harder. Traffic shifted to a secondary provider should shift back to primary gradually after recovery is confirmed, not immediately. A gradual ramp (5%, 25%, 50%, 100% over 20 minutes) with automatic rollback if error rates spike prevents oscillation when a provider is marginally recovered.