Fully resolved. Total user-facing impact ~45 minutes. Root cause: simultaneous Azure OpenAI regional outage and Claude API rate limiting on our primary provider. Mitigation: we have since expanded our Azure OpenAI footprint from a single region to 7 US regions (westus, westus3, eastus, eastus2, centralus, southcentralus, northcentralus), roughly 5x effective GPT capacity, with automatic health- and rate-limit-aware routing so a localized issue in any one region no longer impacts availability.
Azure OpenAI traffic has been routed to a healthy region and Claude requests are succeeding again after backoff. Chat Agent and Questionnaire pipelines are recovering. Monitoring for stability.
We are seeing degraded performance on Chat Agent and Questionnaire answer generation. Upstream cause: an Azure OpenAI regional outage combined with Claude API rate limiting on our primary provider. Engineering is failing traffic over to alternate regions and providers.