WAF and AWS S3 problems blocking certain incoming API requests for V1 and V2 APIs
Incident Report for Voicebase
Resolved
We are calling the issue resolved, after seeing consistent throughput and no errors for almost 30 minutes.. The V1 API, which has a more simplistic scheduling algorithm than V2, does have a very large backlog of work that could take several hours to burn down, even with the several thousand extra CPU cores that we have added to the cluster.
Posted 10 months ago. Feb 28, 2017 - 14:23 PST
Update
AWS has fixed S3, and our API is, as of approximately 1:53PM PST, processing normally. We anticipate a very large backlog of work, so turnaround times are expected to be longer than usual.
Posted 10 months ago. Feb 28, 2017 - 14:04 PST
Update
AWS says that they have begun recovering S3 functionality, though we have not yet seen meaningful improvements.
Posted 10 months ago. Feb 28, 2017 - 12:58 PST
Update
Amazon has identified the root cause of their various problems today, including the massive S3, failure. They do not yet have a fix, however.
Posted 10 months ago. Feb 28, 2017 - 11:48 PST
Update
The S3 outage is preventing us from using buckets in an alternative location. We, along with half of the Internet, are anxiously awaiting a fix from Amazon.
Posted 10 months ago. Feb 28, 2017 - 10:29 PST
Update
We are preparing to switch to an alternative set of S3 buckets if the Amazon problem continues. Such a switchover would cause the loss of in-flight work, and would result in work being "split" across multiple sets of buckets, requiring a post-incident migration, so we will only take this step if absolutely necessary. Please note that at the current time S3 is not allowing the creation of new buckets, so there is no guarantee that this recovery step is even possible.
Posted 10 months ago. Feb 28, 2017 - 10:13 PST
Update
UPDATE: Amazon is experiencing very serious problems with S3. We believe that this is the root cause of the remaining issues with the V1 and V2 APIs.
Posted 10 months ago. Feb 28, 2017 - 10:05 PST
Update
WAF is recovered. GET requests process normally. POST requests to the V1 and V2 APIs that upload new work for processing are still failing due to the Amazon S3 outage.
Posted 10 months ago. Feb 28, 2017 - 10:02 PST
Monitoring
WAF is recovering, but POST requests that submit new work are still failing.
Posted 10 months ago. Feb 28, 2017 - 09:58 PST
Identified
Most types of API requests appear to be unaffected. The impact is mostly limited to API requests that submit new audio files for processing.
Posted 10 months ago. Feb 28, 2017 - 09:57 PST
Investigating
We are seeing failures in both the V1 and V2 APIs. Work accepted for processing will be processed normally. We are working to recover the cluster.
UPDATE: We believe that AWS S3 issues are the root cause of the outage.
Posted 10 months ago. Feb 28, 2017 - 09:50 PST