WAF and AWS S3 problems blocking certain incoming API requests for V1 and V2 APIs

Incident Report for Voicebase

Resolved

We are calling the issue resolved, after seeing consistent throughput and no errors for almost 30 minutes.. The V1 API, which has a more simplistic scheduling algorithm than V2, does have a very large backlog of work that could take several hours to burn down, even with the several thousand extra CPU cores that we have added to the cluster.

Posted Feb 28, 2017 - 14:23 PST

Update

AWS has fixed S3, and our API is, as of approximately 1:53PM PST, processing normally. We anticipate a very large backlog of work, so turnaround times are expected to be longer than usual.

Posted Feb 28, 2017 - 14:04 PST

Update

AWS says that they have begun recovering S3 functionality, though we have not yet seen meaningful improvements.

Posted Feb 28, 2017 - 12:58 PST

Update

Amazon has identified the root cause of their various problems today, including the massive S3, failure. They do not yet have a fix, however.

Posted Feb 28, 2017 - 11:48 PST

Update

The S3 outage is preventing us from using buckets in an alternative location. We, along with half of the Internet, are anxiously awaiting a fix from Amazon.

Posted Feb 28, 2017 - 10:29 PST

Update

We are preparing to switch to an alternative set of S3 buckets if the Amazon problem continues. Such a switchover would cause the loss of in-flight work, and would result in work being "split" across multiple sets of buckets, requiring a post-incident migration, so we will only take this step if absolutely necessary. Please note that at the current time S3 is not allowing the creation of new buckets, so there is no guarantee that this recovery step is even possible.

Posted Feb 28, 2017 - 10:13 PST

Update

UPDATE: Amazon is experiencing very serious problems with S3. We believe that this is the root cause of the remaining issues with the V1 and V2 APIs.

Posted Feb 28, 2017 - 10:05 PST

Update

WAF is recovered. GET requests process normally. POST requests to the V1 and V2 APIs that upload new work for processing are still failing due to the Amazon S3 outage.

Posted Feb 28, 2017 - 10:02 PST

Monitoring

WAF is recovering, but POST requests that submit new work are still failing.

Posted Feb 28, 2017 - 09:58 PST

Identified

Most types of API requests appear to be unaffected. The impact is mostly limited to API requests that submit new audio files for processing.

Posted Feb 28, 2017 - 09:57 PST

Investigating

We are seeing failures in both the V1 and V2 APIs. Work accepted for processing will be processed normally. We are working to recover the cluster.
UPDATE: We believe that AWS S3 issues are the root cause of the outage.

Posted Feb 28, 2017 - 09:50 PST