We are calling the issue resolved, after seeing consistent throughput and no errors for almost 30 minutes.. The V1 API, which has a more simplistic scheduling algorithm than V2, does have a very large backlog of work that could take several hours to burn down, even with the several thousand extra CPU cores that we have added to the cluster.
Posted 8 months ago. Feb 28, 2017 - 14:23 PST
AWS has fixed S3, and our API is, as of approximately 1:53PM PST, processing normally. We anticipate a very large backlog of work, so turnaround times are expected to be longer than usual.
Posted 8 months ago. Feb 28, 2017 - 14:04 PST
AWS says that they have begun recovering S3 functionality, though we have not yet seen meaningful improvements.
Posted 8 months ago. Feb 28, 2017 - 12:58 PST
Amazon has identified the root cause of their various problems today, including the massive S3, failure. They do not yet have a fix, however.
Posted 8 months ago. Feb 28, 2017 - 11:48 PST
The S3 outage is preventing us from using buckets in an alternative location. We, along with half of the Internet, are anxiously awaiting a fix from Amazon.
Posted 8 months ago. Feb 28, 2017 - 10:29 PST
We are preparing to switch to an alternative set of S3 buckets if the Amazon problem continues. Such a switchover would cause the loss of in-flight work, and would result in work being "split" across multiple sets of buckets, requiring a post-incident migration, so we will only take this step if absolutely necessary. Please note that at the current time S3 is not allowing the creation of new buckets, so there is no guarantee that this recovery step is even possible.
Posted 8 months ago. Feb 28, 2017 - 10:13 PST
UPDATE: Amazon is experiencing very serious problems with S3. We believe that this is the root cause of the remaining issues with the V1 and V2 APIs.
Posted 8 months ago. Feb 28, 2017 - 10:05 PST
WAF is recovered. GET requests process normally. POST requests to the V1 and V2 APIs that upload new work for processing are still failing due to the Amazon S3 outage.
Posted 8 months ago. Feb 28, 2017 - 10:02 PST
WAF is recovering, but POST requests that submit new work are still failing.
Posted 8 months ago. Feb 28, 2017 - 09:58 PST
Most types of API requests appear to be unaffected. The impact is mostly limited to API requests that submit new audio files for processing.
Posted 8 months ago. Feb 28, 2017 - 09:57 PST
We are seeing failures in both the V1 and V2 APIs. Work accepted for processing will be processed normally. We are working to recover the cluster.
UPDATE: We believe that AWS S3 issues are the root cause of the outage.