We are calling the issue resolved, after seeing consistent throughput and no errors for almost 30 minutes.. The V1 API, which has a more simplistic scheduling algorithm than V2, does have a very large backlog of work that could take several hours to burn down, even with the several thousand extra CPU cores that we have added to the cluster.
Posted over 1 year ago. Feb 28, 2017 - 14:23 PST
AWS has fixed S3, and our API is, as of approximately 1:53PM PST, processing normally. We anticipate a very large backlog of work, so turnaround times are expected to be longer than usual.
Posted over 1 year ago. Feb 28, 2017 - 14:04 PST
AWS says that they have begun recovering S3 functionality, though we have not yet seen meaningful improvements.
Posted over 1 year ago. Feb 28, 2017 - 12:58 PST
Amazon has identified the root cause of their various problems today, including the massive S3, failure. They do not yet have a fix, however.
Posted over 1 year ago. Feb 28, 2017 - 11:48 PST
The S3 outage is preventing us from using buckets in an alternative location. We, along with half of the Internet, are anxiously awaiting a fix from Amazon.
Posted over 1 year ago. Feb 28, 2017 - 10:29 PST
We are preparing to switch to an alternative set of S3 buckets if the Amazon problem continues. Such a switchover would cause the loss of in-flight work, and would result in work being "split" across multiple sets of buckets, requiring a post-incident migration, so we will only take this step if absolutely necessary. Please note that at the current time S3 is not allowing the creation of new buckets, so there is no guarantee that this recovery step is even possible.
Posted over 1 year ago. Feb 28, 2017 - 10:13 PST
UPDATE: Amazon is experiencing very serious problems with S3. We believe that this is the root cause of the remaining issues with the V1 and V2 APIs.
Posted over 1 year ago. Feb 28, 2017 - 10:05 PST
WAF is recovered. GET requests process normally. POST requests to the V1 and V2 APIs that upload new work for processing are still failing due to the Amazon S3 outage.
Posted over 1 year ago. Feb 28, 2017 - 10:02 PST
WAF is recovering, but POST requests that submit new work are still failing.
Posted over 1 year ago. Feb 28, 2017 - 09:58 PST
Most types of API requests appear to be unaffected. The impact is mostly limited to API requests that submit new audio files for processing.
Posted over 1 year ago. Feb 28, 2017 - 09:57 PST
We are seeing failures in both the V1 and V2 APIs. Work accepted for processing will be processed normally. We are working to recover the cluster.
UPDATE: We believe that AWS S3 issues are the root cause of the outage.