US-East instance unplanned downtime

Incident Report for Voicebase

Resolved

We sincerely apologize for the downtime in our US-East instance. The first signs of a problem appeared at 12:38am PST and were completely resolved by 2:57am PST.

If you would like to be insulated from single-instance issues like this, as uncommon as they are, please consider updating your API code to utilize multiple VoiceBase instances.

Posted Mar 05, 2018 - 03:10 PST

Monitoring

We have completed the process of migrating the US-East instance to the DR queueing cluster.
At this time the system is recovering, a process that should take 5-10 minutes. There was no loss of work.

Posted Mar 05, 2018 - 02:50 PST

Update

We are in the process of manually switching the US-East instance from its primary queueing cluster to a standby disaster recovery cluster. The recovery hardware is fully up to date with the primary cluster, so we do not expect any message loss.

Posted Mar 05, 2018 - 01:43 PST

Identified

We have identified a routing problem, apparently the result of hardware failure, that has caused a loss of connectivity to a key part of our messaging system in the US-East instance. The system is fully redundant, but the system did not automatically reconfigure around the problem. We are working to recover connectivity.

Posted Mar 05, 2018 - 01:35 PST

Investigating

The VoiceBase US-East instance is not properly processing new jobs due to multiple suspected failures in its orchestration cluster. Customers may wish to move their jobs to another VoiceBase instance until the problem is resolved.

Posted Mar 05, 2018 - 00:42 PST

This incident affected: VoiceBase US-East (Processing Cluster - US).