It’s hard to write a post which talks about any virtues associated with downtime but, in the wake of last night’s Azure UK outage, we had some small sense of satisfaction that Seamless behaved both as we designed and would’ve hoped. Connectivity to the UK South Azure data centre was lost between about 22:30 and 01:40.
We are aware of the issues connecting to services in UK South, please refer to https://t.co/Dw19fIGsXf for updates.
— Azure Support (@AzureSupport) July 20, 2017
Accepting that we weren’t going to fix Azure’s connectivity issues, where do we see the positives for Seamless?
1. Our alerting let us know it was happening (before Microsoft confirmed any issues)
The Seamless monitoring alerted us to the first connectivity alert at approx. 19:43. This seemed to be an isolated alert, however we received a second connectivity alert at 22:30 and they began to be raised regularly at that time. We then began our own investigation and the Azure service page confirmed a connectivity issue at approx. 23:30. Shortly after that, we issued our alert to the Seamless user community.
2. When Azure connectivity returned, Seamless shrugged its shoulders and carried on working
Bearing in mind that connectivity was the primary issue, the service itself actually continued running. As soon as connectivity was restored, data continued sync’ing with no intervention from us (or, of course, our customers). Any data that had not been collected at source would’ve then been collected and posted to target systems. Any data “in transit” when the connectivity issues started was processed and sync’ed to relevant target(s).
3. We had mitigation actions ready in case we needed to support customers
Our mitigation options are typically to redeploy the service and the config to other Azure locations (subject to relevant instructions from our customers bearing in mind potential data considerations). In the event that the UK South region was not restored by the start of the business day, we were in a position to relocate the service and client configurations to Western Europe (Dublin). This would’ve taken us less than an hour and no data would’ve been lost in this process.
As Microsoft restored the Azure service around 01:40, this was not necessary.
No-one likes downtime and we recognise that we’re lucky that these 3 hours were outside of core business hours for our current customer base. In practice, we continue to believe that a Microsoft data centre is more secure and robust than were we to build our own. In this light, we architected the Seamless service to “fail elegantly” and although disappointed this had to be tested in anger, we’re delighted that it worked as we would’ve expected it to.