For those of you who may of been in the US or outside of Europe this past weekend, you may not have heard about the major British Airways IT outage, that took down their entire operations for most of Saturday and into Sunday. Rumors, which were later confirmed, were that a switch from primary to backup power at their primary data centre (they’re a UK company, so I’ll spell it in the Queen’s English), lead to a complete operations failure. I have a bit of inside information, since my darling wife was stuck inside of Terminal 5 at Heathrow.
There’s a requirement for planes that travel across oceans call ETOPs, which stands for Extended Range Operation with Two-Engine Airplanes, however in parlance is know as Engines Turn or Passengers Swim. This protocol and requirements are a set of rules that ensure if a plane has a problem over a body of water, it can make it back to shore for a safe landing. As someone who flies across oceans a decent amount, I am very happy the regulatory bodies have these rules in place.
However, there are no such rules for data centers that run airline operations. In fact, in January, Delta Airlines had a major failure which took down most of its operations for a couple of days. Most IT experts have surmised that Delta was running a single data center for it’s operations. Based on the evidence from Saturday’s incident with BA, I have to assume that they are, as well. One key bit of evidence, was that BA employees were unable to access email. They are an Office 365 customer, so theoretically, even if on-premises systems were down e-mail should work. However, if they were using Active Directory Federation Services, so that all of their passwords were stored on-prem, then the data center being down, would mean they couldn’t authenticate, and therefore would not have email.
This was my biggest clue that BA was running with a single data center—was that email didn’while some systems, particularly some of the mainframe systems that may handle flight operations, have a tendency to not do well with failover across sites, Active Directory is one of the best distributed systems there is, and is extremely resilient to failures. In fact, given BA’s global business, I’m really surprised they didn’t have ADFS servers in locations around the world.
Enter the Cloud
Denny and I sat talking yesterday and running some numbers on what we thought a second data center would cost a company like BA. Our rough estimate (and this is very rough) was around $30-40 million USD. While that is a ton of money, it is estimated that weekend’s mess may cost BA up to £150 million. However, companies no longer have to build multiple data centers in order to have redundancy, as Microsoft (and Amazon, and Google) have data centers throughout the world. The cloud gives you the flexibility to protect critical systems, and at a much cheaper cost. I’ve designed DR strategies for small firms that cost under $100/month, and I’ve had real-time failover that supported 99.99% uptime. With the resources of a firm like BA, this should be a no-brainer given the risk profile.
What About Outsourcing?
Much has been made of the fact that BA has outsourced much of its IT functions to TCS and various other providers. Some have even tried to place blame on the providers for this outage. Frankly, I don’t have enough detail to blame anyone, and it seems more like the data center operator’s issue. However, I do think it speaks to the lack of attention and resources paid to technology at a company that clearly depends on it heavily. Computers and data are more important to business now than ever, and if your firm doesn’t value that, you are going to have problems down the road.
In the cloud era, I’m convinced no business, no matter how big or small should run with a single data center. It is way too cheap and easy to ship your backups to multiple sites, and be online in a matter of hours with a cloud provider. Given the importance and consolidation of airlines to our world economy, it probably wouldn’t be a terrible idea if their regulators created regulations requiring failover and failover testing. Don’t let this happen to your stock price.
British Airways owner IAG drops 4% after bank holiday chaos https://t.co/CHCVqtxMET
— Financial Times (@FinancialTimes) May 30, 2017