There we were in Oakland airport last Thursday, having dutifully arrived at 5 am for our 6 am Southwest Airlines flight to Salt Lake City. I had heard about the computer “glitch” that Southwest had the day before, and saw the long lines and bleary-eyed people when we arrived at the airport, but I was still optimistic that our plane would take off on time. But no such luck.
First came the announcement of a delay, then a cancellation because “there was only one pilot available” and two are needed to fly a plane.
So we dejectedly trudged back home. In further reading about this Southwest debacle, I realized we were one of the lucky ones. We could just go home; others were stuck in airports far from their homes or destinations; others were missing important family and business events.
All in all, Southwest ended up cancelling some 2,300 flights and delaying thousands more over a five-day period. At my guess of 100 to 150 people per flight, that was easily over a quarter of a million people, probably closer to half a million, not to mention the residual impacts on people expecting the arrival of these passengers, hotel bookings, rental cars, and much more.
Southwest’s Explanation – Router Failure
Southwest Airlines CEO Gary Kelly estimated the “glitch” will result in a loss of $10 million to the airline. But the cost in disrupted plans, cancelled events and bookings, the stress and consternation of travelers, and other ancillary factors is much harder to calculate but certainly far exceeds the costs Southwest will incur.
What was Southwest’s explanation? I read an article that the CEO blamed a router:
Kelly said legacy technology used for a router failed and that backup systems did not overcome the problem as expected. He compared it to having a power failure, but then having the backup generator fail too.
“We do have significant redundancies built into our in mission-critical systems. Those redundancies did not work,” Kelly said. “We need to understand why and make sure that that doesn’t happen again.”
Note to Mr. Kelly: “You think?”
Theories on Critical System Failure
This “Southwest Summer Screwup” brought up theories I studied years ago on Critical System Failures – how one small event can cause a complex system to catastrophically fail. There are many articles on this online, as it is obviously an area of major concern, especially to IT organizations. One I found recently was from IT Skeptic: “Great paper on failure of complex systems”, where he references work – both a paper and a video – by Richard Cook, MD.
If you are in IT, the paper and video are well worth the time. Here is one quote I found pertinent:
The surprise is not that there are so many accidents [in complex systems]. The surprise is that there are so few. … Richard Cook
Dr. Cook then goes on to explain that IT systems are usually designed for reliability when they should be designed for resilience. That is, instead of planning for an imagined ideal, static state of an IT network, the effort needs to be to build in resilience and adaptability to a real world, dynamic system and all the many perturbations and disruptions that will inevitably occur.
In other words, if I may borrow from Dr. Cook’s presentation:
Crow Canyon SharePoint and Office 365 ITSM Applications
Here at Crow Canyon, we develop ITSM (IT Service Management) systems incorporating processes for identifying root causes, escalating incidents to problems, enacting change management, tracking asset and equipment performance and maintenance, and much more. These considerations are integral parts of our application design and custom development work. We follow the ITIL standards while at the same time adapting to how the real world operates and what is truly workable and usable by an IT team.
With Southwest’s epic fail in our recent memory, we are again reminded of just how important preparation for system failure and disaster recovery is in any IT organization. We will explore this topic in more detail in future articles in this blog, as well as, of course, continue to incorporate these strategies into our applications.[For the record, as far as the flight disruptions, one of our group rebooked on Amtrak, and myself and a friend rebooked to flights in September. Southwest issued a 50% off coupon for our next flights and promised to refund all the airfare to us. Small compensation, but appreciated.]
— by Scott Restivo, CEO, Crow Canyon Systems