Manage Learn to apply best practices and optimize your operations.

The perfect storm: Multiple mishaps lead to disaster

A series of mistakes, due to a series of software errors and lapses in judgment, result in a situation that could mean the difference between life and death for customers. Find out lessons that test expert Chris McMahon learned from experiencing a disaster in an organization that tested 911 location handling software.

It happened more than a decade ago. The company involved no longer exists in any recognizable form, nor does the code base. Still, I am going to gloss over a few details because telling this story still makes me a little nervous, even after all this time. However, this software disaster story taught me some lessons that, hopefully, I can impart to you before your organization experiences a perfect storm.

Dealing with life-or-death software
When I began my software career, I worked testing 911 location information handling. If you are choking, and you have a land line, and you call 911, the emergency dispatcher has your address, and even though you are choking and cannot speak, the dispatcher can send help to where you live because of the software that I was testing. That was good software, a good code base made by a good team. But one time something went terribly wrong.

In the US, as towns and cities grow and demographics change, it is not uncommon for the phone system to add area codes. Users are migrated to the new area code over time, so one day your phone number might be 111-555-1212, and then on a certain date your number changes to 222-555-1212. For safety reasons, the 911 system keeps the old numbers around just in case some stragglers do not get the new area code in a timely fashion. But eventually, the old numbers are deleted from the system for maintenance and performance reasons.

At one point my company was doing a routine deletion of 911 information for phone numbers made obsolete by area code splits. We would run a routine report to generate a list of these numbers. One time the running of this report happened to be the responsibility of someone fairly new to the company, who was not very experienced. She ran the report, but she gave it the wrong parameters, and the list generated by the report contained both the old 911 information and also the new, valid 911 information. She turned over the report to our system administrators and thought nothing more of it.

Check into anomalies
When the report reached our sysadmins, one of the senior sysadmins immediately noticed that there were far too many phone numbers in the list. He raised an objection, sending the report back to our business people noting that the file of numbers was far too large and that something must have gone wrong in the running of the report. And this was our first mistake. Instead of investigating the contents of the faulty report, the business manager defended the inexperienced person who had generated the report. After some fairly heated back-and-forth conversation, our senior sysadmin was essentially ordered to delete all of the numbers in the report regardless. Over strenuous objection, he did exactly that.

The report ran too long. After far too long a time, we as a company came to the realization that we had deleted nearly all of the 911 location information for an entire midwestern US state.

Test your recovery programs
But in the world of 911 location information, everything is recorded and backed up with multiple layers of redundancy. The software that deleted these numbers made copies of every record, so that in case something went wrong, all the deleted records could be re-loaded into the system. The deletion report algorithm went:

  • Read record to be deleted.
  • Read corresponding system record.
  • Write copy of system record to file.
  • Delete system record.
  • Read next record to be deleted, and repeat.
And this was our second mistake. The program that deleted the information from the system had been written by our sysadmin staff, not our development staff, and had never been through our formal validation and testing process. On this particular operating system, when creating a file it is required that the file size be specified. Upon filling the file, any subsequent attempts to write to that file causes the operating system to return an error, and the attempt to write to the full file fails. Our delete program, having never been tested, ignored the system error upon attempting to write to a full file. So in the case of our perfect storm, what actually happened was:

  • Read record to be deleted.
  • Read corresponding system record.
  • Attempt to write copy of system record to file.
  • Delete system record.
  • Read next record to be deleted, and repeat.
So we had almost an entire US state's worth of 911 information deleted and only a few backup records had actually been created for all of the deleted records.

Have a disaster recovery plan in place
But again, in the 911 business, there is always another copy. We had a very recent system backup that contained all of the deleted records. The problem was that we had this data in Colorado and we needed to deliver it to a midwestern state in order to update the physical phone switches in that state. And we had deleted so much data that it was physically impossible to transfer that data over the network in less than something like two weeks. At one point we were considering making a tape backup of the data and having someone get on an airplane with the physical tapes in hand. Ultimately one of our very experienced developers who was an expert at this operating system devised a customized compression scheme that allowed us to transfer the missing data to where it was needed, but that also took time.

While all this was going on, everyone in the company was simply hanging on, hoping and wishing for no major disasters in this particular midwestern state. 911 dispatchers could handle something like a warehouse fire, something highly localized, but if there had been a disaster covering a large area, say a major flood or a release of deadly gas, many people would have been injured or killed because the 911 system would have been overwhelmed with no automated way to discover the locations of thousands of 911 calls. Thankfully, everything remained calm while we worked to restore the information we had deleted. It was a tense time.

Lessons learned
I learned two very simple, but critically important lessons from this experience.

First, no process is perfect. Test your process. If something feels wrong, it might actually be wrong, and someone could get hurt because of it. Second, test your code. No testing process is perfect, but as long as a testing process exists, it is worth sending all of the code through that process, regardless of how seemingly trivial or routine that code might be.

Dig Deeper on Topics Archive

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.