Have you ever pondered the history of incident management?
If you work in SRE, you might be so preoccupied with the day-to-day operations of maintaining availability and responding to problems that you never have time to consider the evolution of your position or obligations. And it’s a shame because SREs didn’t invent incident management ideas and practices on their own.
On the contrary, the way SREs conceive of incident response, create incident management teams and rank incidents’ urgency is informed by decades-old incident management techniques. To fully grasp what it means to be an SRE nowadays, you must appreciate this long history of crisis management and resolution.
Let’s look at how incident response has evolved over time and where modern concepts originated.
Incident Management Historical Issues
There have always been calamities in society, of course. Fires, floods, infrastructure failures, and other disasters have occurred for ages.
For the most part, humans have had a difficult time dealing with these sorts of events for the past several millennia. Response operations were ad hoc and largely ineffective owing to their lack of coordination and planning.
The difficulties that we faced included:
- Empowerment is a stage in the process where you provide resources and opportunities to people who would not otherwise have access to them. It’s critical that everyone involved in the project have adequate information, but it’s also important to ensure that all stakeholders are aware of what’s going on at all times.
- Responders were unable to determine who was in command because to different management systems that made it difficult to identify leaders, unify response efforts, and distribute work.
- Ineffective crisis management.
- Incident priorities can be assessed in a variety of ways.
Historically, if a problem required only one small group’s attention, organizations may have been able to manage it. However, the more stakeholders are involved, the more difficult it is to respond promptly and effectively.
The Birth of ICS: Putting Out Fires
When stakeholders began to consider innovative ways to put out fires—literally—things started to improve.
Fire chiefs in California began to realize in the 1960s that they were finding it increasingly difficult to deal with the devastating blazes that burned every summer. Each year brought bigger fires, more property incinerated, and more buildings destroyed. The Laguna fire of 1970 was the straw that broke the camel’s back, prompting a paradigm shift on incident response for fire agencies.
The fire chiefs discovered that the problem was not a lack of tools or personnel. It was a failure to coordinate responses from multiple firefighting organizations that arrived on the scene. The agencies had difficulty deploying their assets promptly and effectively due to a lack of clarity in the chain of command and a non-systematic technique to fighting fires.
The problem was solved when California fire chiefs developed the Incident Command System, or ICS, in the 1970s. The ICS established a command hierarchy for incidents, with an incident commander at the top. It also outlined several sorts of incident response procedures, such as operations, planning, logistics, and finance. It also established a firm set of definitions that stakeholders may use to describe their actions throughout incident response, making it simpler to speak plainly.
While the ICS was designed to combat fires, it has gradually become the de facto standard for all types of incident response plans.
From ICS to NIMS: The Evolution of Risk Management in the United States
The history of incident response doesn’t end with ICS. When the federal government established a more extensive method of incident management, known as the National Incident Management System, or NIMS, in the early 2000s, it opened a new chapter.
NIMS was formed in the wake of the September 11, 2001, terrorist attacks, which highlighted the need for efficient communication between different agencies of the same kind (such as fire departments), as well as completely separate organizations. NIMS built on ICS concepts.
In addition to adopting most of the incident command principles and procedures outlined in the ICS, NIMS specified standards for coordinating the provision of resources. It also accepted the idea of an emergency operations center, which is comparable to a network operations center in the digital world.
NIMS was intended to be a framework for ensuring that all critical personnel are capable of implementing the appropriate level of response in the event of an emergency. It also included fourteen management principles, comparable to compliance controls, which organizations must implement in order to respond using a NIMS method.
Incident Management Today & Now
Obviously, dealing with forest fires or terrorist attacks is very different from solving data center problems or a faulty application deployment. ICS and NIMS were not created with site reliability engineering in mind or IT teams.
Despite this, the impact of ICS and NIMS on SREs’ thinking is evident. The words “incident commander” are derived from these frameworks. So do ideas like shared responsibility for incident response processes and the need to include all stakeholders—not just technical staff—in incident response.
Although AC/DC and NIMS are not words that many SREs are likely to be acquainted with, they should be, because they are the historical sources of today’s incident management ideas, which provide valuable lessons for any SRE on the job today.