Incident Response

Feb 21, 2024

When technology workers talk about incident response, we are talking about a specific group of practices, which recur across many organizations. When the organization encounters some excursion from its normal state, its staff declare an incident to deliver an immediate response. Incident work follows a recognizable stereotype: it interrupts normal flows of work, people are re-tasked (or expected to re-task themselves) to resolve incidents as soon as possible, and staff are frequently also tasked with reviewing the incident after the fact in the hopes of identifying why it happened and what to do about it longer term.

In organizations that do some variety of incident response, incident work has effectively unlimited priority. Someone working on an active incident is allowed, and even expected, to drop nearly any other commitment, is justified in interrupting any conversation (within the community of people who are accountable for incidents, anyways), and is generally allowed to recruit peers into the incident at their discretion, granting them the privileges in the process. The only thing that is more important than an incident is, in general, another incident.

Incident work is also, frequently, the only time business stakeholders and technical staff are on even remotely the same page about what must be done and when. Incidents of sufficient seriousness regularly pull in stakeholders from across the organization, giving technical staff direct, immediate, and often minimally-supervised access to the attention of senior executives, lawyers, customer service leadership, and other organizational functions that are normally insulated from technical departments through layers of business functionaries. While those groups may speak different language, the question “what do you need to resolve this” allows them to communicate clearly and directly in the moment.

Nearly universally, organizations that include incident work in their staff expectations also expect staff to take time to learn or be trained on their approach to handling incidents. At the same time, incident response work is common enough to technical organizations that there are courses you can take to learn how to do it, which purport to apply to broad categories of organizations. Equally, many organizations I’ve worked with have tried to organize incident training for their staff, in the hopes of improving the outcomes of incident work for the organization; clearly, those organizations, at least, have seen value in the idea that these skills are broadly transferrable.

What gets to be “an incident” is, probably, the one thing I’ve seen organizations disagree on. The dynamics above mean that incident work is both an opportunity to get urgent work done when it might otherwise not be prioritized, and a way for the people doing that work to get recognition for exceptional contributions (since incidents are, almost definitionally, viewed as exceptional circumstanaces). As a result, everyone wants to be sure that something they feel is urgent is recognizable to the organization through the lens of “this is an incident.” And, therefore, what constitutes an incident can range from anything as business-impacting as payments not being processed or confidential information being shared inappropriately, down to functional failures such as image previews not showing up when expected in a chat application, and often below that even down to purely technical excursions, defined more by some metric being outside of the developers’ expectations than by any outwardly-visible change in the organization’s behaviour.

Organizations I’ve worked with split this difference by assigning these categories a “severity,” with the things the execs or board members care about being given higher severity than the things that line workers get alarmed by; this is a practice common enough that you can make many technical workers laugh nervously by joking that the true measure of an incident’s severity is the number of rules you break working on it. What we’re really talking about, though, is the reality that the organization is suddenly able to agree on the importance of a task, at any level of severity, once that task becomes an incident - which can only be true if the organization is not capable of agreeing on the importance of tasks outside of incidents, or at least, not capable of doing so for long enough to action those agreements.

I contend that incident response represents a breakdown in the organization’s ability to effectively prioritize work and to hold itself accountable for completing work. Operational, legal, or professional excursions from the norm will happen no matter how prepared an organization is or how careful they are in their work, and those excursions will inevitably force the organization to respond to them - potentially quickly. How an organization responds to those excursions, and how it structures that work, says a lot about how much faith they have in their routine processes. Having a separate “incident response” mode for work means that the organization is unable to trust that work in the incident category, generally work which is both urgent and important, will actually be done on an urgent basis.

Furthermore, I contend that we, as technology professionals, cultivate this breakdown, because incident work is one of the few ways we feel we are able to exert agency over the actions of the organization. By play-acting as fire-fighters or air traffic controllers, we gain the temporary authority to Do The Right Thing, and to be seen doing it by people with the authority to give us raises and promotions, when the routine processes of the organization do not provide those opportunities. We sabotage ourselves, in so doing; by segregating the politically and personally important work to exceptional processes, we create the justification for excluding that work from the routine process.