Skip to content

67 patterns · cross-incident

Patterns

Incidents as isolated narratives are interesting. Incidents read in aggregate are instructive. These essays follow a single failure shape across industries — showing that the same dynamics recur whether the substrate is code, capital, cargo, or cells.

Pattern · 14 incidents

Retry storms across industries

softwarecapacityemergent

A component fails. Its clients retry. The retries concentrate load at the worst possible moment — when the component is already struggling or just recovering. The system cannot distinguish legitimate load from amplified retries. Recovery becomes impossible without explicit load-shedding.

This dynamic appears in DynamoDB cascades, in the 2010 Flash Crash's liquidity withdrawal, in the 2008 CME pricing storm, in bank settlement systems during stress, and in airline rebooking during weather events. The substrate changes; the shape does not.

Read the full pattern →

Pattern · 9 incidents

When the monitoring system fails with the system

designemergent

Observability tools are supposed to be the pair of eyes that sees the outage. In practice, observability tools often share infrastructure with the thing they observe. The dashboard that would tell you the site is down is running on the site.

This failure pattern appears in at least nine of our analyzed incidents, including DynamoDB 2025, Facebook 2021, and the 2020 Kinesis outage. The second-order consequence is that internal engineering coordination is forced onto whatever out-of-band channel still works — Slack, Teams, in one case a decade-old IRC bot that nobody had retired.

Read the full pattern →

Pattern · 11 incidents

Normalization of deviance, in production

humanprocess

Diane Vaughan described this for the Challenger disaster. It generalizes: small violations of a standard, accepted without consequence, become the new standard. Over time, the organization is operating outside its own defined safety envelope and does not know it.

Challenger, Columbia, 737 MAX, Therac-25, the 2023 SVB incident, and several hospital medication-error cases share this structure. The pattern is organizationally diagnostic — when it is present in one system, it is often present elsewhere in the same organization.

Read the full pattern →

Pattern · 7 incidents

The recovery path needs the system

design

The tool used to restore service depends on the service being up. This happened at Facebook in 2021 (the badge-reader auth tool used to enter the data center required the DNS the outage had taken down) and a milder version happens almost every cloud outage. Any system whose recovery requires itself is a fragility in drag.

Read the full pattern →

Pattern · 6 incidents

Authority gradient and the first officer who would not speak

humanprocess

Tenerife. Air France 447. A subset of medical-error cases. When a junior-ranked operator sees something wrong and the senior is confident, hierarchy suppresses the correction. Every industry with a cockpit-shaped decision structure has this pattern and has spent decades trying to engineer it out of the room.

Read the full pattern →

Pattern · 8 incidents

Cascading dependency in unrelated domains

emergentexternal

The entity experiencing the failure is several layers removed from the entity that caused it. MTA card readers on AWS via a vendor. Hospitals on Windows via CrowdStrike via Microsoft. A small library fault propagates through the dependency graph until the spectator is the victim. Software supply chain, financial correspondent banking, and logistics all show this shape.

Read the full pattern →