IRROPS Recovery: The Operating System That Keeps a Bad Day From Becoming a Bad Week

Operations

Feb 16

This article focuses on aviation but the recovery system described here is transferable to any complex operation where disruption triggers cross-functional decisions, customer impact, and local execution under pressure.

Most companies think the disruption itself—the snowstorm, the power outage, the air traffic control glitch—is the problem. But disruption is normal. In aviation, the real problem is what happens after the disruption starts. It’s the slow decisions, conflicting priorities, inconsistent messaging, and local teams stuck waiting for permission while the system melts down.

The industry often relies on heroics but playing the hero isn’t a strategy. The goal of this article is to show you how to install a repeatable recovery operating system so your teams don’t need superpowers to stabilize the operation.

Define The Problem In Practical Terms

IRROPS is just a shorthand for one reality: the day stops behaving like the schedule. Weather, mechanical issues, staffing gaps, air traffic constraints, airport congestion, vendor failures, IT outages. Any of these can break the plan.

Normal operations are mostly execution. A plan exists, capacity is predictable, and the system can absorb small misses without major consequences. Recovery is different. Recovery is decision-making under constraint and constraints don't wait for alignment. That’s why “managing IRROPS” isn't the same as “managing delays.” You're not managing flights. You're managing:

Constraints: Physical and legal limits you cannot break.
Option collapse: Every passing minute of inaction reduces the set of feasible moves.
Broken coordination: The chaos that happens when the station, the crew, and the customer channels contradict each other.

Why A Bad Day Turns Into A Bad Week

A single storm shouldn’t ruin an airline's schedule for 3 days. When it does, it’s usually because of a specific chain reaction that creates a margin death spiral:

An early disruption hits. A flight is delayed by weather.
Decisions lag. The operations center waits to see if the weather clears, often using a wait and see strategy.
Crews time out. Pilots and flight attendants have strict legal limits on how long they can work. A short delay pushes them over their limit, and suddenly a plane with no mechanical issues can't fly because the crew is illegal.
Rebooking options shrink. Because planes are flying 85% full, there are no empty seats to move passengers into. The rebooking math stops working.
Stations get overloaded. Airport staff get overwhelmed by angry crowds and start improvising solutions that may or may not match company policy.
Tomorrow’s schedule breaks. Because crews and planes are in the wrong cities tonight, they can’t start tomorrow's schedule, causing a knock-on effect that ruins the next day, too.

Recovery isn’t one big decision. It’s a sequence of small decisions that either contain the blast radius or let it spread.

The Recovery Bottlenecks That Decide Outcomes

When an IRROPS event gets worse than it needed to be, it almost always runs through the same handful of failure points. These are the bottlenecks you should look for in your last major disruption:

1. Decision Rights (Who decides what)

When it’s unclear who has the authority to cancel a flight or swap a plane, everything becomes a debate. In a crisis, decision thrash—where too many people have veto power—kills speed. If you have to call a meeting to decide, you’re already too slow.

This isn't an airline-only problem. I’ve seen the same failure mode in a completely different environment: a brand-new team that had two standing meetings and no real operating system. Approvals bounced through email chains and pings, cross-functional work stalled, and the default “solution” became adding more meetings and more “quick syncs.” The team didn't lack talent at all. What they lacked were decision rights and an escalation path that stopped approvals from pinballing around the team.

In that role, I started by mapping reality, then turned chaos into a one-page operating structure: team charter, owner list, and escalation rules that stopped approvals from pinballing around the organization. Within a few months time, the work and decisions moved faster because authority stopped being ambiguous.

2. Communications Workflow (One story, one rhythm)

If the internal communication is bad, the game of telephone begins. The Operations Center tells the station one thing, but the crew hears another. This comms drift confuses staff and destroys passenger trust.

3. Capacity Constraints (Seats + crew clocks)

Rebooking is a math problem. If you cancel a flight with 180 people and the next flights are already near full, it can take multiple departures to absorb that demand. When leaders don’t check feasibility early, the backlog becomes physical, not emotional.

Crew legality is part of the same feasibility reality. A delay can quietly turn into a legality failure that forces an outcome nobody planned for. Recovery systems that ignore legality until late tend to “discover” cancellations after options have already collapsed.

4. Station Execution (Local overload)

The best plan in the world fails if the team at the airport can’t execute it. If station managers have to call HQ to get permission for a voucher or a hotel room, the line at the counter stops moving, and the terminal becomes a bottleneck.

The IRROPS Recovery Operating System

You don’t need better luck. You need a recovery operating system that behaves the same way every time disruption hits.

This section is the install list. You’re going to see four pieces that work together: who decides, how everyone stays aligned, how feasibility is checked before big moves, and what stations can do without waiting for permission. None of it requires a multi-year transformation. It’s the minimum structure that prevents thrash, preserves options, and makes recovery performance repeatable instead of personality-dependent.

1. Install #1: Decision Rights Map

The goal is to eliminate confusion about who is allowed to decide. You need a pre-defined map that creates a single owner for every critical action.

What to define:

Who can cancel a flight, and under what triggers?
Who can delay versus protect a flight?
Who can swap aircraft tails?
Who can reroute crews or activate reserves?
Who can authorize station exceptions (hotel, meal, reroute overrides)?
Who owns the customer promise, and who approves changes to it?

Output: A simple table that clarifies exactly who decides to cancel, delay, swap, or authorize exceptions.

2. Install #2: A Standard Communications Workflow

You need one story and one rhythm. This prevents information failure where passengers know a flight is canceled before the crew does.

Lane 1: Internal coordination workflow: this is how the organization stays synchronized: ops center, stations, crews, customer support, and digital teams operating from the same source of truth with predictable update windows.
Lane 2: External customer communications workflow: this is how customers get clarity: what happened, what it means for them, what their options are, and when the next update is coming.

The transferable principle is predictability. Silence and mixed messages create anxiety, which creates contact volume and escalations, which then steals capacity from recovery execution.

Output: a comms cadence, a single narrative owner, and templates + approval rules that prevent improvisation.

3. Install #3: Rebooking And Legality Protections

The most expensive recovery failures are decisions that look fine for two hours and detonate later. Those are usually decisions made without a structured check on legality margin, rebooking feasibility, and station capacity.

What to install:

Constraint Checks: Require a short feasibility check before major recovery moves so leaders stop guessing under pressure. The check should surface (1) legality risk, (2) re-accommodation feasibility, and (3) station throughput capacity.
Scenario Playbooks: Define default moves by disruption type (weather, ATC, aircraft shortage, IT outage) so the organization isn’t inventing a strategy in real time.

Output: A required constraint check before major recovery moves, so feasibility is explicit rather than assumed.

4. Install #4: Station-Level Exception Handling With Guardrails

Trusting your frontline is a must. If every exception requires a phone call to HQ, the system freezes. The line stops moving, the terminal becomes a backlog factory, and your ops center gets dragged into micro-approvals when it should be managing the network.

The fix isn’t “give stations absolute freedom”. The fix is leaders defining the decision boundaries in advance so local teams can move fast without inventing policy on the fly. Think of it as pre-approving the most common disruption decisions, then creating a clean escalation path for anything expensive, unusual, or high-risk.

The Bucket System:

Stations can do this without asking (standard exceptions).
Stations can do this if they log it (controlled exceptions).
Stations must escalate for approval (high-cost, high-risk exceptions).

Output: A clear exception menu with authority limits, logging, and escalation triggers so stations can move fast without breaking network control.

The Operating Rhythm: Running the Recovery

When recovery has no rhythm, teams convene when pressure peaks. Decisions get revisited repeatedly because there's no structured cadence for committing, communicating, executing, and reassessing. That creates thrash and thrash consumes options.

A recovery rhythm creates predictable decision windows. It also turns status updates into decision production. The cadence can run every 15, 30, or 60 minutes depending on severity. The important part is consistency and a standard agenda that produces commitments, not debate.

Example agenda (kept intentionally simple):

Current snapshot (2–3 minutes): what changed since the last cycle and what’s driving the disruption
Constraints check (3–5 minutes): what limits are binding in the next window (crew, capacity, station throughput, customer backlog)
Decisions needed (5–10 minutes): what must be decided before the next cycle, by whom, and by what time
Execution owners (2–3 minutes): who is doing what in stations, crew coordination, customer support, and digital updates
Comms for this window (2–3 minutes): what we’re telling customers and internal teams, and when the next update is
Escalations (as needed): what exceeds local authority and needs leadership sign-off

Output: A repeatable recovery cadence (frequency + roles + agenda + escalation rules) so each cycle produces decisions and actions, not debate.

What To Measure (And Why Most Metrics Don’t Help in IRROPS)

Most dashboards show history. History is useful for post-mortems, not live recovery. In disruption, you need visibility that triggers decisions early enough to preserve options, not metrics that simply document pain after the fact.

The metrics below were chosen for one reason: they map to the four failure modes that turn a bad day into a bad week. If you can (1) stabilize faster, (2) operate more of the schedule, (3) reduce cash leakage, and (4) protect tomorrow, you’re winning IRROPS. If you can see legality risk, station saturation, and re-accommodation capacity early enough, you can prevent the spiral.

1. The Leadership Outcomes (What you’re optimizing for)

These four were selected because they're the universal scoreboard for recovery, regardless of the disruption type. They answer: How fast did we stabilize, how much did we operate, how much did it cost, and how much did we push into tomorrow?

Containment time: how long until the operation is stable again?
Completion factor: how much of the planned schedule you actually operated?
Compensation leakage: direct cash cost of disruption (refunds, hotels, meals, ground transport, credits).
Integrity check: how much today damaged tomorrow?

2. The Operational Triggers (What tells you to act now)

These triggers were picked because they’re early indicators of option collapse. Each trigger should have four things: a clear definition, a threshold, an owner, and a default response. Without that, it’s just commentary.

Crew duty red zone: crews approaching legality limits in the next decision window.
Station throughput saturation: signs the local operation can’t absorb more change without backlog.
Re-accommodation feasibility: whether disrupted demand is outpacing available options.

When these go red, the point isn’t to “do flight ops.” The point is to engage the constraint check and escalation process early, while the operation still has options.

30-Day Install Plan (Minimal Viable Recovery OS)

This is the timeline layer: how you install a minimum viable recovery operating system without launching a multi-year transformation program. The goal is speed to structure.

Week 1: Map the Reality: Identify where your decisions stall. Look at the last disruption and identify the bottlenecks that slowed down the cancellation decisions.
Week 2: Lock Decision Rights + Comms: Publish the decision map and escalation rules. Clarify who commits decisions versus who advises. Establish a predictable comms rhythm appropriate to severity, and make sure every update includes the timing of the next update.
Week 3: Build Constraint Playbooks: Install the constraint check + scenario structure. Build the short feasibility check that must happen before major moves and define scenario-based default structures so recovery isn’t invented from scratch each time.
Week 4: Enable Stations: Give your station managers the authority. Define their guardrails for hotels and vouchers so they stop calling you for permission.

Takeaway

Disruption is unavoidable. The multi-day meltdown is usually self-inflicted: unclear decision rights, inconsistent comms, feasibility checks that happen too late, and stations trapped waiting for permission while the backlog grows.

The fix isn’t a transformation program. It’s four installs: a decision rights map, a comms workflow, a constraint check, and station exception guardrails. Add a recovery rhythm and the operation stops depending on heroics.

If you want help installing this, that’s the work I do: mapping the real recovery flow, clarifying decision rights and escalation, designing comms workflows that prevent thrash, building station exception guardrails, and creating metric-to-action controls that hold under pressure.

If you want to pressure-test your recovery layer, I can run an audit. We’ll map your last major disruption end-to-end, identify the 2–3 loops that created spillover, then deliver the one-page decision rights map, comms cadence, constraint check, and station guardrails your team can run immediately.

Aviation & TravelReliability & RiskGovernance & Accountability

Diane Bonheur