Incident Lifecycle

Incident Lifecycle Guide

This guide explains the complete incident lifecycle in the Incidents Management Platform, from initial declaration through resolution and closure.

Lifecycle Overview

Every incident follows a well-defined lifecycle with clear states and transitions:

┌──────────────┐
│   canceled   │ ← open (guard: no_customer_impact)
└──────────────┘

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│   open   │────►│ acknowledged │────►│  mitigated   │
└──────────┘     └──────────────┘     └──────────────┘
     │                  │                     │
     │                  │                     │
     ▼                  ▼                     ▼
┌──────────────────────────────────────────────────┐
│                    resolved                       │
└──────────────────────────────────────────────────┘
                       │
                       ▼
                ┌──────────────┐
                │    closed    │
                └──────────────┘

Any state (except duplicate) ──► duplicate (guard: duplicate_of set)

State Descriptions

Open

The initial state for all new incidents. An incident enters the open state when:

  • A responder declares a new incident via CLI or API
  • An external system (PagerDuty, monitoring) triggers incident creation
  • A webhook from an ITSM system creates an incident

Required information:

  • Title (descriptive summary)
  • Severity (SEV-1 through SEV-4)

Actions available:

  • Acknowledge the incident
  • Cancel (if false positive)
  • Mark as duplicate

Acknowledged

Indicates that a responder has acknowledged the incident and is actively investigating.

Transition requirements:

  • None (can transition immediately from open)

Actions available:

  • Mitigate (after blast radius is stabilized)
  • Resolve (if quick resolution is possible)
  • Cancel (if false positive)

Mitigated

Indicates that the immediate customer impact has been addressed, though the root cause may not be fixed.

Transition requirements (guard):

  • blast_radius_stabilized: true
  • mitigation_summary: "<description of mitigation>"

Actions available:

  • Resolve (after root cause is fixed)

Resolved

Indicates that the root cause has been identified and fixed. The incident is stable and monitoring is in place.

Transition requirements (guard):

  • root_cause_fixed: true
  • root_cause_summary: "<description of fix>"

Actions available:

  • Close (after communications are complete)
  • Reopen (if issue recurs)

Closed

Final state for successfully resolved incidents. Communications to stakeholders are complete.

Transition requirements (guard):

  • comms_complete: true

Actions available:

  • Reopen (if issue recurs)

Terminal States

Duplicate

Used when an incident is discovered to be a duplicate of another incident. This is a terminal state - no further transitions are allowed.

Transition requirements:

  • duplicate_of: "<incident_id>" (target incident must exist)

Canceled

Used for false positive incidents or incidents that were declared in error. This is a terminal state.

Transition requirements (guard):

  • no_customer_impact: true
  • cancel_reason: "<reason for cancellation>"

Guard Evidence

Guards ensure that state transitions are only performed when appropriate conditions are met. Each guard requires specific metadata to be provided with the transition request.

BlastRadiusStabilizedGuard

Required for: acknowledged → mitigated

{
  "blast_radius_stabilized": true,
  "mitigation_summary": "Rolled back deployment to v2.3.1"
}

RootCauseFixedGuard

Required for: mitigated → resolved

{
  "root_cause_fixed": true,
  "root_cause_summary": "Fixed memory leak in connection pool (PR #1234)"
}

CommsCompleteGuard

Required for: resolved → closed

{
  "comms_complete": true
}

NoCustomerImpactGuard

Required for: * → canceled

{
  "no_customer_impact": true,
  "cancel_reason": "False positive - monitoring alert misconfigured"
}

DuplicateOfSetGuard

Required for: * → duplicate

{
  "duplicate_of": "INC-1234"
}

Severity Levels

Incidents are classified by severity to ensure appropriate response:

Severity Description Response Time Target Example
SEV-1 Critical - Service down for all users < 15 minutes Complete outage
SEV-2 High - Major feature unavailable < 30 minutes Payment processing down
SEV-3 Medium - Feature degraded < 2 hours Slow API responses
SEV-4 Low - Minor issue < 24 hours UI cosmetic bug

SEV-1 Special Rules

SEV-1 incidents have special handling:

  • Require incident.commander role for most transitions
  • Automatically escalate if not acknowledged within SLA
  • Executive summary field is protected
  • Cannot be quickly closed (minimum 5 minute wait)

CLI Examples

Declare an Incident

# Basic declaration
im declare --title "API latency spike" --severity SEV-2 --service api-gateway

# With labels
im declare --title "Database connection issues" \
  --severity SEV-1 \
  --service user-service \
  --labels "team=platform,env=prod"

Transition Through States

# Acknowledge
im ack --id INC-1234

# Mitigate with evidence
im mitigate --id INC-1234 --metadata '{
  "blast_radius_stabilized": true,
  "mitigation_summary": "Scaled up database connections"
}'

# Resolve with evidence
im resolve --id INC-1234 --metadata '{
  "root_cause_fixed": true,
  "root_cause_summary": "Fixed connection pool leak in v2.4.0"
}'

# Close
im close --id INC-1234 --metadata '{"comms_complete": true}'

Cancel or Mark Duplicate

# Cancel a false positive
im cancel --id INC-1235 \
  --reason "Alert misconfiguration - no actual issue"

# Mark as duplicate
im mark-duplicate --id INC-1236 --of INC-1234

API Examples

Declare Incident

curl -X POST http://localhost:8080/api/v1/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Production API latency spike",
    "severity": "SEV-2",
    "service": "api-gateway"
  }'

Transition Incident

curl -X POST http://localhost:8080/api/v1/incidents/INC-1234/transition \
  -H "Content-Type: application/json" \
  -d '{
    "to_state": "mitigated",
    "metadata": {
      "blast_radius_stabilized": true,
      "mitigation_summary": "Scaled up database connections"
    }
  }'

CloudEvents

All state transitions emit CloudEvents v1.0 events for audit and integration:

im.incident.declared.v1

Emitted when a new incident is declared.

{
  "specversion": "1.0",
  "type": "im.incident.declared.v1",
  "source": "im://api",
  "id": "evt-abc123",
  "time": "2025-12-21T10:00:00Z",
  "subject": "incident/INC-1234",
  "data": {
    "incident_id": "INC-1234",
    "title": "Production API latency spike",
    "severity": "SEV-2",
    "service": "api-gateway"
  }
}

im.incident.state_change.v1

Emitted when an incident transitions between states.

{
  "specversion": "1.0",
  "type": "im.incident.state_change.v1",
  "source": "im://api",
  "id": "evt-def456",
  "time": "2025-12-21T10:30:00Z",
  "subject": "incident/INC-1234",
  "data": {
    "incident_id": "INC-1234",
    "from_state": "acknowledged",
    "to_state": "mitigated",
    "actor": "alice@example.com",
    "metadata": {
      "blast_radius_stabilized": true,
      "mitigation_summary": "Scaled up database connections"
    }
  }
}

RBAC Permissions

Different roles have different permissions for incident operations:

Operation responder commander admin
Declare
Acknowledge
Mitigate
Resolve
Close
Cancel
Mark Duplicate
Merge/Split
Override Guards

Note: SEV-1 incidents require commander role for most transitions.

Best Practices

  1. Declare early - It’s better to declare an incident and cancel it than to miss a real issue
  2. Acknowledge quickly - Acknowledgment signals that someone is actively working on the issue
  3. Provide mitigation evidence - Clear mitigation summaries help with postmortems
  4. Document root cause - The root cause summary is invaluable for preventing recurrence
  5. Complete communications - Ensure all stakeholders are notified before closing
  6. Use severity appropriately - SEV-1 should be reserved for critical customer-impacting issues