Incident Lifecycle

Incident Lifecycle Guide

This guide explains the complete incident lifecycle in the Incidents Management Platform, from initial declaration through resolution and closure.

Lifecycle Overview

Every incident follows a well-defined lifecycle with clear states and transitions:

┌──────────────┐
│   canceled   │ ← open (guard: no_customer_impact)
└──────────────┘

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│   open   │────►│ acknowledged │────►│  mitigated   │
└──────────┘     └──────────────┘     └──────────────┘
     │                  │                     │
     │                  │                     │
     ▼                  ▼                     ▼
┌──────────────────────────────────────────────────┐
│                    resolved                       │
└──────────────────────────────────────────────────┘
                       │
                       ▼
                ┌──────────────┐
                │    closed    │
                └──────────────┘

Any state (except duplicate) ──► duplicate (guard: duplicate_of set)

State Descriptions

Open

The initial state for all new incidents. An incident enters the open state when:

A responder declares a new incident via CLI or API
An external system (PagerDuty, monitoring) triggers incident creation
A webhook from an ITSM system creates an incident

Required information:

Title (descriptive summary)
Severity (SEV-1 through SEV-4)

Actions available:

Acknowledge the incident
Cancel (if false positive)
Mark as duplicate

Acknowledged

Indicates that a responder has acknowledged the incident and is actively investigating.

Transition requirements:

None (can transition immediately from open)

Actions available:

Mitigate (after blast radius is stabilized)
Resolve (if quick resolution is possible)
Cancel (if false positive)

Mitigated

Indicates that the immediate customer impact has been addressed, though the root cause may not be fixed.

Transition requirements (guard):

blast_radius_stabilized: true
mitigation_summary: "<description of mitigation>"

Actions available:

Resolve (after root cause is fixed)

Resolved

Indicates that the root cause has been identified and fixed. The incident is stable and monitoring is in place.

Transition requirements (guard):

root_cause_fixed: true
root_cause_summary: "<description of fix>"

Actions available:

Close (after communications are complete)
Reopen (if issue recurs)

Closed

Final state for successfully resolved incidents. Communications to stakeholders are complete.

Transition requirements (guard):

comms_complete: true

Actions available:

Reopen (if issue recurs)

Terminal States

Duplicate

Used when an incident is discovered to be a duplicate of another incident. This is a terminal state - no further transitions are allowed.

Transition requirements:

duplicate_of: "<incident_id>" (target incident must exist)

Canceled

Used for false positive incidents or incidents that were declared in error. This is a terminal state.

Transition requirements (guard):

no_customer_impact: true
cancel_reason: "<reason for cancellation>"

Guard Evidence

Guards ensure that state transitions are only performed when appropriate conditions are met. Each guard requires specific metadata to be provided with the transition request.

BlastRadiusStabilizedGuard

Required for: acknowledged → mitigated

{
  "blast_radius_stabilized": true,
  "mitigation_summary": "Rolled back deployment to v2.3.1"
}

RootCauseFixedGuard

Required for: mitigated → resolved

{
  "root_cause_fixed": true,
  "root_cause_summary": "Fixed memory leak in connection pool (PR #1234)"
}

CommsCompleteGuard

Required for: resolved → closed

{
  "comms_complete": true
}

NoCustomerImpactGuard

Required for: * → canceled

{
  "no_customer_impact": true,
  "cancel_reason": "False positive - monitoring alert misconfigured"
}

DuplicateOfSetGuard

Required for: * → duplicate

{
  "duplicate_of": "INC-1234"
}

Severity Levels

Incidents are classified by severity to ensure appropriate response:

Severity	Description	Response Time Target	Example
SEV-1	Critical - Service down for all users	< 15 minutes	Complete outage
SEV-2	High - Major feature unavailable	< 30 minutes	Payment processing down
SEV-3	Medium - Feature degraded	< 2 hours	Slow API responses
SEV-4	Low - Minor issue	< 24 hours	UI cosmetic bug

SEV-1 Special Rules

SEV-1 incidents have special handling:

Require incident.commander role for most transitions
Automatically escalate if not acknowledged within SLA
Executive summary field is protected
Cannot be quickly closed (minimum 5 minute wait)

CLI Examples

Declare an Incident

# Basic declaration
im declare --title "API latency spike" --severity SEV-2 --service api-gateway

# With labels
im declare --title "Database connection issues" \
  --severity SEV-1 \
  --service user-service \
  --labels "team=platform,env=prod"

Transition Through States

# Acknowledge
im ack --id INC-1234

# Mitigate with evidence
im mitigate --id INC-1234 --metadata '{
  "blast_radius_stabilized": true,
  "mitigation_summary": "Scaled up database connections"
}'

# Resolve with evidence
im resolve --id INC-1234 --metadata '{
  "root_cause_fixed": true,
  "root_cause_summary": "Fixed connection pool leak in v2.4.0"
}'

# Close
im close --id INC-1234 --metadata '{"comms_complete": true}'

Cancel or Mark Duplicate

# Cancel a false positive
im cancel --id INC-1235 \
  --reason "Alert misconfiguration - no actual issue"

# Mark as duplicate
im mark-duplicate --id INC-1236 --of INC-1234

API Examples

Declare Incident

curl -X POST http://localhost:8080/api/v1/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Production API latency spike",
    "severity": "SEV-2",
    "service": "api-gateway"
  }'

Transition Incident

curl -X POST http://localhost:8080/api/v1/incidents/INC-1234/transition \
  -H "Content-Type: application/json" \
  -d '{
    "to_state": "mitigated",
    "metadata": {
      "blast_radius_stabilized": true,
      "mitigation_summary": "Scaled up database connections"
    }
  }'

CloudEvents

All state transitions emit CloudEvents v1.0 events for audit and integration:

im.incident.declared.v1

Emitted when a new incident is declared.

{
  "specversion": "1.0",
  "type": "im.incident.declared.v1",
  "source": "im://api",
  "id": "evt-abc123",
  "time": "2025-12-21T10:00:00Z",
  "subject": "incident/INC-1234",
  "data": {
    "incident_id": "INC-1234",
    "title": "Production API latency spike",
    "severity": "SEV-2",
    "service": "api-gateway"
  }
}

im.incident.state_change.v1

Emitted when an incident transitions between states.

{
  "specversion": "1.0",
  "type": "im.incident.state_change.v1",
  "source": "im://api",
  "id": "evt-def456",
  "time": "2025-12-21T10:30:00Z",
  "subject": "incident/INC-1234",
  "data": {
    "incident_id": "INC-1234",
    "from_state": "acknowledged",
    "to_state": "mitigated",
    "actor": "alice@example.com",
    "metadata": {
      "blast_radius_stabilized": true,
      "mitigation_summary": "Scaled up database connections"
    }
  }
}

RBAC Permissions

Different roles have different permissions for incident operations:

Operation	responder	commander	admin
Declare	✓	✓	✓
Acknowledge	✓	✓	✓
Mitigate	✓	✓	✓
Resolve	✓	✓	✓
Close		✓	✓
Cancel		✓	✓
Mark Duplicate		✓	✓
Merge/Split		✓	✓
Override Guards			✓

Note: SEV-1 incidents require commander role for most transitions.

Best Practices

Declare early - It’s better to declare an incident and cancel it than to miss a real issue
Acknowledge quickly - Acknowledgment signals that someone is actively working on the issue
Provide mitigation evidence - Clear mitigation summaries help with postmortems
Document root cause - The root cause summary is invaluable for preventing recurrence
Complete communications - Ensure all stakeholders are notified before closing
Use severity appropriately - SEV-1 should be reserved for critical customer-impacting issues

State Machine Developer Guide - Technical details for developers
ITSM Integration - Connecting to ServiceNow, Jira, etc.
Timeline Service - Understanding the event timeline

Edit this page on GitHub

Quick Start

Client Configuration

Docs

Incidents

Title here

Incident Lifecycle

Incident Lifecycle Guide

Lifecycle Overview

State Descriptions

Open

Acknowledged

Mitigated

Resolved

Closed

Terminal States

Duplicate

Canceled

Guard Evidence

BlastRadiusStabilizedGuard

RootCauseFixedGuard

CommsCompleteGuard

NoCustomerImpactGuard

DuplicateOfSetGuard

Severity Levels

SEV-1 Special Rules

CLI Examples

Declare an Incident

Transition Through States

Cancel or Mark Duplicate

API Examples

Declare Incident

Transition Incident

CloudEvents

im.incident.declared.v1

im.incident.state_change.v1

RBAC Permissions

Best Practices

Incident Lifecycle

Incident Lifecycle Guide

Lifecycle Overview#

State Descriptions#

Open#

Acknowledged#

Mitigated#

Resolved#

Closed#

Terminal States#

Duplicate#

Canceled#

Guard Evidence#

BlastRadiusStabilizedGuard#

RootCauseFixedGuard#

CommsCompleteGuard#

NoCustomerImpactGuard#

DuplicateOfSetGuard#

Severity Levels#

SEV-1 Special Rules#

CLI Examples#

Declare an Incident#

Transition Through States#

Cancel or Mark Duplicate#

API Examples#

Declare Incident#

Transition Incident#

CloudEvents#

im.incident.declared.v1#

im.incident.state_change.v1#

RBAC Permissions#

Best Practices#

Related Guides#

Lifecycle Overview

State Descriptions

Open

Acknowledged

Mitigated

Resolved

Closed

Terminal States

Duplicate

Canceled

Guard Evidence

BlastRadiusStabilizedGuard

RootCauseFixedGuard

CommsCompleteGuard

NoCustomerImpactGuard

DuplicateOfSetGuard

Severity Levels

SEV-1 Special Rules

CLI Examples

Declare an Incident

Transition Through States

Cancel or Mark Duplicate

API Examples

Declare Incident

Transition Incident

CloudEvents

im.incident.declared.v1

im.incident.state_change.v1

RBAC Permissions

Best Practices

Related Guides