Incident Lifecycle
Incident Lifecycle Guide
This guide explains the complete incident lifecycle in the Incidents Management Platform, from initial declaration through resolution and closure.
Lifecycle Overview
Every incident follows a well-defined lifecycle with clear states and transitions:
┌──────────────┐
│ canceled │ ← open (guard: no_customer_impact)
└──────────────┘
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ open │────►│ acknowledged │────►│ mitigated │
└──────────┘ └──────────────┘ └──────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ resolved │
└──────────────────────────────────────────────────┘
│
▼
┌──────────────┐
│ closed │
└──────────────┘
Any state (except duplicate) ──► duplicate (guard: duplicate_of set)
State Descriptions
Open
The initial state for all new incidents. An incident enters the open state when:
- A responder declares a new incident via CLI or API
- An external system (PagerDuty, monitoring) triggers incident creation
- A webhook from an ITSM system creates an incident
Required information:
- Title (descriptive summary)
- Severity (SEV-1 through SEV-4)
Actions available:
- Acknowledge the incident
- Cancel (if false positive)
- Mark as duplicate
Acknowledged
Indicates that a responder has acknowledged the incident and is actively investigating.
Transition requirements:
- None (can transition immediately from open)
Actions available:
- Mitigate (after blast radius is stabilized)
- Resolve (if quick resolution is possible)
- Cancel (if false positive)
Mitigated
Indicates that the immediate customer impact has been addressed, though the root cause may not be fixed.
Transition requirements (guard):
blast_radius_stabilized: truemitigation_summary: "<description of mitigation>"
Actions available:
- Resolve (after root cause is fixed)
Resolved
Indicates that the root cause has been identified and fixed. The incident is stable and monitoring is in place.
Transition requirements (guard):
root_cause_fixed: trueroot_cause_summary: "<description of fix>"
Actions available:
- Close (after communications are complete)
- Reopen (if issue recurs)
Closed
Final state for successfully resolved incidents. Communications to stakeholders are complete.
Transition requirements (guard):
comms_complete: true
Actions available:
- Reopen (if issue recurs)
Terminal States
Duplicate
Used when an incident is discovered to be a duplicate of another incident. This is a terminal state - no further transitions are allowed.
Transition requirements:
duplicate_of: "<incident_id>"(target incident must exist)
Canceled
Used for false positive incidents or incidents that were declared in error. This is a terminal state.
Transition requirements (guard):
no_customer_impact: truecancel_reason: "<reason for cancellation>"
Guard Evidence
Guards ensure that state transitions are only performed when appropriate conditions are met. Each guard requires specific metadata to be provided with the transition request.
BlastRadiusStabilizedGuard
Required for: acknowledged → mitigated
{
"blast_radius_stabilized": true,
"mitigation_summary": "Rolled back deployment to v2.3.1"
}RootCauseFixedGuard
Required for: mitigated → resolved
{
"root_cause_fixed": true,
"root_cause_summary": "Fixed memory leak in connection pool (PR #1234)"
}CommsCompleteGuard
Required for: resolved → closed
{
"comms_complete": true
}NoCustomerImpactGuard
Required for: * → canceled
{
"no_customer_impact": true,
"cancel_reason": "False positive - monitoring alert misconfigured"
}DuplicateOfSetGuard
Required for: * → duplicate
{
"duplicate_of": "INC-1234"
}Severity Levels
Incidents are classified by severity to ensure appropriate response:
| Severity | Description | Response Time Target | Example |
|---|---|---|---|
| SEV-1 | Critical - Service down for all users | < 15 minutes | Complete outage |
| SEV-2 | High - Major feature unavailable | < 30 minutes | Payment processing down |
| SEV-3 | Medium - Feature degraded | < 2 hours | Slow API responses |
| SEV-4 | Low - Minor issue | < 24 hours | UI cosmetic bug |
SEV-1 Special Rules
SEV-1 incidents have special handling:
- Require
incident.commanderrole for most transitions - Automatically escalate if not acknowledged within SLA
- Executive summary field is protected
- Cannot be quickly closed (minimum 5 minute wait)
CLI Examples
Declare an Incident
# Basic declaration
im declare --title "API latency spike" --severity SEV-2 --service api-gateway
# With labels
im declare --title "Database connection issues" \
--severity SEV-1 \
--service user-service \
--labels "team=platform,env=prod"Transition Through States
# Acknowledge
im ack --id INC-1234
# Mitigate with evidence
im mitigate --id INC-1234 --metadata '{
"blast_radius_stabilized": true,
"mitigation_summary": "Scaled up database connections"
}'
# Resolve with evidence
im resolve --id INC-1234 --metadata '{
"root_cause_fixed": true,
"root_cause_summary": "Fixed connection pool leak in v2.4.0"
}'
# Close
im close --id INC-1234 --metadata '{"comms_complete": true}'Cancel or Mark Duplicate
# Cancel a false positive
im cancel --id INC-1235 \
--reason "Alert misconfiguration - no actual issue"
# Mark as duplicate
im mark-duplicate --id INC-1236 --of INC-1234API Examples
Declare Incident
curl -X POST http://localhost:8080/api/v1/incidents \
-H "Content-Type: application/json" \
-d '{
"title": "Production API latency spike",
"severity": "SEV-2",
"service": "api-gateway"
}'Transition Incident
curl -X POST http://localhost:8080/api/v1/incidents/INC-1234/transition \
-H "Content-Type: application/json" \
-d '{
"to_state": "mitigated",
"metadata": {
"blast_radius_stabilized": true,
"mitigation_summary": "Scaled up database connections"
}
}'CloudEvents
All state transitions emit CloudEvents v1.0 events for audit and integration:
im.incident.declared.v1
Emitted when a new incident is declared.
{
"specversion": "1.0",
"type": "im.incident.declared.v1",
"source": "im://api",
"id": "evt-abc123",
"time": "2025-12-21T10:00:00Z",
"subject": "incident/INC-1234",
"data": {
"incident_id": "INC-1234",
"title": "Production API latency spike",
"severity": "SEV-2",
"service": "api-gateway"
}
}im.incident.state_change.v1
Emitted when an incident transitions between states.
{
"specversion": "1.0",
"type": "im.incident.state_change.v1",
"source": "im://api",
"id": "evt-def456",
"time": "2025-12-21T10:30:00Z",
"subject": "incident/INC-1234",
"data": {
"incident_id": "INC-1234",
"from_state": "acknowledged",
"to_state": "mitigated",
"actor": "alice@example.com",
"metadata": {
"blast_radius_stabilized": true,
"mitigation_summary": "Scaled up database connections"
}
}
}RBAC Permissions
Different roles have different permissions for incident operations:
| Operation | responder | commander | admin |
|---|---|---|---|
| Declare | ✓ | ✓ | ✓ |
| Acknowledge | ✓ | ✓ | ✓ |
| Mitigate | ✓ | ✓ | ✓ |
| Resolve | ✓ | ✓ | ✓ |
| Close | ✓ | ✓ | |
| Cancel | ✓ | ✓ | |
| Mark Duplicate | ✓ | ✓ | |
| Merge/Split | ✓ | ✓ | |
| Override Guards | ✓ |
Note: SEV-1 incidents require commander role for most transitions.
Best Practices
- Declare early - It’s better to declare an incident and cancel it than to miss a real issue
- Acknowledge quickly - Acknowledgment signals that someone is actively working on the issue
- Provide mitigation evidence - Clear mitigation summaries help with postmortems
- Document root cause - The root cause summary is invaluable for preventing recurrence
- Complete communications - Ensure all stakeholders are notified before closing
- Use severity appropriately - SEV-1 should be reserved for critical customer-impacting issues
Related Guides
- State Machine Developer Guide - Technical details for developers
- ITSM Integration - Connecting to ServiceNow, Jira, etc.
- Timeline Service - Understanding the event timeline