Your First Incident
On this page
- Scenario: Database Performance Issue
- Step 1: Declare the Incident
- Step 2: Acknowledge the Incident
- Step 3: Investigation and Updates
- Step 4: Resolution
- Step 5: Close the Incident
- Step 6: Export for Postmortem
- Understanding the Dashboard
- Advanced Features
- Team Collaboration Features
- Best Practices Learned
- What You’ve Accomplished
- Next Steps
- Common Scenarios
Scenario: Database Performance Issue
Let’s walk through a realistic incident scenario to learn the platform’s capabilities. You’ll learn how to declare, manage, and resolve an incident while understanding the event-sourced timeline.
The Situation
It’s 2:30 PM on a Tuesday. Your monitoring system detects unusual database response times affecting the checkout service. Customer complaints are starting to come in. Time to declare an incident.
Step 1: Declare the Incident
Using the Web Interface
- Open the platform at http://localhost:8080
- Click “Declare Incident” (large red button on the dashboard)
- Fill in the incident details:
- Title:
Checkout service experiencing high database latency - Severity:
SEV-2(Major functionality impacted) - Service:
checkout - Description:
Database response times increased to 5+ seconds. Multiple customer reports of failed checkouts.
- Title:
- Click “Create Incident”
You’ll see the incident appear with an automatically assigned ID like INC-2025-001.
Using the CLI
Alternatively, declare the incident via command line:
./im declare \
--sev SEV-2 \
--title "Checkout service experiencing high database latency" \
--service checkout \
--description "Database response times increased to 5+ seconds. Multiple customer reports of failed checkouts."Output:
β Incident declared: INC-2025-001
π View at: http://localhost:8080/incidents/INC-2025-001
What Happened Behind the Scenes
The platform created the following CloudEvent in the timeline:
{
"specversion": "1.0",
"type": "incident.declared",
"source": "incident-management/cli",
"id": "evt-001",
"time": "2025-08-23T14:30:00Z",
"data": {
"incident_id": "INC-2025-001",
"title": "Checkout service experiencing high database latency",
"severity": "SEV-2",
"service": "checkout",
"status": "open"
}
}Step 2: Acknowledge the Incident
Show that someone is actively working on the issue.
Web Interface
- Go to the incident page (click the incident ID)
- Click “Acknowledge” button
- Add a note:
Investigating database performance metrics - Click “Acknowledge Incident”
CLI
./im ack --incident INC-2025-001 --note "Investigating database performance metrics"Timeline Update: The incident timeline now shows:
- β 14:30 - Incident declared
- β 14:32 - Incident acknowledged by user@example.com
Step 3: Investigation and Updates
Keep stakeholders informed with regular updates as you investigate.
Add Investigation Updates
# First update - root cause investigation
./im update --incident INC-2025-001 \
--note "Found slow query in orders table. Query execution time: 8.2s avg"
# Second update - mitigation attempt
./im update --incident INC-2025-001 \
--note "Applied database index to orders.created_at. Testing performance"
# Third update - mitigation confirmed
./im update --incident INC-2025-001 \
--note "Database performance improved. Response time now <500ms. Monitoring"Change Status to Mitigated
Once you’ve applied a fix but want to monitor before full resolution:
./im mitigate --incident INC-2025-001 \
--note "Applied database optimization. Performance restored, monitoring for stability"Status Flow:
Open β Mitigated β You are here
Timeline View
Your incident timeline now shows comprehensive activity:
π¨ 14:30 - Incident declared (SEV-2)
Checkout service experiencing high database latency
β 14:32 - Acknowledged by user@example.com
Investigating database performance metrics
π 14:35 - Investigation update
Found slow query in orders table. Query execution time: 8.2s avg
π 14:42 - Mitigation attempt
Applied database index to orders.created_at. Testing performance
π 14:45 - Performance confirmed
Database performance improved. Response time now <500ms. Monitoring
π§ 14:47 - Incident mitigated
Applied database optimization. Performance restored, monitoring for stability
Step 4: Resolution
After monitoring confirms the fix is stable:
./im resolve --incident INC-2025-001 \
--note "Database performance stable for 30+ minutes. Customer reports ceased. Issue resolved."Status Flow:
Open β Mitigated β Resolved β You are here
Step 5: Close the Incident
Final closure after confirming no further issues:
./im close --incident INC-2025-001Final Status:
Open β Mitigated β Resolved β Closed β
Step 6: Export for Postmortem
Generate a complete incident report for postmortem analysis:
./im timeline export --incident INC-2025-001 --format md > postmortem-INC-2025-001.mdGenerated Postmortem Document:
# Incident Postmortem - INC-2025-001
## Incident Summary
- **Title:** Checkout service experiencing high database latency
- **Severity:** SEV-2 (Major)
- **Service:** checkout
- **Duration:** 17 minutes (14:30 - 14:47)
- **Status:** Closed
## Timeline
- **14:30** - Incident declared
- **14:32** - Acknowledged, investigation started
- **14:35** - Root cause identified: slow query
- **14:42** - Database index applied
- **14:47** - Incident mitigated
- **15:15** - Incident resolved and closed
## Resolution
Applied database index to orders.created_at column, reducing query time from 8.2s to <500ms.
## Action Items
- [ ] Review other queries on orders table
- [ ] Set up monitoring for query performance
- [ ] Add database performance to alertingUnderstanding the Dashboard
Real-Time Updates
The web dashboard automatically refreshes every 5 seconds, so team members see updates instantly:
- Active Incidents: Shows all open/mitigated incidents
- Timeline View: Chronological incident activity
- Status Badges: Color-coded severity levels
- Auto-refresh: No manual refresh needed
Incident Metrics
View key metrics from the dashboard:
- MTTR: Mean Time To Resolution
- MTTA: Mean Time To Acknowledgment
- Incident Volume: By severity and service
- Team Performance: Response times by user
Advanced Features
Severity Escalation
Escalate severity if the issue becomes more critical:
./im escalate --incident INC-2025-001 --severity SEV-1 \
--note "Issue spreading to other services, escalating to SEV-1"Service Correlation
Link related services affected by the incident:
./im correlate --incident INC-2025-001 --service payments \
--note "Payment processing also affected by database performance"SLA Tracking
The platform automatically tracks SLA metrics:
- SEV-1: 15-minute acknowledgment, 1-hour resolution target
- SEV-2: 30-minute acknowledgment, 4-hour resolution target
- SEV-3: 2-hour acknowledgment, 24-hour resolution target
- SEV-4: 8-hour acknowledgment, 72-hour resolution target
Team Collaboration Features
Real-Time Notifications
When integrations are configured, updates are sent to:
- Slack/Teams: Incident notifications and status changes
- PagerDuty: Escalation and on-call management
- Email: Stakeholder updates and SLA breach alerts
Multiple User Workflow
# User A declares incident
./im declare --sev SEV-1 --title "Payment system down"
# User B acknowledges and takes ownership
./im ack --incident INC-2025-002 --note "Taking ownership, investigating"
# User C adds findings
./im update --incident INC-2025-002 --note "Found network connectivity issue"
# User A resolves
./im resolve --incident INC-2025-002 --note "Network restored, payments functional"All actions are attributed to the acting user and recorded in the timeline.
Best Practices Learned
Incident Declaration
- Be specific in titles (avoid “System is down”)
- Choose appropriate severity based on customer impact
- Include service context for faster routing
- Declare early rather than waiting for certainty
During Investigation
- Update frequently (every 10-15 minutes for SEV-1/2)
- Be specific about actions taken and findings
- Use appropriate status transitions (Open β Mitigated β Resolved)
- Document decisions for future reference
Resolution and Closure
- Wait for stability before resolving (monitoring period)
- Document root cause and remediation steps
- Close only after confirming no customer impact
- Export timeline for postmortem analysis
What You’ve Accomplished
β
Declared your first incident with appropriate severity
β
Managed the complete incident lifecycle
β
Used both web interface and CLI tools
β
Learned event-sourced timeline concepts
β
Generated postmortem documentation
β
Understood real-time collaboration features
Next Steps
Now that you’ve mastered basic incident management:
-
- Connect to PagerDuty for on-call management
- Add Slack for team notifications
- Configure SCIM for user provisioning
-
- Set up role-based access control
- Configure authentication providers
- Review audit logging capabilities
-
- Automate incident workflows
- Build custom integrations
- Set up monitoring alerts
-
- Configure high availability
- Set up monitoring and alerting
- Plan disaster recovery
Common Scenarios
High-Severity Incidents (SEV-1)
# Immediate declaration with war room
./im declare --sev SEV-1 --title "Complete payment system outage" \
--service payments --war-room
# Quick acknowledgment by incident commander
./im ack --incident INC-2025-003 --role commanderMulti-Service Incidents
# Initial incident
./im declare --sev SEV-2 --title "Database performance issue" --service db
# Correlate additional affected services
./im correlate --incident INC-2025-004 --service api
./im correlate --incident INC-2025-004 --service checkout
./im correlate --incident INC-2025-004 --service paymentsLong-Running Investigations
# Regular updates during extended investigation
./im update --incident INC-2025-005 --note "Continuing investigation. Checking network logs"
./im update --incident INC-2025-005 --note "Network logs clear. Moving to application analysis"
./im update --incident INC-2025-005 --note "Found memory leak in application. Preparing fix"You’re now ready to handle real incidents with confidence! π