Your First Incident

Scenario: Database Performance Issue

Let’s walk through a realistic incident scenario to learn the platform’s capabilities. You’ll learn how to declare, manage, and resolve an incident while understanding the event-sourced timeline.

The Situation

It’s 2:30 PM on a Tuesday. Your monitoring system detects unusual database response times affecting the checkout service. Customer complaints are starting to come in. Time to declare an incident.

Step 1: Declare the Incident

Using the Web Interface

  1. Open the platform at http://localhost:8080
  2. Click “Declare Incident” (large red button on the dashboard)
  3. Fill in the incident details:
    • Title: Checkout service experiencing high database latency
    • Severity: SEV-2 (Major functionality impacted)
    • Service: checkout
    • Description: Database response times increased to 5+ seconds. Multiple customer reports of failed checkouts.
  4. Click “Create Incident”

You’ll see the incident appear with an automatically assigned ID like INC-2025-001.

Using the CLI

Alternatively, declare the incident via command line:

./im declare \
  --sev SEV-2 \
  --title "Checkout service experiencing high database latency" \
  --service checkout \
  --description "Database response times increased to 5+ seconds. Multiple customer reports of failed checkouts."

Output:

βœ“ Incident declared: INC-2025-001
πŸ”— View at: http://localhost:8080/incidents/INC-2025-001

What Happened Behind the Scenes

The platform created the following CloudEvent in the timeline:

{
  "specversion": "1.0",
  "type": "incident.declared",
  "source": "incident-management/cli",
  "id": "evt-001",
  "time": "2025-08-23T14:30:00Z",
  "data": {
    "incident_id": "INC-2025-001",
    "title": "Checkout service experiencing high database latency",
    "severity": "SEV-2",
    "service": "checkout",
    "status": "open"
  }
}

Step 2: Acknowledge the Incident

Show that someone is actively working on the issue.

Web Interface

  1. Go to the incident page (click the incident ID)
  2. Click “Acknowledge” button
  3. Add a note: Investigating database performance metrics
  4. Click “Acknowledge Incident”

CLI

./im ack --incident INC-2025-001 --note "Investigating database performance metrics"

Timeline Update: The incident timeline now shows:

  • βœ… 14:30 - Incident declared
  • βœ… 14:32 - Incident acknowledged by user@example.com

Step 3: Investigation and Updates

Keep stakeholders informed with regular updates as you investigate.

Add Investigation Updates

# First update - root cause investigation
./im update --incident INC-2025-001 \
  --note "Found slow query in orders table. Query execution time: 8.2s avg"

# Second update - mitigation attempt
./im update --incident INC-2025-001 \
  --note "Applied database index to orders.created_at. Testing performance"

# Third update - mitigation confirmed
./im update --incident INC-2025-001 \
  --note "Database performance improved. Response time now <500ms. Monitoring"

Change Status to Mitigated

Once you’ve applied a fix but want to monitor before full resolution:

./im mitigate --incident INC-2025-001 \
  --note "Applied database optimization. Performance restored, monitoring for stability"

Status Flow:

Open β†’ Mitigated ← You are here

Timeline View

Your incident timeline now shows comprehensive activity:

🚨 14:30 - Incident declared (SEV-2)
   Checkout service experiencing high database latency

βœ‹ 14:32 - Acknowledged by user@example.com
   Investigating database performance metrics

πŸ“ 14:35 - Investigation update
   Found slow query in orders table. Query execution time: 8.2s avg

πŸ“ 14:42 - Mitigation attempt
   Applied database index to orders.created_at. Testing performance

πŸ“ 14:45 - Performance confirmed
   Database performance improved. Response time now <500ms. Monitoring

πŸ”§ 14:47 - Incident mitigated
   Applied database optimization. Performance restored, monitoring for stability

Step 4: Resolution

After monitoring confirms the fix is stable:

./im resolve --incident INC-2025-001 \
  --note "Database performance stable for 30+ minutes. Customer reports ceased. Issue resolved."

Status Flow:

Open β†’ Mitigated β†’ Resolved ← You are here

Step 5: Close the Incident

Final closure after confirming no further issues:

./im close --incident INC-2025-001

Final Status:

Open β†’ Mitigated β†’ Resolved β†’ Closed βœ…

Step 6: Export for Postmortem

Generate a complete incident report for postmortem analysis:

./im timeline export --incident INC-2025-001 --format md > postmortem-INC-2025-001.md

Generated Postmortem Document:

# Incident Postmortem - INC-2025-001

## Incident Summary

- **Title:** Checkout service experiencing high database latency
- **Severity:** SEV-2 (Major)
- **Service:** checkout
- **Duration:** 17 minutes (14:30 - 14:47)
- **Status:** Closed

## Timeline

- **14:30** - Incident declared
- **14:32** - Acknowledged, investigation started
- **14:35** - Root cause identified: slow query
- **14:42** - Database index applied
- **14:47** - Incident mitigated
- **15:15** - Incident resolved and closed

## Resolution

Applied database index to orders.created_at column, reducing query time from 8.2s to <500ms.

## Action Items

- [ ] Review other queries on orders table
- [ ] Set up monitoring for query performance
- [ ] Add database performance to alerting

Understanding the Dashboard

Real-Time Updates

The web dashboard automatically refreshes every 5 seconds, so team members see updates instantly:

  • Active Incidents: Shows all open/mitigated incidents
  • Timeline View: Chronological incident activity
  • Status Badges: Color-coded severity levels
  • Auto-refresh: No manual refresh needed

Incident Metrics

View key metrics from the dashboard:

  • MTTR: Mean Time To Resolution
  • MTTA: Mean Time To Acknowledgment
  • Incident Volume: By severity and service
  • Team Performance: Response times by user

Advanced Features

Severity Escalation

Escalate severity if the issue becomes more critical:

./im escalate --incident INC-2025-001 --severity SEV-1 \
  --note "Issue spreading to other services, escalating to SEV-1"

Service Correlation

Link related services affected by the incident:

./im correlate --incident INC-2025-001 --service payments \
  --note "Payment processing also affected by database performance"

SLA Tracking

The platform automatically tracks SLA metrics:

  • SEV-1: 15-minute acknowledgment, 1-hour resolution target
  • SEV-2: 30-minute acknowledgment, 4-hour resolution target
  • SEV-3: 2-hour acknowledgment, 24-hour resolution target
  • SEV-4: 8-hour acknowledgment, 72-hour resolution target

Team Collaboration Features

Real-Time Notifications

When integrations are configured, updates are sent to:

  • Slack/Teams: Incident notifications and status changes
  • PagerDuty: Escalation and on-call management
  • Email: Stakeholder updates and SLA breach alerts

Multiple User Workflow

# User A declares incident
./im declare --sev SEV-1 --title "Payment system down"

# User B acknowledges and takes ownership
./im ack --incident INC-2025-002 --note "Taking ownership, investigating"

# User C adds findings
./im update --incident INC-2025-002 --note "Found network connectivity issue"

# User A resolves
./im resolve --incident INC-2025-002 --note "Network restored, payments functional"

All actions are attributed to the acting user and recorded in the timeline.

Best Practices Learned

Incident Declaration

  • Be specific in titles (avoid “System is down”)
  • Choose appropriate severity based on customer impact
  • Include service context for faster routing
  • Declare early rather than waiting for certainty

During Investigation

  • Update frequently (every 10-15 minutes for SEV-1/2)
  • Be specific about actions taken and findings
  • Use appropriate status transitions (Open β†’ Mitigated β†’ Resolved)
  • Document decisions for future reference

Resolution and Closure

  • Wait for stability before resolving (monitoring period)
  • Document root cause and remediation steps
  • Close only after confirming no customer impact
  • Export timeline for postmortem analysis

What You’ve Accomplished

βœ… Declared your first incident with appropriate severity
βœ… Managed the complete incident lifecycle
βœ… Used both web interface and CLI tools
βœ… Learned event-sourced timeline concepts
βœ… Generated postmortem documentation
βœ… Understood real-time collaboration features

Next Steps

Now that you’ve mastered basic incident management:

  1. Set Up Integrations

    • Connect to PagerDuty for on-call management
    • Add Slack for team notifications
    • Configure SCIM for user provisioning
  2. Configure Security

    • Set up role-based access control
    • Configure authentication providers
    • Review audit logging capabilities
  3. Explore the API

    • Automate incident workflows
    • Build custom integrations
    • Set up monitoring alerts
  4. Deploy to Production

    • Configure high availability
    • Set up monitoring and alerting
    • Plan disaster recovery

Common Scenarios

High-Severity Incidents (SEV-1)

# Immediate declaration with war room
./im declare --sev SEV-1 --title "Complete payment system outage" \
  --service payments --war-room

# Quick acknowledgment by incident commander
./im ack --incident INC-2025-003 --role commander

Multi-Service Incidents

# Initial incident
./im declare --sev SEV-2 --title "Database performance issue" --service db

# Correlate additional affected services
./im correlate --incident INC-2025-004 --service api
./im correlate --incident INC-2025-004 --service checkout
./im correlate --incident INC-2025-004 --service payments

Long-Running Investigations

# Regular updates during extended investigation
./im update --incident INC-2025-005 --note "Continuing investigation. Checking network logs"
./im update --incident INC-2025-005 --note "Network logs clear. Moving to application analysis"
./im update --incident INC-2025-005 --note "Found memory leak in application. Preparing fix"

You’re now ready to handle real incidents with confidence! πŸš€