Your First Incident

Scenario: Database Performance Issue

Let’s walk through a realistic incident scenario to learn the platform’s capabilities. You’ll learn how to declare, manage, and resolve an incident while understanding the event-sourced timeline.

The Situation

It’s 2:30 PM on a Tuesday. Your monitoring system detects unusual database response times affecting the checkout service. Customer complaints are starting to come in. Time to declare an incident.

Step 1: Declare the Incident

Using the Web Interface

Open the platform at http://localhost:8080
Click “Declare Incident” (large red button on the dashboard)
Fill in the incident details:
- Title: Checkout service experiencing high database latency
- Severity: SEV-2 (Major functionality impacted)
- Service: checkout
- Description: Database response times increased to 5+ seconds. Multiple customer reports of failed checkouts.
Click “Create Incident”

You’ll see the incident appear with an automatically assigned ID like INC-2025-001.

Using the CLI

Alternatively, declare the incident via command line:

./im declare \
  --sev SEV-2 \
  --title "Checkout service experiencing high database latency" \
  --service checkout \
  --description "Database response times increased to 5+ seconds. Multiple customer reports of failed checkouts."

Output:

✓ Incident declared: INC-2025-001
🔗 View at: http://localhost:8080/incidents/INC-2025-001

What Happened Behind the Scenes

The platform created the following CloudEvent in the timeline:

{
  "specversion": "1.0",
  "type": "incident.declared",
  "source": "incident-management/cli",
  "id": "evt-001",
  "time": "2025-08-23T14:30:00Z",
  "data": {
    "incident_id": "INC-2025-001",
    "title": "Checkout service experiencing high database latency",
    "severity": "SEV-2",
    "service": "checkout",
    "status": "open"
  }
}

Step 2: Acknowledge the Incident

Show that someone is actively working on the issue.

Web Interface

Go to the incident page (click the incident ID)
Click “Acknowledge” button
Add a note: Investigating database performance metrics
Click “Acknowledge Incident”

CLI

./im ack --incident INC-2025-001 --note "Investigating database performance metrics"

Timeline Update: The incident timeline now shows:

✅ 14:30 - Incident declared
✅ 14:32 - Incident acknowledged by user@example.com

Step 3: Investigation and Updates

Keep stakeholders informed with regular updates as you investigate.

Add Investigation Updates

# First update - root cause investigation
./im update --incident INC-2025-001 \
  --note "Found slow query in orders table. Query execution time: 8.2s avg"

# Second update - mitigation attempt
./im update --incident INC-2025-001 \
  --note "Applied database index to orders.created_at. Testing performance"

# Third update - mitigation confirmed
./im update --incident INC-2025-001 \
  --note "Database performance improved. Response time now <500ms. Monitoring"

Change Status to Mitigated

Once you’ve applied a fix but want to monitor before full resolution:

./im mitigate --incident INC-2025-001 \
  --note "Applied database optimization. Performance restored, monitoring for stability"

Status Flow:

Open → Mitigated ← You are here

Timeline View

Your incident timeline now shows comprehensive activity:

🚨 14:30 - Incident declared (SEV-2)
   Checkout service experiencing high database latency

✋ 14:32 - Acknowledged by user@example.com
   Investigating database performance metrics

📝 14:35 - Investigation update
   Found slow query in orders table. Query execution time: 8.2s avg

📝 14:42 - Mitigation attempt
   Applied database index to orders.created_at. Testing performance

📝 14:45 - Performance confirmed
   Database performance improved. Response time now <500ms. Monitoring

🔧 14:47 - Incident mitigated
   Applied database optimization. Performance restored, monitoring for stability

Step 4: Resolution

After monitoring confirms the fix is stable:

./im resolve --incident INC-2025-001 \
  --note "Database performance stable for 30+ minutes. Customer reports ceased. Issue resolved."

Status Flow:

Open → Mitigated → Resolved ← You are here

Step 5: Close the Incident

Final closure after confirming no further issues:

./im close --incident INC-2025-001

Final Status:

Open → Mitigated → Resolved → Closed ✅

Step 6: Export for Postmortem

Generate a complete incident report for postmortem analysis:

./im timeline export --incident INC-2025-001 --format md > postmortem-INC-2025-001.md

Generated Postmortem Document:

# Incident Postmortem - INC-2025-001

## Incident Summary

- **Title:** Checkout service experiencing high database latency
- **Severity:** SEV-2 (Major)
- **Service:** checkout
- **Duration:** 17 minutes (14:30 - 14:47)
- **Status:** Closed

## Timeline

- **14:30** - Incident declared
- **14:32** - Acknowledged, investigation started
- **14:35** - Root cause identified: slow query
- **14:42** - Database index applied
- **14:47** - Incident mitigated
- **15:15** - Incident resolved and closed

## Resolution

Applied database index to orders.created_at column, reducing query time from 8.2s to <500ms.

## Action Items

- [ ] Review other queries on orders table
- [ ] Set up monitoring for query performance
- [ ] Add database performance to alerting

Understanding the Dashboard

Real-Time Updates

The web dashboard automatically refreshes every 5 seconds, so team members see updates instantly:

Active Incidents: Shows all open/mitigated incidents
Timeline View: Chronological incident activity
Status Badges: Color-coded severity levels
Auto-refresh: No manual refresh needed

Incident Metrics

View key metrics from the dashboard:

MTTR: Mean Time To Resolution
MTTA: Mean Time To Acknowledgment
Incident Volume: By severity and service
Team Performance: Response times by user

Advanced Features

Severity Escalation

Escalate severity if the issue becomes more critical:

./im escalate --incident INC-2025-001 --severity SEV-1 \
  --note "Issue spreading to other services, escalating to SEV-1"

Service Correlation

Link related services affected by the incident:

./im correlate --incident INC-2025-001 --service payments \
  --note "Payment processing also affected by database performance"

SLA Tracking

The platform automatically tracks SLA metrics:

SEV-1: 15-minute acknowledgment, 1-hour resolution target
SEV-2: 30-minute acknowledgment, 4-hour resolution target
SEV-3: 2-hour acknowledgment, 24-hour resolution target
SEV-4: 8-hour acknowledgment, 72-hour resolution target

Team Collaboration Features

Real-Time Notifications

When integrations are configured, updates are sent to:

Slack/Teams: Incident notifications and status changes
PagerDuty: Escalation and on-call management
Email: Stakeholder updates and SLA breach alerts

Multiple User Workflow

# User A declares incident
./im declare --sev SEV-1 --title "Payment system down"

# User B acknowledges and takes ownership
./im ack --incident INC-2025-002 --note "Taking ownership, investigating"

# User C adds findings
./im update --incident INC-2025-002 --note "Found network connectivity issue"

# User A resolves
./im resolve --incident INC-2025-002 --note "Network restored, payments functional"

All actions are attributed to the acting user and recorded in the timeline.

Best Practices Learned

Incident Declaration

Be specific in titles (avoid “System is down”)
Choose appropriate severity based on customer impact
Include service context for faster routing
Declare early rather than waiting for certainty

During Investigation

Update frequently (every 10-15 minutes for SEV-1/2)
Be specific about actions taken and findings
Use appropriate status transitions (Open → Mitigated → Resolved)
Document decisions for future reference

Resolution and Closure

Wait for stability before resolving (monitoring period)
Document root cause and remediation steps
Close only after confirming no customer impact
Export timeline for postmortem analysis

What You’ve Accomplished

✅ Declared your first incident with appropriate severity
✅ Managed the complete incident lifecycle
✅ Used both web interface and CLI tools
✅ Learned event-sourced timeline concepts
✅ Generated postmortem documentation
✅ Understood real-time collaboration features

Next Steps

Now that you’ve mastered basic incident management:

Set Up Integrations
- Connect to PagerDuty for on-call management
- Add Slack for team notifications
- Configure SCIM for user provisioning
Configure Security
- Set up role-based access control
- Configure authentication providers
- Review audit logging capabilities
Explore the API
- Automate incident workflows
- Build custom integrations
- Set up monitoring alerts
Deploy to Production
- Configure high availability
- Set up monitoring and alerting
- Plan disaster recovery

Common Scenarios

High-Severity Incidents (SEV-1)

# Immediate declaration with war room
./im declare --sev SEV-1 --title "Complete payment system outage" \
  --service payments --war-room

# Quick acknowledgment by incident commander
./im ack --incident INC-2025-003 --role commander

Multi-Service Incidents

# Initial incident
./im declare --sev SEV-2 --title "Database performance issue" --service db

# Correlate additional affected services
./im correlate --incident INC-2025-004 --service api
./im correlate --incident INC-2025-004 --service checkout
./im correlate --incident INC-2025-004 --service payments

Long-Running Investigations

# Regular updates during extended investigation
./im update --incident INC-2025-005 --note "Continuing investigation. Checking network logs"
./im update --incident INC-2025-005 --note "Network logs clear. Moving to application analysis"
./im update --incident INC-2025-005 --note "Found memory leak in application. Preparing fix"

You’re now ready to handle real incidents with confidence! 🚀

Edit this page on GitHub

Production Guide

PagerDuty Troubleshooting

Docs

Incidents

Title here

Your First Incident

Scenario: Database Performance Issue

The Situation

Step 1: Declare the Incident

Using the Web Interface

Using the CLI

What Happened Behind the Scenes

Step 2: Acknowledge the Incident

Web Interface

CLI

Step 3: Investigation and Updates

Add Investigation Updates

Change Status to Mitigated

Timeline View

Step 4: Resolution

Step 5: Close the Incident

Step 6: Export for Postmortem

Understanding the Dashboard

Real-Time Updates

Incident Metrics

Advanced Features

Severity Escalation

Service Correlation

SLA Tracking

Team Collaboration Features

Real-Time Notifications

Multiple User Workflow

Best Practices Learned

Incident Declaration

During Investigation

Resolution and Closure

What You’ve Accomplished

Next Steps

Common Scenarios

High-Severity Incidents (SEV-1)

Multi-Service Incidents

Long-Running Investigations

Your First Incident

Scenario: Database Performance Issue#

The Situation#

Step 1: Declare the Incident#

Using the Web Interface#

Using the CLI#

What Happened Behind the Scenes#

Step 2: Acknowledge the Incident#

Web Interface#

CLI#

Step 3: Investigation and Updates#

Add Investigation Updates#

Change Status to Mitigated#

Timeline View#

Step 4: Resolution#

Step 5: Close the Incident#

Step 6: Export for Postmortem#

Understanding the Dashboard#

Real-Time Updates#

Incident Metrics#

Advanced Features#

Severity Escalation#

Service Correlation#

SLA Tracking#

Team Collaboration Features#

Real-Time Notifications#

Multiple User Workflow#

Best Practices Learned#

Incident Declaration#

During Investigation#

Resolution and Closure#

What You’ve Accomplished#

Next Steps#

Common Scenarios#

High-Severity Incidents (SEV-1)#

Multi-Service Incidents#

Long-Running Investigations#

Scenario: Database Performance Issue

The Situation

Step 1: Declare the Incident

Using the Web Interface

Using the CLI

What Happened Behind the Scenes

Step 2: Acknowledge the Incident

Web Interface

CLI

Step 3: Investigation and Updates

Add Investigation Updates

Change Status to Mitigated

Timeline View

Step 4: Resolution

Step 5: Close the Incident

Step 6: Export for Postmortem

Understanding the Dashboard

Real-Time Updates

Incident Metrics

Advanced Features

Severity Escalation

Service Correlation

SLA Tracking

Team Collaboration Features

Real-Time Notifications

Multiple User Workflow

Best Practices Learned

Incident Declaration

During Investigation

Resolution and Closure

What You’ve Accomplished

Next Steps

Common Scenarios

High-Severity Incidents (SEV-1)

Multi-Service Incidents

Long-Running Investigations