Architecture
Overview
The Incidents is built on event-sourced architecture with CloudEvents v1.0 as the canonical event format. The system serves as a single source of truth for incident data while maintaining bi-directional synchronization with external ITSM platforms.
Core Architectural Principles:
- Event-sourced timeline as the foundation
- CloudEvents v1.0 for all inter-service communication
- Policy-based security with field-level redaction
- Bi-directional ITSM sync with conflict resolution
- Self-hostable and cloud-agnostic deployment
Architecture Documentation
Platform Vision: The complete technical vision and architecture is documented in vision.md (formerly docs/architecture/VISION.md).
Architecture Decision Records: All significant architectural decisions are documented using the MADR format. Browse all ADRs to understand key design choices and their rationale.
Research & Analysis: Market research, competitive analysis, and integration priorities are documented in the Research section.
Component Status: View the component status dashboard for current implementation progress.
High-Level Architecture
System Components
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ External │ │ External │ │ External │
│ Monitoring │ │ ITSM │ │ ChatOps │
│ (PD, etc) │ │ (SNOW, JSM) │ │(Slack,Teams)│
└─────┬───────┘ └─────┬───────┘ └─────┬───────┘
│ │ │
│ Webhooks/API │ Bi-directional │ Bot/Webhook
│ │ Sync │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Provider │ │ Provider │ │ Provider │
│ Connectors │ │ Connectors │ │ Connectors │
└─────┬───────┘ └─────┬───────┘ └─────┬───────┘
│ │ │
└──────────────────┼──────────────────┘
│ CloudEvents v1.0
▼
┌─────────────┐
│ Event Bus │
│ (NATS/Kafka)│
└─────┬───────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│Timeline │ │Orchestr- │ │ Policy │
│Service │ │ ator │ │ Engine │
│(Events) │ │(Workflow)│ │(RBAC/ │
│ │ │ │ │ ABAC) │
└─────┬────┘ └─────┬────┘ └─────┬────┘
│ │ │
└─────────────┼─────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ API │ │ Status │ │Artifacts │
│ Gateway │ │ Boards │ │ Store │
│ │ │ │ │(Object │
│ │ │ │ │Storage) │
└─────┬────┘ └──────────┘ └──────────┘
│
▼
┌──────────┐
│ UI │
│(Web/CLI/ │
│ Mobile) │
└──────────┘
Data Flow
External events → Ingestors → Event Bus → Orchestrator → Timeline + Boards + Connectors → API/UI
All inter-service messages use CloudEvents v1.0 envelopes (structured JSON on the bus). Policy is evaluated at Policy Enforcement Points (PEPs) before reads/lists/exports and before connector side-effects.
Event-Sourced Foundation
Timeline Service
The Timeline Service is the core component that maintains an append-only log of all incident-related events using CloudEvents v1.0 format.
Key Features:
- Immutable event store with PostgreSQL backend
- CloudEvents v1.0 compliance for all events
- Deduplication using
source + idcombination - Event correlation and fingerprinting
- Materialized views for performance
Event Structure:
{
"specversion": "1.0",
"type": "com.incidents.state_change.v1",
"source": "incidents://orchestrator",
"id": "550e8400-e29b-41d4-a716-446655440000",
"time": "2025-08-23T10:30:00Z",
"datacontenttype": "application/json",
"subject": "incidents/12345",
"data": {
"incident_id": "12345",
"previous_status": "open",
"new_status": "mitigated",
"actor": {
"type": "user",
"id": "user@company.com"
},
"reason": "Applied database fix"
}
}Event Types
Core Incident Events:
com.incidents.declared.v1- New incident createdcom.incidents.state_change.v1- Status transitionscom.incidents.assigned.v1- Assignment changescom.incidents.note_added.v1- Timeline notescom.incidents.resolved.v1- Resolution with fix detailscom.incidents.closed.v1- Final closure
Integration Events:
com.incidents.provider.*.ingested.v1- External system eventscom.incidents.provider.*.sync_requested.v1- Sync operationscom.incidents.provider.*.sync_completed.v1- Sync results
Communication Events:
com.incidents.chat.message.v1- Chat messagescom.incidents.comms.update_sent.v1- Stakeholder updatescom.incidents.comms.escalation.v1- Escalation events
Orchestrator
The Orchestrator component manages incident state machines, correlation, and workflow coordination.
State Machine
┌─────────┐
│ Draft │ (optional pre-state)
└────┬────┘
│
▼
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐
│ Open ├───►│Mitigated ├───►│Resolved ├───►│Closed │
└────┬────┘ └──────────┘ └──────────┘ └────────┘
│
▼
┌─────────┐
│Canceled │
└─────────┘
State Definitions:
- Open: Active incident requiring response
- Mitigated: Impact reduced, root cause being addressed
- Resolved: Fixed, monitoring for stability
- Closed: Final state after confirmation
- Canceled: Invalid/duplicate incident
Correlation Engine
The Orchestrator includes intelligent event correlation to:
- Group related alerts into single incidents
- Detect duplicate incidents across providers
- Link incidents to affected services and infrastructure
- Track incident families and cascading failures
Correlation Keys:
correlation:
fingerprints:
- service + error_type + time_window
- alert_rule + host + time_window
- user_report + service + time_window
deduplication:
- source + origin_id (exact match)
- fingerprint + time_window (fuzzy match)Bi-directional ITSM Sync
Design Goals
- Keep incident record as canonical system of record
- Sync essential fields with ITSM platforms both ways
- Explicit field ownership and conflict resolution
- Drift detection and reconciliation
- Deterministic behavior under concurrent updates
Field Mapping
Each integrated ITSM platform has a configured field mapping:
servicenow:
field_mappings:
title:
external_field: "short_description"
source_of_truth: "internal"
conflict_policy: "internal_wins"
severity:
external_field: "urgency"
source_of_truth: "bidirectional"
conflict_policy: "most_recent_allowed"
value_mapping:
SEV-1: "1 - High"
SEV-2: "2 - Medium"
SEV-3: "3 - Low"
status:
external_field: "state"
source_of_truth: "internal"
conflict_policy: "state_machine"
value_mapping:
open: "2" # In Progress
mitigated: "6" # Resolved
resolved: "6" # Resolved
closed: "7" # ClosedConflict Resolution
When field conflicts occur, the system applies configured policies:
internal_wins: Our value takes precedenceexternal_wins: Provider value takes precedencemost_recent_allowed: Use timestamps with ETag validationstate_machine: Apply state transition guardsmanual: Flag for human review
Outbox Pattern
Reliable side-effects use the transactional outbox pattern:
-- Ensure connector actions are reliable
CREATE TABLE outbox_events (
id UUID PRIMARY KEY,
aggregate_id UUID NOT NULL, -- incident_id
event_type TEXT NOT NULL, -- provider.action type
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
processed_at TIMESTAMPTZ,
attempts INT DEFAULT 0,
last_error TEXT
);
-- Atomic update + side-effect scheduling
BEGIN;
UPDATE incidents SET status = 'acknowledged' WHERE id = ?;
INSERT INTO outbox_events (aggregate_id, event_type, payload)
VALUES (?, 'provider.pagerduty.acknowledge', ?);
COMMIT;ChatOps Integration
Channel Strategy
Per-incident channels (default):
- Naming:
#inc-<id>-sev<level>-<slug> - Example:
#inc-1234-sev2-checkout-errors - Visibility suffix:
-restrictedfor sensitive data
Channel Features:
- Rich topic with incident summary and links
- Pinned widgets (status board, timeline, runbooks)
- Auto-invites for relevant roles and on-call teams
- Thread-to-timeline event mapping
Slash Commands
Key ChatOps commands integrated with event sourcing:
| Command | Purpose | Event Type |
|---|---|---|
/im new [title] [sev] |
Declare incident | com.incidents.declared.v1 |
/im ic @user |
Assign Incident Commander | com.incidents.role.assigned.v1 |
/im status <state> |
Transition status | com.incidents.state_change.v1 |
/im note <text> |
Add timeline note | com.incidents.note_added.v1 |
/im ack <alert> |
Acknowledge alert | com.incidents.provider.*.ack.v1 |
Message Mapping
Chat messages become CloudEvents with deterministic IDs:
{
"specversion": "1.0",
"type": "com.incidents.chat.message.v1",
"source": "slack://team123/channel456",
"id": "sha256(slack:team123:channel456:1692789123.456)",
"time": "2025-08-23T10:30:00Z",
"data": {
"message": "Database connection pool optimized",
"user_id": "U123456",
"thread_ts": "1692789100.123",
"permalink": "https://team.slack.com/archives/C123/p1692789123456"
}
}Security & Privacy Model
Policy-Based Access Control
The system implements Attribute-Based Access Control (ABAC) using Open Policy Agent (OPA):
Policy Enforcement Points (PEPs):
- API Gateway (all requests)
- Timeline Service (event reads)
- Status Boards (widget data)
- Connector sync (external updates)
- Export operations (data downloads)
Policy Decision Point (PDP):
- Centralized OPA instance with policy bundles
- Dynamic policy loading and hot reload
- Comprehensive policy testing framework
Data Classification
Events and fields are classified for access control:
data_classifications:
public: # Status, basic timeline
internal: # Technical details, some communications
confidential: # Personal data, sensitive business info
restricted: # Legal hold, privacy incidents
secret: # Security incidents, compliance dataField-Level Redaction
Sensitive data is redacted based on user permissions:
{
"incident_id": "12345",
"title": "Payment processing issue",
"description": "**[REDACTED - CONFIDENTIAL]**",
"assignee": {
"name": "**[REDACTED - PII]**",
"email": "**[REDACTED - PII]**"
},
"customer_impact": "**[REDACTED - RESTRICTED]**"
}Storage Architecture
Core Storage
PostgreSQL (primary):
- Incident records and metadata
- Event store (timeline_events table)
- User and group management
- Connector configurations
- Policy cache
Redis (caching):
- Session storage
- Real-time board updates
- Rate limiting counters
- WebSocket connection state
Object Storage (artifacts):
- File attachments
- Exported timelines
- Backup archives
- Policy bundles
Data Retention
Tiered Retention:
- Active incidents: Full access, no restrictions
- Recent incidents (< 90 days): Full data with normal policies
- Archived incidents (90 days - 7 years): Read-only, compressed storage
- Legal hold: Override retention, full preservation
- Purged data (> 7 years): Hard delete unless legal hold
API Design
REST API
Core Resources:
/api/v1/incidents # Incident CRUD operations
/api/v1/incidents/{id}/timeline # Event timeline access
/api/v1/incidents/{id}/board # Status board data
/api/v1/connectors # Provider integrations
/api/v1/users # User management (SCIM)
/api/v1/groups # Group management (SCIM)
/api/v1/policies # Policy administration
Event Streaming:
/api/v1/events/stream # Server-Sent Events
/api/v1/incidents/{id}/stream # Incident-specific events
GraphQL API
For complex queries and real-time subscriptions:
type Incident {
id: ID!
title: String!
severity: Severity!
status: IncidentStatus!
timeline(limit: Int, offset: Int): [TimelineEvent!]!
statusBoard(persona: Persona!): StatusBoard!
connectedTickets: [ExternalTicket!]!
}
type Subscription {
incidentUpdated(id: ID!): Incident!
timelineEventAdded(incidentId: ID!): TimelineEvent!
}Deployment Architecture
Deployment Targets
Docker Compose (development/small teams):
services:
incident-server:
image: incidents/server:latest
depends_on: [postgres, redis, nats]
postgres:
image: postgres:15-alpine
redis:
image: redis:7-alpine
nats:
image: nats:latestKubernetes (production):
- Horizontal Pod Autoscaling for API servers
- StatefulSets for PostgreSQL with replicas
- Persistent volumes for data storage
- Ingress with TLS termination
- NetworkPolicies for security
Serverless (cloud-native):
- Containerized API on Cloud Run/ECS
- Managed PostgreSQL (RDS/Cloud SQL)
- Managed Redis (ElastiCache/MemoryStore)
- Event streaming via cloud-native solutions
Performance Characteristics
Throughput:
- 10,000+ events/second ingestion
- Sub-second API response times
- Real-time WebSocket updates
- Efficient bulk operations
Scalability:
- Horizontal scaling for stateless components
- Read replicas for PostgreSQL
- Redis clustering for high availability
- CDN for static assets
Reliability:
- 99.9% uptime SLA target
- Automatic failover capabilities
- Comprehensive backup and recovery
- Circuit breakers and rate limiting
This event-sourced architecture provides a solid foundation for incident management while maintaining flexibility, security, and operational excellence.