Architecture

Overview

The Incidents is built on event-sourced architecture with CloudEvents v1.0 as the canonical event format. The system serves as a single source of truth for incident data while maintaining bi-directional synchronization with external ITSM platforms.

Core Architectural Principles:

Event-sourced timeline as the foundation
CloudEvents v1.0 for all inter-service communication
Policy-based security with field-level redaction
Bi-directional ITSM sync with conflict resolution
Self-hostable and cloud-agnostic deployment

Architecture Documentation

Platform Vision: The complete technical vision and architecture is documented in vision.md (formerly docs/architecture/VISION.md).

Architecture Decision Records: All significant architectural decisions are documented using the MADR format. Browse all ADRs to understand key design choices and their rationale.

Research & Analysis: Market research, competitive analysis, and integration priorities are documented in the Research section.

Component Status: View the component status dashboard for current implementation progress.

High-Level Architecture

System Components

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   External  │    │   External  │    │   External  │
│ Monitoring  │    │    ITSM     │    │   ChatOps   │
│  (PD, etc)  │    │ (SNOW, JSM) │    │(Slack,Teams)│
└─────┬───────┘    └─────┬───────┘    └─────┬───────┘
      │                  │                  │
      │ Webhooks/API     │ Bi-directional   │ Bot/Webhook
      │                  │ Sync             │
      ▼                  ▼                  ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Provider   │    │  Provider   │    │  Provider   │
│ Connectors  │    │ Connectors  │    │ Connectors  │
└─────┬───────┘    └─────┬───────┘    └─────┬───────┘
      │                  │                  │
      └──────────────────┼──────────────────┘
                         │ CloudEvents v1.0
                         ▼
                  ┌─────────────┐
                  │ Event Bus   │
                  │ (NATS/Kafka)│
                  └─────┬───────┘
                        │
          ┌─────────────┼─────────────┐
          │             │             │
          ▼             ▼             ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │Timeline  │  │Orchestr- │  │ Policy   │
    │Service   │  │  ator    │  │ Engine   │
    │(Events)  │  │(Workflow)│  │(RBAC/    │
    │          │  │          │  │ ABAC)    │
    └─────┬────┘  └─────┬────┘  └─────┬────┘
          │             │             │
          └─────────────┼─────────────┘
                        │
          ┌─────────────┼─────────────┐
          │             │             │
          ▼             ▼             ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │   API    │  │ Status   │  │Artifacts │
    │ Gateway  │  │ Boards   │  │ Store    │
    │          │  │          │  │(Object   │
    │          │  │          │  │Storage)  │
    └─────┬────┘  └──────────┘  └──────────┘
          │
          ▼
    ┌──────────┐
    │   UI     │
    │(Web/CLI/ │
    │ Mobile)  │
    └──────────┘

Data Flow

External events → Ingestors → Event Bus → Orchestrator → Timeline + Boards + Connectors → API/UI

All inter-service messages use CloudEvents v1.0 envelopes (structured JSON on the bus). Policy is evaluated at Policy Enforcement Points (PEPs) before reads/lists/exports and before connector side-effects.

Event-Sourced Foundation

Timeline Service

The Timeline Service is the core component that maintains an append-only log of all incident-related events using CloudEvents v1.0 format.

Key Features:

Immutable event store with PostgreSQL backend
CloudEvents v1.0 compliance for all events
Deduplication using source + id combination
Event correlation and fingerprinting
Materialized views for performance

Event Structure:

{
  "specversion": "1.0",
  "type": "com.incidents.state_change.v1",
  "source": "incidents://orchestrator",
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "time": "2025-08-23T10:30:00Z",
  "datacontenttype": "application/json",
  "subject": "incidents/12345",
  "data": {
    "incident_id": "12345",
    "previous_status": "open",
    "new_status": "mitigated",
    "actor": {
      "type": "user",
      "id": "user@company.com"
    },
    "reason": "Applied database fix"
  }
}

Event Types

Core Incident Events:

com.incidents.declared.v1 - New incident created
com.incidents.state_change.v1 - Status transitions
com.incidents.assigned.v1 - Assignment changes
com.incidents.note_added.v1 - Timeline notes
com.incidents.resolved.v1 - Resolution with fix details
com.incidents.closed.v1 - Final closure

Integration Events:

com.incidents.provider.*.ingested.v1 - External system events
com.incidents.provider.*.sync_requested.v1 - Sync operations
com.incidents.provider.*.sync_completed.v1 - Sync results

Communication Events:

com.incidents.chat.message.v1 - Chat messages
com.incidents.comms.update_sent.v1 - Stakeholder updates
com.incidents.comms.escalation.v1 - Escalation events

Orchestrator

The Orchestrator component manages incident state machines, correlation, and workflow coordination.

State Machine

    ┌─────────┐
    │  Draft  │ (optional pre-state)
    └────┬────┘
         │
         ▼
    ┌─────────┐    ┌──────────┐    ┌──────────┐    ┌────────┐
    │  Open   ├───►│Mitigated ├───►│Resolved  ├───►│Closed  │
    └────┬────┘    └──────────┘    └──────────┘    └────────┘
         │
         ▼
    ┌─────────┐
    │Canceled │
    └─────────┘

State Definitions:

Open: Active incident requiring response
Mitigated: Impact reduced, root cause being addressed
Resolved: Fixed, monitoring for stability
Closed: Final state after confirmation
Canceled: Invalid/duplicate incident

Correlation Engine

The Orchestrator includes intelligent event correlation to:

Group related alerts into single incidents
Detect duplicate incidents across providers
Link incidents to affected services and infrastructure
Track incident families and cascading failures

Correlation Keys:

correlation:
  fingerprints:
    - service + error_type + time_window
    - alert_rule + host + time_window
    - user_report + service + time_window

  deduplication:
    - source + origin_id (exact match)
    - fingerprint + time_window (fuzzy match)

Bi-directional ITSM Sync

Design Goals

Keep incident record as canonical system of record
Sync essential fields with ITSM platforms both ways
Explicit field ownership and conflict resolution
Drift detection and reconciliation
Deterministic behavior under concurrent updates

Field Mapping

Each integrated ITSM platform has a configured field mapping:

servicenow:
  field_mappings:
    title:
      external_field: "short_description"
      source_of_truth: "internal"
      conflict_policy: "internal_wins"

    severity:
      external_field: "urgency"
      source_of_truth: "bidirectional"
      conflict_policy: "most_recent_allowed"
      value_mapping:
        SEV-1: "1 - High"
        SEV-2: "2 - Medium"
        SEV-3: "3 - Low"

    status:
      external_field: "state"
      source_of_truth: "internal"
      conflict_policy: "state_machine"
      value_mapping:
        open: "2" # In Progress
        mitigated: "6" # Resolved
        resolved: "6" # Resolved
        closed: "7" # Closed

Conflict Resolution

When field conflicts occur, the system applies configured policies:

internal_wins: Our value takes precedence
external_wins: Provider value takes precedence
most_recent_allowed: Use timestamps with ETag validation
state_machine: Apply state transition guards
manual: Flag for human review

Outbox Pattern

Reliable side-effects use the transactional outbox pattern:

-- Ensure connector actions are reliable
CREATE TABLE outbox_events (
  id UUID PRIMARY KEY,
  aggregate_id UUID NOT NULL,    -- incident_id
  event_type TEXT NOT NULL,      -- provider.action type
  payload JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  processed_at TIMESTAMPTZ,
  attempts INT DEFAULT 0,
  last_error TEXT
);

-- Atomic update + side-effect scheduling
BEGIN;
  UPDATE incidents SET status = 'acknowledged' WHERE id = ?;
  INSERT INTO outbox_events (aggregate_id, event_type, payload)
    VALUES (?, 'provider.pagerduty.acknowledge', ?);
COMMIT;

ChatOps Integration

Channel Strategy

Per-incident channels (default):

Naming: #inc-<id>-sev<level>-<slug>
Example: #inc-1234-sev2-checkout-errors
Visibility suffix: -restricted for sensitive data

Channel Features:

Rich topic with incident summary and links
Pinned widgets (status board, timeline, runbooks)
Auto-invites for relevant roles and on-call teams
Thread-to-timeline event mapping

Slash Commands

Key ChatOps commands integrated with event sourcing:

Command	Purpose	Event Type
`/im new [title] [sev]`	Declare incident	`com.incidents.declared.v1`
`/im ic @user`	Assign Incident Commander	`com.incidents.role.assigned.v1`
`/im status <state>`	Transition status	`com.incidents.state_change.v1`
`/im note <text>`	Add timeline note	`com.incidents.note_added.v1`
`/im ack <alert>`	Acknowledge alert	`com.incidents.provider.*.ack.v1`

Message Mapping

Chat messages become CloudEvents with deterministic IDs:

{
  "specversion": "1.0",
  "type": "com.incidents.chat.message.v1",
  "source": "slack://team123/channel456",
  "id": "sha256(slack:team123:channel456:1692789123.456)",
  "time": "2025-08-23T10:30:00Z",
  "data": {
    "message": "Database connection pool optimized",
    "user_id": "U123456",
    "thread_ts": "1692789100.123",
    "permalink": "https://team.slack.com/archives/C123/p1692789123456"
  }
}

Security & Privacy Model

Policy-Based Access Control

The system implements Attribute-Based Access Control (ABAC) using Open Policy Agent (OPA):

Policy Enforcement Points (PEPs):

API Gateway (all requests)
Timeline Service (event reads)
Status Boards (widget data)
Connector sync (external updates)
Export operations (data downloads)

Policy Decision Point (PDP):

Centralized OPA instance with policy bundles
Dynamic policy loading and hot reload
Comprehensive policy testing framework

Data Classification

Events and fields are classified for access control:

data_classifications:
  public: # Status, basic timeline
  internal: # Technical details, some communications
  confidential: # Personal data, sensitive business info
  restricted: # Legal hold, privacy incidents
  secret: # Security incidents, compliance data

Field-Level Redaction

Sensitive data is redacted based on user permissions:

{
  "incident_id": "12345",
  "title": "Payment processing issue",
  "description": "**[REDACTED - CONFIDENTIAL]**",
  "assignee": {
    "name": "**[REDACTED - PII]**",
    "email": "**[REDACTED - PII]**"
  },
  "customer_impact": "**[REDACTED - RESTRICTED]**"
}

Storage Architecture

Core Storage

PostgreSQL (primary):

Incident records and metadata
Event store (timeline_events table)
User and group management
Connector configurations
Policy cache

Redis (caching):

Session storage
Real-time board updates
Rate limiting counters
WebSocket connection state

Object Storage (artifacts):

File attachments
Exported timelines
Backup archives
Policy bundles

Data Retention

Tiered Retention:

Active incidents: Full access, no restrictions
Recent incidents (< 90 days): Full data with normal policies
Archived incidents (90 days - 7 years): Read-only, compressed storage
Legal hold: Override retention, full preservation
Purged data (> 7 years): Hard delete unless legal hold

API Design

REST API

Core Resources:

/api/v1/incidents              # Incident CRUD operations
/api/v1/incidents/{id}/timeline # Event timeline access
/api/v1/incidents/{id}/board   # Status board data
/api/v1/connectors            # Provider integrations
/api/v1/users                 # User management (SCIM)
/api/v1/groups                # Group management (SCIM)
/api/v1/policies              # Policy administration

Event Streaming:

/api/v1/events/stream         # Server-Sent Events
/api/v1/incidents/{id}/stream # Incident-specific events

GraphQL API

For complex queries and real-time subscriptions:

type Incident {
  id: ID!
  title: String!
  severity: Severity!
  status: IncidentStatus!
  timeline(limit: Int, offset: Int): [TimelineEvent!]!
  statusBoard(persona: Persona!): StatusBoard!
  connectedTickets: [ExternalTicket!]!
}

type Subscription {
  incidentUpdated(id: ID!): Incident!
  timelineEventAdded(incidentId: ID!): TimelineEvent!
}

Deployment Architecture

Deployment Targets

Docker Compose (development/small teams):

services:
  incident-server:
    image: incidents/server:latest
    depends_on: [postgres, redis, nats]

  postgres:
    image: postgres:15-alpine

  redis:
    image: redis:7-alpine

  nats:
    image: nats:latest

Kubernetes (production):

Horizontal Pod Autoscaling for API servers
StatefulSets for PostgreSQL with replicas
Persistent volumes for data storage
Ingress with TLS termination
NetworkPolicies for security

Serverless (cloud-native):

Containerized API on Cloud Run/ECS
Managed PostgreSQL (RDS/Cloud SQL)
Managed Redis (ElastiCache/MemoryStore)
Event streaming via cloud-native solutions

Performance Characteristics

Throughput:

10,000+ events/second ingestion
Sub-second API response times
Real-time WebSocket updates
Efficient bulk operations

Scalability:

Horizontal scaling for stateless components
Read replicas for PostgreSQL
Redis clustering for high availability
CDN for static assets

Reliability:

99.9% uptime SLA target
Automatic failover capabilities
Comprehensive backup and recovery
Circuit breakers and rate limiting

This event-sourced architecture provides a solid foundation for incident management while maintaining flexibility, security, and operational excellence.

Edit this page on GitHub

Installation

Docker Deployment

Docs

Incidents

Title here

Architecture

Overview

Architecture Documentation

High-Level Architecture

System Components

Data Flow

Event-Sourced Foundation

Timeline Service

Event Types

Orchestrator

State Machine

Correlation Engine

Bi-directional ITSM Sync

Design Goals

Field Mapping

Conflict Resolution

Outbox Pattern

ChatOps Integration

Channel Strategy

Slash Commands

Message Mapping

Security & Privacy Model

Policy-Based Access Control

Data Classification

Field-Level Redaction

Storage Architecture

Core Storage

Data Retention

API Design

REST API

GraphQL API

Deployment Architecture

Deployment Targets

Performance Characteristics

Architecture

Overview#

Architecture Documentation#

High-Level Architecture#

System Components#

Data Flow#

Event-Sourced Foundation#

Timeline Service#

Event Types#

Orchestrator#

State Machine#

Correlation Engine#

Bi-directional ITSM Sync#

Design Goals#

Field Mapping#

Conflict Resolution#

Outbox Pattern#

ChatOps Integration#

Channel Strategy#

Slash Commands#

Message Mapping#

Security & Privacy Model#

Policy-Based Access Control#

Data Classification#

Field-Level Redaction#

Storage Architecture#

Core Storage#

Data Retention#

API Design#

REST API#

GraphQL API#

Deployment Architecture#

Deployment Targets#

Performance Characteristics#

Overview

Architecture Documentation

High-Level Architecture

System Components

Data Flow

Event-Sourced Foundation

Timeline Service

Event Types

Orchestrator

State Machine

Correlation Engine

Bi-directional ITSM Sync

Design Goals

Field Mapping

Conflict Resolution

Outbox Pattern

ChatOps Integration

Channel Strategy

Slash Commands

Message Mapping

Security & Privacy Model

Policy-Based Access Control

Data Classification

Field-Level Redaction

Storage Architecture

Core Storage

Data Retention

API Design

REST API

GraphQL API

Deployment Architecture

Deployment Targets

Performance Characteristics