Architecture

Overview

The Incidents is built on event-sourced architecture with CloudEvents v1.0 as the canonical event format. The system serves as a single source of truth for incident data while maintaining bi-directional synchronization with external ITSM platforms.

Core Architectural Principles:

  • Event-sourced timeline as the foundation
  • CloudEvents v1.0 for all inter-service communication
  • Policy-based security with field-level redaction
  • Bi-directional ITSM sync with conflict resolution
  • Self-hostable and cloud-agnostic deployment

Architecture Documentation

Platform Vision: The complete technical vision and architecture is documented in vision.md (formerly docs/architecture/VISION.md).

Architecture Decision Records: All significant architectural decisions are documented using the MADR format. Browse all ADRs to understand key design choices and their rationale.

Research & Analysis: Market research, competitive analysis, and integration priorities are documented in the Research section.

Component Status: View the component status dashboard for current implementation progress.

High-Level Architecture

System Components

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   External  │    │   External  │    │   External  │
│ Monitoring  │    │    ITSM     │    │   ChatOps   │
│  (PD, etc)  │    │ (SNOW, JSM) │    │(Slack,Teams)│
└─────┬───────┘    └─────┬───────┘    └─────┬───────┘
      │                  │                  │
      │ Webhooks/API     │ Bi-directional   │ Bot/Webhook
      │                  │ Sync             │
      ▼                  ▼                  ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Provider   │    │  Provider   │    │  Provider   │
│ Connectors  │    │ Connectors  │    │ Connectors  │
└─────┬───────┘    └─────┬───────┘    └─────┬───────┘
      │                  │                  │
      └──────────────────┼──────────────────┘
                         │ CloudEvents v1.0
                         ▼
                  ┌─────────────┐
                  │ Event Bus   │
                  │ (NATS/Kafka)│
                  └─────┬───────┘
                        │
          ┌─────────────┼─────────────┐
          │             │             │
          ▼             ▼             ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │Timeline  │  │Orchestr- │  │ Policy   │
    │Service   │  │  ator    │  │ Engine   │
    │(Events)  │  │(Workflow)│  │(RBAC/    │
    │          │  │          │  │ ABAC)    │
    └─────┬────┘  └─────┬────┘  └─────┬────┘
          │             │             │
          └─────────────┼─────────────┘
                        │
          ┌─────────────┼─────────────┐
          │             │             │
          ▼             ▼             ▼
    ┌──────────┐  ┌──────────┐  ┌──────────┐
    │   API    │  │ Status   │  │Artifacts │
    │ Gateway  │  │ Boards   │  │ Store    │
    │          │  │          │  │(Object   │
    │          │  │          │  │Storage)  │
    └─────┬────┘  └──────────┘  └──────────┘
          │
          ▼
    ┌──────────┐
    │   UI     │
    │(Web/CLI/ │
    │ Mobile)  │
    └──────────┘

Data Flow

External eventsIngestorsEvent BusOrchestratorTimeline + Boards + ConnectorsAPI/UI

All inter-service messages use CloudEvents v1.0 envelopes (structured JSON on the bus). Policy is evaluated at Policy Enforcement Points (PEPs) before reads/lists/exports and before connector side-effects.

Event-Sourced Foundation

Timeline Service

The Timeline Service is the core component that maintains an append-only log of all incident-related events using CloudEvents v1.0 format.

Key Features:

  • Immutable event store with PostgreSQL backend
  • CloudEvents v1.0 compliance for all events
  • Deduplication using source + id combination
  • Event correlation and fingerprinting
  • Materialized views for performance

Event Structure:

{
  "specversion": "1.0",
  "type": "com.incidents.state_change.v1",
  "source": "incidents://orchestrator",
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "time": "2025-08-23T10:30:00Z",
  "datacontenttype": "application/json",
  "subject": "incidents/12345",
  "data": {
    "incident_id": "12345",
    "previous_status": "open",
    "new_status": "mitigated",
    "actor": {
      "type": "user",
      "id": "user@company.com"
    },
    "reason": "Applied database fix"
  }
}

Event Types

Core Incident Events:

  • com.incidents.declared.v1 - New incident created
  • com.incidents.state_change.v1 - Status transitions
  • com.incidents.assigned.v1 - Assignment changes
  • com.incidents.note_added.v1 - Timeline notes
  • com.incidents.resolved.v1 - Resolution with fix details
  • com.incidents.closed.v1 - Final closure

Integration Events:

  • com.incidents.provider.*.ingested.v1 - External system events
  • com.incidents.provider.*.sync_requested.v1 - Sync operations
  • com.incidents.provider.*.sync_completed.v1 - Sync results

Communication Events:

  • com.incidents.chat.message.v1 - Chat messages
  • com.incidents.comms.update_sent.v1 - Stakeholder updates
  • com.incidents.comms.escalation.v1 - Escalation events

Orchestrator

The Orchestrator component manages incident state machines, correlation, and workflow coordination.

State Machine

    ┌─────────┐
    │  Draft  │ (optional pre-state)
    └────┬────┘
         │
         ▼
    ┌─────────┐    ┌──────────┐    ┌──────────┐    ┌────────┐
    │  Open   ├───►│Mitigated ├───►│Resolved  ├───►│Closed  │
    └────┬────┘    └──────────┘    └──────────┘    └────────┘
         │
         ▼
    ┌─────────┐
    │Canceled │
    └─────────┘

State Definitions:

  • Open: Active incident requiring response
  • Mitigated: Impact reduced, root cause being addressed
  • Resolved: Fixed, monitoring for stability
  • Closed: Final state after confirmation
  • Canceled: Invalid/duplicate incident

Correlation Engine

The Orchestrator includes intelligent event correlation to:

  • Group related alerts into single incidents
  • Detect duplicate incidents across providers
  • Link incidents to affected services and infrastructure
  • Track incident families and cascading failures

Correlation Keys:

correlation:
  fingerprints:
    - service + error_type + time_window
    - alert_rule + host + time_window
    - user_report + service + time_window

  deduplication:
    - source + origin_id (exact match)
    - fingerprint + time_window (fuzzy match)

Bi-directional ITSM Sync

Design Goals

  • Keep incident record as canonical system of record
  • Sync essential fields with ITSM platforms both ways
  • Explicit field ownership and conflict resolution
  • Drift detection and reconciliation
  • Deterministic behavior under concurrent updates

Field Mapping

Each integrated ITSM platform has a configured field mapping:

servicenow:
  field_mappings:
    title:
      external_field: "short_description"
      source_of_truth: "internal"
      conflict_policy: "internal_wins"

    severity:
      external_field: "urgency"
      source_of_truth: "bidirectional"
      conflict_policy: "most_recent_allowed"
      value_mapping:
        SEV-1: "1 - High"
        SEV-2: "2 - Medium"
        SEV-3: "3 - Low"

    status:
      external_field: "state"
      source_of_truth: "internal"
      conflict_policy: "state_machine"
      value_mapping:
        open: "2" # In Progress
        mitigated: "6" # Resolved
        resolved: "6" # Resolved
        closed: "7" # Closed

Conflict Resolution

When field conflicts occur, the system applies configured policies:

  • internal_wins: Our value takes precedence
  • external_wins: Provider value takes precedence
  • most_recent_allowed: Use timestamps with ETag validation
  • state_machine: Apply state transition guards
  • manual: Flag for human review

Outbox Pattern

Reliable side-effects use the transactional outbox pattern:

-- Ensure connector actions are reliable
CREATE TABLE outbox_events (
  id UUID PRIMARY KEY,
  aggregate_id UUID NOT NULL,    -- incident_id
  event_type TEXT NOT NULL,      -- provider.action type
  payload JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  processed_at TIMESTAMPTZ,
  attempts INT DEFAULT 0,
  last_error TEXT
);

-- Atomic update + side-effect scheduling
BEGIN;
  UPDATE incidents SET status = 'acknowledged' WHERE id = ?;
  INSERT INTO outbox_events (aggregate_id, event_type, payload)
    VALUES (?, 'provider.pagerduty.acknowledge', ?);
COMMIT;

ChatOps Integration

Channel Strategy

Per-incident channels (default):

  • Naming: #inc-<id>-sev<level>-<slug>
  • Example: #inc-1234-sev2-checkout-errors
  • Visibility suffix: -restricted for sensitive data

Channel Features:

  • Rich topic with incident summary and links
  • Pinned widgets (status board, timeline, runbooks)
  • Auto-invites for relevant roles and on-call teams
  • Thread-to-timeline event mapping

Slash Commands

Key ChatOps commands integrated with event sourcing:

Command Purpose Event Type
/im new [title] [sev] Declare incident com.incidents.declared.v1
/im ic @user Assign Incident Commander com.incidents.role.assigned.v1
/im status <state> Transition status com.incidents.state_change.v1
/im note <text> Add timeline note com.incidents.note_added.v1
/im ack <alert> Acknowledge alert com.incidents.provider.*.ack.v1

Message Mapping

Chat messages become CloudEvents with deterministic IDs:

{
  "specversion": "1.0",
  "type": "com.incidents.chat.message.v1",
  "source": "slack://team123/channel456",
  "id": "sha256(slack:team123:channel456:1692789123.456)",
  "time": "2025-08-23T10:30:00Z",
  "data": {
    "message": "Database connection pool optimized",
    "user_id": "U123456",
    "thread_ts": "1692789100.123",
    "permalink": "https://team.slack.com/archives/C123/p1692789123456"
  }
}

Security & Privacy Model

Policy-Based Access Control

The system implements Attribute-Based Access Control (ABAC) using Open Policy Agent (OPA):

Policy Enforcement Points (PEPs):

  • API Gateway (all requests)
  • Timeline Service (event reads)
  • Status Boards (widget data)
  • Connector sync (external updates)
  • Export operations (data downloads)

Policy Decision Point (PDP):

  • Centralized OPA instance with policy bundles
  • Dynamic policy loading and hot reload
  • Comprehensive policy testing framework

Data Classification

Events and fields are classified for access control:

data_classifications:
  public: # Status, basic timeline
  internal: # Technical details, some communications
  confidential: # Personal data, sensitive business info
  restricted: # Legal hold, privacy incidents
  secret: # Security incidents, compliance data

Field-Level Redaction

Sensitive data is redacted based on user permissions:

{
  "incident_id": "12345",
  "title": "Payment processing issue",
  "description": "**[REDACTED - CONFIDENTIAL]**",
  "assignee": {
    "name": "**[REDACTED - PII]**",
    "email": "**[REDACTED - PII]**"
  },
  "customer_impact": "**[REDACTED - RESTRICTED]**"
}

Storage Architecture

Core Storage

PostgreSQL (primary):

  • Incident records and metadata
  • Event store (timeline_events table)
  • User and group management
  • Connector configurations
  • Policy cache

Redis (caching):

  • Session storage
  • Real-time board updates
  • Rate limiting counters
  • WebSocket connection state

Object Storage (artifacts):

  • File attachments
  • Exported timelines
  • Backup archives
  • Policy bundles

Data Retention

Tiered Retention:

  • Active incidents: Full access, no restrictions
  • Recent incidents (< 90 days): Full data with normal policies
  • Archived incidents (90 days - 7 years): Read-only, compressed storage
  • Legal hold: Override retention, full preservation
  • Purged data (> 7 years): Hard delete unless legal hold

API Design

REST API

Core Resources:

/api/v1/incidents              # Incident CRUD operations
/api/v1/incidents/{id}/timeline # Event timeline access
/api/v1/incidents/{id}/board   # Status board data
/api/v1/connectors            # Provider integrations
/api/v1/users                 # User management (SCIM)
/api/v1/groups                # Group management (SCIM)
/api/v1/policies              # Policy administration

Event Streaming:

/api/v1/events/stream         # Server-Sent Events
/api/v1/incidents/{id}/stream # Incident-specific events

GraphQL API

For complex queries and real-time subscriptions:

type Incident {
  id: ID!
  title: String!
  severity: Severity!
  status: IncidentStatus!
  timeline(limit: Int, offset: Int): [TimelineEvent!]!
  statusBoard(persona: Persona!): StatusBoard!
  connectedTickets: [ExternalTicket!]!
}

type Subscription {
  incidentUpdated(id: ID!): Incident!
  timelineEventAdded(incidentId: ID!): TimelineEvent!
}

Deployment Architecture

Deployment Targets

Docker Compose (development/small teams):

services:
  incident-server:
    image: incidents/server:latest
    depends_on: [postgres, redis, nats]

  postgres:
    image: postgres:15-alpine

  redis:
    image: redis:7-alpine

  nats:
    image: nats:latest

Kubernetes (production):

  • Horizontal Pod Autoscaling for API servers
  • StatefulSets for PostgreSQL with replicas
  • Persistent volumes for data storage
  • Ingress with TLS termination
  • NetworkPolicies for security

Serverless (cloud-native):

  • Containerized API on Cloud Run/ECS
  • Managed PostgreSQL (RDS/Cloud SQL)
  • Managed Redis (ElastiCache/MemoryStore)
  • Event streaming via cloud-native solutions

Performance Characteristics

Throughput:

  • 10,000+ events/second ingestion
  • Sub-second API response times
  • Real-time WebSocket updates
  • Efficient bulk operations

Scalability:

  • Horizontal scaling for stateless components
  • Read replicas for PostgreSQL
  • Redis clustering for high availability
  • CDN for static assets

Reliability:

  • 99.9% uptime SLA target
  • Automatic failover capabilities
  • Comprehensive backup and recovery
  • Circuit breakers and rate limiting

This event-sourced architecture provides a solid foundation for incident management while maintaining flexibility, security, and operational excellence.