State Machine Developer Guide

State Machine Developer Guide

This guide provides technical details for developers working with the incident state machine, including extending states, implementing custom guards, and integrating with the transition system.

Architecture Overview

The state machine is implemented in internal/statemachine/ and provides:

  • State definitions - Valid incident states
  • Transition rules - Valid state transitions
  • Guard framework - Configurable transition guards
  • Terminal state enforcement - Prevents transitions from terminal states

Package Structure

internal/statemachine/
├── statemachine.go      # Core state machine implementation
├── statemachine_test.go # State machine tests
├── guards.go            # Guard interface and implementations
├── guards_test.go       # Guard tests
├── transitions.go       # Transition types and helpers
└── transitions_test.go  # Transition tests

State Definitions

States are defined as constants in statemachine.go:

type State string

const (
    StateOpen        State = "open"
    StateAcknowledged State = "acknowledged"
    StateMitigated   State = "mitigated"
    StateResolved    State = "resolved"
    StateClosed      State = "closed"
    StateDuplicate   State = "duplicate"  // Terminal
    StateCanceled    State = "canceled"   // Terminal
)

// Terminal states cannot transition to any other state
var terminalStates = map[State]bool{
    StateDuplicate: true,
    StateCanceled:  true,
}

Transition Rules

The state machine defines valid transitions in a transition table:

var validTransitions = map[State][]State{
    StateOpen:        {StateAcknowledged, StateResolved, StateCanceled, StateDuplicate},
    StateAcknowledged: {StateMitigated, StateResolved, StateCanceled, StateDuplicate},
    StateMitigated:   {StateResolved, StateCanceled, StateDuplicate},
    StateResolved:    {StateClosed, StateOpen, StateDuplicate},
    StateClosed:      {StateOpen, StateDuplicate},
    // Terminal states have no valid transitions
}

Checking Valid Transitions

sm := statemachine.New()

// Check if transition is valid
if sm.IsValidTransition(statemachine.StateOpen, statemachine.StateAcknowledged) {
    // Transition is allowed
}

// Check if state is terminal
if statemachine.IsTerminalState(statemachine.StateDuplicate) {
    // Cannot transition out of this state
}

Guard System

Guards enforce conditions that must be met before a transition can occur. Each guard evaluates the incident and provided metadata.

Guard Interface

type Guard interface {
    // Name returns the guard's identifier
    Name() string

    // Evaluate checks if the guard condition is met
    Evaluate(incident *models.Incident, metadata map[string]interface{}) GuardResult
}

type GuardResult struct {
    Passed  bool
    Message string
}

Built-in Guards

BlastRadiusStabilizedGuard

Required for acknowledged → mitigated:

type BlastRadiusStabilizedGuard struct{}

func (g *BlastRadiusStabilizedGuard) Name() string {
    return "BlastRadiusStabilizedGuard"
}

func (g *BlastRadiusStabilizedGuard) Evaluate(
    incident *models.Incident,
    metadata map[string]interface{},
) GuardResult {
    stabilized, ok := metadata["blast_radius_stabilized"].(bool)
    if !ok || !stabilized {
        return GuardResult{
            Passed:  false,
            Message: "blast_radius_stabilized must be true",
        }
    }

    summary, ok := metadata["mitigation_summary"].(string)
    if !ok || summary == "" {
        return GuardResult{
            Passed:  false,
            Message: "mitigation_summary is required",
        }
    }

    return GuardResult{Passed: true}
}

RootCauseFixedGuard

Required for mitigated → resolved:

type RootCauseFixedGuard struct{}

func (g *RootCauseFixedGuard) Evaluate(
    incident *models.Incident,
    metadata map[string]interface{},
) GuardResult {
    fixed, ok := metadata["root_cause_fixed"].(bool)
    if !ok || !fixed {
        return GuardResult{
            Passed:  false,
            Message: "root_cause_fixed must be true",
        }
    }

    summary, ok := metadata["root_cause_summary"].(string)
    if !ok || summary == "" {
        return GuardResult{
            Passed:  false,
            Message: "root_cause_summary is required",
        }
    }

    return GuardResult{Passed: true}
}

ChildrenResolvedGuard

Required for resolving parent incidents:

type ChildrenResolvedGuard struct {
    incidentStore storage.IncidentStore
}

func (g *ChildrenResolvedGuard) Evaluate(
    incident *models.Incident,
    metadata map[string]interface{},
) GuardResult {
    children, err := g.incidentStore.GetChildrenStatuses(
        context.Background(),
        incident.ID,
    )
    if err != nil {
        return GuardResult{Passed: false, Message: err.Error()}
    }

    for _, status := range children {
        if status != models.StatusResolved && status != models.StatusClosed {
            return GuardResult{
                Passed:  false,
                Message: "all child incidents must be resolved or closed",
            }
        }
    }

    return GuardResult{Passed: true}
}

Implementing Custom Guards

To implement a custom guard:

  1. Create a struct implementing the Guard interface
  2. Register it with the state machine for specific transitions
  3. Write tests for the guard
// Example: SLAMetGuard ensures SLA target was met
type SLAMetGuard struct {
    slaService *sla.Service
}

func (g *SLAMetGuard) Name() string {
    return "SLAMetGuard"
}

func (g *SLAMetGuard) Evaluate(
    incident *models.Incident,
    metadata map[string]interface{},
) GuardResult {
    slaStatus, err := g.slaService.GetStatus(incident.ID)
    if err != nil {
        return GuardResult{Passed: false, Message: err.Error()}
    }

    if slaStatus.Breached {
        // Allow transition but record breach
        return GuardResult{
            Passed:  true,
            Message: "SLA breached - proceeding with transition",
        }
    }

    return GuardResult{Passed: true}
}

Registering Guards

Guards are registered for specific transition pairs:

sm := statemachine.New()

// Register guard for specific transition
sm.RegisterGuard(
    statemachine.StateAcknowledged,
    statemachine.StateMitigated,
    &BlastRadiusStabilizedGuard{},
)

// Register guard for multiple transitions
sm.RegisterGuardForStates(
    []statemachine.State{statemachine.StateResolved, statemachine.StateClosed},
    statemachine.StateOpen,
    &NoteRequiredGuard{},
)

Using the State Machine

Basic Transition Validation

sm := statemachine.New()

// Validate without guards
err := sm.ValidateTransition(fromState, toState)
if err != nil {
    // Invalid transition
}

// Validate with guards
err := sm.ValidateTransitionWithGuards(fromState, toState, incident, metadata)
if err != nil {
    // Guard failed or invalid transition
}

In HTTP Handlers

func (h *IncidentHandlers) HandleTransitionIncident(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    incidentID := vars["id"]

    var req TransitionRequestBody
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeJSONError(w, "Invalid request body", http.StatusBadRequest)
        return
    }

    // Get incident
    incident, err := h.incidentStore.GetIncident(r.Context(), incidentID)
    if err != nil {
        writeJSONError(w, "Incident not found", http.StatusNotFound)
        return
    }

    fromState := statemachine.StatusToState(incident.Status)
    toState := statemachine.State(req.ToState)

    // Validate transition with guards
    if err := h.stateMachine.ValidateTransitionWithGuards(
        fromState,
        toState,
        incident,
        req.Metadata,
    ); err != nil {
        writeTransitionError(w, "Guard evaluation failed", err.Error())
        return
    }

    // Perform transition
    newStatus := statemachine.StateToStatus(toState)
    _, err = h.incidentStore.UpdateIncidentStatus(
        r.Context(),
        incidentID,
        newStatus,
        incident.Version,
    )
    if err == storage.ErrVersionConflict {
        writeJSONError(w, "Conflict", http.StatusConflict)
        return
    }

    // Emit CloudEvent
    // ...
}

Status Conversion

The state machine uses State type internally, while the models use IncidentStatus. Conversion helpers are provided:

// Convert model status to state machine state
state := statemachine.StatusToState(incident.Status)

// Convert state machine state to model status
status := statemachine.StateToStatus(state)

// Check if model status is terminal
if statemachine.IsTerminalStatus(incident.Status) {
    // Cannot transition
}

Optimistic Locking

The state machine uses optimistic locking to prevent concurrent modification conflicts:

// Update with version check
newVersion, err := incidentStore.UpdateIncidentStatus(
    ctx,
    incidentID,
    newStatus,
    currentVersion, // Must match DB version
)
if err == storage.ErrVersionConflict {
    // Another process modified the incident
    // Reload and retry
}

Database Implementation

UPDATE incidents
SET status = $1, version = version + 1, updated_at = NOW()
WHERE id = $2 AND version = $3
-- If rows affected = 0, version conflict

CloudEvents Integration

State transitions emit CloudEvents for audit and integration:

// After successful transition
if h.eventEmitter != nil {
    metadata := map[string]interface{}{
        "from_state": string(fromState),
        "to_state":   string(toState),
    }
    // Merge transition request metadata
    for k, v := range req.Metadata {
        metadata[k] = v
    }

    eventID, _ := h.eventEmitter.EmitIncidentEvent(
        ctx,
        models.EventTypeStateChangeV1,
        incident,
        metadata,
    )
}

Event Types

Event Type Description
im.incident.declared.v1 New incident declared
im.incident.state_change.v1 State transition occurred
im.incident.merged.v1 Incidents merged
im.incident.split.v1 Incident split

OpenTelemetry Integration

The state machine is instrumented with OpenTelemetry for observability:

Tracing

import "github.com/systmms/incidents/internal/observability"

func (h *IncidentHandlers) HandleTransitionIncident(w http.ResponseWriter, r *http.Request) {
    ctx, span := observability.GetTracer().Start(
        r.Context(),
        "incident.transition",
    )
    defer span.End()

    span.SetAttributes(
        attribute.String("incident.id", incidentID),
        attribute.String("from_state", string(fromState)),
        attribute.String("to_state", string(toState)),
    )

    // ... transition logic
}

Metrics

import "github.com/systmms/incidents/internal/observability"

// Record successful transition
observability.RecordIncidentTransition(
    ctx,
    string(fromState),
    string(toState),
    string(incident.Severity),
    duration,
    true, // success
)

// Record guard evaluation
observability.RecordIncidentGuardEvaluation(
    ctx,
    guard.Name(),
    string(incident.Severity),
    result.Passed,
)

Testing

Unit Testing Guards

func TestBlastRadiusStabilizedGuard(t *testing.T) {
    guard := &BlastRadiusStabilizedGuard{}
    incident := &models.Incident{
        ID:       "INC-1234",
        Severity: models.SeveritySEV2,
    }

    t.Run("passes with required metadata", func(t *testing.T) {
        metadata := map[string]interface{}{
            "blast_radius_stabilized": true,
            "mitigation_summary":      "Rolled back to v2.3.1",
        }

        result := guard.Evaluate(incident, metadata)
        assert.True(t, result.Passed)
    })

    t.Run("fails without stabilized flag", func(t *testing.T) {
        metadata := map[string]interface{}{
            "mitigation_summary": "Rolled back",
        }

        result := guard.Evaluate(incident, metadata)
        assert.False(t, result.Passed)
        assert.Contains(t, result.Message, "blast_radius_stabilized")
    })
}

Integration Testing Transitions

func TestStateTransitions(t *testing.T) {
    sm := statemachine.New()

    // Register guards
    sm.RegisterGuard(
        statemachine.StateAcknowledged,
        statemachine.StateMitigated,
        &BlastRadiusStabilizedGuard{},
    )

    incident := &models.Incident{
        ID:     "INC-1234",
        Status: models.StatusAcknowledged,
    }

    t.Run("valid transition with guard passes", func(t *testing.T) {
        metadata := map[string]interface{}{
            "blast_radius_stabilized": true,
            "mitigation_summary":      "Fixed",
        }

        err := sm.ValidateTransitionWithGuards(
            statemachine.StateAcknowledged,
            statemachine.StateMitigated,
            incident,
            metadata,
        )
        assert.NoError(t, err)
    })
}

Best Practices

  1. Always validate transitions - Use ValidateTransitionWithGuards before performing state changes
  2. Handle version conflicts - Implement retry logic for optimistic locking conflicts
  3. Emit events after success - Only emit CloudEvents after the database update succeeds
  4. Test guards thoroughly - Guards are critical for data integrity
  5. Use meaningful guard messages - Messages are returned to API clients
  6. Log guard failures - Failed guards may indicate issues that need investigation