State Machine Developer Guide
State Machine Developer Guide
This guide provides technical details for developers working with the incident state machine, including extending states, implementing custom guards, and integrating with the transition system.
Architecture Overview
The state machine is implemented in internal/statemachine/ and provides:
- State definitions - Valid incident states
- Transition rules - Valid state transitions
- Guard framework - Configurable transition guards
- Terminal state enforcement - Prevents transitions from terminal states
Package Structure
internal/statemachine/
├── statemachine.go # Core state machine implementation
├── statemachine_test.go # State machine tests
├── guards.go # Guard interface and implementations
├── guards_test.go # Guard tests
├── transitions.go # Transition types and helpers
└── transitions_test.go # Transition tests
State Definitions
States are defined as constants in statemachine.go:
type State string
const (
StateOpen State = "open"
StateAcknowledged State = "acknowledged"
StateMitigated State = "mitigated"
StateResolved State = "resolved"
StateClosed State = "closed"
StateDuplicate State = "duplicate" // Terminal
StateCanceled State = "canceled" // Terminal
)
// Terminal states cannot transition to any other state
var terminalStates = map[State]bool{
StateDuplicate: true,
StateCanceled: true,
}Transition Rules
The state machine defines valid transitions in a transition table:
var validTransitions = map[State][]State{
StateOpen: {StateAcknowledged, StateResolved, StateCanceled, StateDuplicate},
StateAcknowledged: {StateMitigated, StateResolved, StateCanceled, StateDuplicate},
StateMitigated: {StateResolved, StateCanceled, StateDuplicate},
StateResolved: {StateClosed, StateOpen, StateDuplicate},
StateClosed: {StateOpen, StateDuplicate},
// Terminal states have no valid transitions
}Checking Valid Transitions
sm := statemachine.New()
// Check if transition is valid
if sm.IsValidTransition(statemachine.StateOpen, statemachine.StateAcknowledged) {
// Transition is allowed
}
// Check if state is terminal
if statemachine.IsTerminalState(statemachine.StateDuplicate) {
// Cannot transition out of this state
}Guard System
Guards enforce conditions that must be met before a transition can occur. Each guard evaluates the incident and provided metadata.
Guard Interface
type Guard interface {
// Name returns the guard's identifier
Name() string
// Evaluate checks if the guard condition is met
Evaluate(incident *models.Incident, metadata map[string]interface{}) GuardResult
}
type GuardResult struct {
Passed bool
Message string
}Built-in Guards
BlastRadiusStabilizedGuard
Required for acknowledged → mitigated:
type BlastRadiusStabilizedGuard struct{}
func (g *BlastRadiusStabilizedGuard) Name() string {
return "BlastRadiusStabilizedGuard"
}
func (g *BlastRadiusStabilizedGuard) Evaluate(
incident *models.Incident,
metadata map[string]interface{},
) GuardResult {
stabilized, ok := metadata["blast_radius_stabilized"].(bool)
if !ok || !stabilized {
return GuardResult{
Passed: false,
Message: "blast_radius_stabilized must be true",
}
}
summary, ok := metadata["mitigation_summary"].(string)
if !ok || summary == "" {
return GuardResult{
Passed: false,
Message: "mitigation_summary is required",
}
}
return GuardResult{Passed: true}
}RootCauseFixedGuard
Required for mitigated → resolved:
type RootCauseFixedGuard struct{}
func (g *RootCauseFixedGuard) Evaluate(
incident *models.Incident,
metadata map[string]interface{},
) GuardResult {
fixed, ok := metadata["root_cause_fixed"].(bool)
if !ok || !fixed {
return GuardResult{
Passed: false,
Message: "root_cause_fixed must be true",
}
}
summary, ok := metadata["root_cause_summary"].(string)
if !ok || summary == "" {
return GuardResult{
Passed: false,
Message: "root_cause_summary is required",
}
}
return GuardResult{Passed: true}
}ChildrenResolvedGuard
Required for resolving parent incidents:
type ChildrenResolvedGuard struct {
incidentStore storage.IncidentStore
}
func (g *ChildrenResolvedGuard) Evaluate(
incident *models.Incident,
metadata map[string]interface{},
) GuardResult {
children, err := g.incidentStore.GetChildrenStatuses(
context.Background(),
incident.ID,
)
if err != nil {
return GuardResult{Passed: false, Message: err.Error()}
}
for _, status := range children {
if status != models.StatusResolved && status != models.StatusClosed {
return GuardResult{
Passed: false,
Message: "all child incidents must be resolved or closed",
}
}
}
return GuardResult{Passed: true}
}Implementing Custom Guards
To implement a custom guard:
- Create a struct implementing the
Guardinterface - Register it with the state machine for specific transitions
- Write tests for the guard
// Example: SLAMetGuard ensures SLA target was met
type SLAMetGuard struct {
slaService *sla.Service
}
func (g *SLAMetGuard) Name() string {
return "SLAMetGuard"
}
func (g *SLAMetGuard) Evaluate(
incident *models.Incident,
metadata map[string]interface{},
) GuardResult {
slaStatus, err := g.slaService.GetStatus(incident.ID)
if err != nil {
return GuardResult{Passed: false, Message: err.Error()}
}
if slaStatus.Breached {
// Allow transition but record breach
return GuardResult{
Passed: true,
Message: "SLA breached - proceeding with transition",
}
}
return GuardResult{Passed: true}
}Registering Guards
Guards are registered for specific transition pairs:
sm := statemachine.New()
// Register guard for specific transition
sm.RegisterGuard(
statemachine.StateAcknowledged,
statemachine.StateMitigated,
&BlastRadiusStabilizedGuard{},
)
// Register guard for multiple transitions
sm.RegisterGuardForStates(
[]statemachine.State{statemachine.StateResolved, statemachine.StateClosed},
statemachine.StateOpen,
&NoteRequiredGuard{},
)Using the State Machine
Basic Transition Validation
sm := statemachine.New()
// Validate without guards
err := sm.ValidateTransition(fromState, toState)
if err != nil {
// Invalid transition
}
// Validate with guards
err := sm.ValidateTransitionWithGuards(fromState, toState, incident, metadata)
if err != nil {
// Guard failed or invalid transition
}In HTTP Handlers
func (h *IncidentHandlers) HandleTransitionIncident(w http.ResponseWriter, r *http.Request) {
vars := mux.Vars(r)
incidentID := vars["id"]
var req TransitionRequestBody
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
writeJSONError(w, "Invalid request body", http.StatusBadRequest)
return
}
// Get incident
incident, err := h.incidentStore.GetIncident(r.Context(), incidentID)
if err != nil {
writeJSONError(w, "Incident not found", http.StatusNotFound)
return
}
fromState := statemachine.StatusToState(incident.Status)
toState := statemachine.State(req.ToState)
// Validate transition with guards
if err := h.stateMachine.ValidateTransitionWithGuards(
fromState,
toState,
incident,
req.Metadata,
); err != nil {
writeTransitionError(w, "Guard evaluation failed", err.Error())
return
}
// Perform transition
newStatus := statemachine.StateToStatus(toState)
_, err = h.incidentStore.UpdateIncidentStatus(
r.Context(),
incidentID,
newStatus,
incident.Version,
)
if err == storage.ErrVersionConflict {
writeJSONError(w, "Conflict", http.StatusConflict)
return
}
// Emit CloudEvent
// ...
}Status Conversion
The state machine uses State type internally, while the models use IncidentStatus. Conversion helpers are provided:
// Convert model status to state machine state
state := statemachine.StatusToState(incident.Status)
// Convert state machine state to model status
status := statemachine.StateToStatus(state)
// Check if model status is terminal
if statemachine.IsTerminalStatus(incident.Status) {
// Cannot transition
}Optimistic Locking
The state machine uses optimistic locking to prevent concurrent modification conflicts:
// Update with version check
newVersion, err := incidentStore.UpdateIncidentStatus(
ctx,
incidentID,
newStatus,
currentVersion, // Must match DB version
)
if err == storage.ErrVersionConflict {
// Another process modified the incident
// Reload and retry
}Database Implementation
UPDATE incidents
SET status = $1, version = version + 1, updated_at = NOW()
WHERE id = $2 AND version = $3
-- If rows affected = 0, version conflictCloudEvents Integration
State transitions emit CloudEvents for audit and integration:
// After successful transition
if h.eventEmitter != nil {
metadata := map[string]interface{}{
"from_state": string(fromState),
"to_state": string(toState),
}
// Merge transition request metadata
for k, v := range req.Metadata {
metadata[k] = v
}
eventID, _ := h.eventEmitter.EmitIncidentEvent(
ctx,
models.EventTypeStateChangeV1,
incident,
metadata,
)
}Event Types
| Event Type | Description |
|---|---|
im.incident.declared.v1 |
New incident declared |
im.incident.state_change.v1 |
State transition occurred |
im.incident.merged.v1 |
Incidents merged |
im.incident.split.v1 |
Incident split |
OpenTelemetry Integration
The state machine is instrumented with OpenTelemetry for observability:
Tracing
import "github.com/systmms/incidents/internal/observability"
func (h *IncidentHandlers) HandleTransitionIncident(w http.ResponseWriter, r *http.Request) {
ctx, span := observability.GetTracer().Start(
r.Context(),
"incident.transition",
)
defer span.End()
span.SetAttributes(
attribute.String("incident.id", incidentID),
attribute.String("from_state", string(fromState)),
attribute.String("to_state", string(toState)),
)
// ... transition logic
}Metrics
import "github.com/systmms/incidents/internal/observability"
// Record successful transition
observability.RecordIncidentTransition(
ctx,
string(fromState),
string(toState),
string(incident.Severity),
duration,
true, // success
)
// Record guard evaluation
observability.RecordIncidentGuardEvaluation(
ctx,
guard.Name(),
string(incident.Severity),
result.Passed,
)Testing
Unit Testing Guards
func TestBlastRadiusStabilizedGuard(t *testing.T) {
guard := &BlastRadiusStabilizedGuard{}
incident := &models.Incident{
ID: "INC-1234",
Severity: models.SeveritySEV2,
}
t.Run("passes with required metadata", func(t *testing.T) {
metadata := map[string]interface{}{
"blast_radius_stabilized": true,
"mitigation_summary": "Rolled back to v2.3.1",
}
result := guard.Evaluate(incident, metadata)
assert.True(t, result.Passed)
})
t.Run("fails without stabilized flag", func(t *testing.T) {
metadata := map[string]interface{}{
"mitigation_summary": "Rolled back",
}
result := guard.Evaluate(incident, metadata)
assert.False(t, result.Passed)
assert.Contains(t, result.Message, "blast_radius_stabilized")
})
}Integration Testing Transitions
func TestStateTransitions(t *testing.T) {
sm := statemachine.New()
// Register guards
sm.RegisterGuard(
statemachine.StateAcknowledged,
statemachine.StateMitigated,
&BlastRadiusStabilizedGuard{},
)
incident := &models.Incident{
ID: "INC-1234",
Status: models.StatusAcknowledged,
}
t.Run("valid transition with guard passes", func(t *testing.T) {
metadata := map[string]interface{}{
"blast_radius_stabilized": true,
"mitigation_summary": "Fixed",
}
err := sm.ValidateTransitionWithGuards(
statemachine.StateAcknowledged,
statemachine.StateMitigated,
incident,
metadata,
)
assert.NoError(t, err)
})
}Best Practices
- Always validate transitions - Use
ValidateTransitionWithGuardsbefore performing state changes - Handle version conflicts - Implement retry logic for optimistic locking conflicts
- Emit events after success - Only emit CloudEvents after the database update succeeds
- Test guards thoroughly - Guards are critical for data integrity
- Use meaningful guard messages - Messages are returned to API clients
- Log guard failures - Failed guards may indicate issues that need investigation
Related Documentation
- Incident Lifecycle Guide - User-facing lifecycle documentation
- API Reference - Full API documentation
- Timeline Events - CloudEvents specification