Health Monitoring
Keelson's layered health monitoring provides generic building blocks — presence detection, health scoring, and composite aggregation — that any application-specific decision layer can consume.
Phase 1 status: This document covers the protocol conventions, message definitions, and reference configuration schema. The aggregator implementation is planned for Phase 2.
Overview
Health monitoring in keelson follows a 3-layer architecture:
| Layer | Responsibility | Mechanism |
|---|---|---|
| Layer 1 — Presence | Detect whether source processes are running | Zenoh liveliness tokens |
| Layer 2 — Health assessment | Evaluate per-component health; produce a composite score | Health aggregator (configurable) |
| Layer 3 — Application logic | Consume the composite score to drive domain-specific decisions | Application-defined (see examples below) |
Layers 1–2 are generic keelson infrastructure. Layer 3 is where applications map the composite score to actionable decisions.
Layer 1: Liveliness (Presence Detection)
Each source process declares a liveliness token using the convention defined in the protocol specification, Section 5:
{base_path}/@v0/{entity_id}/pubsub/*/{source_id}
The * wildcard in the subject position signals that the source is alive and may produce output on any subject. This is a coarse presence signal — the token does not declare which specific subjects the source publishes.
A health aggregator subscribes to liveliness events to detect source join/leave:
session.liveliness().declare_subscriber(
"keelson/@v0/landkrabban/pubsub/**",
callback,
)
See protocol specification, Section 5 for full details on token format, subscriber patterns, and verbatim chunk isolation.
Declaring liveliness in connectors
Any connector that publishes data into keelson (a source/ingestion connector) should declare a liveliness token. The token signals "this source process is alive and may produce output."
When to declare: Source connectors that publish to pubsub/ key expressions, such as ais2keelson, n2k2keelson, nmea01832keelson, or platform-geometry2keelson.
When NOT to declare:
- Sink connectors (subscribers/recorders like keelson2foxglove, keelson2mcap) — they have no --source-id and don't publish into keelson
- Offline utilities (klog2mcap, mcap-tagg) — not long-running network processes
- RPC-only services (mediamtx-whep) — until a separate RPC liveliness convention is defined
Pattern: Use the declare_liveliness_token context manager from keelson.scaffolding immediately after opening the Zenoh session. The token is automatically undeclared when the with block exits:
from keelson.scaffolding import declare_liveliness_token
with zenoh.open(conf) as session:
with declare_liveliness_token(session, args.realm, args.entity_id, args.source_id):
run(session, args)
What it gives you: Health aggregators and monitoring UIs can detect source join/leave events without polling. When a connector process starts, it appears in the liveliness set; when it exits (cleanly or via crash), the token is automatically removed and subscribers receive a leave event.
Layer 2: Health Aggregation
The health aggregator is a generic, configurable component that produces a single composite score (0.0–1.0) for downstream consumers. It evaluates per-component health using a weighted scoring model. Each component is assigned:
- weight — its relative importance in the composite score (all weights should sum to 1.0)
- stale_threshold_ms — maximum age of the last received message before the component is considered stale (health score → 0.0)
- health_rules — conditions evaluated against incoming messages
Health rules
Each rule inspects a specific subject and evaluates a condition:
| Rule type | Description | Example |
|---|---|---|
| Value threshold | Numeric comparison against a message field | good_if: "value < 2.0" |
| Enum/state requirement | Exact match against an expected value | require: "FIX_3D" |
| Message rate | Frequency of messages on a subject | good_if: "> 20 Hz" |
A component's health score is determined by the worst-performing rule:
- All rules pass
good_if→ score = 1.0 - At least one rule in
degraded_ifrange → score = 0.5 - Any rule fails all conditions or the component is stale → score = 0.0
Composite score
The composite score is the weighted sum of all component scores:
composite_score = Σ (component_weight × component_score)
This normalized score is the output of Layer 2. Layer 3 consumers interpret it according to their own domain logic.
Layer 3: Application-Specific Decision Logic
The composite score produced by Layer 2 is the input to whatever domain-specific logic a deployment requires. Applications subscribe to the composite score and apply their own rules to translate it into actionable decisions. Keelson does not prescribe what those decisions are — it only guarantees a well-defined, normalized health signal.
Example use cases:
- Operational authority for autonomous vessels — map the composite score to authority levels (detailed below)
- Dashboard health indicators — translate the score into green / yellow / red status for operator UIs
- Automated alerting or degraded-mode switching — trigger alarms or fall back to a safe mode when the score drops below a threshold
Example: Operational Authority for Autonomous Vessels
This built-in example maps the composite score to operational authority levels aligned with the IMO MASS (Maritime Autonomous Surface Ships) framework.
The aggregator publishes an OperationalAuthority message to:
{base_path}/@v0/{entity_id}/pubsub/operational_authority/{aggregator_id}
Message format
The message contains:
| Field | Type | Description |
|---|---|---|
timestamp |
google.protobuf.Timestamp |
Time of the authority determination |
level |
AuthorityLevel enum |
Current authority level |
composite_score |
float |
Normalized composite health score (0.0–1.0) |
reason |
string |
Human-readable explanation |
component_scores |
map<string, float> |
Per-component health scores for observability |
Authority levels
The AuthorityLevel enum is aligned with the IMO MASS framework:
| Value | Name | Description |
|---|---|---|
| 0 | AUTHORITY_LEVEL_UNKNOWN |
Authority level has not been determined |
| 1 | AUTHORITY_LEVEL_MINIMAL_SAFE_MODE |
Minimal safe operation (e.g., all-stop, hold position) |
| 2 | AUTHORITY_LEVEL_SUPERVISED_REMOTE |
Remote operator with limited situational awareness |
| 3 | AUTHORITY_LEVEL_REMOTE_CONTROLLED |
Full remote control with good situational awareness |
| 4 | AUTHORITY_LEVEL_ASSISTED_AUTONOMOUS |
Autonomous with operator supervision |
| 5 | AUTHORITY_LEVEL_FULL_AUTONOMOUS |
Fully autonomous operation |
Authority thresholds and hysteresis
The composite score is mapped to an authority level using configurable thresholds. The aggregator selects the highest authority level whose threshold is met:
| Authority level | Default threshold |
|---|---|
FULL_AUTONOMOUS |
≥ 0.85 |
ASSISTED_AUTONOMOUS |
≥ 0.65 |
REMOTE_CONTROLLED |
≥ 0.45 |
SUPERVISED_REMOTE |
≥ 0.25 |
MINIMAL_SAFE_MODE |
< 0.25 |
A hysteresis band (default: 0.05) prevents rapid oscillation between levels. Transitioning down requires the score to drop below threshold - hysteresis, and transitioning up requires the score to exceed threshold + hysteresis.