Observability
Observability in Satusky spans three different things that are easy to confuse:
| Layer | Question |
|---|---|
| Workload | Are pods built, scheduled, and ready? |
| Endpoint | Can users reach the public hostname over valid HTTPS? |
| Machine | Is the underlying capacity healthy and usable? |
A complete operator view needs all three.
Current telemetry sources
Section titled “Current telemetry sources”| Source | Current use |
|---|---|
| Kubernetes watch / informer data | deployment status and pod readiness |
| Kubernetes metrics API | CPU / memory observations |
| Prometheus | application, network, and machine/network time-series |
| Hubble-derived metrics paths | deployment network quality / latency inputs |
| Talos API | low-level machine resources and state |
| SideroLink | machine connectivity and discovery |
| WebSockets | live deployment, machine, log, and notification streams |
| PostgreSQL | historical metrics, billing records, persisted metadata |
Current shape
Section titled “Current shape”The backend already contains many useful observability pieces:
- deployment live-status WebSockets,
- machine status WebSockets,
- deployment metrics endpoints,
- Prometheus-backed application and network metrics,
- machine health jobs,
- Talos resource inspection,
- machine logs and events.
The gap is less “no observability exists” and more “the user-facing model is not yet unified.”
Desired operator model
Section titled “Desired operator model”status = concise current statemetrics = changing measurements over timelogs = emitted events / text streamsevents = lifecycle and system transitionscheck = cross-plane validation, especially for domainsThat means:
deploy statusshould remain workload-focused,- domain checks should own public endpoint diagnostics,
- machine commands should own fleet health,
- dashboards may compose them, but the architecture should not blur them.
Health is multi-dimensional
Section titled “Health is multi-dimensional”A deployment can be:
- pod-ready but publicly unreachable,
- publicly routable but backed by a degraded node,
- healthy in one cluster and unhealthy in another,
- consuming resources normally while billing state is impaired.
The platform should preserve these dimensions instead of collapsing them into one opaque “healthy” bit.
Current gaps and target direction
Section titled “Current gaps and target direction”| Gap | Target |
|---|---|
| Machine telemetry exists mostly behind APIs, not a mature CLI contract. | First-class machine status and metrics workflows. |
| Public endpoint readiness is not yet reported with the same rigor as pod readiness. | Route/DNS/TLS/HTTP checks become standard. |
| Metrics sources are rich but scattered. | One documented observability model with clear source-of-truth boundaries. |
| Historical billing observations and live operational metrics can be conflated. | Distinguish accounting records from live telemetry. |