Observability

Observability in Satusky spans three different things that are easy to confuse:

Layer	Question
Workload	Are pods built, scheduled, and ready?
Endpoint	Can users reach the public hostname over valid HTTPS?
Machine	Is the underlying capacity healthy and usable?

A complete operator view needs all three.

Current telemetry sources

Source	Current use
Kubernetes watch / informer data	deployment status and pod readiness
Kubernetes metrics API	CPU / memory observations
Prometheus	application, network, and machine/network time-series
Hubble-derived metrics paths	deployment network quality / latency inputs
Talos API	low-level machine resources and state
SideroLink	machine connectivity and discovery
WebSockets	live deployment, machine, log, and notification streams
PostgreSQL	historical metrics, billing records, persisted metadata

Current shape

The backend already contains many useful observability pieces:

deployment live-status WebSockets,
machine status WebSockets,
deployment metrics endpoints,
Prometheus-backed application and network metrics,
machine health jobs,
Talos resource inspection,
machine logs and events.

The gap is less “no observability exists” and more “the user-facing model is not yet unified.”

Desired operator model

status  = concise current state
metrics = changing measurements over time
logs    = emitted events / text streams
events  = lifecycle and system transitions
check   = cross-plane validation, especially for domains

That means:

deploy status should remain workload-focused,
domain checks should own public endpoint diagnostics,
machine commands should own fleet health,
dashboards may compose them, but the architecture should not blur them.

Health is multi-dimensional

A deployment can be:

pod-ready but publicly unreachable,
publicly routable but backed by a degraded node,
healthy in one cluster and unhealthy in another,
consuming resources normally while billing state is impaired.

The platform should preserve these dimensions instead of collapsing them into one opaque “healthy” bit.

Current gaps and target direction

Gap	Target
Machine telemetry exists mostly behind APIs, not a mature CLI contract.	First-class machine status and metrics workflows.
Public endpoint readiness is not yet reported with the same rigor as pod readiness.	Route/DNS/TLS/HTTP checks become standard.
Metrics sources are rich but scattered.	One documented observability model with clear source-of-truth boundaries.
Historical billing observations and live operational metrics can be conflated.	Distinguish accounting records from live telemetry.