docs: add observability page (#2384)

Co-authored-by: Moritz Sanft <58110325+msanft@users.noreply.github.com> Co-authored-by: 3u13r <lc@edgeless.systems> Co-authored-by: Thomas Tendyck <51411342+thomasten@users.noreply.github.com>
2025-08-03 20:44:14 -04:00 · 2023-10-04 09:37:46 +02:00 · 2023-10-04 09:37:46 +02:00 · 7c76592a08
commit 7c76592a08
parent e938cc5e63
7 changed files with 187 additions and 1 deletions
--- a/docs/docs/architecture/observability.md
+++ b/docs/docs/architecture/observability.md
@ -0,0 +1,78 @@
+# Observability
+
+In Kubernetes, observability is the ability to gain insight into the behavior and performance of applications.
+It helps identify and resolve issues more effectively, ensuring stability and performance of Kubernetes workloads, reducing downtime and outages, and improving efficiency.
+The "three pillars of observability" are logs, metrics, and traces.
+
+In the context of Confidential Computing, observability is a delicate subject and needs to be applied such that it doesn't leak any sensitive information.
+The following gives an overview of where and how you can apply standard observability tools in Constellation.
+
+## Cloud resource monitoring
+
+While inaccessible, Constellation's nodes are still visible as black box VMs to the hypervisor.
+Resource consumption, such as memory and CPU utilization, can be monitored from the outside and observed via the cloud platforms directly.
+Similarly, other resources, such as storage and network and their respective metrics, are visible via the cloud platform.
+
+## Metrics
+
+Metrics are numeric representations of data measured over intervals of time. They're essential for understanding system health and gaining insights using telemetry signals.
+
+By default, Constellation exposes the [metrics for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/) inside the cluster.
+Similarly, the [etcd metrics](https://etcd.io/docs/v3.5/metrics/) endpoints are exposed inside the cluster.
+These [metrics endpoints can be disabled](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#disabling-metrics).
+
+You can collect these cluster-internal metrics via tools such as [Prometheus](https://prometheus.io/) or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
+
+Constellation's CNI Cilium also supports [metrics via Prometheus endpoints](https://docs.cilium.io/en/latest/observability/metrics/).
+However, in Constellation, they're disabled by default and must be enabled first.
+
+## Logs
+
+Logs represent discrete events that usually describe what's happening with your service.
+The payload is an actual message emitted from your system along with a metadata section containing a timestamp, labels, and tracking identifiers.
+
+### System logs
+
+Constellation uses cloud logging for events occurring during the early stages of a node's boot process.
+These logs include [Bootstrapper](./microservices.md#bootstrapper) events and [state disk UUIDs](../architecture/images.md#state-disk).
+You can access the cloud logging [directly via the cloud provider endpoints](../workflows/troubleshooting.md#cloud-logging).
+
+More detailed system-level logs are accessible via `/var/log` and [journald](https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html) on the nodes directly.
+They can be collected from there, for example, via [Filebeat and Logstash](https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html), which are tools of the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
+
+In case of an error during the initialization, the CLI automatically collects the [Bootstrapper](./microservices.md#bootstrapper) logs and returns these as a file for [troubleshooting](../workflows/troubleshooting.md). Here is an example of such an event:
+
+```shell-session
+Cluster initialization failed. This error is not recoverable.
+Terminate your cluster and try again.
+Fetched bootstrapper logs are stored in "constellation-cluster.log"
+```
+
+### Kubernetes logs
+
+Constellation supports the [Kubernetes logging architecture](https://kubernetes.io/docs/concepts/cluster-administration/logging/).
+By default, logs are written to the nodes' encrypted state disks.
+These include the Pod and container logs and the [system component logs](https://kubernetes.io/docs/concepts/cluster-administration/logging/#system-component-logs).
+
+[Constellation services](microservices.md) run as Pods inside the `kube-system` namespace and use the standard container logging mechanism.
+The same applies for the [Cilium Pods](https://docs.cilium.io/en/latest/operations/troubleshooting/#logs).
+
+You can collect logs from within the cluster via tools such as [Fluentd](https://github.com/fluent/fluentd), [Loki](https://github.com/grafana/loki), or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
+
+## Traces
+
+Modern systems are implemented as interconnected complex and distributed microservices. Understanding request flows and system communications is challenging, mainly because all systems in a chain need to be modified to propagate tracing information. Distributed tracing is a new approach to increasing observability and understanding performance bottlenecks. A trace represents consecutive events that reflect an end-to-end request path in a distributed system.
+
+Constellation supports [traces for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-traces/).
+By default, they're disabled and need to be enabled first.
+
+Similarly, Cilium can be enabled to [export traces](https://cilium.io/use-cases/metrics-export/).
+
+You can collect these traces via tools such as [Jaeger](https://www.jaegertracing.io/) or [Zipkin](https://zipkin.io/).
+
+## Integrations
+
+Platforms and SaaS solutions such as Datadog, logz.io, Dynatrace, or New Relic facilitate the observability challenge for Kubernetes and provide all-in-one SaaS solutions.
+They install agents into the cluster that collect metrics, logs, and tracing information and upload them into the data lake of the platform.
+Technically, the agent-based approach is compatible with Constellation, and attaching these platforms is straightforward.
+However, you need to evaluate if the exported data might violate Constellation's compliance and privacy guarantees by uploading them to a third-party platform.
--- a/docs/docs/architecture/overview.md
+++ b/docs/docs/architecture/overview.md
@ -22,3 +22,9 @@ You can learn more about [the images](images.md) and how verified boot ensures t
 ## About key management and cryptographic primitives

 Encryption of data at-rest, in-transit, and in-use is the fundamental building block for confidential computing and Constellation. Learn more about the [keys and cryptographic primitives](keys.md) used in Constellation, [encrypted persistent storage](encrypted-storage.md), and [network encryption](networking.md).
+
+## About observability
+
+Observability in Kubernetes refers to the capability to swiftly troubleshoot issues using telemetry signals such as logs, metrics, and traces.
+In the realm of Confidential Computing, it's crucial that observability aligns with confidentiality, necessitating careful implementation.
+Learn more about the [observability capabilities in Constellation](./observability.md).
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@ -254,6 +254,11 @@ const sidebars = {
          label: 'Networking',
          id: 'architecture/networking',
        },
+        {
+          type: 'doc',
+          label: 'Observability',
+          id: 'architecture/observability',
+        },
      ],
    },
    {
--- a/docs/styles/Vocab/constellation/accept.txt
+++ b/docs/styles/Vocab/constellation/accept.txt
@ -15,11 +15,15 @@ Bootstrapper
 config
 cyber
 datacenter
+Datadog
 deallocate
 Dockerfile
+Dynatrace
 [Ee]mojivoto
 etcd
+Filebeat
 Filestore
+Fluentd
 Fulcio
 Mbps
 Gbps
@ -30,12 +34,14 @@ iam
 IAM
 iodepth
 initramfs
+journald
 [Kk]3s
 Kata
 kubeadm
 kubectl
 kubelet
 libcryptsetup
+Logstash
 MicroK8s
 [Mm]inikube
 namespace
@ -45,11 +51,13 @@ Rekor
 resizable
 rollout
 sigstore
+[Ss]uperset
 Syft
 systemd
 [Uu]nencrypted
 unspoofable
 updatable
+UUID
 proxied
 QEMU
 virsh
@ -58,4 +66,4 @@ whitepaper
 WireGuard
 Xeon
 xsltproc
-[Ss]uperset
+Zipkin
--- a/docs/versioned_docs/version-2.11/architecture/observability.md
+++ b/docs/versioned_docs/version-2.11/architecture/observability.md
@ -0,0 +1,78 @@
+# Observability
+
+In Kubernetes, observability is the ability to gain insight into the behavior and performance of applications.
+It helps identify and resolve issues more effectively, ensuring stability and performance of Kubernetes workloads, reducing downtime and outages, and improving efficiency.
+The "three pillars of observability" are logs, metrics, and traces.
+
+In the context of Confidential Computing, observability is a delicate subject and needs to be applied such that it doesn't leak any sensitive information.
+The following gives an overview of where and how you can apply standard observability tools in Constellation.
+
+## Cloud resource monitoring
+
+While inaccessible, Constellation's nodes are still visible as black box VMs to the hypervisor.
+Resource consumption, such as memory and CPU utilization, can be monitored from the outside and observed via the cloud platforms directly.
+Similarly, other resources, such as storage and network and their respective metrics, are visible via the cloud platform.
+
+## Metrics
+
+Metrics are numeric representations of data measured over intervals of time. They're essential for understanding system health and gaining insights using telemetry signals.
+
+By default, Constellation exposes the [metrics for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/) inside the cluster.
+Similarly, the [etcd metrics](https://etcd.io/docs/v3.5/metrics/) endpoints are exposed inside the cluster.
+These [metrics endpoints can be disabled](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#disabling-metrics).
+
+You can collect these cluster-internal metrics via tools such as [Prometheus](https://prometheus.io/) or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
+
+Constellation's CNI Cilium also supports [metrics via Prometheus endpoints](https://docs.cilium.io/en/latest/observability/metrics/).
+However, in Constellation, they're disabled by default and must be enabled first.
+
+## Logs
+
+Logs represent discrete events that usually describe what's happening with your service.
+The payload is an actual message emitted from your system along with a metadata section containing a timestamp, labels, and tracking identifiers.
+
+### System logs
+
+Constellation uses cloud logging for events occurring during the early stages of a node's boot process.
+These logs include [Bootstrapper](./microservices.md#bootstrapper) events and [state disk UUIDs](../architecture/images.md#state-disk).
+You can access the cloud logging [directly via the cloud provider endpoints](../workflows/troubleshooting.md#cloud-logging).
+
+More detailed system-level logs are accessible via `/var/log` and [journald](https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html) on the nodes directly.
+They can be collected from there, for example, via [Filebeat and Logstash](https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html), which are tools of the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
+
+In case of an error during the initialization, the CLI automatically collects the [Bootstrapper](./microservices.md#bootstrapper) logs and returns these as a file for [troubleshooting](../workflows/troubleshooting.md). Here is an example of such an event:
+
+```shell-session
+Cluster initialization failed. This error is not recoverable.
+Terminate your cluster and try again.
+Fetched bootstrapper logs are stored in "constellation-cluster.log"
+```
+
+### Kubernetes logs
+
+Constellation supports the [Kubernetes logging architecture](https://kubernetes.io/docs/concepts/cluster-administration/logging/).
+By default, logs are written to the nodes' encrypted state disks.
+These include the Pod and container logs and the [system component logs](https://kubernetes.io/docs/concepts/cluster-administration/logging/#system-component-logs).
+
+[Constellation services](microservices.md) run as Pods inside the `kube-system` namespace and use the standard container logging mechanism.
+The same applies for the [Cilium Pods](https://docs.cilium.io/en/latest/operations/troubleshooting/#logs).
+
+You can collect logs from within the cluster via tools such as [Fluentd](https://github.com/fluent/fluentd), [Loki](https://github.com/grafana/loki), or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
+
+## Traces
+
+Modern systems are implemented as interconnected complex and distributed microservices. Understanding request flows and system communications is challenging, mainly because all systems in a chain need to be modified to propagate tracing information. Distributed tracing is a new approach to increasing observability and understanding performance bottlenecks. A trace represents consecutive events that reflect an end-to-end request path in a distributed system.
+
+Constellation supports [traces for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-traces/).
+By default, they're disabled and need to be enabled first.
+
+Similarly, Cilium can be enabled to [export traces](https://cilium.io/use-cases/metrics-export/).
+
+You can collect these traces via tools such as [Jaeger](https://www.jaegertracing.io/) or [Zipkin](https://zipkin.io/).
+
+## Integrations
+
+Platforms and SaaS solutions such as Datadog, logz.io, Dynatrace, or New Relic facilitate the observability challenge for Kubernetes and provide all-in-one SaaS solutions.
+They install agents into the cluster that collect metrics, logs, and tracing information and upload them into the data lake of the platform.
+Technically, the agent-based approach is compatible with Constellation, and attaching these platforms is straightforward.
+However, you need to evaluate if the exported data might violate Constellation's compliance and privacy guarantees by uploading them to a third-party platform.
--- a/docs/versioned_docs/version-2.11/architecture/overview.md
+++ b/docs/versioned_docs/version-2.11/architecture/overview.md
@ -22,3 +22,9 @@ You can learn more about [the images](images.md) and how verified boot ensures t
 ## About key management and cryptographic primitives

 Encryption of data at-rest, in-transit, and in-use is the fundamental building block for confidential computing and Constellation. Learn more about the [keys and cryptographic primitives](keys.md) used in Constellation, [encrypted persistent storage](encrypted-storage.md), and [network encryption](networking.md).
+
+## About observability
+
+Observability in Kubernetes refers to the capability to swiftly troubleshoot issues using telemetry signals such as logs, metrics, and traces.
+In the realm of Confidential Computing, it's crucial that observability aligns with confidentiality, necessitating careful implementation.
+Learn more about the [observability capabilities in Constellation](./observability.md).
--- a/docs/versioned_sidebars/version-2.11-sidebars.json
+++ b/docs/versioned_sidebars/version-2.11-sidebars.json
@ -233,6 +233,11 @@
          "type": "doc",
          "label": "Networking",
          "id": "architecture/networking"
+        },
+        {
+          "type": "doc",
+          "label": "Observability",
+          "id": "architecture/observability"
        }
      ]
    },