docs: add observability page (#2384)

Co-authored-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
Co-authored-by: 3u13r <lc@edgeless.systems>
Co-authored-by: Thomas Tendyck <51411342+thomasten@users.noreply.github.com>
This commit is contained in:
Moritz Eckert 2023-10-04 09:37:46 +02:00 committed by GitHub
parent e938cc5e63
commit 7c76592a08
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 187 additions and 1 deletions

View File

@ -0,0 +1,78 @@
# Observability
In Kubernetes, observability is the ability to gain insight into the behavior and performance of applications.
It helps identify and resolve issues more effectively, ensuring stability and performance of Kubernetes workloads, reducing downtime and outages, and improving efficiency.
The "three pillars of observability" are logs, metrics, and traces.
In the context of Confidential Computing, observability is a delicate subject and needs to be applied such that it doesn't leak any sensitive information.
The following gives an overview of where and how you can apply standard observability tools in Constellation.
## Cloud resource monitoring
While inaccessible, Constellation's nodes are still visible as black box VMs to the hypervisor.
Resource consumption, such as memory and CPU utilization, can be monitored from the outside and observed via the cloud platforms directly.
Similarly, other resources, such as storage and network and their respective metrics, are visible via the cloud platform.
## Metrics
Metrics are numeric representations of data measured over intervals of time. They're essential for understanding system health and gaining insights using telemetry signals.
By default, Constellation exposes the [metrics for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/) inside the cluster.
Similarly, the [etcd metrics](https://etcd.io/docs/v3.5/metrics/) endpoints are exposed inside the cluster.
These [metrics endpoints can be disabled](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#disabling-metrics).
You can collect these cluster-internal metrics via tools such as [Prometheus](https://prometheus.io/) or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
Constellation's CNI Cilium also supports [metrics via Prometheus endpoints](https://docs.cilium.io/en/latest/observability/metrics/).
However, in Constellation, they're disabled by default and must be enabled first.
## Logs
Logs represent discrete events that usually describe what's happening with your service.
The payload is an actual message emitted from your system along with a metadata section containing a timestamp, labels, and tracking identifiers.
### System logs
Constellation uses cloud logging for events occurring during the early stages of a node's boot process.
These logs include [Bootstrapper](./microservices.md#bootstrapper) events and [state disk UUIDs](../architecture/images.md#state-disk).
You can access the cloud logging [directly via the cloud provider endpoints](../workflows/troubleshooting.md#cloud-logging).
More detailed system-level logs are accessible via `/var/log` and [journald](https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html) on the nodes directly.
They can be collected from there, for example, via [Filebeat and Logstash](https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html), which are tools of the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
In case of an error during the initialization, the CLI automatically collects the [Bootstrapper](./microservices.md#bootstrapper) logs and returns these as a file for [troubleshooting](../workflows/troubleshooting.md). Here is an example of such an event:
```shell-session
Cluster initialization failed. This error is not recoverable.
Terminate your cluster and try again.
Fetched bootstrapper logs are stored in "constellation-cluster.log"
```
### Kubernetes logs
Constellation supports the [Kubernetes logging architecture](https://kubernetes.io/docs/concepts/cluster-administration/logging/).
By default, logs are written to the nodes' encrypted state disks.
These include the Pod and container logs and the [system component logs](https://kubernetes.io/docs/concepts/cluster-administration/logging/#system-component-logs).
[Constellation services](microservices.md) run as Pods inside the `kube-system` namespace and use the standard container logging mechanism.
The same applies for the [Cilium Pods](https://docs.cilium.io/en/latest/operations/troubleshooting/#logs).
You can collect logs from within the cluster via tools such as [Fluentd](https://github.com/fluent/fluentd), [Loki](https://github.com/grafana/loki), or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
## Traces
Modern systems are implemented as interconnected complex and distributed microservices. Understanding request flows and system communications is challenging, mainly because all systems in a chain need to be modified to propagate tracing information. Distributed tracing is a new approach to increasing observability and understanding performance bottlenecks. A trace represents consecutive events that reflect an end-to-end request path in a distributed system.
Constellation supports [traces for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-traces/).
By default, they're disabled and need to be enabled first.
Similarly, Cilium can be enabled to [export traces](https://cilium.io/use-cases/metrics-export/).
You can collect these traces via tools such as [Jaeger](https://www.jaegertracing.io/) or [Zipkin](https://zipkin.io/).
## Integrations
Platforms and SaaS solutions such as Datadog, logz.io, Dynatrace, or New Relic facilitate the observability challenge for Kubernetes and provide all-in-one SaaS solutions.
They install agents into the cluster that collect metrics, logs, and tracing information and upload them into the data lake of the platform.
Technically, the agent-based approach is compatible with Constellation, and attaching these platforms is straightforward.
However, you need to evaluate if the exported data might violate Constellation's compliance and privacy guarantees by uploading them to a third-party platform.

View File

@ -22,3 +22,9 @@ You can learn more about [the images](images.md) and how verified boot ensures t
## About key management and cryptographic primitives
Encryption of data at-rest, in-transit, and in-use is the fundamental building block for confidential computing and Constellation. Learn more about the [keys and cryptographic primitives](keys.md) used in Constellation, [encrypted persistent storage](encrypted-storage.md), and [network encryption](networking.md).
## About observability
Observability in Kubernetes refers to the capability to swiftly troubleshoot issues using telemetry signals such as logs, metrics, and traces.
In the realm of Confidential Computing, it's crucial that observability aligns with confidentiality, necessitating careful implementation.
Learn more about the [observability capabilities in Constellation](./observability.md).

View File

@ -254,6 +254,11 @@ const sidebars = {
label: 'Networking',
id: 'architecture/networking',
},
{
type: 'doc',
label: 'Observability',
id: 'architecture/observability',
},
],
},
{

View File

@ -15,11 +15,15 @@ Bootstrapper
config
cyber
datacenter
Datadog
deallocate
Dockerfile
Dynatrace
[Ee]mojivoto
etcd
Filebeat
Filestore
Fluentd
Fulcio
Mbps
Gbps
@ -30,12 +34,14 @@ iam
IAM
iodepth
initramfs
journald
[Kk]3s
Kata
kubeadm
kubectl
kubelet
libcryptsetup
Logstash
MicroK8s
[Mm]inikube
namespace
@ -45,11 +51,13 @@ Rekor
resizable
rollout
sigstore
[Ss]uperset
Syft
systemd
[Uu]nencrypted
unspoofable
updatable
UUID
proxied
QEMU
virsh
@ -58,4 +66,4 @@ whitepaper
WireGuard
Xeon
xsltproc
[Ss]uperset
Zipkin

View File

@ -0,0 +1,78 @@
# Observability
In Kubernetes, observability is the ability to gain insight into the behavior and performance of applications.
It helps identify and resolve issues more effectively, ensuring stability and performance of Kubernetes workloads, reducing downtime and outages, and improving efficiency.
The "three pillars of observability" are logs, metrics, and traces.
In the context of Confidential Computing, observability is a delicate subject and needs to be applied such that it doesn't leak any sensitive information.
The following gives an overview of where and how you can apply standard observability tools in Constellation.
## Cloud resource monitoring
While inaccessible, Constellation's nodes are still visible as black box VMs to the hypervisor.
Resource consumption, such as memory and CPU utilization, can be monitored from the outside and observed via the cloud platforms directly.
Similarly, other resources, such as storage and network and their respective metrics, are visible via the cloud platform.
## Metrics
Metrics are numeric representations of data measured over intervals of time. They're essential for understanding system health and gaining insights using telemetry signals.
By default, Constellation exposes the [metrics for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/) inside the cluster.
Similarly, the [etcd metrics](https://etcd.io/docs/v3.5/metrics/) endpoints are exposed inside the cluster.
These [metrics endpoints can be disabled](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#disabling-metrics).
You can collect these cluster-internal metrics via tools such as [Prometheus](https://prometheus.io/) or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
Constellation's CNI Cilium also supports [metrics via Prometheus endpoints](https://docs.cilium.io/en/latest/observability/metrics/).
However, in Constellation, they're disabled by default and must be enabled first.
## Logs
Logs represent discrete events that usually describe what's happening with your service.
The payload is an actual message emitted from your system along with a metadata section containing a timestamp, labels, and tracking identifiers.
### System logs
Constellation uses cloud logging for events occurring during the early stages of a node's boot process.
These logs include [Bootstrapper](./microservices.md#bootstrapper) events and [state disk UUIDs](../architecture/images.md#state-disk).
You can access the cloud logging [directly via the cloud provider endpoints](../workflows/troubleshooting.md#cloud-logging).
More detailed system-level logs are accessible via `/var/log` and [journald](https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html) on the nodes directly.
They can be collected from there, for example, via [Filebeat and Logstash](https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html), which are tools of the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
In case of an error during the initialization, the CLI automatically collects the [Bootstrapper](./microservices.md#bootstrapper) logs and returns these as a file for [troubleshooting](../workflows/troubleshooting.md). Here is an example of such an event:
```shell-session
Cluster initialization failed. This error is not recoverable.
Terminate your cluster and try again.
Fetched bootstrapper logs are stored in "constellation-cluster.log"
```
### Kubernetes logs
Constellation supports the [Kubernetes logging architecture](https://kubernetes.io/docs/concepts/cluster-administration/logging/).
By default, logs are written to the nodes' encrypted state disks.
These include the Pod and container logs and the [system component logs](https://kubernetes.io/docs/concepts/cluster-administration/logging/#system-component-logs).
[Constellation services](microservices.md) run as Pods inside the `kube-system` namespace and use the standard container logging mechanism.
The same applies for the [Cilium Pods](https://docs.cilium.io/en/latest/operations/troubleshooting/#logs).
You can collect logs from within the cluster via tools such as [Fluentd](https://github.com/fluent/fluentd), [Loki](https://github.com/grafana/loki), or the [Elastic Stack](https://www.elastic.co/de/elastic-stack/).
## Traces
Modern systems are implemented as interconnected complex and distributed microservices. Understanding request flows and system communications is challenging, mainly because all systems in a chain need to be modified to propagate tracing information. Distributed tracing is a new approach to increasing observability and understanding performance bottlenecks. A trace represents consecutive events that reflect an end-to-end request path in a distributed system.
Constellation supports [traces for Kubernetes system components](https://kubernetes.io/docs/concepts/cluster-administration/system-traces/).
By default, they're disabled and need to be enabled first.
Similarly, Cilium can be enabled to [export traces](https://cilium.io/use-cases/metrics-export/).
You can collect these traces via tools such as [Jaeger](https://www.jaegertracing.io/) or [Zipkin](https://zipkin.io/).
## Integrations
Platforms and SaaS solutions such as Datadog, logz.io, Dynatrace, or New Relic facilitate the observability challenge for Kubernetes and provide all-in-one SaaS solutions.
They install agents into the cluster that collect metrics, logs, and tracing information and upload them into the data lake of the platform.
Technically, the agent-based approach is compatible with Constellation, and attaching these platforms is straightforward.
However, you need to evaluate if the exported data might violate Constellation's compliance and privacy guarantees by uploading them to a third-party platform.

View File

@ -22,3 +22,9 @@ You can learn more about [the images](images.md) and how verified boot ensures t
## About key management and cryptographic primitives
Encryption of data at-rest, in-transit, and in-use is the fundamental building block for confidential computing and Constellation. Learn more about the [keys and cryptographic primitives](keys.md) used in Constellation, [encrypted persistent storage](encrypted-storage.md), and [network encryption](networking.md).
## About observability
Observability in Kubernetes refers to the capability to swiftly troubleshoot issues using telemetry signals such as logs, metrics, and traces.
In the realm of Confidential Computing, it's crucial that observability aligns with confidentiality, necessitating careful implementation.
Learn more about the [observability capabilities in Constellation](./observability.md).

View File

@ -233,6 +233,11 @@
"type": "doc",
"label": "Networking",
"id": "architecture/networking"
},
{
"type": "doc",
"label": "Observability",
"id": "architecture/observability"
}
]
},