Recovery of a Constellation cluster means getting it back into a healthy state after too many concurrent node failures in the control plane.
Reasons for an unhealthy cluster can vary from a power outage, or planned reboot, to migration of nodes and regions.
Recovery events are rare, because Constellation is built for high availability and automatically and securely replaces failed nodes. When a node is replaced, Constellation's control plane first verifies the new node before it sends the node the cryptographic keys required to decrypt its [state disk](../architecture/
Constellation provides a recovery mechanism for cases where the control plane has failed and is unable to replace nodes.
The `constellation recover` command securely connects to all nodes in need of recovery using [attested TLS](../architecture/ and provides them with the keys to decrypt their state disks and continue booting.
## Identify unhealthy clusters
The first step to recovery is identifying when a cluster becomes unhealthy.
Usually, this can be first observed when the Kubernetes API server becomes unresponsive.
You can check the health status of the nodes via the cloud service provider (CSP).
Constellation provides logging information on the boot process and status via [cloud logging](
In the following, you'll find detailed descriptions for identifying clusters stuck in recovery for each CSP.
The node will then try to connect to the [*JoinService*](../architecture/ and obtain the decryption key.
If this fails due to an unhealthy control plane, you will see log messages similar to the following:
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["",""]}
The node will then try to connect to the [*JoinService*](../architecture/ and obtain the decryption key.
If this fails due to an unhealthy control plane, you will see log messages similar to the following:
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["",""]}
First, open the AWS console to view all Auto Scaling Groups (ASGs) in the region of your cluster. Select the ASG of the control plane `<cluster-name>-<UID>-control-plane` and check that enough members are in a *Running* state.
Second, check the boot logs of these *Instances*. In the ASG's *Instance management* view, select each desired instance. In the upper right corner, select **Action > Monitor and troubleshoot > Get system log**.
In the serial console output, search for `Waiting for decryption key`.
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/
The node will then try to connect to the [*JoinService*](../architecture/ and obtain the decryption key.
If this fails due to an unhealthy control plane, you will see log messages similar to the following:
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["",""]}
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:125","msg":"Received state disk key and measurement secret, shutting down server"}
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer.gRPC","caller":"zap/server_interceptors.go:61","msg":"finished streaming call with code OK","grpc.start_time":"2022-09-08T10:26:59Z","system":"grpc","span.kind":"server","grpc.service":"recoverproto.API","grpc.method":"Recover","peer.address":"","grpc.code":"OK","grpc.time_ms":15.701}