AB#2262 Automatic recovery (#158)

* Update `constellation recover` to be fully automated * Update recovery docs Signed-off-by: Daniel Weiße <dw@edgeless.systems>
2025-07-23 23:40:44 -04:00 · 2022-09-26 09:57:40 +02:00 · 2022-09-26 09:57:40 +02:00 · 30f0554168
commit 30f0554168
parent 4f596cde3c
6 changed files with 408 additions and 240 deletions
--- a/docs/docs/workflows/recovery.md
+++ b/docs/docs/workflows/recovery.md
@ -51,7 +51,7 @@ If that fails, because the control plane is unhealthy, you will see log messages
 {"level":"ERROR","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
 ```

-This means that you have to recover the node manually. For this, you need its IP address, which can be obtained from the *Overview* page under *Private IP address*.
+This means that you have to recover the node manually.

 </tabItem>
 <tabItem value="gcp" label="GCP">
@ -88,33 +88,26 @@ If that fails, because the control plane is unhealthy, you will see log messages
 {"level":"ERROR","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
 ```

-This means that you have to recover the node manually. For this, you need its IP address, which can be obtained from the *"VM Instance" -> "network interfaces"* page under *"Primary internal IP address."*
+This means that you have to recover the node manually.

 </tabItem>
 </tabs>

 ## Recover your cluster

-The following process needs to be repeated until a [member quorum for etcd](https://etcd.io/docs/v3.5/faq/#what-is-failure-tolerance) is established.
-For example, assume you have 5 control-plane nodes in your cluster and 4 of them have been rebooted due to a maintenance downtime in the cloud environment.
-You have to run through the following process for 2 of these nodes and recover them manually to recover the quorum.
-From there, your cluster will auto heal the remaining 2 control-plane nodes and the rest of your cluster.
+Recovering a cluster requires the following parameters:

-Recovering a node requires the following parameters:
-
-* The node's IP address
+* The `constellation-id.json` file in your working directory or the cluster's load balancer IP address
 * Access to the master secret of the cluster

-See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address.
-Note that the recovery command needs to connect to the recovering nodes.
-Nodes only have private IP addresses in the VPC of the cluster, hence, the command needs to be issued from within the VPC network of the cluster.
-The easiest approach is to set up a jump host connected to the VPC network and perform the recovery from there.
+A cluster can be recovered like this:

-Given these prerequisites a node can be recovered like this:
-
-```
-$ constellation recover -e 34.107.89.208 --master-secret constellation-mastersecret.json
+```bash
+$ constellation recover --master-secret constellation-mastersecret.json
 Pushed recovery key.
+Pushed recovery key.
+Pushed recovery key.
+Recovered 3 control-plane nodes.
 ```

 In the serial console output of the node you'll see a similar output to the following: