Update docs to new recover workflow (#100)

Signed-off-by: Daniel Weiße <dw@edgeless.systems>
This commit is contained in:
Daniel Weiße 2022-09-08 14:47:48 +02:00 committed by GitHub
parent 8cb155d5c5
commit 21397bf98b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -44,26 +44,27 @@ In the serial console output search for `Waiting for decryption key`.
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
```shell
{"level":"INFO","ts":"2022-08-01T08:02:20Z","caller":"cmd/main.go:46","msg":"Starting disk-mapper","version":"0.0.0","cloudProvider":"azure"}
{"level":"INFO","ts":"2022-08-01T08:02:20Z","logger":"setupManager","caller":"setup/setup.go:57","msg":"Preparing existing state disk"}
{"level":"INFO","ts":"2022-08-01T08:02:20Z","logger":"keyService","caller":"keyservice/keyservice.go:92","msg":"Waiting for decryption key. Listening on: [::]:9000"}
{"level":"INFO","ts":"2022-09-08T09:56:41Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"azure"}
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
```
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
```shell
{"level":"INFO","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:118","msg":"Received list with JoinService endpoints: [10.9.0.5:30090 10.9.0.6:30090 10.9.0.7:30090 10.9.0.8:30090 10.9.0.9:30090 10.9.0.10:30090 10.9.0.11:30090 10.9.0.12:30090 10.9.0.13:30090 10.9.0.14:30090 10.9.0.15:30090 10.9.0.16:30090 10.9.0.17:30090 10.9.0.18:30090 10.9.0.19:30090 10.9.0.20:30090 10.9.0.21:30090 10.9.0.22:30090 10.9.0.23:30090]"}
{"level":"INFO","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:145","msg":"Requesting rejoin ticket","endpoint":"10.9.0.5:30090"}
{"level":"ERROR","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:148","msg":"Failed to request rejoin ticket","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.5:30090: connect: connection refused\"","endpoint":"10.9.0.5:30090"}
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["10.9.0.5:30090","10.9.0.6:30090"]}
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"10.9.0.5:30090"}
{"level":"WARN","ts":"2022-09-08T09:57:03Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.5:30090: i/o timeout\"","endpoint":"10.9.0.5:30090"}
{"level":"INFO","ts":"2022-09-08T09:57:03Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"10.9.0.6:30090"}
{"level":"WARN","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.6:30090: i/o timeout\"","endpoint":"10.9.0.6:30090"}
{"level":"ERROR","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
```
That means you have to recover that node manually.
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address and state disk's UUID.
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address.
For the IP address, return to the instances *Overview* page and find the *Private IP address*.
For the UUID open the [Cloud logging](troubleshooting.md#azure) explorer.
Type `traces | where message contains "Disk UUID"` and click `Run`.
Find the entry corresponding to that instance `{"instance-name":"<cluster-name>-control-plane-<suffix>"}` and take the UUID from the message field `Disk UUID: <UUID>`.
</tabItem>
<tabItem value="gcp" label="GCP" default>
@ -81,25 +82,28 @@ In the serial console output search for `Waiting for decryption key`.
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
```shell
{"level":"INFO","ts":"2022-07-29T09:45:55Z","caller":"cmd/main.go:46","msg":"Starting disk-mapper","version":"0.0.0","cloudProvider":"gcp"}
{"level":"INFO","ts":"2022-07-29T09:45:55Z","logger":"setupManager","caller":"setup/setup.go:57","msg":"Preparing existing state disk"}
{"level":"INFO","ts":"2022-07-29T09:45:55Z","logger":"keyService","caller":"keyservice/keyservice.go:92","msg":"Waiting for decryption key. Listening on: [::]:9000"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"gcp"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}
```
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
```shell
{"level":"INFO","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:118","msg":"Received list with JoinService endpoints: [192.168.178.2:30090]"}
{"level":"INFO","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:145","msg":"Requesting rejoin ticket","endpoint":"192.168.178.2:30090"}
{"level":"ERROR","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:148","msg":"Failed to request rejoin ticket","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.2:30090: connect: connection refused\"","endpoint":"192.168.178.2:30090"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["192.168.178.4:30090","192.168.178.2:30090"]}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"192.168.178.4:30090"}
{"level":"WARN","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.4:30090: connect: connection refused\"","endpoint":"192.168.178.4:30090"}
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"192.168.178.2:30090"}
{"level":"WARN","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.2:30090: i/o timeout\"","endpoint":"192.168.178.2:30090"}
{"level":"ERROR","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
```
That means you have to recover that node manually.
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address and state disk's UUID.
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address.
For the IP address go to the *"VM Instance" -> "network interfaces"* page and take the address from *"Primary internal IP address."*
For the UUID open the [Cloud logging](troubleshooting.md#cloud-logging) explorer, you'll find that right above the serial console link (see the picture above).
Search for `Disk UUID: <UUID>`.
</tabItem>
</tabs>
@ -114,10 +118,9 @@ From there, your cluster will auto heal the remaining 2 control-plane nodes and
Recovering a node requires the following parameters:
* The node's IP address
* The node's state disk UUID
* Access to the master secret of the cluster
See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address and state disk UUID.
See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address.
Note that the recovery command needs to connect to the recovering nodes.
Nodes only have private IP addresses in the VPC of the cluster, hence, the command needs to be issued from within the VPC network of the cluster.
The easiest approach is to set up a jump host connected to the VPC network and perform the recovery from there.
@ -125,23 +128,17 @@ The easiest approach is to set up a jump host connected to the VPC network and p
Given these prerequisites a node can be recovered like this:
```bash
$ constellation recover -e 34.107.89.208 --disk-uuid b27f817c-6799-4c0d-81d8-57abc8386b70 --master-secret constellation-mastersecret.json
$ constellation recover -e 34.107.89.208 --master-secret constellation-mastersecret.json
Pushed recovery key.
```
In the serial console output of the node you'll see a similar output to the following:
```shell
[ 3225.621753] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
[ 3225.628807] EXT4-fs (dm-1): write access will be enabled during recovery
[ 3226.295816] EXT4-fs (dm-1): recovery complete
[ 3226.301618] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[ 3226.338157] systemd[1]: run-state.mount: Deactivated successfully.
[ OK [[ 3226.347833] systemd[1]: Finished Prepare encrypted state disk.
0m] Finished Prepare encrypted state disk.
Startin[ 3226.363705] systemd[1]: Starting OSTree Prepare OS/...
g OSTre[ 3226.370625] ostree-prepare-root[939]: preparing sysroot at /sysroot
e Prepare OS/...
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:93","msg":"Received recover call"}
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:125","msg":"Received state disk key and measurement secret, shutting down server"}
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer.gRPC","caller":"zap/server_interceptors.go:61","msg":"finished streaming call with code OK","grpc.start_time":"2022-09-08T10:26:59Z","system":"grpc","span.kind":"server","grpc.service":"recoverproto.API","grpc.method":"Recover","peer.address":"192.0.2.3:41752","grpc.code":"OK","grpc.time_ms":15.701}
{"level":"INFO","ts":"2022-09-08T10:27:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:87","msg":"RejoinClient stopped"}
```
After enough control plane nodes have been recovered and the Kubernetes cluster becomes healthy again, the rest of the cluster will start auto healing using the mechanism described above.