mirror of
https://github.com/edgelesssys/constellation.git
synced 2024-10-01 01:36:09 -04:00
Update docs to new recover workflow (#100)
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
This commit is contained in:
parent
8cb155d5c5
commit
21397bf98b
@ -44,26 +44,27 @@ In the serial console output search for `Waiting for decryption key`.
|
||||
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
|
||||
|
||||
```shell
|
||||
{"level":"INFO","ts":"2022-08-01T08:02:20Z","caller":"cmd/main.go:46","msg":"Starting disk-mapper","version":"0.0.0","cloudProvider":"azure"}
|
||||
{"level":"INFO","ts":"2022-08-01T08:02:20Z","logger":"setupManager","caller":"setup/setup.go:57","msg":"Preparing existing state disk"}
|
||||
{"level":"INFO","ts":"2022-08-01T08:02:20Z","logger":"keyService","caller":"keyservice/keyservice.go:92","msg":"Waiting for decryption key. Listening on: [::]:9000"}
|
||||
{"level":"INFO","ts":"2022-09-08T09:56:41Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"azure"}
|
||||
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
|
||||
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}
|
||||
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
|
||||
```
|
||||
|
||||
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
|
||||
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
|
||||
|
||||
```shell
|
||||
{"level":"INFO","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:118","msg":"Received list with JoinService endpoints: [10.9.0.5:30090 10.9.0.6:30090 10.9.0.7:30090 10.9.0.8:30090 10.9.0.9:30090 10.9.0.10:30090 10.9.0.11:30090 10.9.0.12:30090 10.9.0.13:30090 10.9.0.14:30090 10.9.0.15:30090 10.9.0.16:30090 10.9.0.17:30090 10.9.0.18:30090 10.9.0.19:30090 10.9.0.20:30090 10.9.0.21:30090 10.9.0.22:30090 10.9.0.23:30090]"}
|
||||
{"level":"INFO","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:145","msg":"Requesting rejoin ticket","endpoint":"10.9.0.5:30090"}
|
||||
{"level":"ERROR","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:148","msg":"Failed to request rejoin ticket","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.5:30090: connect: connection refused\"","endpoint":"10.9.0.5:30090"}
|
||||
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["10.9.0.5:30090","10.9.0.6:30090"]}
|
||||
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"10.9.0.5:30090"}
|
||||
{"level":"WARN","ts":"2022-09-08T09:57:03Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.5:30090: i/o timeout\"","endpoint":"10.9.0.5:30090"}
|
||||
{"level":"INFO","ts":"2022-09-08T09:57:03Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"10.9.0.6:30090"}
|
||||
{"level":"WARN","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.6:30090: i/o timeout\"","endpoint":"10.9.0.6:30090"}
|
||||
{"level":"ERROR","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
|
||||
```
|
||||
|
||||
That means you have to recover that node manually.
|
||||
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address and state disk's UUID.
|
||||
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address.
|
||||
For the IP address, return to the instances *Overview* page and find the *Private IP address*.
|
||||
For the UUID open the [Cloud logging](troubleshooting.md#azure) explorer.
|
||||
Type `traces | where message contains "Disk UUID"` and click `Run`.
|
||||
Find the entry corresponding to that instance `{"instance-name":"<cluster-name>-control-plane-<suffix>"}` and take the UUID from the message field `Disk UUID: <UUID>`.
|
||||
|
||||
</tabItem>
|
||||
<tabItem value="gcp" label="GCP" default>
|
||||
@ -81,25 +82,28 @@ In the serial console output search for `Waiting for decryption key`.
|
||||
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
|
||||
|
||||
```shell
|
||||
{"level":"INFO","ts":"2022-07-29T09:45:55Z","caller":"cmd/main.go:46","msg":"Starting disk-mapper","version":"0.0.0","cloudProvider":"gcp"}
|
||||
{"level":"INFO","ts":"2022-07-29T09:45:55Z","logger":"setupManager","caller":"setup/setup.go:57","msg":"Preparing existing state disk"}
|
||||
{"level":"INFO","ts":"2022-07-29T09:45:55Z","logger":"keyService","caller":"keyservice/keyservice.go:92","msg":"Waiting for decryption key. Listening on: [::]:9000"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"gcp"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}
|
||||
|
||||
```
|
||||
|
||||
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
|
||||
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
|
||||
|
||||
```shell
|
||||
{"level":"INFO","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:118","msg":"Received list with JoinService endpoints: [192.168.178.2:30090]"}
|
||||
{"level":"INFO","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:145","msg":"Requesting rejoin ticket","endpoint":"192.168.178.2:30090"}
|
||||
{"level":"ERROR","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:148","msg":"Failed to request rejoin ticket","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.2:30090: connect: connection refused\"","endpoint":"192.168.178.2:30090"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["192.168.178.4:30090","192.168.178.2:30090"]}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"192.168.178.4:30090"}
|
||||
{"level":"WARN","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.4:30090: connect: connection refused\"","endpoint":"192.168.178.4:30090"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"192.168.178.2:30090"}
|
||||
{"level":"WARN","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.2:30090: i/o timeout\"","endpoint":"192.168.178.2:30090"}
|
||||
{"level":"ERROR","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
|
||||
```
|
||||
|
||||
That means you have to recover that node manually.
|
||||
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address and state disk's UUID.
|
||||
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address.
|
||||
For the IP address go to the *"VM Instance" -> "network interfaces"* page and take the address from *"Primary internal IP address."*
|
||||
For the UUID open the [Cloud logging](troubleshooting.md#cloud-logging) explorer, you'll find that right above the serial console link (see the picture above).
|
||||
Search for `Disk UUID: <UUID>`.
|
||||
|
||||
</tabItem>
|
||||
</tabs>
|
||||
@ -114,10 +118,9 @@ From there, your cluster will auto heal the remaining 2 control-plane nodes and
|
||||
Recovering a node requires the following parameters:
|
||||
|
||||
* The node's IP address
|
||||
* The node's state disk UUID
|
||||
* Access to the master secret of the cluster
|
||||
|
||||
See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address and state disk UUID.
|
||||
See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address.
|
||||
Note that the recovery command needs to connect to the recovering nodes.
|
||||
Nodes only have private IP addresses in the VPC of the cluster, hence, the command needs to be issued from within the VPC network of the cluster.
|
||||
The easiest approach is to set up a jump host connected to the VPC network and perform the recovery from there.
|
||||
@ -125,23 +128,17 @@ The easiest approach is to set up a jump host connected to the VPC network and p
|
||||
Given these prerequisites a node can be recovered like this:
|
||||
|
||||
```bash
|
||||
$ constellation recover -e 34.107.89.208 --disk-uuid b27f817c-6799-4c0d-81d8-57abc8386b70 --master-secret constellation-mastersecret.json
|
||||
$ constellation recover -e 34.107.89.208 --master-secret constellation-mastersecret.json
|
||||
Pushed recovery key.
|
||||
```
|
||||
|
||||
In the serial console output of the node you'll see a similar output to the following:
|
||||
|
||||
```shell
|
||||
[ 3225.621753] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
|
||||
[ 3225.628807] EXT4-fs (dm-1): write access will be enabled during recovery
|
||||
[ 3226.295816] EXT4-fs (dm-1): recovery complete
|
||||
[ 3226.301618] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
|
||||
[ 3226.338157] systemd[1]: run-state.mount: Deactivated successfully.
|
||||
[[0;32m OK [[ 3226.347833] systemd[1]: Finished Prepare encrypted state disk.
|
||||
0m] Finished [0;1;39mPrepare encrypted state disk[0m.
|
||||
Startin[ 3226.363705] systemd[1]: Starting OSTree Prepare OS/...
|
||||
g [0;1;39mOSTre[ 3226.370625] ostree-prepare-root[939]: preparing sysroot at /sysroot
|
||||
e Prepare OS/[0m...
|
||||
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:93","msg":"Received recover call"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:125","msg":"Received state disk key and measurement secret, shutting down server"}
|
||||
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer.gRPC","caller":"zap/server_interceptors.go:61","msg":"finished streaming call with code OK","grpc.start_time":"2022-09-08T10:26:59Z","system":"grpc","span.kind":"server","grpc.service":"recoverproto.API","grpc.method":"Recover","peer.address":"192.0.2.3:41752","grpc.code":"OK","grpc.time_ms":15.701}
|
||||
{"level":"INFO","ts":"2022-09-08T10:27:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:87","msg":"RejoinClient stopped"}
|
||||
```
|
||||
|
||||
After enough control plane nodes have been recovered and the Kubernetes cluster becomes healthy again, the rest of the cluster will start auto healing using the mechanism described above.
|
||||
|
Loading…
Reference in New Issue
Block a user