mirror of
https://github.com/edgelesssys/constellation.git
synced 2025-02-02 10:35:08 -05:00
Update docs to new recover workflow (#100)
Signed-off-by: Daniel Weiße <dw@edgeless.systems>
This commit is contained in:
parent
8cb155d5c5
commit
21397bf98b
@ -44,26 +44,27 @@ In the serial console output search for `Waiting for decryption key`.
|
|||||||
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
|
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
{"level":"INFO","ts":"2022-08-01T08:02:20Z","caller":"cmd/main.go:46","msg":"Starting disk-mapper","version":"0.0.0","cloudProvider":"azure"}
|
{"level":"INFO","ts":"2022-09-08T09:56:41Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"azure"}
|
||||||
{"level":"INFO","ts":"2022-08-01T08:02:20Z","logger":"setupManager","caller":"setup/setup.go:57","msg":"Preparing existing state disk"}
|
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
|
||||||
{"level":"INFO","ts":"2022-08-01T08:02:20Z","logger":"keyService","caller":"keyservice/keyservice.go:92","msg":"Waiting for decryption key. Listening on: [::]:9000"}
|
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}
|
||||||
|
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
|
||||||
```
|
```
|
||||||
|
|
||||||
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
|
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
|
||||||
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
|
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
{"level":"INFO","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:118","msg":"Received list with JoinService endpoints: [10.9.0.5:30090 10.9.0.6:30090 10.9.0.7:30090 10.9.0.8:30090 10.9.0.9:30090 10.9.0.10:30090 10.9.0.11:30090 10.9.0.12:30090 10.9.0.13:30090 10.9.0.14:30090 10.9.0.15:30090 10.9.0.16:30090 10.9.0.17:30090 10.9.0.18:30090 10.9.0.19:30090 10.9.0.20:30090 10.9.0.21:30090 10.9.0.22:30090 10.9.0.23:30090]"}
|
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["10.9.0.5:30090","10.9.0.6:30090"]}
|
||||||
{"level":"INFO","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:145","msg":"Requesting rejoin ticket","endpoint":"10.9.0.5:30090"}
|
{"level":"INFO","ts":"2022-09-08T09:56:43Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"10.9.0.5:30090"}
|
||||||
{"level":"ERROR","ts":"2022-08-01T08:02:21Z","logger":"keyService","caller":"keyservice/keyservice.go:148","msg":"Failed to request rejoin ticket","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.5:30090: connect: connection refused\"","endpoint":"10.9.0.5:30090"}
|
{"level":"WARN","ts":"2022-09-08T09:57:03Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.5:30090: i/o timeout\"","endpoint":"10.9.0.5:30090"}
|
||||||
|
{"level":"INFO","ts":"2022-09-08T09:57:03Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"10.9.0.6:30090"}
|
||||||
|
{"level":"WARN","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.9.0.6:30090: i/o timeout\"","endpoint":"10.9.0.6:30090"}
|
||||||
|
{"level":"ERROR","ts":"2022-09-08T09:57:23Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
|
||||||
```
|
```
|
||||||
|
|
||||||
That means you have to recover that node manually.
|
That means you have to recover that node manually.
|
||||||
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address and state disk's UUID.
|
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address.
|
||||||
For the IP address, return to the instances *Overview* page and find the *Private IP address*.
|
For the IP address, return to the instances *Overview* page and find the *Private IP address*.
|
||||||
For the UUID open the [Cloud logging](troubleshooting.md#azure) explorer.
|
|
||||||
Type `traces | where message contains "Disk UUID"` and click `Run`.
|
|
||||||
Find the entry corresponding to that instance `{"instance-name":"<cluster-name>-control-plane-<suffix>"}` and take the UUID from the message field `Disk UUID: <UUID>`.
|
|
||||||
|
|
||||||
</tabItem>
|
</tabItem>
|
||||||
<tabItem value="gcp" label="GCP" default>
|
<tabItem value="gcp" label="GCP" default>
|
||||||
@ -81,25 +82,28 @@ In the serial console output search for `Waiting for decryption key`.
|
|||||||
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
|
Similar output to the following means your node was restarted and needs to decrypt the [state disk](../architecture/images.md#state-disk):
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
{"level":"INFO","ts":"2022-07-29T09:45:55Z","caller":"cmd/main.go:46","msg":"Starting disk-mapper","version":"0.0.0","cloudProvider":"gcp"}
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","caller":"cmd/main.go:55","msg":"Starting disk-mapper","version":"2.0.0","cloudProvider":"gcp"}
|
||||||
{"level":"INFO","ts":"2022-07-29T09:45:55Z","logger":"setupManager","caller":"setup/setup.go:57","msg":"Preparing existing state disk"}
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"setupManager","caller":"setup/setup.go:72","msg":"Preparing existing state disk"}
|
||||||
{"level":"INFO","ts":"2022-07-29T09:45:55Z","logger":"keyService","caller":"keyservice/keyservice.go:92","msg":"Waiting for decryption key. Listening on: [::]:9000"}
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:65","msg":"Starting RejoinClient"}
|
||||||
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"recoveryServer","caller":"recoveryserver/server.go:59","msg":"Starting RecoveryServer"}
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
|
The node will then try to connect to the [*JoinService*](../architecture/components.md#joinservice) and obtain the decryption key.
|
||||||
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
|
If that fails, because the control plane is unhealthy, you will see log messages similar to the following:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
{"level":"INFO","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:118","msg":"Received list with JoinService endpoints: [192.168.178.2:30090]"}
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:77","msg":"Received list with JoinService endpoints","endpoints":["192.168.178.4:30090","192.168.178.2:30090"]}
|
||||||
{"level":"INFO","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:145","msg":"Requesting rejoin ticket","endpoint":"192.168.178.2:30090"}
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"192.168.178.4:30090"}
|
||||||
{"level":"ERROR","ts":"2022-07-29T09:46:15Z","logger":"keyService","caller":"keyservice/keyservice.go:148","msg":"Failed to request rejoin ticket","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.2:30090: connect: connection refused\"","endpoint":"192.168.178.2:30090"}
|
{"level":"WARN","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.4:30090: connect: connection refused\"","endpoint":"192.168.178.4:30090"}
|
||||||
|
{"level":"INFO","ts":"2022-09-08T10:21:53Z","logger":"rejoinClient","caller":"rejoinclient/client.go:96","msg":"Requesting rejoin ticket","endpoint":"192.168.178.2:30090"}
|
||||||
|
{"level":"WARN","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:101","msg":"Failed to rejoin on endpoint","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.178.2:30090: i/o timeout\"","endpoint":"192.168.178.2:30090"}
|
||||||
|
{"level":"ERROR","ts":"2022-09-08T10:22:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:110","msg":"Failed to rejoin on all endpoints"}
|
||||||
```
|
```
|
||||||
|
|
||||||
That means you have to recover that node manually.
|
That means you have to recover that node manually.
|
||||||
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address and state disk's UUID.
|
Before you continue with the [recovery process](#recover-your-cluster) you need to know the node's IP address.
|
||||||
For the IP address go to the *"VM Instance" -> "network interfaces"* page and take the address from *"Primary internal IP address."*
|
For the IP address go to the *"VM Instance" -> "network interfaces"* page and take the address from *"Primary internal IP address."*
|
||||||
For the UUID open the [Cloud logging](troubleshooting.md#cloud-logging) explorer, you'll find that right above the serial console link (see the picture above).
|
|
||||||
Search for `Disk UUID: <UUID>`.
|
|
||||||
|
|
||||||
</tabItem>
|
</tabItem>
|
||||||
</tabs>
|
</tabs>
|
||||||
@ -114,10 +118,9 @@ From there, your cluster will auto heal the remaining 2 control-plane nodes and
|
|||||||
Recovering a node requires the following parameters:
|
Recovering a node requires the following parameters:
|
||||||
|
|
||||||
* The node's IP address
|
* The node's IP address
|
||||||
* The node's state disk UUID
|
|
||||||
* Access to the master secret of the cluster
|
* Access to the master secret of the cluster
|
||||||
|
|
||||||
See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address and state disk UUID.
|
See the [Identify unhealthy clusters](#identify-unhealthy-clusters) description of how to obtain the node's IP address.
|
||||||
Note that the recovery command needs to connect to the recovering nodes.
|
Note that the recovery command needs to connect to the recovering nodes.
|
||||||
Nodes only have private IP addresses in the VPC of the cluster, hence, the command needs to be issued from within the VPC network of the cluster.
|
Nodes only have private IP addresses in the VPC of the cluster, hence, the command needs to be issued from within the VPC network of the cluster.
|
||||||
The easiest approach is to set up a jump host connected to the VPC network and perform the recovery from there.
|
The easiest approach is to set up a jump host connected to the VPC network and perform the recovery from there.
|
||||||
@ -125,23 +128,17 @@ The easiest approach is to set up a jump host connected to the VPC network and p
|
|||||||
Given these prerequisites a node can be recovered like this:
|
Given these prerequisites a node can be recovered like this:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ constellation recover -e 34.107.89.208 --disk-uuid b27f817c-6799-4c0d-81d8-57abc8386b70 --master-secret constellation-mastersecret.json
|
$ constellation recover -e 34.107.89.208 --master-secret constellation-mastersecret.json
|
||||||
Pushed recovery key.
|
Pushed recovery key.
|
||||||
```
|
```
|
||||||
|
|
||||||
In the serial console output of the node you'll see a similar output to the following:
|
In the serial console output of the node you'll see a similar output to the following:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
[ 3225.621753] EXT4-fs (dm-1): INFO: recovery required on readonly filesystem
|
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:93","msg":"Received recover call"}
|
||||||
[ 3225.628807] EXT4-fs (dm-1): write access will be enabled during recovery
|
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer","caller":"recoveryserver/server.go:125","msg":"Received state disk key and measurement secret, shutting down server"}
|
||||||
[ 3226.295816] EXT4-fs (dm-1): recovery complete
|
{"level":"INFO","ts":"2022-09-08T10:26:59Z","logger":"recoveryServer.gRPC","caller":"zap/server_interceptors.go:61","msg":"finished streaming call with code OK","grpc.start_time":"2022-09-08T10:26:59Z","system":"grpc","span.kind":"server","grpc.service":"recoverproto.API","grpc.method":"Recover","peer.address":"192.0.2.3:41752","grpc.code":"OK","grpc.time_ms":15.701}
|
||||||
[ 3226.301618] EXT4-fs (dm-1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
|
{"level":"INFO","ts":"2022-09-08T10:27:13Z","logger":"rejoinClient","caller":"rejoinclient/client.go:87","msg":"RejoinClient stopped"}
|
||||||
[ 3226.338157] systemd[1]: run-state.mount: Deactivated successfully.
|
|
||||||
[[0;32m OK [[ 3226.347833] systemd[1]: Finished Prepare encrypted state disk.
|
|
||||||
0m] Finished [0;1;39mPrepare encrypted state disk[0m.
|
|
||||||
Startin[ 3226.363705] systemd[1]: Starting OSTree Prepare OS/...
|
|
||||||
g [0;1;39mOSTre[ 3226.370625] ostree-prepare-root[939]: preparing sysroot at /sysroot
|
|
||||||
e Prepare OS/[0m...
|
|
||||||
```
|
```
|
||||||
|
|
||||||
After enough control plane nodes have been recovered and the Kubernetes cluster becomes healthy again, the rest of the cluster will start auto healing using the mechanism described above.
|
After enough control plane nodes have been recovered and the Kubernetes cluster becomes healthy again, the rest of the cluster will start auto healing using the mechanism described above.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user