ci: collect cluster metrics to OpenSearch (#2347)

* add Metricbeat deployment to debugd

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* set metricbeat debugd image version

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* fix k8s deployment

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* use 2 separate deployments

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* only deploy via k8s in non-debug-images

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* add missing tilde

* remove k8s metrics

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* unify flag

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* add cloud metadata processor to filebeat

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* ci: fix debugd logcollection (#2355)

* add missing keyvault access role

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* bump logstash image version

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* bump filebeat / metricbeat image version

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* log used image version

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* use debugging image versions

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* increase wait timeout for image upload

* add cloud metadata processor to filebeat

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* fix template locations in container

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* fix image version typo

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* add filebeat / metricbeat users

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* remove user additions

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* update workflow step name

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* only mount config files

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* document potential rc

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* fix IAM permissions in workflow

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* fix AWS permissions

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* tidy

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* add missing workflow input

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* rename action

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* pin image versions

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* remove unnecessary workflow inputs

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

---------

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* add refStream input

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* remove inputs.yml dep

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* increase system metric period

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

* fix linkchecker

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>

---------

Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
This commit is contained in:
Moritz Sanft 2023-09-27 16:17:31 +02:00 committed by GitHub
parent c7b728f202
commit f4b2d02194
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
27 changed files with 699 additions and 149 deletions

View file

@ -8,6 +8,14 @@ The logcollection functionality can be deployed to both [debug](./debug-cluster.
In debug clusters, logcollection functionality should be deployed automatically through the debug daemon `debugd`, which runs *before* the bootstrapper
and can therefore, contrary to non-debug clusters, also collect logs of the bootstrapper.
> [!WARNING]
> If logs from a E2E test run for a debug-cluster with a bootstrapping-failure are missing in OpenSearch, this might be caused by a race condition
> between the termination of the cluster and the start-up of the logcollection containers in the debugd.
> If the failure can be reproduced manually, it is best to do so and observe the serial console of the bootstrapping node with the following command until the logcollection containers have started.
> ```bash
> journalctl _SYSTEMD_UNIT=debugd.service | grep > logcollect
> ```
## Deployment in Non-Debug Clusters
In non-debug clusters, logcollection functionality needs to be explicitly deployed as a Kubernetes Deployment through Helm. To do that, a few steps need to be followed:
@ -29,28 +37,39 @@ In non-debug clusters, logcollection functionality needs to be explicitly deploy
One can add additional key-value pairs to the configuration by appending `--info key=value` to the command.
These key-value pairs will be attached to the log entries and can be used to filter them in OpenSearch.
For example, it might be helpful to add a `test=<xyz>` tag to be able to filter out logs from a specific test run.
2. Add the Elastic Helm repository
```bash
helm repo add elastic https://helm.elastic.co
helm repo update
```
2. Deploy Logstash
```bash
cd logstash
make add
make install
helm install logstash elastic/logstash \
--wait --timeout=1200s --values values.yml
cd ..
```
This will add the required Logstash Helm charts and deploy them to your cluster.
3. Deploy Filebeat
2. Deploy Beats
```bash
cd metricbeat
helm install metricbeat-k8s elastic/metricbeat \
--wait --timeout=1200s --values values-control-plane.yml
helm install metricbeat-system elastic/metricbeat \
--wait --timeout=1200s --values values-all-nodes.yml
cd ..
cd filebeat
make add
make install
helm install filebeat elastic/filebeat \
--wait --timeout=1200s --values values.yml
cd ..
```
This will add the required Filebeat Helm charts and deploy them to your cluster.
This will add the required Filebeat and Metricbeat Helm charts and deploy them to your cluster.
To remove Logstash or Filebeat, `cd` into the corresponding directory and run `make remove`.
To remove Logstash or one of the beats, `cd` into the corresponding directory and run `helm uninstall {logstash,filebeat,metricbeat}`.
## Inspecting Logs in OpenSearch