Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 looses state history and notifications during restart #10179

Open
w1ll-i-code opened this issue Sep 30, 2024 · 6 comments
Open

Icinga2 looses state history and notifications during restart #10179

w1ll-i-code opened this issue Sep 30, 2024 · 6 comments

Comments

@w1ll-i-code
Copy link

Describe the bug

If an object has a state change during an icinga2 restart (e.g. during a deploy), it is sometimes not written to the database and does not trigger the notifications.

To Reproduce

  1. Import the basket with icingacli director basket restore < icinga-lost-statechange-basket.json
    1. icinga-lost-statechange-basket.json
    2. This basket contains:
      1. A check command that randomly goes into warning to generate the state changes.
      2. A service template that runs the check command
      3. A service group to quickly create lots of services, making the occurrence more likely.
      4. A host template as the target for the apply rule of the service group.
  2. Create a few hosts:
    1. for i in $(seq --equal-width 1 100); do
          icingacli director host create "host-icinga-lost-statehistory-${i}" --imports 'ht-icinga-lost-statechange'
       done
  3. Deploy the config
    1. icingacli director config deploy

With that configuration running, deploy icinga2 a few times: icingacli director config deploy --force --wait

Soon there will be state changes in the state history that should not be possible:
Screenshot from 2024-09-30 14-38-26

In this case, the service went from hard warning into soft warning. The soft warning history says that the last state was Ok, but that was never written into the history.

To find lost state histories quicker I used the following script:
dropped_state_query.tar.gz

It needs as parameters the endpoint, user and password. If the db is postgres, it can be run with the --postgres flag.

Expected behavior

I expect icinga2 to not loose state changes like that.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)

Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Build information:
  Compiler: GNU 8.5.0
  Build host: staging5591master
  OpenSSL version: OpenSSL 1.1.1k  FIPS 25 Mar 2021
  • Operating System and version:
System information:
  Platform: Red Hat Enterprise Linux
  Platform version: 8.10 (Ootpa)
  Kernel: Linux
  Kernel version: 4.18.0-553.el8_10.x86_64
  Architecture: x86_64
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 journald opentsdb perfdata syslog
Enabled features: api checker ido-mysql influxdb livestatus mainlog notification
  • Icinga Web 2 version and modules (System - About):
Icinga Web 2  NetEye release 4.39 (Traditional bock)

PHP Version   7.4.33
MODULE                  VERSION
analytics               1.58.0
auditlog                1.15.1
cube                    1.1.0
customproblemview       0.0.0
director                1.11.1
geomap                  1.22.0
grafana                 1.4.2
neteye                  1.155.0-1
host2servicedetailview  1.4.0
idoreports              0.10.1
incubator               0.22.0
ipl                     v0.5.0
lampo                   1.2.2
leafletjs               1.9.4
loginaudit              0.0.1
mapDatatype             0.1.0
monitoring              2.10.5
monitoringview          1.7.0
nagvis                  1.1.1
pdfexport               0.10.2
reactbundle             0.9.0
reporting               1.0.0
shutdownmanager         0.0.0
srwebbackend            0.0.0
tornado                 2.19.2
update                  1.44.1-2
  • Config validation (icinga2 daemon -C):
[2024-09-30 14:45:34 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-09-30 14:45:34 +0200] information/cli: Loading configuration file(s).
[2024-09-30 14:45:34 +0200] information/ConfigItem: Committing config item(s).
[2024-09-30 14:45:34 +0200] information/ApiListener: My API identity: localhost.localdomain
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 LivestatusListener.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ServiceGroup.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 902 Services.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 3 Zones.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 2 NotificationCommands.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 101 Hosts.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 Endpoint.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 6 ApiUsers.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 251 CheckCommands.
[2024-09-30 14:45:34 +0200] information/ScriptGlobal: Dumping variables to file '/neteye/shared/icinga2/data/cache/icinga2/icinga2.vars'
[2024-09-30 14:45:34 +0200] information/cli: Finished validating the configuration file(s).

Additional context

I could observe the loss of notifications in production, have however not yet reproduced that behavior locally. I suspect however that the two behavior are linked.

We could also observe the same behavior when creating objects over the icinga2 api and then immediately sending a check-result. Once again, I have not replicated this locally yet, but I suspect the problem is the same in all these cases.

@w1ll-i-code
Copy link
Author

On vanilla icinga2 I can only reproduce it on the rhel8 package:

image

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Oct 2, 2024

Here is everything you will need to reproduce the same errors locally:

  1. Download the tar archive and unpack it with tar -xaf icinga2-bugreport-reproduce.tar.gz
  2. In the resulting directory you will find two sub-directories:
    1. dropped_state_query contains a script in rust to locate the lost events more easily
    2. reproduce_local contains a docker-compose to setup the test environment-
  3. Fix the credentials in icinga2-bugreport-reproduce/reproduce_local/icinga2-init/rhel8.Dockerfile
    1. The username + password for the icinga2 rhel8 repo
    2. the orgid + activationkey for the redhat subscription
  4. To run the test environment run docker-compose up --build in reproduce_local
    1. The docker-compose is a bit flaky, so if a service fails, it might need to be restarted.
  5. After the test environment is running, run:
    docker exec reproduce_local_icingaweb_1 bash -c "while true; do icingacli director config deploy --force ; sleep 30; done"
  6. In the directory dropped_state_query run
    cargo run -- --host localhost --user icingadb --password icingadb --database icingadb
    1. You might have to install cargo first. For that follow the instructions on https://rustup.rs/

Here is the archive with all needed data:
icinga2-bugreport-reproduce.tar.gz

@w1ll-i-code
Copy link
Author

I could also reproduce the same problem on both Debian and RHEL 9, but it was orders of magnitude less likely to happen there.

@w1ll-i-code
Copy link
Author

I have now also reproduced the same issue with icingadb:

How to Reproduce:

  1. Launch an icinga2 environment from the docker-compose playground
    git clone https://github.com/lippserd/docker-compose-icinga
    cd docker-compose-icinga
    docker-compose up
  2. Copy the icinga director basket into the directory
    icinga-lost-statechange-basket.json
  3. Load the configuration from the basket:
    docker exec docker-compose-icinga_director_1 icingacli director basket restore < icinga-lost-statechange-basket.json
  4. Create a lot of hosts (and with it services from the service set in the basket):
    docker exec docker-compose-icinga_director_1 bash -c 'for i in $(seq -f "%05g" 0 999); do
         icingacli director host create "host-icinga-lost-statehistory-${i}" --imports "ht-icinga-lost-statechange"
       done'
  5. Deploy the new director config:
    docker exec docker-compose-icinga_director_1 icingacli director config deploy
  6. Deploy the new config every 5 minutes to trigger the bug:
    docker exec docker-compose-icinga_director_1 bash -c 'while true; do sleep 300; icingacli director config deploy --force; done'

From my local tests on a VM with 4 Cores and 8GB of RAM, the error should be observable in around 10 Minutes / two deploys.
To locate the errors automatically from the db, I have attached the rust program I used to analyze the history tables.
dropped_state_query.tar.gz

To run that, download the dropped_state_query.tar.gz file and run:

tar -xaf dropped_state_query.tar.gz
cd dropped_state_query
cargo run --release -- --host 127.0.0.1 --user icingadb --password icingadb --icingadb | tee analyzed-history.log

Hint: Rust can be installed easily from https://rustup.rs/

Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.

dropped_state_change_and_notification

@lippserd
Copy link
Member

Hi all,

Thank you for all the details. I am trying to reproduce the scenario, but so far without success. When was the screenshot taken? Immediately after executing the queries to check if something is missing in the database? As there is a high chance that Icinga DB has not inserted everything yet.

Also, missing entries in the database do not necessarily mean that Icinga has not sent a notification. That should rather be verified using custom check and notification plugins.

Best regards,
Eric

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Oct 16, 2024

Hello Eric.

I have run the tests over the weekend. Even after several days they do not appear in icingadb. This is of course obvious to you, as I posted the screenshot on the 14th, while the state changes happened on the 10th. And as you know, if it were a few minutes ago, it would not have shown the date, but rather the delta time since then.

As for the notifications, we became aware of the problem, because one of our services went into critical without sending any notifications. That does normally work and we have tried the configuration to make sure it works. It was on pure coincident that we noticed that, which lead us to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants