Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_is_alive returns unknown when patroni is not reachable #77

Open
MLyssens opened this issue Oct 28, 2024 · 2 comments · May be fixed by #78
Open

node_is_alive returns unknown when patroni is not reachable #77

MLyssens opened this issue Oct 28, 2024 · 2 comments · May be fixed by #78
Assignees

Comments

@MLyssens
Copy link

Hi,

When I stop patroni on a node and run node_is_alive the check returns unknown:

check_patroni -vvv -e http://x.x.x.x:8008 node_is_alive
DEBUG - Trying to connect to http://x.x.x.x:8008/liveness with cert: None verify: None
DEBUG - Starting new HTTP connection (1): x.x.x.x:8008
DEBUG - HTTPConnectionPool(host='x.x.x.x', port=8008): Max retries exceeded with url: /liveness (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4205724790>: Failed to establish a new connection: [Errno 111] Connection refused'))
NODEISALIVE UNKNOWN - Connection failed for all provided endpoints
unknown: Connection failed for all provided endpoints

When it tries to connect and it fails I would suspect it gives a critical:

connect(3, {sa_family=AF_INET, sin_port=htons(8008), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 ECONNREFUSED (Connection refused)
close(3)                                = 0
alarm(0)                                = 2
write(1, "NODEISALIVE UNKNOWN - Connection failed for all provided endpoints\nunknown: Connection failed for all provided endpoints\n", 121NODEISALIVE UNKNOWN - Connection failed for all provided endpoints
unknown: Connection failed for all provided endpoints
) = 121

@blogh
Copy link
Collaborator

blogh commented Nov 6, 2024

Hi!

Sorry for the wait and thank you for the patch/report.

The actual value is set on purpose, because:

  • it could also be a configuration error which should be matched to UNKNOWN according to nagios;
  • I think remember checking other checks like check_pgactivity who does the same for similar situations.

@blogh
Copy link
Collaborator

blogh commented Nov 6, 2024

I recognise that given the probe's description, it's not obvious ..

Check if the node is alive ie patroni is running. This is a liveness check as defined in Patroni's documentation.

My Idea was to map to the /liveness check: 200 = OK, <>200 cirtical, failed connection UNKNOWN.

GET /liveness: returns HTTP status code 200 if Patroni heartbeat loop is properly running and 503 if the last run was
more than ttl seconds ago on the primary or 2*ttl on the replica. Could be used for livenessProbe.

@blogh blogh self-assigned this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants