Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling for xml error when no battery stats are reported by the device #303

Merged
merged 1 commit into from
Mar 10, 2024

Conversation

r0ckarong
Copy link
Contributor

As previously mentioned here:

#245 (comment)

My watchtower instance automatically updated my container to the 2.4 release and it started crashing immediately. You mentioned in your comment here that you would consider just handling the parse error. This is what this patch should do.

Unfortunately, I'm not really able to test because I just can't get poetry installed on my system and I'm still struggling to get a newer python to work since the 3.6 release I have currently won't let me run the code directly.

@r0ckarong r0ckarong changed the title Add handling for xml error when no battery stats are reported by the … Add handling for xml error when no battery stats are reported by the device Mar 9, 2024
@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

I'll take it. I cannot test this "in real life" either, as I don't have any Home Automation Devices on my Fritz Box.

@pdreker pdreker merged commit 02197fa into pdreker:main Mar 10, 2024
1 of 2 checks passed
@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

NB: I assumed the exporter could be built from the Dockerfile with no external dependencies, but it seems this is not the case. I'll look into a clean Docker Multistage build, so it can be built with no local requirements aside from Docker.

@r0ckarong
Copy link
Contributor Author

So far no problems with the 2.4.2 container on my side. Thanks for the quick turnaround!

@r0ckarong
Copy link
Contributor Author

Correction: Not all good. The container doesn't crash and I don't see any immediate problems but the metrics are not showing up in prometheus anymore. I haven't yet found out what's wrong. The configuration for the containers is identical to before and just reverting back to the 2.3 image works fine.

@r0ckarong r0ckarong deleted the fix-nonavm branch March 10, 2024 12:47
@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

Can you check, what the exporter reports back via "curl"? curl http://<EXPORTER_IP>:9787/metrics should at least show the "usual" metrics. When I try 2.4.3 (no real changes from 2.4.2, except for a broken ARMv6/7 build I get the basic stats from my box, but obviously I do not get HA statistics, as I donÄt have any HA devices on my box here.

@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

Also "Setting log:level: DEBUG in the YAML config might youeld more interesting output, if not already enabled.

@r0ckarong
Copy link
Contributor Author

r0ckarong commented Mar 10, 2024

One potential candidate for the problem:

Got IndexError for index 35, stopping

2024/03/10 13:59:01,stderr,"2024-03-10 12:59:01,335    DEBUG fritzexporter.fritzcapability | Got IndexError for index 35, stopping
"
2024/03/10 13:59:01,stderr,"2024-03-10 12:59:01,321    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 35
"
2024/03/10 13:59:00,stderr,"2024-03-10 12:59:00,778    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 34
"
2024/03/10 13:59:00,stderr,"2024-03-10 12:59:00,218    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 33
"
2024/03/10 13:58:59,stderr,"2024-03-10 12:58:59,639    DEBUG fritzexporter.fritzcapability | Fetching home automation device information for index 32
... [more devices came before]

Need to look at the code what this actually does.

@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

The code assumes, that the device have a strictly consecuitve index (1, 2, 3, 4, ... NOT 1, 2, 4, ...). There is no way to ask for how many devices there are, so the code simply enumerates the devices until it gets an IndexError and stops there.

@r0ckarong
Copy link
Contributor Author

Ok so my suspicion at the moment is that there is some timing problem between the exporter and prometheus. The new version takes WAYY longer to complete the metrics listing than the old one. In 2.3 I get <5s complete times on the curl check. For 2.4.2 it takes upwards of 23 seconds for the same thing.

I already have a 60s scrape interval but nothing ever shows up in prometheus. Right now I'm at a loss what could be happening.

@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

You can check the prometheus webinterface under Status -> Targets, to see what Prometheus thinks of the exporter.

If it really takes 23s to fetch all information it may be worth checking the Prometheus scrape_timeout setting. IIRC this is 15s by default.

@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

Against my FritzBox 5590 Fiber running with HostInfo enabled takes 6.6s for 45 devices in my WiFi. 23s is an awfully long time...

If you have 35 HA devices, 23s boils down to 650ms per device (completely ignoring all other metrics). The Fritz Boxes are not known to be super fast, so this may actually be expected. The code enumerates all devices, and then sequentially reads their status info. Soooo... This may actually be OK or to be expected.

In the end I think Prometheus scrape_timeout is my best guess at the moment.

@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

Uuh... reading the code explains the performance :-/ Definitely something that needs to be improved...

The exporter naively walks through the devices by enumerating them by index until it fails. If it does not fail, it will parse the whole status response from TR-064 adds a metric ton of metrics (pun intended) and then calls "getdeviceinfos" with the AIN of the device via HTTP, and again adds a ton of metrics.

So it does a lot of things and uses up two remote calls to the box per HA device. I think it would be smarter to *only use the HTTP API for the HA stuff, especially since it seems, that there is a call, which returns all devices, but not all infos for all devices...

So... yes, it currently is slow. and this needs work. Unfortunately this is not something I can do quickly, as I will have to rework the HA side of things a implement a more smart XML "Parser" to extract all the information with fewer calls.

@pdreker
Copy link
Owner

pdreker commented Mar 10, 2024

I have opened a new issue for this: #310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants