-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling for xml error when no battery stats are reported by the device #303
Conversation
I'll take it. I cannot test this "in real life" either, as I don't have any Home Automation Devices on my Fritz Box. |
NB: I assumed the exporter could be built from the Dockerfile with no external dependencies, but it seems this is not the case. I'll look into a clean Docker Multistage build, so it can be built with no local requirements aside from Docker. |
So far no problems with the 2.4.2 container on my side. Thanks for the quick turnaround! |
Correction: Not all good. The container doesn't crash and I don't see any immediate problems but the metrics are not showing up in prometheus anymore. I haven't yet found out what's wrong. The configuration for the containers is identical to before and just reverting back to the 2.3 image works fine. |
Can you check, what the exporter reports back via "curl"? |
Also "Setting |
One potential candidate for the problem:
Need to look at the code what this actually does. |
The code assumes, that the device have a strictly consecuitve index (1, 2, 3, 4, ... NOT 1, 2, 4, ...). There is no way to ask for how many devices there are, so the code simply enumerates the devices until it gets an IndexError and stops there. |
Ok so my suspicion at the moment is that there is some timing problem between the exporter and prometheus. The new version takes WAYY longer to complete the metrics listing than the old one. In 2.3 I get <5s complete times on the curl check. For 2.4.2 it takes upwards of 23 seconds for the same thing. I already have a 60s scrape interval but nothing ever shows up in prometheus. Right now I'm at a loss what could be happening. |
You can check the prometheus webinterface under Status -> Targets, to see what Prometheus thinks of the exporter. If it really takes 23s to fetch all information it may be worth checking the Prometheus scrape_timeout setting. IIRC this is 15s by default. |
Against my FritzBox 5590 Fiber running with HostInfo enabled takes 6.6s for 45 devices in my WiFi. 23s is an awfully long time... If you have 35 HA devices, 23s boils down to 650ms per device (completely ignoring all other metrics). The Fritz Boxes are not known to be super fast, so this may actually be expected. The code enumerates all devices, and then sequentially reads their status info. Soooo... This may actually be OK or to be expected. In the end I think Prometheus scrape_timeout is my best guess at the moment. |
Uuh... reading the code explains the performance :-/ Definitely something that needs to be improved... The exporter naively walks through the devices by enumerating them by index until it fails. If it does not fail, it will parse the whole status response from TR-064 adds a metric ton of metrics (pun intended) and then calls "getdeviceinfos" with the AIN of the device via HTTP, and again adds a ton of metrics. So it does a lot of things and uses up two remote calls to the box per HA device. I think it would be smarter to *only use the HTTP API for the HA stuff, especially since it seems, that there is a call, which returns all devices, but not all infos for all devices... So... yes, it currently is slow. and this needs work. Unfortunately this is not something I can do quickly, as I will have to rework the HA side of things a implement a more smart XML "Parser" to extract all the information with fewer calls. |
I have opened a new issue for this: #310 |
As previously mentioned here:
#245 (comment)
My watchtower instance automatically updated my container to the 2.4 release and it started crashing immediately. You mentioned in your comment here that you would consider just handling the parse error. This is what this patch should do.
Unfortunately, I'm not really able to test because I just can't get poetry installed on my system and I'm still struggling to get a newer python to work since the 3.6 release I have currently won't let me run the code directly.