Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and fix source of errors in WMArchive #359

Open
vkuznet opened this issue May 12, 2023 · 6 comments
Open

Investigate and fix source of errors in WMArchive #359

vkuznet opened this issue May 12, 2023 · 6 comments

Comments

@vkuznet
Copy link
Contributor

vkuznet commented May 12, 2023

Our CMSWEB operator has reported on mattermost channel the increased number of errors with WMArchive data flow. They may be related to reports from CERN IT (Lionel) who requested to change heartbeat configuration. I post below Lionel's suggestion and my response:

# from Lionel

I don’t know how the rate of warnings (of this kind) evolved over time. When investigating an unrelated issue I noticed
the abnormal rate of warnings coming from two machines and reported it.

However, I can give you hints on how heart-beats should be used.

Their main purpose is to allow one end to detect when the other end of the STOMP connection is not alive. The thresholds
depend on the use case.

For long running connections, a large heart-beat threshold is fine. The worst that can happen is that the dead connection
stays open for a bit more than needed. This is not big deal. Here, thresholds in the range of 1 to 10 minutes are fine.

For short lived connections, we often see a higher rate of new connections being created. Here, a smaller threshold should
be used to avoid having too many dead connections lying around. Here, thresholds in the range of 1 second to 1 minute are
fine.

In any case, the messaging client must send heart-beats frequently enough to avoid the connection from being closed at the
other end. The warnings I reported (Channel was inactive for too ... long) are exactly this: the broker closing the
connection because it hasn’t received a heart-beat recently enough. So the best practice is to send heart-beats well
before the maximum delay, usually half of that.

So, normally, the client has two different values: the proposed heart-beat timeout (that could change during heart-best
negotiation) and the actual timeout used, usually half of what has been negotiated.

# VK response

> Lionel,
> thanks for the feedback. Unfortunately, I do not see how it may be helpful since
> our service does not control data flow. It is defined by upstream CMS GRID
> machinery, i.e. once jobs are finished the data will be send to WMArchive
> and it will send it to CERN AMQ. That implies that our data rate is unknown
> by definition and I can't define clearly is it too high or too low since it
> depends on CMS data processing operations. Therefore, it is unclear which
> threshold will serve best. Moreover, it is likely that if we choose one
> threshold it may work for a while and then due to change in CMS data flow
> it will not fit the expectations.
>
> Said that, I'm in favor of changing thresholds to 1 min since it looks like
> it is in between two extremes. Please provide your feedback again if it
> will be optimal solution. If you agree our CMSWEB operator Aroosha
> can change WMArchive configuration file accordingly and restart the service.

# Lionel feedback

The heart-beat rate is independent from the message rate. The client must send an EOL in case it has nothing to send.

This is explained in https://stomp.github.io/stomp-specification-1.2.html#Heart-beating

So, we should check the optimal hear-beat rate assignment based on provided STOMP documentation.

@yuyiguo
Copy link
Member

yuyiguo commented May 12, 2023

@vkuznet
What is the average duration of WMarchive connection to AMQ brokers? Do you have monitoring to share?

@vkuznet
Copy link
Contributor Author

vkuznet commented May 12, 2023

Yuyi, you can find relevant information over here: https://monit-grafana.cern.ch/d/u_qOeVqZk/wmarchive-monit?orgId=11 and https://monit-grafana.cern.ch/d/wma-service/wmarchive-service?orgId=11 The first one contains the latency plot.

@yuyiguo
Copy link
Member

yuyiguo commented May 12, 2023

Valentin, Which plots show the WMArchive to AMQ connection duration or disconnection rate?

@vkuznet
Copy link
Contributor Author

vkuznet commented May 14, 2023

Yuyi, I pointed out to existing dashboard, but it does not have duration of AMQ connection, someone should add this to the code. Said that, it is trivial to see from wmarhive logs (vocms750:/cephfs/product/wma-logs/):

...
2023/05/14 00:05:32 stomp.go:168: send data to 188.185.13.100:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:05:32 stomp.go:168: send data to 188.185.11.68:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:06:28 stomp.go:168: send data to 188.185.35.176:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:06:28 wmarchive.go:298: POST /wmarchive/data/ 10.100.36.192:60508 [WMCore.Services.Requests/v002] [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0255.cern.ch] [188.185.89.194] {"result":[{"ids":["c5b8aed966fc4585865d2da2ebfd1b0d"],"status":"ok"}]}
2023/05/14 00:06:31 stomp.go:168: send data to 188.184.92.147:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:06:31 stomp.go:168: send data to 188.184.92.147:61313 endpoint /topic/cms.jobmon.wmarchive

So, connection did not last more than a minute since logs shows every time WMArchive sends the data and timestamp in logs shows that usually we have few log entries within a minute.

@LionelCons
Copy link

FWIW, I can confirm that the problem is still present. I still see an abnormal number of warnings coming from the cmsweb machines and linked to the small (1.5s) heart-beat threshold.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 16, 2023

I updated WMArchive configuration to use 1min threshold for recv/send timeouts on production and testbed clusters (FYI: @arooshap , @muhammadimranfarooqi) . Apart from that as I explained earlier is no longer allocated to development on services outside of WM area and further development efforts should be addressed via @klannon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants