Investigate and fix source of errors in WMArchive #359

vkuznet · 2023-05-12T15:18:46Z

Our CMSWEB operator has reported on mattermost channel the increased number of errors with WMArchive data flow. They may be related to reports from CERN IT (Lionel) who requested to change heartbeat configuration. I post below Lionel's suggestion and my response:

# from Lionel

I don’t know how the rate of warnings (of this kind) evolved over time. When investigating an unrelated issue I noticed
the abnormal rate of warnings coming from two machines and reported it.

However, I can give you hints on how heart-beats should be used.

Their main purpose is to allow one end to detect when the other end of the STOMP connection is not alive. The thresholds
depend on the use case.

For long running connections, a large heart-beat threshold is fine. The worst that can happen is that the dead connection
stays open for a bit more than needed. This is not big deal. Here, thresholds in the range of 1 to 10 minutes are fine.

For short lived connections, we often see a higher rate of new connections being created. Here, a smaller threshold should
be used to avoid having too many dead connections lying around. Here, thresholds in the range of 1 second to 1 minute are
fine.

In any case, the messaging client must send heart-beats frequently enough to avoid the connection from being closed at the
other end. The warnings I reported (Channel was inactive for too ... long) are exactly this: the broker closing the
connection because it hasn’t received a heart-beat recently enough. So the best practice is to send heart-beats well
before the maximum delay, usually half of that.

So, normally, the client has two different values: the proposed heart-beat timeout (that could change during heart-best
negotiation) and the actual timeout used, usually half of what has been negotiated.

# VK response

> Lionel,
> thanks for the feedback. Unfortunately, I do not see how it may be helpful since
> our service does not control data flow. It is defined by upstream CMS GRID
> machinery, i.e. once jobs are finished the data will be send to WMArchive
> and it will send it to CERN AMQ. That implies that our data rate is unknown
> by definition and I can't define clearly is it too high or too low since it
> depends on CMS data processing operations. Therefore, it is unclear which
> threshold will serve best. Moreover, it is likely that if we choose one
> threshold it may work for a while and then due to change in CMS data flow
> it will not fit the expectations.
>
> Said that, I'm in favor of changing thresholds to 1 min since it looks like
> it is in between two extremes. Please provide your feedback again if it
> will be optimal solution. If you agree our CMSWEB operator Aroosha
> can change WMArchive configuration file accordingly and restart the service.

# Lionel feedback

The heart-beat rate is independent from the message rate. The client must send an EOL in case it has nothing to send.

This is explained in https://stomp.github.io/stomp-specification-1.2.html#Heart-beating

So, we should check the optimal hear-beat rate assignment based on provided STOMP documentation.

The text was updated successfully, but these errors were encountered:

yuyiguo · 2023-05-12T16:43:05Z

@vkuznet
What is the average duration of WMarchive connection to AMQ brokers? Do you have monitoring to share?

vkuznet · 2023-05-12T18:30:24Z

Yuyi, you can find relevant information over here: https://monit-grafana.cern.ch/d/u_qOeVqZk/wmarchive-monit?orgId=11 and https://monit-grafana.cern.ch/d/wma-service/wmarchive-service?orgId=11 The first one contains the latency plot.

yuyiguo · 2023-05-12T22:10:19Z

Valentin, Which plots show the WMArchive to AMQ connection duration or disconnection rate?

vkuznet · 2023-05-14T14:26:47Z

Yuyi, I pointed out to existing dashboard, but it does not have duration of AMQ connection, someone should add this to the code. Said that, it is trivial to see from wmarhive logs (vocms750:/cephfs/product/wma-logs/):

...
2023/05/14 00:05:32 stomp.go:168: send data to 188.185.13.100:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:05:32 stomp.go:168: send data to 188.185.11.68:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:06:28 stomp.go:168: send data to 188.185.35.176:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:06:28 wmarchive.go:298: POST /wmarchive/data/ 10.100.36.192:60508 [WMCore.Services.Requests/v002] [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0255.cern.ch] [188.185.89.194] {"result":[{"ids":["c5b8aed966fc4585865d2da2ebfd1b0d"],"status":"ok"}]}
2023/05/14 00:06:31 stomp.go:168: send data to 188.184.92.147:61313 endpoint /topic/cms.jobmon.wmarchive
2023/05/14 00:06:31 stomp.go:168: send data to 188.184.92.147:61313 endpoint /topic/cms.jobmon.wmarchive

So, connection did not last more than a minute since logs shows every time WMArchive sends the data and timestamp in logs shows that usually we have few log entries within a minute.

LionelCons · 2023-05-16T06:26:00Z

FWIW, I can confirm that the problem is still present. I still see an abnormal number of warnings coming from the cmsweb machines and linked to the small (1.5s) heart-beat threshold.

vkuznet · 2023-05-16T12:06:30Z

I updated WMArchive configuration to use 1min threshold for recv/send timeouts on production and testbed clusters (FYI: @arooshap , @muhammadimranfarooqi) . Apart from that as I explained earlier is no longer allocated to development on services outside of WM area and further development efforts should be addressed via @klannon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and fix source of errors in WMArchive #359

Investigate and fix source of errors in WMArchive #359

vkuznet commented May 12, 2023

yuyiguo commented May 12, 2023

vkuznet commented May 12, 2023

yuyiguo commented May 12, 2023

vkuznet commented May 14, 2023

LionelCons commented May 16, 2023

vkuznet commented May 16, 2023

Investigate and fix source of errors in WMArchive #359

Investigate and fix source of errors in WMArchive #359

Comments

vkuznet commented May 12, 2023

yuyiguo commented May 12, 2023

vkuznet commented May 12, 2023

yuyiguo commented May 12, 2023

vkuznet commented May 14, 2023

LionelCons commented May 16, 2023

vkuznet commented May 16, 2023