TCP Out of Memory #159

eleblebici · 2024-10-23T13:30:00Z

Bug Description

The memory usage keeps increasing:

ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 07:33:34 UTC 2024
sockets: used 2166
TCP: inuse 1579 orphan 140 tw 8 alloc 13489 mem 215980
UDP: inuse 6 mem 144
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
 
ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 08:17:29 UTC 2024
sockets: used 2165
TCP: inuse 1580 orphan 141 tw 14 alloc 20460 mem 431280
UDP: inuse 6 mem 146
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 08:31:07 UTC 2024
sockets: used 2164
TCP: inuse 1579 orphan 145 tw 21 alloc 22608 mem 497946
UDP: inuse 6 mem 144
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 08:35:17 UTC 2024
sockets: used 2164
TCP: inuse 1579 orphan 145 tw 9 alloc 23261 mem 518181
UDP: inuse 6 mem 144
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 09:10:40 UTC 2024
sockets: used 2165
TCP: inuse 1579 orphan 140 tw 5 alloc 28838 mem 690292
UDP: inuse 6 mem 144
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 09:13:24 UTC 2024
sockets: used 2164
TCP: inuse 1579 orphan 140 tw 26 alloc 29263 mem 703098
UDP: inuse 6 mem 144
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
ubuntu@osinfrand1:~$ juju exec --machine 2 -- "date ; cat /proc/net/sockstat"
Mon Oct 21 10:09:38 UTC 2024
sockets: used 2178
TCP: inuse 1581 orphan 140 tw 23 alloc 38209 mem 977463
UDP: inuse 6 mem 145
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

It seems we also have orphaned sockets which not closed by the vector service:

tcp CLOSE-WAIT 153530 0 10.252.20.82:5044 10.252.20.125:39332 users:(("vector",pid=1028,fd=16)) ino:1197351486 sk:6572e cgroup:/system.slice/vector.service -->

Looking at the "/proc/net/sockstat" output:

TCP: inuse 302 orphan 5 tw 4 alloc 866299 mem 20649344

We see that 866299 sockets are allocated, and 80,6 GiB worth of memory pages are reserved for their usage. (mem column is the amount of 4k pages, not bytes. the actual memory usage most probably will be less than this).

The "pidstat_-p_ALL_-rudvwsRU_--human_-h" output shows that the "vector" process has 859617 file descriptors open. Combining that finding with the active socket count of ~3000 866299 allocated sockets in sockstat, it points to the "vector" process not properly cleaning up the sockets and leaking them, resulting in a TCP memory exhaustion issue. This is consistent with everything being back to normal after restarting the "vector".

We also have the logs below in the unit:

Sep 23 11:35:24 oscomputend3 filebeat[3250]: ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 10.252.20.223:60778->10.252.20.82:5044: i/o timeout
Sep 23 11:35:24 oscomputend3 filebeat[3250]: ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 10.252.20.223:60778->10.252.20.82:5044: i/o timeout

We tried to add the "keepalive" option in the vector configuration but it did not help:

  logstash:
    address: 0.0.0.0:5044
    type: logstash
    keepalive:
      time_secs: 60

To Reproduce

Unfortunately, I have not the steps to reproduce.

I think this issue could also be related to Loki and can be relevant to #130 although the large_dir already enabled for the device. It also did not help.

We also tried to tune some settings with Loki but they also did not help:

juju config loki ingestion-burst-size-mb=512
juju config loki ingestion-rate-mb=128

Environment

cos-proxy charm revision is 117.

ubuntu@juju-8b9a36-openstack-21:/var/lib/juju/agents/unit-cos-proxy-0/charm$ ./vector --version
vector 0.36.0 (x86_64-unknown-linux-musl a5e48bb 2024-02-13 14:43:11.911392615)

Relevant log output

Sep 23 11:35:24 oscomputend3 filebeat[3250]: ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 10.252.20.223:60778->10.252.20.82:5044: i/o timeout
Sep 23 11:35:24 oscomputend3 filebeat[3250]: ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 10.252.20.223:60778->10.252.20.82:5044: i/o timeout

Aug 19 08:53:47 oscomputend3 kernel: [397430.141313] TCP: out of memory -- consider tuning tcp_mem

Additional context

Please let me know if anything else is needed for the resolution.

The text was updated successfully, but these errors were encountered:

eleblebici · 2024-11-04T06:35:54Z

Hi,

Any update on this one?

Thanks.

simskij · 2024-11-14T09:32:38Z

Hi @eleblebici,

So for a short-term fix, you should be able to just restart the vector service. If it's still not shrinking, that is an indicator that there is queue build-up in vector.

If you could get us:

a juju debug-log --replay for cos-proxy, as well as for loki.
a dump of what vector top is showing if you leave it running for a couple of minutes.
a juju status --relations in the model cos-proxy resides in

Thanks.

eleblebici added Status: Triage Type: Bug labels Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP Out of Memory #159

TCP Out of Memory #159

eleblebici commented Oct 23, 2024 •

edited

Loading

eleblebici commented Nov 4, 2024

simskij commented Nov 14, 2024 •

edited

Loading

TCP Out of Memory #159

TCP Out of Memory #159

Comments

eleblebici commented Oct 23, 2024 • edited Loading

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

eleblebici commented Nov 4, 2024

simskij commented Nov 14, 2024 • edited Loading

eleblebici commented Oct 23, 2024 •

edited

Loading

simskij commented Nov 14, 2024 •

edited

Loading