You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.
Yesterday I did a fun experiment.
I made a copy of an internal single-tenant monitoring cluster (well, with only a subset of the data, but anyway)
the number of pods don't really matter, but for completeness:
16 query pods
32 read pods (16x2 replication)
16 write pods
more importantly is that active series (both in idx and tank) was about 75k for all read and write pods, or about 1.2M cluster wide (3.5M taking into account replication)
I then did this query
time curl `localhost/graphite/render?target=*&target=*.*&target=*.*.*&target=*.*.*.*&target=*.*.*.*.*&target=*.*.*.*.*.*&target=*.*.*.*.*.*.*&target=*.*.*.*.*.
*.*.*&target=*.*.*.*.*.*.*.*.*&target=*.*.*.*.*.*.*.*.*.*&target=*.*.*.*.*.*.*.*.*.*.*' >/dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 587M 0 587M 0 0 7914k 0 --:--:-- 0:01:16 --:--:-- 156M
real 1m 16.05s
user 0m 0.18s
sys 0m 0.53s
this query hit 1.1.86M series (all series in the cluster) and returned 15M points (so 12 points per series, thanks to max-points-per-req-soft, and having a 2h rollup, and the default time window being 24h)
to achieve this, this happened to memory usage:
cluster wide: 20.6 -> 46.5 GB
mt-query ~50 MB -> 4.5GB (only the one hit, others remained)
The memory usage for read pods seemed a bit crazy.
I would expect memory usage to be something along the lines of (at least conceptually, minus the overhead), per series:
for the initial find:
metric metadata/details (name, interval, etc): few hundred B
for the getdata:
12 points of 12B each = 144B
sliceheader = 24B
metric metadata/details (name, interval, etc): few hundred B
let's call the /getdata 600B per series, at 75k series that's 45MB, not 1300. That's a lot more overhead then expected.
I ran the same query again and took a bunch of profiles.
I could have done this a lot more scientifically and get profiles at exact right times (exact query start, just before first response comes in on query pod, right when last response comes in on query pod, etc) but I didn't have a lot of time to set up a proper experiment.
the query pods first spend ~900MB encoding the request bodies for the /getdata calls, then the rest decoding responses from query pods and generating the final output. nothing too unexpected here, though i hadn't considered the cost of the bodies for /getdata before.
the read pod profile is pretty clear:
(pprof) top30 -cum
Showing nodes accounting for 1323.76MB, 94.98% of 1393.78MB total
Dropped 112 nodes (cum <= 6.97MB)
Showing top 30 nodes out of 45
flat flat% sum% cum cum%
3MB 0.22% 0.22% 1245MB 89.33% github.com/grafana/metrictank/api.(*Server).getTarget
0 0% 0.22% 1245MB 89.33% github.com/grafana/metrictank/api.(*Server).getTargetsLocal.func1
0 0% 0.22% 1241.50MB 89.07% github.com/grafana/metrictank/api.(*Server).getSeriesFixed
1241.50MB 89.07% 89.29% 1241.50MB 89.07% github.com/grafana/metrictank/pointslicepool.(*PointSlicePool).GetMin
0 0% 89.29% 1239.95MB 88.96% github.com/grafana/metrictank/api.Fix
(...)
ROUTINE ======================== github.com/grafana/metrictank/api.Fix in /go/src/github.com/grafana/metrictank/api/dataprocessor.go
0 1.21GB (flat, cum) 88.96% of Total
. . 74: // the requested range is too narrow for the requested interval
. . 75: return []schema.Point{}
. . 76: }
. . 77:
. . 78: neededCap := int((last-first)/interval + 1)
. 1.21GB 79: out := pointSlicePool.GetMin(neededCap)
. . 80: out = out[:neededCap]
. . 81:
. . 82: // i iterates in. o iterates out. t is the ts we're looking to fill.
. . 83: for t, i, o := first, 0, -1; t <= last; t += interval {
. . 84: o += 1
(pprof) list GetMin
Total: 1.36GB
ROUTINE ======================== github.com/grafana/metrictank/pointslicepool.(*PointSlicePool).GetMin in /go/src/github.com/grafana/metrictank/pointslicepool/pointslicepool.go
1.21GB 1.21GB (flat, cum) 89.07% of Total
. . 43: return make([]schema.Point, 0, minCap)
. . 44: }
. . 45: // even if our caller needs a smaller cap now, we expect they will put it back in the pool
. . 46: // so it can later be reused.
. . 47: // may as well allocate a size now that we expect will be more useful down the road.
1.21GB 1.21GB 48: return make([]schema.Point, 0, p.defaultSize)
. . 49:}
While we only need a cap of 12, we allocate a cap of 2000! At 12B per point that's 24kB per series that's 1.8GB on average (pre-overhead) !
This is a pretty interesting case. We have the max-points-per-req settings which allow to control how much space we allocate for points across the entire request (regardless of how many series are used), but this is some kind of loophole, once you query many series such that the amount of points per series is small, we lose control over the amount of space allocated because we allows allocate slices of at least 2000 points capacity.
1) save memory by not using large slices when we only need small ones
In particular, fix#1912
2) use a default size that corresponds to a capacity that append()
actually uses (I believe a size of 800 gets rounded up by make)
(side note: this issue can happen both when requesting very short timeranges - which obviously result in few points - , but also longer timeranges. because either due to the time window being queries, or thanks to max-points-per-req-soft, a rollup may be selected where even a long window still results in few points)
Dieterbe
changed the title
queries with high number of series trigger super duper excessive pointslicepool usage
queries with high number of series trigger excessive pointslicepool usage
Oct 6, 2020
Yesterday I did a fun experiment.
I made a copy of an internal single-tenant monitoring cluster (well, with only a subset of the data, but anyway)
the number of pods don't really matter, but for completeness:
more importantly is that active series (both in idx and tank) was about 75k for all read and write pods, or about 1.2M cluster wide (3.5M taking into account replication)
I then did this query
this query hit 1.1.86M series (all series in the cluster) and returned 15M points (so 12 points per series, thanks to max-points-per-req-soft, and having a 2h rollup, and the default time window being 24h)
to achieve this, this happened to memory usage:
https://snapshot.raintank.io/dashboard/snapshot/6NNs0f0zsDcx1lXoJGb1xkAV4QoXXIp2?orgId=2
legend names in snapshots get messed up grafana#27973
The memory usage for read pods seemed a bit crazy.
I would expect memory usage to be something along the lines of (at least conceptually, minus the overhead), per series:
for the initial find:
for the getdata:
let's call the /getdata 600B per series, at 75k series that's 45MB, not 1300. That's a lot more overhead then expected.
I ran the same query again and took a bunch of profiles.
I could have done this a lot more scientifically and get profiles at exact right times (exact query start, just before first response comes in on query pod, right when last response comes in on query pod, etc) but I didn't have a lot of time to set up a proper experiment.
https://snapshot.raintank.io/dashboard/snapshot/r5dzlOiBEDznDwxIRrbtoQn9s28arbEk?orgId=2
i actually ran it a 3rd time to be able to get the file
query.new-after
, a heap profile on the query pod right after the query completes. for all intents and purposes I think we can pretend i just executed one query and collected profiles as follows:profiles.zip
the query pods first spend ~900MB encoding the request bodies for the /getdata calls, then the rest decoding responses from query pods and generating the final output. nothing too unexpected here, though i hadn't considered the cost of the bodies for /getdata before.
the read pod profile is pretty clear:
While we only need a cap of 12, we allocate a cap of 2000! At 12B per point that's 24kB per series that's 1.8GB on average (pre-overhead) !
This is a pretty interesting case. We have the max-points-per-req settings which allow to control how much space we allocate for points across the entire request (regardless of how many series are used), but this is some kind of loophole, once you query many series such that the amount of points per series is small, we lose control over the amount of space allocated because we allows allocate slices of at least 2000 points capacity.
The text was updated successfully, but these errors were encountered: