queries with high number of series trigger excessive pointslicepool usage #1912

Dieterbe · 2020-10-01T17:53:25Z

Yesterday I did a fun experiment.
I made a copy of an internal single-tenant monitoring cluster (well, with only a subset of the data, but anyway)
the number of pods don't really matter, but for completeness:

16 query pods
32 read pods (16x2 replication)
16 write pods

more importantly is that active series (both in idx and tank) was about 75k for all read and write pods, or about 1.2M cluster wide (3.5M taking into account replication)

I then did this query

time curl `localhost/graphite/render?target=*&target=*.*&target=*.*.*&target=*.*.*.*&target=*.*.*.*.*&target=*.*.*.*.*.*&target=*.*.*.*.*.*.*&target=*.*.*.*.*.
*.*.*&target=*.*.*.*.*.*.*.*.*&target=*.*.*.*.*.*.*.*.*.*&target=*.*.*.*.*.*.*.*.*.*.*' >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  587M    0  587M    0     0  7914k      0 --:--:--  0:01:16 --:--:--  156M
real	1m 16.05s
user	0m 0.18s
sys	0m 0.53s

this query hit 1.1.86M series (all series in the cluster) and returned 15M points (so 12 points per series, thanks to max-points-per-req-soft, and having a 2h rollup, and the default time window being 24h)
to achieve this, this happened to memory usage:

cluster wide: 20.6 -> 46.5 GB
mt-query ~50 MB -> 4.5GB (only the one hit, others remained)
mt-read 430 MB to 1.7 GB (only 16 out of the 32 that were hit)
https://snapshot.raintank.io/dashboard/snapshot/6NNs0f0zsDcx1lXoJGb1xkAV4QoXXIp2?orgId=2
legend names in snapshots get messed up grafana#27973

The memory usage for read pods seemed a bit crazy.
I would expect memory usage to be something along the lines of (at least conceptually, minus the overhead), per series:
for the initial find:

metric metadata/details (name, interval, etc): few hundred B

for the getdata:

12 points of 12B each = 144B
sliceheader = 24B
metric metadata/details (name, interval, etc): few hundred B

let's call the /getdata 600B per series, at 75k series that's 45MB, not 1300. That's a lot more overhead then expected.

I ran the same query again and took a bunch of profiles.
I could have done this a lot more scientifically and get profiles at exact right times (exact query start, just before first response comes in on query pod, right when last response comes in on query pod, etc) but I didn't have a lot of time to set up a proper experiment.

https://snapshot.raintank.io/dashboard/snapshot/r5dzlOiBEDznDwxIRrbtoQn9s28arbEk?orgId=2
i actually ran it a 3rd time to be able to get the file query.new-after, a heap profile on the query pod right after the query completes. for all intents and purposes I think we can pretend i just executed one query and collected profiles as follows:

profiles.zip

the query pods first spend ~900MB encoding the request bodies for the /getdata calls, then the rest decoding responses from query pods and generating the final output. nothing too unexpected here, though i hadn't considered the cost of the bodies for /getdata before.

the read pod profile is pretty clear:

(pprof) top30 -cum
Showing nodes accounting for 1323.76MB, 94.98% of 1393.78MB total
Dropped 112 nodes (cum <= 6.97MB)
Showing top 30 nodes out of 45
      flat  flat%   sum%        cum   cum%
       3MB  0.22%  0.22%     1245MB 89.33%  github.com/grafana/metrictank/api.(*Server).getTarget
         0     0%  0.22%     1245MB 89.33%  github.com/grafana/metrictank/api.(*Server).getTargetsLocal.func1
         0     0%  0.22%  1241.50MB 89.07%  github.com/grafana/metrictank/api.(*Server).getSeriesFixed
 1241.50MB 89.07% 89.29%  1241.50MB 89.07%  github.com/grafana/metrictank/pointslicepool.(*PointSlicePool).GetMin
         0     0% 89.29%  1239.95MB 88.96%  github.com/grafana/metrictank/api.Fix
(...)
ROUTINE ======================== github.com/grafana/metrictank/api.Fix in /go/src/github.com/grafana/metrictank/api/dataprocessor.go
         0     1.21GB (flat, cum) 88.96% of Total
         .          .     74:		// the requested range is too narrow for the requested interval
         .          .     75:		return []schema.Point{}
         .          .     76:	}
         .          .     77:
         .          .     78:	neededCap := int((last-first)/interval + 1)
         .     1.21GB     79:	out := pointSlicePool.GetMin(neededCap)
         .          .     80:	out = out[:neededCap]
         .          .     81:
         .          .     82:	// i iterates in. o iterates out. t is the ts we're looking to fill.
         .          .     83:	for t, i, o := first, 0, -1; t <= last; t += interval {
         .          .     84:		o += 1
(pprof) list GetMin
Total: 1.36GB
ROUTINE ======================== github.com/grafana/metrictank/pointslicepool.(*PointSlicePool).GetMin in /go/src/github.com/grafana/metrictank/pointslicepool/pointslicepool.go
    1.21GB     1.21GB (flat, cum) 89.07% of Total
         .          .     43:		return make([]schema.Point, 0, minCap)
         .          .     44:	}
         .          .     45:	// even if our caller needs a smaller cap now, we expect they will put it back in the pool
         .          .     46:	// so it can later be reused.
         .          .     47:	// may as well allocate a size now that we expect will be more useful down the road.
    1.21GB     1.21GB     48:	return make([]schema.Point, 0, p.defaultSize)
         .          .     49:}

While we only need a cap of 12, we allocate a cap of 2000! At 12B per point that's 24kB per series that's 1.8GB on average (pre-overhead) !

This is a pretty interesting case. We have the max-points-per-req settings which allow to control how much space we allocate for points across the entire request (regardless of how many series are used), but this is some kind of loophole, once you query many series such that the amount of points per series is small, we lose control over the amount of space allocated because we allows allocate slices of at least 2000 points capacity.

metrictank --version
metrictank (version: v1.0-148-gb8bb526 - runtime: go1.15.2)

The text was updated successfully, but these errors were encountered:

1) save memory by not using large slices when we only need small ones In particular, fix #1912 2) use a default size that corresponds to a capacity that append() actually uses (I believe a size of 800 gets rounded up by make)

Dieterbe · 2020-10-05T11:37:48Z

(side note: this issue can happen both when requesting very short timeranges - which obviously result in few points - , but also longer timeranges. because either due to the time window being queries, or thanks to max-points-per-req-soft, a rollup may be selected where even a long window still results in few points)

Dieterbe added the bug label Oct 1, 2020

This was referenced Oct 1, 2020

Pointslicepool size classes #1913

Closed

max-series-per-req should also apply to non-tagged metrics #1916

Closed

Dieterbe changed the title ~~queries with high number of series trigger super duper excessive pointslicepool usage~~ queries with high number of series trigger excessive pointslicepool usage Oct 6, 2020

Dieterbe added this to the sprint-18 milestone Nov 4, 2020

fkaleo assigned Dieterbe Nov 4, 2020

Dieterbe removed their assignment Sep 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

queries with high number of series trigger excessive pointslicepool usage #1912

queries with high number of series trigger excessive pointslicepool usage #1912

Dieterbe commented Oct 1, 2020 •

edited

Loading

Dieterbe commented Oct 5, 2020

queries with high number of series trigger excessive pointslicepool usage #1912

queries with high number of series trigger excessive pointslicepool usage #1912

Comments

Dieterbe commented Oct 1, 2020 • edited Loading

Dieterbe commented Oct 5, 2020

Dieterbe commented Oct 1, 2020 •

edited

Loading