Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime error: slice bounds out of range #14345

Open
ohtyap opened this issue Oct 1, 2024 · 8 comments
Open

runtime error: slice bounds out of range #14345

ohtyap opened this issue Oct 1, 2024 · 8 comments

Comments

@ohtyap
Copy link

ohtyap commented Oct 1, 2024

Describe the bug
When querying certain timeframes within loki, loki-read is crashing (see stacktrace output below). Sadly, I do not have more info than the stacktrace - I was not able to see any pattern when this happens. It happens for different labels (e.g. completly different types of logs), different times etc.
Any idea how to debug this further?

To Reproduce
I am not able to deliberately reproduce the issue ... it just happens from time to time (and basically completly breaks the usage of loki)

Expected behavior
loki not crashing 😅

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm

Screenshots, Promtail config, or terminal output


2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485
2024-10-01 16:12:17.196 | created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 11537
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1de
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2()
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/object_client.go:187 +0x5ca
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(_, {_, _}, _, {{0x33f196330e3b21ab, {0xc00b1cf338, 0x4}, 0x19247502928, 0x1924761e382, 0x2e112952}, ...})
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/chunk.go:359 +0x596
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk.(*Chunk).Decode(0xc00f757b10, 0xc00ab8dd38, {0xc039b80000, 0x14215f, 0x14235f})
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/facade.go:64 +0x34
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.(*Facade).UnmarshalFromBuf(0xc00e86f4a0, {0xc039b8019a?, 0x35a2f00?, 0x4da0e40?})
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/memchunk.go:392
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...)
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/memchunk.go:479 +0xd9e
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc039b8019a, 0x141fc5, 0x1421c5}, 0x0, 0x0, 0x0)
2024-10-01 16:12:17.196 | goroutine 11486 [running]:
2024-10-01 16:12:17.196 | 
2024-10-01 16:12:17.196 | panic: runtime error: slice bounds out of range [:1360462] with capacity 1319365
@Jayclifford345
Copy link
Contributor

Hi @ohtyap can you provide some more information for the team around your deployment:

  • Version
  • Deployment type
  • The query
  • Range of data
  • Loki config

That should help help us out

@ohtyap
Copy link
Author

ohtyap commented Oct 2, 2024

@Jayclifford345 thanks a lot, sure here the infos:

Version: We encountered the problem with loki 3.1.0 - we upgraded in the meanwhile to 3.1.1 (happens there too). But not sure if it is really a 3.1.X problem.

Deployment Type: Simple Scalable (on AWS with S3 storage)

The query and range of data: It happens everytime when loki wants to retrieve affect loglines/chunks 🤷 Basically a {pod="xyz"} is enough to trigger the error if an "infected" timeframe is selected. In some cases we were able to reduce the affected timeframe to a few seconds. If someone is selecting these timeframe (by even just selecting 1sec), the crash happens.

The loki config:

loki:
  server:
    grpc_server_max_recv_msg_size: 1.048576e+08
    grpc_server_max_send_msg_size: 1.048576e+08
  storage:
    type: 's3'
    s3:
      region: XYZ
    bucketNames:
      chunks: XYZ
      ruler: XYZ
      admin: XYZ
  limits_config:
    max_line_size: 512KB
    max_query_length: 0
    ingestion_rate_mb: 24
    ingestion_burst_size_mb: 32
    reject_old_samples_max_age: 30d
    split_queries_by_interval: 1h
    query_timeout: 10m
    tsdb_max_query_parallelism: 100
    retention_period: 90d
    retention_stream:
      - selector: '{channel="XYZ"}'
        period: 372d
        priority: 100

  schemaConfig:
    configs:
      - from: 2024-04-30
        store: tsdb
        object_store: aws
        schema: v13
        index:
          prefix: index_
          period: 24h
  querier:
    max_concurrent: 16

  query_scheduler:
    max_outstanding_requests_per_tenant: 3276

  compactor:
    retention_enabled: true
    working_directory: /retention
    delete_request_store: s3
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    compaction_interval: 10m

  ingester:
    chunk_encoding: snappy
    autoforget_unhealthy: true

@axelbodo
Copy link

axelbodo commented Oct 4, 2024

I can deterministically reproduce this, and I get the following error in all querier right before the crash loop:
panic: runtime error: slice bounds out of range [:837727] with capacity 760512

goroutine 12813 [running]:
github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc02f7001f8, 0xb98c0, 0xb9ac0}, 0x0, 0x0, 0x0)
/src/loki/pkg/chunkenc/memchunk.go:479 +0xdbe
github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...)
/src/loki/pkg/chunkenc/memchunk.go:392
github.com/grafana/loki/v3/pkg/chunkenc.(*Facade).UnmarshalFromBuf(0xc02f04a2d0, {0xc02f7001f8?, 0x3260580?, 0x4944280?>
/src/loki/pkg/chunkenc/facade.go:64 +0x34
github.com/grafana/loki/v3/pkg/storage/chunk.(*Chunk).Decode(0xc02d1b7b10, 0xc00079d570, {0xc02f700000, 0xb9ab8, 0xb9cb>
/src/loki/pkg/storage/chunk/chunk.go:359 +0x5a2
github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(, {, _}, _, {{0x4246a33a92bd0a8b, {0xc02f04c96>
/src/loki/pkg/storage/chunk/client/object_client.go:187 +0x34f
github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2()
/src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1c4
created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 12716
/src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485

sum(count_over_time(
{
service_name="logstash-loki-pipeline"
, loki_log_group="k8s-pod"
}
| logfmt
[1m])) by(namespace)

on a 16 hour range it happens on 8 hour range as well, but not on 4 hour range, maybe not the range is what counts, but the data in the extra range.

The version we use: b4f7181 (HEAD, tag: v3.0.0)

@Jayclifford345
Copy link
Contributor

Thank you all for the extra information. I will raise this at the Loki engineer call tomorrow.

@axelbodo
Copy link

axelbodo commented Oct 9, 2024

Thank You!
I narrowed down the query range to 1s and the over_time interval to 10ms, and some of the querier produced this panic, however when just 1 or 2 is panicking, grafana doesn' show context cancelled on the ui, so it surely relates somechunks in that interval.

@ohtyap
Copy link
Author

ohtyap commented Oct 11, 2024

@Jayclifford345 Is there anything we can help with - like providing additional information or something specific to check?

@Jayclifford345
Copy link
Contributor

Hi @ohtyap, sorry for the late reply we sadly didn't have the engineer call this week since its focus week. Will make sure the team is aware on Monday to take a look.

@ohtyap
Copy link
Author

ohtyap commented Oct 11, 2024

@Jayclifford345 No worries; I suspected it will be hard to re-produce. So please, ping me in case we can check or try something or any other way to help out on this one. But I hope there is a solution or a fix, as otherwise, loki is sadly not usable for us (because this bug makes the usage quite unreliable - broken dashboards and alarms etc.) 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants