Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapoint for aggregation too far in past #309

Open
eberkut opened this issue Dec 28, 2021 · 0 comments
Open

datapoint for aggregation too far in past #309

eberkut opened this issue Dec 28, 2021 · 0 comments

Comments

@eberkut
Copy link

eberkut commented Dec 28, 2021

  • What version of the operator are you running?

m3db-operator-0.13.0

  • What version of Kubernetes are you running?
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:41:42Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • What are you trying to do?

I have deployed a m3db cluster with the operator. I have a default namespace for live data and an aggregated namespace for long-term storage. My spec is:

apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
metadata:
  name: m3db-cluster
spec:
  image: quay.io/m3db/m3dbnode:latest
  replicationFactor: 3
  numberOfShards: 128
  isolationGroups:
  - name: group1
    numInstances: 2
    nodeAffinityTerms:
    - key: alpha.eksctl.io/nodegroup-name
      values:
      - prod-opsmon-ng-1a
  - name: group2
    numInstances: 2
    nodeAffinityTerms:
    - key: alpha.eksctl.io/nodegroup-name
      values:
      - prod-opsmon-ng-1b
  - name: group3
    numInstances: 2
    nodeAffinityTerms:
    - key: alpha.eksctl.io/nodegroup-name
      values:
      - prod-opsmon-ng-1c
  namespaces:
  - name: default
    options:
      bootstrapEnabled: true
      flushEnabled: true
      writesToCommitLog: true
      cleanupEnabled: true
      snapshotEnabled: true
      repairEnabled: false
      retentionOptions:
        retentionPeriod: 2160h
        blockSize: 12h
        bufferFuture: 1h
        bufferPast: 2h
        blockDataExpiry: true
        blockDataExpiryAfterNotAccessPeriod: 10m
      indexOptions:
        enabled: true
        blockSize: 12h
      aggregationOptions:
        aggregations:
          - aggregated: false
  - name: longterm
    options:
      bootstrapEnabled: true
      flushEnabled: true
      writesToCommitLog: true
      cleanupEnabled: true
      snapshotEnabled: true
      repairEnabled: false
      retentionOptions:
        retentionPeriod: 9000h
        blockSize: 12h
        bufferFuture: 1h
        bufferPast: 2h
        blockDataExpiry: true
        blockDataExpiryAfterNotAccessPeriod: 30m
      indexOptions:
        enabled: true
        blockSize: 12h
      aggregationOptions:
        aggregations:
          - aggregated: true
            attributes:
              resolution: 10m
              downsampleOptions:
                all: true
  etcdEndpoints:
  - http://etcd.monitoring.svc.cluster.local:2379
  containerResources:
    requests:
      memory: 16Gi
      cpu: '4'
    limits:
      memory: 28Gi
      cpu: '8'
  dataDirVolumeClaimTemplate:
    metadata:
      name: m3db-data
    spec:
      accessModes:
      - ReadWriteOnce
      storageClassName: gp3
      resources:
        requests:
          storage: 10Ti

My namespaces are initializing correctly:

{
  "registry": {
    "namespaces": {
      "default": {
        "bootstrapEnabled": true,
        "flushEnabled": true,
        "writesToCommitLog": true,
        "cleanupEnabled": true,
        "repairEnabled": false,
        "retentionOptions": {
          "retentionPeriodNanos": "7776000000000000",
          "blockSizeNanos": "43200000000000",
          "bufferFutureNanos": "3600000000000",
          "bufferPastNanos": "7200000000000",
          "blockDataExpiry": true,
          "blockDataExpiryAfterNotAccessPeriodNanos": "600000000000",
          "futureRetentionPeriodNanos": "0"
        },
        "snapshotEnabled": true,
        "indexOptions": {
          "enabled": true,
          "blockSizeNanos": "43200000000000"
        },
        "schemaOptions": null,
        "coldWritesEnabled": false,
        "runtimeOptions": null,
        "cacheBlocksOnRetrieve": false,
        "aggregationOptions": {
          "aggregations": [
            {
              "aggregated": false,
              "attributes": null
            }
          ]
        },
        "stagingState": {
          "status": "READY"
        },
        "extendedOptions": null
      },
      "longterm": {
        "bootstrapEnabled": true,
        "flushEnabled": true,
        "writesToCommitLog": true,
        "cleanupEnabled": true,
        "repairEnabled": false,
        "retentionOptions": {
          "retentionPeriodNanos": "32400000000000000",
          "blockSizeNanos": "43200000000000",
          "bufferFutureNanos": "3600000000000",
          "bufferPastNanos": "7200000000000",
          "blockDataExpiry": true,
          "blockDataExpiryAfterNotAccessPeriodNanos": "1800000000000",
          "futureRetentionPeriodNanos": "0"
        },
        "snapshotEnabled": true,
        "indexOptions": {
          "enabled": true,
          "blockSizeNanos": "43200000000000"
        },
        "schemaOptions": null,
        "coldWritesEnabled": false,
        "runtimeOptions": null,
        "cacheBlocksOnRetrieve": false,
        "aggregationOptions": {
          "aggregations": [
            {
              "aggregated": true,
              "attributes": {
                "resolutionNanos": "600000000000",
                "downsampleOptions": {
                  "all": true
                }
              }
            }
          ]
        },
        "stagingState": {
          "status": "READY"
        },
        "extendedOptions": null
      }
    }
  }
}

I have configured a couple of test Prometheus environment to remote_write to M3DB.

remote_write:
  - url: https://REDCATED/api/v1/prom/remote/write
    remote_timeout: 30s
    queue_config:
      capacity: 10000
      max_samples_per_send: 3000
      batch_send_deadline: 10s
      min_shards: 4
      max_shards: 200
      min_backoff: 100ms
      max_backoff: 10s
  • What happened?

Source Prometheus logs are filled with remote_write errors:

ts=2021-12-28T17:22:17.510Z caller=dedupe.go:112 component=remote level=error remote_name=8ae741 url=https://REDACTED/api/v1/prom/remote/write msg="non-recoverable error" count=3000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: {\"status\":\"error\",\"error\":\"bad_request_errors: count=58, last=datapoint for aggregation too far in past: off_by=10m17.435332333s, timestamp=2021-12-28T17:10:00Z, past_limit=2021-12-28T17:20:17Z, timestamp_unix_nanos=1640711400000000000, past_limit_unix_nanos=1640712017435332333\"}"
ts=2021-12-28T17:24:58.540Z caller=dedupe.go:112 component=remote level=error remote_name=8ae741 url=https://REDACTED/api/v1/prom/remote/write msg="non-recoverable error" count=3000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: {\"status\":\"error\",\"error\":\"bad_request_errors: count=2, last=datapoint for aggregation too far in past: off_by=1m5.057778422s, timestamp=2021-12-28T17:21:53Z, past_limit=2021-12-28T17:22:58Z, timestamp_unix_nanos=1640712113450000000, past_limit_unix_nanos=1640712178507778422\"}"

And on the m3db nodes I get datapoint for aggregation too far in past errors, ranging from a few seconds to about 10 minutes.\

{"level":"error","ts":1640712238.5445695,"msg":"write error","rqID":"f13324a3-0126-45cd-b05b-d81b8ddc7a15","remoteAddr":"192.168.4.56:37888","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":1,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=5.067515455s, timestamp=2021-12-28T17:21:53Z, past_limit=2021-12-28T17:21:58Z, timestamp_unix_nanos=1640712113450000000, past_limit_unix_nanos=1640712118517515455"}
{"level":"error","ts":1640712257.5282857,"msg":"write error","rqID":"51c82440-a10a-464d-a701-fe4e29dfa196","remoteAddr":"192.168.92.76:26034","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":59,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=10m17.485669938s, timestamp=2021-12-28T17:12:00Z, past_limit=2021-12-28T17:22:17Z, timestamp_unix_nanos=1640711520000000000, past_limit_unix_nanos=1640712137485669938"}
{"level":"error","ts":1640712267.5369058,"msg":"write error","rqID":"7da4065e-50db-48e4-afcb-3d721cad7d93","remoteAddr":"192.168.55.139:51756","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=42.383129547s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:22:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712147515129547"}
{"level":"error","ts":1640712329.0677392,"msg":"write error","rqID":"1a31c55c-b894-4208-a279-4559e9f12ae9","remoteAddr":"192.168.92.76:26034","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=1m43.902580804s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:23:29Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712209034580804"}
{"level":"error","ts":1640712387.462137,"msg":"write error","rqID":"72c66e32-cfe1-4d65-b999-5050cd3c0f93","remoteAddr":"192.168.92.76:26032","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=2m42.280133072s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:24:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712267412133072"}
{"level":"error","ts":1640712387.5179935,"msg":"write error","rqID":"c3bf9e0b-b8c0-4871-b61f-bcfcdd45d6b0","remoteAddr":"192.168.92.76:26030","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":1,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=2m42.356698292s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:24:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712267488698292"}
{"level":"error","ts":1640712387.5640953,"msg":"write error","rqID":"e03314d0-3466-40fa-a968-435cc0094e36","remoteAddr":"192.168.55.139:51756","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":2,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=2m42.410037855s, timestamp=2021-12-28T17:21:45Z, past_limit=2021-12-28T17:24:27Z, timestamp_unix_nanos=1640712105132000000, past_limit_unix_nanos=1640712267542037855"}
{"level":"error","ts":1640712429.2815886,"msg":"write error","rqID":"3204ed52-ece5-45f9-a9ee-203dd75766e6","remoteAddr":"192.168.92.76:26032","httpResponseStatusCode":400,"numRegularErrors":0,"numBadRequestErrors":60,"lastRegularError":"","lastBadRequestErr":"datapoint for aggregation too far in past: off_by=10m9.226459202s, timestamp=2021-12-28T17:15:00Z, past_limit=2021-12-28T17:25:09Z, timestamp_unix_nanos=1640711700000000000, past_limit_unix_nanos=1640712309226459202"}

Checking the Prometheus remote_write metrics, it looks like no samples at all are successfully sent.

It doesn't seem to be a throughput issue (checking max shards is never reached and remote_write duration p99 is fairly low). There's no large backlog of samples either. I've also verified that there's no time synchronization issue on either the Prometheus sources or M3DB cluster.

prom-remote-1
prom-remote-2

I've seen references in Github and Slack to downsample bufferPastLimits (e.g. m3db/m3#2355) but I don't see any way to customize this value with the operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant