Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtail missing logs when push latency is high on kubernetes #14462

Open
liam-howe-maersk opened this issue Oct 11, 2024 · 0 comments
Open

Promtail missing logs when push latency is high on kubernetes #14462

liam-howe-maersk opened this issue Oct 11, 2024 · 0 comments

Comments

@liam-howe-maersk
Copy link
Contributor

Describe the bug
We have observed that when latency pushing logs from promtail to Loki is high and promtail is hosted on kubernetes, that logs can be dropped, it seems that log files on kubernetes are skipped, leading to gaps in the logs. When running an OpenTelemetry Collector instead on the same cluster we observe that all expected logs are pushed to Loki.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (3.1.1)
  2. Started Promtail (3.1.1) DaemonSet on a k8s cluster to tail k8s pods using kubernetes_sd_configs
  3. Latency pushing must be high, in our case this is due to crossing regions and it averages 1 second per push, calculated using the metrics
sum(rate(promtail_request_duration_seconds_sum[$__rate_interval]))
/ 
sum(rate(promtail_request_duration_seconds_count[$__rate_interval]))

Expected behavior
All logs should be pushed to Loki

Environment:

  • Infrastructure: Kubernetes v1.29.7
  • Deployment tool: Applying k8s resources

Screenshots, Promtail config, or terminal output

Click here for Promtail config
server:
  log_level: debug
  log_format: logfmt
  http_listen_port: 3101

clients:
  - url: http://loki-instance/loki/api/v1/push
    oauth2:
      client_id: ${OAUTH2_CLIENT_ID}
      client_secret: ${OAUTH2_CLIENT_SECRET}
      token_url: https://login.microsoftonline.com/.../oauth2/v2.0/token
      scopes:
        - "scope.default"
    external_labels:
      k8s_cluster: my-cluster
      env: prod
      provider: azure
    backoff_config:
      min_period: 1s
      max_period: 2m
      max_retries: 10

positions:
  filename: /run/promtail/positions.yaml

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - app-namespace
            - platform-monitoring
    pipeline_stages:
      - cri: {}
      - labeldrop:
          - filename
          - stream
          - level
    relabel_configs:
      - source_labels:
          - __meta_kubernetes_namespace
          - __meta_kubernetes_pod_name
        regex: (app-namespace.+|platform-monitoring;promtail-liam-test-logs.+)
        action: keep
      - source_labels:
          - __meta_kubernetes_pod_controller_name
        regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})?
        action: replace
        target_label: __tmp_controller_name
      - source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
          - __meta_kubernetes_pod_label_app
          - __tmp_controller_name
          - __meta_kubernetes_namespace
        regex: ^;*([^;]+)(;.*)?$
        action: replace
        target_label: app
      - source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_component
          - __meta_kubernetes_pod_label_component
        regex: ^;*([^;]+)(;.*)?$
        action: replace
        target_label: component
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
          - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
          - __meta_kubernetes_pod_uid
          - __meta_kubernetes_pod_container_name
        target_label: __path__
      - action: replace
        regex: true/(.*)
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
          - __meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash
          - __meta_kubernetes_pod_annotation_kubernetes_io_config_hash
          - __meta_kubernetes_pod_container_name
        target_label: __path__
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_label_env
          - __meta_kubernetes_pod_label_environment
        regex: ^;*([^;]+)(;.*)?$
        target_label: env

tracing:
  enabled: false

We have tested this by deploying an application to the cluster that will print a log line and increase a counter metric every time it receives a HTTP request. We would therefore expect that the count of log lines and the increase in the counter metric are roughly equal, when we run a load test however, we can see that the count of logs via promtail is far lower. For comparison we have also deployed OpenTelemetry collectors to the cluster, below shows, during a load test, the count of HTTP requests as measured by the counter increase, promtail log count and OpenTelemetry collector log count

Image

As you can see the count for promtail is far lower, it looks like at points it flatlines and this is when I believe log files on the k8s node are being missed because latency pushing is high and it is still trying to process previous log files. The OpenTelemetry collector does not seem to have such problems, I would expect promtail to also buffer the newer log files while it struggles to push the older files.

For further evidence that high latency is the issue, we have another cluster where latency is generally low and we do not observe log loss there. However, recently we had some instability with our Loki ingestion and we saw push latency increase, here you can see that exactly when latency spikes we can see a gap in the applications logs.

Image
Image

In order to identify which log lines were missing after a load test, I gained access to the kubernetes node to look at the log files directly, there I could see a pattern that looked like

-rw-r----- 1 root root 15314989 Oct 11 09:20 0.log
-rw-r--r-- 1 root root  3388094 Oct 11 07:50 0.log.20241011-075001.gz
-rw-r--r-- 1 root root  3612170 Oct 11 07:51 0.log.20241011-075031.gz
-rw-r--r-- 1 root root  3342465 Oct 11 07:51 0.log.20241011-075102.gz
-rw-r----- 1 root root 62686779 Oct 11 07:51 0.log.20241011-075132

None of the logs lines in 0.log.20241011-075031.gz seemed to be available in Loki while I could see log lines from the others. This further proves that it is entire log files that are getting missed during these high latency issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant