You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We have observed that when latency pushing logs from promtail to Loki is high and promtail is hosted on kubernetes, that logs can be dropped, it seems that log files on kubernetes are skipped, leading to gaps in the logs. When running an OpenTelemetry Collector instead on the same cluster we observe that all expected logs are pushed to Loki.
To Reproduce
Steps to reproduce the behavior:
Started Loki (3.1.1)
Started Promtail (3.1.1) DaemonSet on a k8s cluster to tail k8s pods using kubernetes_sd_configs
Latency pushing must be high, in our case this is due to crossing regions and it averages 1 second per push, calculated using the metrics
We have tested this by deploying an application to the cluster that will print a log line and increase a counter metric every time it receives a HTTP request. We would therefore expect that the count of log lines and the increase in the counter metric are roughly equal, when we run a load test however, we can see that the count of logs via promtail is far lower. For comparison we have also deployed OpenTelemetry collectors to the cluster, below shows, during a load test, the count of HTTP requests as measured by the counter increase, promtail log count and OpenTelemetry collector log count
As you can see the count for promtail is far lower, it looks like at points it flatlines and this is when I believe log files on the k8s node are being missed because latency pushing is high and it is still trying to process previous log files. The OpenTelemetry collector does not seem to have such problems, I would expect promtail to also buffer the newer log files while it struggles to push the older files.
For further evidence that high latency is the issue, we have another cluster where latency is generally low and we do not observe log loss there. However, recently we had some instability with our Loki ingestion and we saw push latency increase, here you can see that exactly when latency spikes we can see a gap in the applications logs.
In order to identify which log lines were missing after a load test, I gained access to the kubernetes node to look at the log files directly, there I could see a pattern that looked like
-rw-r----- 1 root root 15314989 Oct 11 09:20 0.log
-rw-r--r-- 1 root root 3388094 Oct 11 07:50 0.log.20241011-075001.gz
-rw-r--r-- 1 root root 3612170 Oct 11 07:51 0.log.20241011-075031.gz
-rw-r--r-- 1 root root 3342465 Oct 11 07:51 0.log.20241011-075102.gz
-rw-r----- 1 root root 62686779 Oct 11 07:51 0.log.20241011-075132
None of the logs lines in 0.log.20241011-075031.gz seemed to be available in Loki while I could see log lines from the others. This further proves that it is entire log files that are getting missed during these high latency issues.
The text was updated successfully, but these errors were encountered:
Describe the bug
We have observed that when latency pushing logs from promtail to Loki is high and promtail is hosted on kubernetes, that logs can be dropped, it seems that log files on kubernetes are skipped, leading to gaps in the logs. When running an OpenTelemetry Collector instead on the same cluster we observe that all expected logs are pushed to Loki.
To Reproduce
Steps to reproduce the behavior:
kubernetes_sd_configs
Expected behavior
All logs should be pushed to Loki
Environment:
Screenshots, Promtail config, or terminal output
Click here for Promtail config
We have tested this by deploying an application to the cluster that will print a log line and increase a counter metric every time it receives a HTTP request. We would therefore expect that the count of log lines and the increase in the counter metric are roughly equal, when we run a load test however, we can see that the count of logs via promtail is far lower. For comparison we have also deployed OpenTelemetry collectors to the cluster, below shows, during a load test, the count of HTTP requests as measured by the counter increase, promtail log count and OpenTelemetry collector log count
As you can see the count for promtail is far lower, it looks like at points it flatlines and this is when I believe log files on the k8s node are being missed because latency pushing is high and it is still trying to process previous log files. The OpenTelemetry collector does not seem to have such problems, I would expect promtail to also buffer the newer log files while it struggles to push the older files.
For further evidence that high latency is the issue, we have another cluster where latency is generally low and we do not observe log loss there. However, recently we had some instability with our Loki ingestion and we saw push latency increase, here you can see that exactly when latency spikes we can see a gap in the applications logs.
In order to identify which log lines were missing after a load test, I gained access to the kubernetes node to look at the log files directly, there I could see a pattern that looked like
None of the logs lines in
0.log.20241011-075031.gz
seemed to be available in Loki while I could see log lines from the others. This further proves that it is entire log files that are getting missed during these high latency issues.The text was updated successfully, but these errors were encountered: