Fluentd pod crashing on Azure Container Service #6

Cryptophobia · 2018-03-20T19:59:09Z

From @sbulman on August 5, 2017 7:27

Hi All,

I'm following the instructions to set up Deis on Azure Container Service. One of the deis-logger-fluentd pods is crashing with the following log.

2017-08-05 07:21:26 +0000 [info]: reading config file path="/opt/fluentd/conf/fluentd.conf"
2017-08-05 07:22:27 +0000 [error]: config error file="/opt/fluentd/conf/fluentd.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint https://10.0.0.1:443: Timed out connecting to server"

Any ideas?

Thanks.

Copied from original issue: deis/workflow#847

Cryptophobia · 2018-03-20T19:59:11Z

From @sbulman on August 5, 2017 7:48

A bit more info. I created the ACS cluster with 1 agent. The fluentd pod that is crashing is on the master node. The pod running on the agent appears to be working fine.

Cryptophobia · 2018-03-20T19:59:11Z

From @henrydecovo on September 25, 2017 19:44

We're facing the same issue, same symptoms and circumstances as @sbulman.
The fluentd logger pod continually crashes on the master node on Azure ACS.

Cryptophobia · 2018-03-20T19:59:12Z

From @bacongobbler on September 25, 2017 19:50

There should not be a fluentd pod running on the master node. There was an open ticket on DaemonSet pods being accidentally scheduled on the kubernetes master node that was eventually solved upstream.

More background context in this ticket, which was resolved in Kubernetes 1.5.0+ via kubernetes/kubernetes#35526.

Cryptophobia · 2018-03-20T19:59:12Z

From @henrydecovo on September 25, 2017 21:36

Ok, thanks @bacongobbler for the context. It still appears to be an issue though on ACS today. Any thoughts much appreciated!

The fluentd logger pod event for the master node indicates the following error:

Error syncing pod, skipping: failed to "StartContainer" for "deis-logger-fluentd" with CrashLoopBackOff: "Back-off 10s restarting failed container=deis-logger-fluentd pod=deis-logger-fluentd-swjnl_deis

K8S versions (client and Azure Container Service):

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Deis version 2.18.0

Fluentd pod is definitely running on the master node on ACS as denoted by the event logs, in this case being created by: k8s-master-47933ef9-0

Cryptophobia · 2018-03-20T19:59:13Z

From @monaka on December 25, 2017 7:54

I also got same issue on my K8s/CoreOS.
Not on ACS but might be same root cause.

In my case, it was fixed by adding the option --register-with-taints=node-role.kubernetes.io/master=true:NoSchedule to hyperkube.

The unschedulable field of a node is not respected by the DaemonSet controller.

Jaipreet95 · 2018-05-29T12:46:44Z

I have the master tainted with the same flag but still facing the same issue is there any workaround for this?

Cryptophobia · 2018-05-29T13:57:33Z

@Jaipreet95 : Have you tried adding a toleration on the fluentd daemonset so that it does not schedule on the master nodes?

Something like the last spec field below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
...

Jaipreet95 · 2018-05-29T15:27:21Z

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.
I want to schedule the pod on the master nodes to accumulate kubernetes specific logs so I added the matching toleration to the fluentd daemonset

PaulCharlton · 2018-07-30T07:20:49Z

it is a bug in fluentd? assuming that the k8s API is at 10.0.0.1 (the first address in the podCIDR range) is not a reasonable assumption – should it be binding by local DNS name for master node?

On ACS, the vnetCIDR may encompass a much broader range that available for pods.
On ACS, the master node is allocated near the end of the node CIDR range, not at the start, so it will definitely not be at "x.x.x.1"
On ACS, the first available IP for a node is 'x.x.x.5' from the range available for nodes
On ACS, with "Azure CNI" networking plugin, pods are allocated IP addresses from the same CIDR range as nodes.

Cryptophobia · 2018-07-30T17:01:55Z

@PaulCharlton , we are currently not doing any testing on ACS and no way for us to really verify. We would have to deploy a k8s cluster and deploy Hephy Workflow on top. This sounds like the issue that the fluentd pods are expecting the k8s API to be on 10.0.0.1 on the internal CIDR cluster range.

If I were to guess from the information we have is that it is most likely failiing because of number 2:

On ACS, the master node is allocated near the end of the node CIDR range, not at the start, so it will definitely not be at "x.x.x.1"

Cryptophobia mentioned this issue Mar 20, 2018

Fluentd pod crashing on Azure Container Service deis/workflow#847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd pod crashing on Azure Container Service #6

Fluentd pod crashing on Azure Container Service #6

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Jaipreet95 commented May 29, 2018

Cryptophobia commented May 29, 2018

Jaipreet95 commented May 29, 2018 •

edited

Loading

PaulCharlton commented Jul 30, 2018

Cryptophobia commented Jul 30, 2018

Fluentd pod crashing on Azure Container Service #6

Fluentd pod crashing on Azure Container Service #6

Comments

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Cryptophobia commented Mar 20, 2018

Jaipreet95 commented May 29, 2018

Cryptophobia commented May 29, 2018

Jaipreet95 commented May 29, 2018 • edited Loading

PaulCharlton commented Jul 30, 2018

Cryptophobia commented Jul 30, 2018

Jaipreet95 commented May 29, 2018 •

edited

Loading