Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd pod crashing on Azure Container Service #6

Open
Cryptophobia opened this issue Mar 20, 2018 · 10 comments
Open

Fluentd pod crashing on Azure Container Service #6

Cryptophobia opened this issue Mar 20, 2018 · 10 comments

Comments

@Cryptophobia
Copy link
Member

From @sbulman on August 5, 2017 7:27

Hi All,

I'm following the instructions to set up Deis on Azure Container Service. One of the deis-logger-fluentd pods is crashing with the following log.

2017-08-05 07:21:26 +0000 [info]: reading config file path="/opt/fluentd/conf/fluentd.conf"
2017-08-05 07:22:27 +0000 [error]: config error file="/opt/fluentd/conf/fluentd.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint https://10.0.0.1:443: Timed out connecting to server"

Any ideas?

Thanks.

Copied from original issue: deis/workflow#847

@Cryptophobia
Copy link
Member Author

From @sbulman on August 5, 2017 7:48

A bit more info. I created the ACS cluster with 1 agent. The fluentd pod that is crashing is on the master node. The pod running on the agent appears to be working fine.

@Cryptophobia
Copy link
Member Author

From @henrydecovo on September 25, 2017 19:44

We're facing the same issue, same symptoms and circumstances as @sbulman.
The fluentd logger pod continually crashes on the master node on Azure ACS.

@Cryptophobia
Copy link
Member Author

From @bacongobbler on September 25, 2017 19:50

There should not be a fluentd pod running on the master node. There was an open ticket on DaemonSet pods being accidentally scheduled on the kubernetes master node that was eventually solved upstream.

More background context in this ticket, which was resolved in Kubernetes 1.5.0+ via kubernetes/kubernetes#35526.

@Cryptophobia
Copy link
Member Author

From @henrydecovo on September 25, 2017 21:36

Ok, thanks @bacongobbler for the context. It still appears to be an issue though on ACS today. Any thoughts much appreciated!

The fluentd logger pod event for the master node indicates the following error:

Error syncing pod, skipping: failed to "StartContainer" for "deis-logger-fluentd" with CrashLoopBackOff: "Back-off 10s restarting failed container=deis-logger-fluentd pod=deis-logger-fluentd-swjnl_deis

K8S versions (client and Azure Container Service):

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

Deis version 2.18.0

Fluentd pod is definitely running on the master node on ACS as denoted by the event logs, in this case being created by: k8s-master-47933ef9-0

@Cryptophobia
Copy link
Member Author

From @monaka on December 25, 2017 7:54

I also got same issue on my K8s/CoreOS.
Not on ACS but might be same root cause.

In my case, it was fixed by adding the option --register-with-taints=node-role.kubernetes.io/master=true:NoSchedule to hyperkube.

The unschedulable field of a node is not respected by the DaemonSet controller.

@Jaipreet95
Copy link

I have the master tainted with the same flag but still facing the same issue is there any workaround for this?

@Cryptophobia
Copy link
Member Author

@Jaipreet95 : Have you tried adding a toleration on the fluentd daemonset so that it does not schedule on the master nodes?

Something like the last spec field below:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
...

@Jaipreet95
Copy link

Jaipreet95 commented May 29, 2018

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.
I want to schedule the pod on the master nodes to accumulate kubernetes specific logs so I added the matching toleration to the fluentd daemonset

@PaulCharlton
Copy link

it is a bug in fluentd? assuming that the k8s API is at 10.0.0.1 (the first address in the podCIDR range) is not a reasonable assumption – should it be binding by local DNS name for master node?

  1. On ACS, the vnetCIDR may encompass a much broader range that available for pods.
  2. On ACS, the master node is allocated near the end of the node CIDR range, not at the start, so it will definitely not be at "x.x.x.1"
  3. On ACS, the first available IP for a node is 'x.x.x.5' from the range available for nodes
  4. On ACS, with "Azure CNI" networking plugin, pods are allocated IP addresses from the same CIDR range as nodes.

@Cryptophobia
Copy link
Member Author

@PaulCharlton , we are currently not doing any testing on ACS and no way for us to really verify. We would have to deploy a k8s cluster and deploy Hephy Workflow on top. This sounds like the issue that the fluentd pods are expecting the k8s API to be on 10.0.0.1 on the internal CIDR cluster range.

If I were to guess from the information we have is that it is most likely failiing because of number 2:

On ACS, the master node is allocated near the end of the node CIDR range, not at the start, so it will definitely not be at "x.x.x.1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants