Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes gRPC liveness probes fail #11

Open
akolybelnikov opened this issue Jun 21, 2022 · 8 comments
Open

Kubernetes gRPC liveness probes fail #11

akolybelnikov opened this issue Jun 21, 2022 · 8 comments

Comments

@akolybelnikov
Copy link

After implementing the Deploy Locally chapter, k8s live- and readiness-probes fail with:

kubectl get pod:

NAME        READY   STATUS             RESTARTS       AGE
proglog-0   0/1     CrashLoopBackOff   12 (19s ago)   32m

kubectl describe pod;

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  30m                  default-scheduler  Successfully assigned default/proglog-0 to kind-control-plane
  Normal   Pulling    30m                  kubelet            Pulling image "busybox"
  Normal   Pulled     30m                  kubelet            Successfully pulled image "busybox" in 3.724847794s
  Normal   Created    30m                  kubelet            Created container proglog-config-init
  Normal   Started    30m                  kubelet            Started container proglog-config-init
  Warning  Unhealthy  29m (x2 over 30m)    kubelet            Readiness probe failed: command "/bin/grpc_health_probe -addr=:8400" timed out
  Warning  Unhealthy  29m                  kubelet            Readiness probe failed: timeout: failed to connect service ":8400" within 1s
  Normal   Killing    29m                  kubelet            Container proglog failed liveness probe, will be restarted
  Warning  Unhealthy  29m (x3 over 30m)    kubelet            Liveness probe failed: command "/bin/grpc_health_probe -addr=:8400" timed out
  Warning  Unhealthy  29m                  kubelet            Readiness probe failed:
  Normal   Started    29m (x3 over 30m)    kubelet            Started container proglog
  Normal   Created    29m (x4 over 30m)    kubelet            Created container proglog
  Normal   Pulled     29m (x4 over 30m)    kubelet            Container image "github.com/akolybelnikov/proglog:0.0.1" already present on machine
  Warning  BackOff    22s (x140 over 29m)  kubelet            Back-off restarting failed container

I tried compiling and deploying the code from the cloned repository with the same result.

@iamnotaturtle
Copy link

What do the logs say?

kubectl logs proglog-0

If it's a config parsing error then the fix is here: #12

@akolybelnikov
Copy link
Author

It looks like an error in the container itself:

Defaulted container "proglog" out of: proglog, proglog-config-init (init)
2022-09-25T21:42:18.367Z [INFO]  raft: Initial configuration (index=1): [{Suffrage:Voter ID:proglog-0 Address:proglog-0.proglog.default.svc.cluster.local:8400}]
2022-09-25T21:42:18.367Z [INFO]  raft: Node at 10.244.0.6:8400 [Follower] entering Follower state (Leader: "")
2022-09-25T21:42:19.880Z [WARN]  raft: Heartbeat timeout from "" reached, starting election
2022-09-25T21:42:19.880Z [INFO]  raft: Node at 10.244.0.6:8400 [Candidate] entering Candidate state in term 13
2022-09-25T21:42:19.900Z [DEBUG] raft: Votes needed: 1
2022-09-25T21:42:19.900Z [DEBUG] raft: Vote granted from proglog-0 in term 13. Tally: 1
2022-09-25T21:42:19.900Z [INFO]  raft: Election won. Tally: 1
2022-09-25T21:42:19.900Z [INFO]  raft: Node at 10.244.0.6:8400 [Leader] entering Leader state
2022-09-25T21:42:19.900Z [ERROR] raft: Failed to commit logs: EOF
2022-09-25T21:42:19.900Z [INFO]  raft: Node at 10.244.0.6:8400 [Follower] entering Follower state (Leader: "")
2022-09-25T21:42:21.189Z [WARN]  raft: Heartbeat timeout from "" reached, starting election
2022-09-25T21:42:21.189Z [INFO]  raft: Node at 10.244.0.6:8400 [Candidate] entering Candidate state in term 14
2022-09-25T21:42:21.214Z [DEBUG] raft: Votes needed: 1
2022-09-25T21:42:21.214Z [DEBUG] raft: Vote granted from proglog-0 in term 14. Tally: 1
2022-09-25T21:42:21.214Z [INFO]  raft: Election won. Tally: 1
2022-09-25T21:42:21.214Z [INFO]  raft: Node at 10.244.0.6:8400 [Leader] entering Leader state
2022-09-25T21:42:21.214Z [ERROR] raft: Failed to commit logs: EOF
2022-09-25T21:42:21.214Z [INFO]  raft: Node at 10.244.0.6:8400 [Follower] entering Follower state (Leader: "")
Error: timed out

@bxffour
Copy link

bxffour commented Jan 21, 2023

Were you able to solve this problem? I've been struggling it

@cschar
Copy link

cschar commented Jul 30, 2023

DeployLocally chapter Fixes so far:

first hurdle: Dockerfile

In the dockerfile, the grpc code in the repo has a few typos, "https:#" instead of "https://" and no space after -q0

fix in Dockerfile

# install grpc_health_probe 
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.2 && \
    wget -q -O /go/bin/grpc_health_probe \
    https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/${GRPC_HEALTH_PROBE_VERSION}/grpc_health_probe-linux-amd64 && \
    chmod +x /go/bin/grpc_health_probe

also after building, (I had a bunch of clusters running from before), my test cluster was kind-two
eventually figured out i had to use
kind load docker-image github.com/cschar/proglog:0.0.1 --name kind-two to load onto the correct cluster.

second hurdle: crashloopbackoff fix

# intiial cryptic message:
└─▪ kubectl logs proglog-0 proglog
Error: While parsing config: yaml: line 3

helm install proglog deploy/proglog
helm get pods  # <--- should see crash-loop-backoff here
helm describe pods proglog
helm logs proglog-0 proglog
helm logs proglog-0 proglog-config-init #< -- here is where u can see debug echo output set in statefulset.yaml

To get past this I eliminated any multiline character stuff that was maybe causing errors and broke up the config step by step
You can add additional stuff here to help debug

part of proglog/templates/statefulset.yaml

      initContainers:
      - name: {{ include "proglog.fullname" . }}-config-init
        image: busybox
        imagePullPolicy: IfNotPresent
        command:          
          - /bin/sh
          - -c
          - |-
            ID=$(echo $HOSTNAME | rev | cut -d- -f1 | rev)
            echo "started with $HOSTNAME, and after parsing..."
            echo "got id: $ID"
            cat > /var/run/proglog/config.yaml <<EOD
            #this is a comment in a config file
            data-dir: /var/run/proglog/data
            rpc-port: {{.Values.rpcPort}}
            bind-addr: "$HOSTNAME.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}"
            bootstrap: $([ $ID = 0 ] && echo true || echo false )            
            EOD

            if [ "$ID" -eq 0 ]; then
              echo "The ID variable is equal to 0. skipping setting start-join-addrs"
            else
              echo "The ID variable is equal to $ID."
              echo start-join-addrs: \"proglog-0.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}\" >> /var/run/proglog/config.yaml
            fi

            echo -e "generated file /var/run/proglog/config.yaml \n "
            cat /var/run/proglog/config.yaml

and running kubectl logs proglog-0 proglog-config-init will show your output above

Third Hurdle: cluster is running but internal raft stuff is breaking

similar to where @akolybelnikov comment above is

currently stuck here...

└─▪ kubectl logs proglog-0 proglog
2023-07-30T20:18:44.402Z [INFO]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:proglog-0 Address:proglog-0.proglog.default.svc.cluster.local:8400}]"
2023-07-30T20:18:44.402Z [INFO]  raft: entering follower state: follower="Node at 10.244.0.22:8400 [Follower]" leader-address= leader-id=
2023-07-30T20:18:45.465Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-07-30T20:18:45.465Z [INFO]  raft: entering candidate state: node="Node at 10.244.0.22:8400 [Candidate]" term=6
2023-07-30T20:18:45.467Z [DEBUG] raft: voting for self: term=6 id=proglog-0
2023-07-30T20:18:45.468Z [DEBUG] raft: calculated votes needed: needed=1 term=6
2023-07-30T20:18:45.468Z [DEBUG] raft: vote granted: from=proglog-0 term=6 tally=1
2023-07-30T20:18:45.468Z [INFO]  raft: election won: term=6 tally=1
2023-07-30T20:18:45.468Z [INFO]  raft: entering leader state: leader="Node at 10.244.0.22:8400 [Leader]"
2023-07-30T20:18:45.468Z [ERROR] raft: failed to commit logs: error=EOF
2023-07-30T20:18:45.468Z [INFO]  raft: entering follower state: follower="Node at 10.244.0.22:8400 [Follower]" leader-address= leader-id=
Error: timed out

ended up googling failed to commit logs: error=EOF and low and behold first result is for the book!:

https://forum.devtalk.com/t/distributed-services-with-go-unable-to-pass-readiness-liveness-checks-page-210-215/22354/2

The patch in here by varun fixes most issues, and then nicewook has a final patch to fix the getservers in his comment.

thank you all for contributing to the fixes

@Gh05ts
Copy link

Gh05ts commented Dec 4, 2023

Stuck here as well, I'm not very familiar with k8s so not able to properly understand what is going on, is the ideal behavior there should just be one pod and 3 StatefulSets or are there supposed to be 3 pods? It seems like only one instance is up (the leader node) and it just tries a few times and times out.

I commented out the probes as well and tried giving very long periods of time as well, none of those things are changing anything, issue is not just about the probes, seems like either k8s networking is not configured right or the application is not getting configured properly or application is not accessing disk properly.

With E2E tests working fine locally, it does seem weird, but if someone figures it out, please let me know as well.

@Gh05ts
Copy link

Gh05ts commented Dec 5, 2023

Okay another update by increasing the wait for leader timeout in file from 3s to 30s I'm able to get another pod started.

image

Now the issue seems to be that the follower is trying to read and failing inside the raft connection.

And the leader is giving this:
2023-12-05T15:38:51.201Z [ERROR] raft: failed to commit logs: error=EOF

@Gh05ts
Copy link

Gh05ts commented Dec 5, 2023

Okay I'm back like I never left, I did though, but I'm back, anyway, I narrowed it down and found where the issue is happening, well not quite, I just found where the error is, still need to fix it and the issue is

image

The index.size + entWidth = 1036 and the mmap size is 1024, therefore it fails the index append and raft is not able to store the entry and leader is useless so followers can't do anything anyway.

Will try to debug how the behavior is different in local vs k8s for this to be happening

@Gh05ts
Copy link

Gh05ts commented Dec 5, 2023

Okay I got it working, I don't think what I've done is the full solution but, it gets you over the hump for the time being.

What I observed in local was that the file size was initially 0, but somehow it seems in k8s the init size of index file was 1024, why I don't know, I never debugged and confirmed it to be 1024 on first run, just on failing previous runs, but the previous error does indicate that to be the case, and based on that I did below to fix it.

image

- /bin/sh
- -c
- |-
  rm -r /var/run/proglog/data
  mkdir -p /var/run/proglog/data
  ID=$(echo $HOSTNAME | rev | cut -d- -f1 | rev)
  cat > /var/run/proglog/config.yaml <<EOD
  data-dir: /var/run/proglog/data
  rpc-port: {{.Values.rpcPort}}
  bind-addr: "$HOSTNAME.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}"
  $([ $ID != 0 ] && echo 'start-join-addrs: "proglog-0.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}"')
  bootstrap: $([ $ID = 0 ] && echo true || echo false)
  EOD

I added a rm and mkdir for the data directory explicitly, I did not debug the old behavior fully, but I think the file was being initially made of size 1024 in first run and then no write was happening, so crash loop and no progress, could be wrong, and I don't think this is the ideal solution as well because I believe since it's a stateful set you want to save state not clean it each run, however, perhaps just doing this in first run and not doing in subsequent runs might be a solution? At the very least, it's running properly on k8s as well now.

edit: It does not work on subsequent runs (if rm is removed) if first run is successful, as pointed out by #5 and #6 the raft logStore is never closed and this causes the index file to be of size 1024 bytes and then it fails the check and no appends and EOF and it never runs.
And the health probes need to be disabled as well.

The only proper solution I can think of is, don't use the distributed log as logStore, instead just use boltdb as the logStore and the stableStore, and only use the log in the fsm or properly handle close in the logStore sent to raft as logStore.

Edit part 2: Even using boltdb does fail for this application for some reason, perhaps I did not use it correctly or I'm missing setting up the first index write which is a required condition for raft, but when I looked up other raft examples (basic raft on go with hashicorp raft with logstore and stable store as bboltdb (boltdb/v2)) on github no one is explicitly setting the index to 1 for initial reader write, so not sure how they got it working, and I also did fix the closing logStore using the application's distributed log but application still somehow manages to corrupt the file on second run, I've not properly debugged or understood this behavior yet, my best guess at this point is, either the followers are messing it up and they are not properly closing and this is missed still after the changes I made or somehow when the leader writes first entry on second run, it's expanding the file size or something the follower's are somehow doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants