-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes gRPC liveness probes fail #11
Comments
What do the logs say?
If it's a config parsing error then the fix is here: #12 |
It looks like an error in the container itself:
|
Were you able to solve this problem? I've been struggling it |
DeployLocally chapter Fixes so far: first hurdle: DockerfileIn the dockerfile, the grpc code in the repo has a few typos, "https:#" instead of "https://" and no space after -q0 fix in
also after building, (I had a bunch of clusters running from before), my test cluster was second hurdle: crashloopbackoff fix
To get past this I eliminated any multiline character stuff that was maybe causing errors and broke up the config step by step part of proglog/templates/statefulset.yaml initContainers:
- name: {{ include "proglog.fullname" . }}-config-init
image: busybox
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- |-
ID=$(echo $HOSTNAME | rev | cut -d- -f1 | rev)
echo "started with $HOSTNAME, and after parsing..."
echo "got id: $ID"
cat > /var/run/proglog/config.yaml <<EOD
#this is a comment in a config file
data-dir: /var/run/proglog/data
rpc-port: {{.Values.rpcPort}}
bind-addr: "$HOSTNAME.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}"
bootstrap: $([ $ID = 0 ] && echo true || echo false )
EOD
if [ "$ID" -eq 0 ]; then
echo "The ID variable is equal to 0. skipping setting start-join-addrs"
else
echo "The ID variable is equal to $ID."
echo start-join-addrs: \"proglog-0.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}\" >> /var/run/proglog/config.yaml
fi
echo -e "generated file /var/run/proglog/config.yaml \n "
cat /var/run/proglog/config.yaml
and running Third Hurdle: cluster is running but internal raft stuff is breakingsimilar to where @akolybelnikov comment above is currently stuck here...
ended up googling The patch in here by varun fixes most issues, and then nicewook has a final patch to fix the getservers in his comment. thank you all for contributing to the fixes |
Stuck here as well, I'm not very familiar with k8s so not able to properly understand what is going on, is the ideal behavior there should just be one pod and 3 StatefulSets or are there supposed to be 3 pods? It seems like only one instance is up (the leader node) and it just tries a few times and times out. I commented out the probes as well and tried giving very long periods of time as well, none of those things are changing anything, issue is not just about the probes, seems like either k8s networking is not configured right or the application is not getting configured properly or application is not accessing disk properly. With E2E tests working fine locally, it does seem weird, but if someone figures it out, please let me know as well. |
Okay another update by increasing the wait for leader timeout in file from 3s to 30s I'm able to get another pod started. Now the issue seems to be that the follower is trying to read and failing inside the raft connection. And the leader is giving this: |
Okay I'm back like I never left, I did though, but I'm back, anyway, I narrowed it down and found where the issue is happening, well not quite, I just found where the error is, still need to fix it and the issue is The index.size + entWidth = 1036 and the mmap size is 1024, therefore it fails the index append and raft is not able to store the entry and leader is useless so followers can't do anything anyway. Will try to debug how the behavior is different in local vs k8s for this to be happening |
Okay I got it working, I don't think what I've done is the full solution but, it gets you over the hump for the time being. What I observed in local was that the file size was initially 0, but somehow it seems in k8s the init size of index file was 1024, why I don't know, I never debugged and confirmed it to be 1024 on first run, just on failing previous runs, but the previous error does indicate that to be the case, and based on that I did below to fix it. - /bin/sh
- -c
- |-
rm -r /var/run/proglog/data
mkdir -p /var/run/proglog/data
ID=$(echo $HOSTNAME | rev | cut -d- -f1 | rev)
cat > /var/run/proglog/config.yaml <<EOD
data-dir: /var/run/proglog/data
rpc-port: {{.Values.rpcPort}}
bind-addr: "$HOSTNAME.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}"
$([ $ID != 0 ] && echo 'start-join-addrs: "proglog-0.proglog.{{.Release.Namespace}}.svc.cluster.local:{{.Values.serfPort}}"')
bootstrap: $([ $ID = 0 ] && echo true || echo false)
EOD I added a rm and mkdir for the data directory explicitly, I did not debug the old behavior fully, but I think the file was being initially made of size 1024 in first run and then no write was happening, so crash loop and no progress, could be wrong, and I don't think this is the ideal solution as well because I believe since it's a stateful set you want to save state not clean it each run, however, perhaps just doing this in first run and not doing in subsequent runs might be a solution? At the very least, it's running properly on k8s as well now. edit: It does not work on subsequent runs (if rm is removed) if first run is successful, as pointed out by #5 and #6 the raft logStore is never closed and this causes the index file to be of size 1024 bytes and then it fails the check and no appends and EOF and it never runs. The only proper solution I can think of is, don't use the distributed log as logStore, instead just use boltdb as the logStore and the stableStore, and only use the log in the fsm or properly handle close in the logStore sent to raft as logStore. Edit part 2: Even using boltdb does fail for this application for some reason, perhaps I did not use it correctly or I'm missing setting up the first index write which is a required condition for raft, but when I looked up other raft examples (basic raft on go with hashicorp raft with logstore and stable store as bboltdb (boltdb/v2)) on github no one is explicitly setting the index to 1 for initial reader write, so not sure how they got it working, and I also did fix the closing logStore using the application's distributed log but application still somehow manages to corrupt the file on second run, I've not properly debugged or understood this behavior yet, my best guess at this point is, either the followers are messing it up and they are not properly closing and this is missed still after the changes I made or somehow when the leader writes first entry on second run, it's expanding the file size or something the follower's are somehow doing. |
After implementing the Deploy Locally chapter, k8s live- and readiness-probes fail with:
kubectl get pod:
kubectl describe pod;
I tried compiling and deploying the code from the cloned repository with the same result.
The text was updated successfully, but these errors were encountered: