Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Go SDK]: Silent Dataflow failure upon glibc library mismatch #24470

Closed
mwallace582 opened this issue Dec 1, 2022 · 11 comments
Closed

[Bug][Go SDK]: Silent Dataflow failure upon glibc library mismatch #24470

mwallace582 opened this issue Dec 1, 2022 · 11 comments
Assignees

Comments

@mwallace582
Copy link

What happened?

Hi Apache Beam team,

I've recently identified an issue with using the Beam Go SDK with Google's Dataflow. I've already filed a bug report with Google, but I wanted to report it here as well in case there's something to be done in the SDK side.

I'll paste the contents of my liked bug report below:

Recently, I encountered an issue where I was unable to run any Dataflow pipeline using the Go language SDK. Previously I'd only used Python with Dataflow. I used the wordcount.go example pipeline as my test case. I followed the deployment instructions on the Dataflow site and didn't do anything fancy.

Upon deploying the pipeline, the Dataflow console reported that the workers had started successfully and that everything was running fine. Despite this, the pipeline never made any progress whatsoever.

I noticed some errors at the logs when looking at Stackdriver with these filters:

resource.type="dataflow_step"
resource.labels.job_id="2022-11-30_13_25_07-7319605715258819834"

The most concerning log entries were the following:

"ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: 1dd1f5aed1062c9fa85e1437dcbb915a8e3844a5a8de66982aaf4c1471dc492e" containerID="1dd1f5aed1062c9fa85e1437dcbb915a8e3844a5a8de66982aaf4c1471dc492e"
"getPodContainerStatuses for pod failed" err="rpc error: code = Unknown desc = Error: No such container: 1dd1f5aed1062c9fa85e1437dcbb915a8e3844a5a8de66982aaf4c1471dc492e" pod="default/df-matthew-test-golang-v3-11301257-l37l-harness-6jnd"
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sdk-0-0\" with CrashLoopBackOff: \"back-off 10s restarting failed container=sdk-0-0 pod=df-matthew-test-golang-v3-11301257-l37l-harness-6jnd_default(57e4bc85cfdd8d814463331df622de1a)\"" pod="default/df-matthew-test-golang-v3-11301257-l37l-harness-6jnd" podUID=57e4bc85cfdd8d814463331df622de1a
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sdk-0-0\" with CrashLoopBackOff: \"back-off 20s restarting failed container=sdk-0-0 pod=df-matthew-test-golang-v3-11301257-l37l-harness-6jnd_default(57e4bc85cfdd8d814463331df622de1a)\"" pod="default/df-matthew-test-golang-v3-11301257-l37l-harness-6jnd" podUID=57e4bc85cfdd8d814463331df622de1a
"Error syncing pod, skipping" err="failed to \"StartContainer\" for \"sdk-0-0\" with CrashLoopBackOff: \"back-off 20s restarting failed container=sdk-0-0 pod=df-matthew-test-golang-v3-11301257-l37l-harness-6jnd_default(57e4bc85cfdd8d814463331df622de1a)\"" pod="default/df-matthew-test-golang-v3-11301257-l37l-harness-6jnd" podUID=57e4bc85cfdd8d814463331df622de1a

I was stuck here for awhile, not understanding what was going wrong which could explain these Kubernetes errors. After staring at the logs for several hours, I came across these log messages, which were not marked as errors:

2022/12/01 00:40:02 Provision info:
pipeline_options:{fields:{key:"beam:option:go_options:v1" value:{struct_value:{fields:...blah..blah}}}}
2022/12/01 00:40:02 Initializing Go harness: /opt/apache/beam/boot --logging_endpoint=localhost:12370 --control_endpoint=localhost:12371 --artifact_endpoint=localhost:12372 --provision_endpoint=localhost:12373 --semi_persist_dir=/var/opt/google --id=sdk-0-0
/bin/worker: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /bin/worker)
/bin/worker: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /bin/worker)
2022/12/01 00:40:02 User program exited: exit status 1
ignoring event map[container:cc38c9371adfcaff693d3b7bb1791111f7793f6d875eea6e756bb0254433b09e module:libcontainerd namespace:moby topic:/tasks/delete type:*events.TaskDelete]

Not being an expert on the internals of Dataflow, I didn't know if this was a real error or one of the many benign errors that appear in the Dataflow logs. But I figured that I would assume that these errors were the errors causing my issues, and see if I could quash them.

My development machine has glibc 2.36 installed, and the apache/beam_go_sdk:2.43.0 docker image has glibc 2.31 installed. Instead of downgrading my glibc version, I did some googling, and found that I could disable Go's dynamic linking by exporting CGO_ENABLED=0 before deploying the pipeline with go run. This fixed the issue with my pipeline immediately.

I'm fairly new to Go, so I can't say with any certainty that dynamic linking should be disabled by default on all Dataflow Go pipelines, but it is something worth considering. Alternatively, it would be EXTREMELY helpful if Dataflow would have reported this error more clearly in the first place. I don't think that mismatched glibc versions are a particularly rare occurrence, and this subtle error is a huge barrier to entry when using Go with Dataflow.

Issue Priority

Priority: 2

Issue Component

Component: sdk-go

@mwallace582 mwallace582 changed the title [Bug]: Silent Dataflow failure upon glibc library mismatch [Bug][Go SDK]: Silent Dataflow failure upon glibc library mismatch Dec 1, 2022
@kennknowles
Copy link
Member

@lostluck

@mwallace582
Copy link
Author

Note that I've just added detailed instructions on reproducing this issue to the linked bug report.

@lostluck
Copy link
Contributor

lostluck commented Dec 2, 2022

So there are two issues here.

  1. Turns out that debian-bullseye as a base container might not have the latest and greatest glibc (this is WAI probably, due to debian's ethos).
  2. Boot loader container messages aren't discoverable.

The first has a few options: Option 1: Custom containers. That's always an option and lets any container be used as the worker container, as long as the entrypoint is the bootloader to handle the rest of the container contract.

Option 2: Disable linking by turning off CGO, as already described.

We aren't likely to move off of debian as the container base anytime soon, but we could expand https://beam.apache.org/documentation/sdks/go-cross-compilation/ with a bit of this information.

The second is that we had to hunt for the root cause. That's never good. It hurts everyone. Ideally, the error is elevated properly, and tells you where to find answers.

In this case, since we have dedicated loaders per language, they could point to relevant places on the beam site, like https://beam.apache.org/documentation/sdks/go-cross-compilation/ or similar. That's where the documentation for this should live. As for how to elevate it.


All uses of the Beam Go SDK use the same containers, so the fix would be in the repo.

In principle, the boot loader could connect to the FnAPI logging service to make the failure announcement. That should elevate it in the Dataflow logs (and all other SDK uses with a FnAPI). It's not entirely clear to me why we don't already beyond it's not been needed before.

So the initial arg log is here: https://github.com/apache/beam/blob/master/sdks/go/container/boot.go#L100

And when the exec call fails, it's logged here: https://github.com/apache/beam/blob/master/sdks/go/container/boot.go#L169

I'm not familiar with how the logging gets elevated to errors/fatals/warnings etc from the container logs. The boot loaders don't use anything particularly fancy for logging, just the standard library "log" package, so it wouldn't hurt to upgrade it somewhat. I've already been considering starting integration with the hopefully upcoming structured log library for Go...

So that would be my suggestion: We switch the boot loaders to connect to the FnAPI log service, and direct users to a beam site page where we can catalog issues and solutions.


The bit that we don't have help for is relating the actual link error easily. Redirecting the binary's StdErr might work, but could lead to other noise going across the logging interface, or other duplication or overrun.

@lostluck
Copy link
Contributor

lostluck commented Feb 7, 2023

Issue: #25314 is also noting this particular problem with the boot loaders.

@cozos
Copy link
Contributor

cozos commented Feb 12, 2023

Some thoughts:

  • In boot.go we only have stdout and stderr but no application level information regarding the severity of the logs. So basically, we don't know what severity of stderr should be - it could contain the root cause for failure (i.e. process termination) but it could also contain warnings and random diagnostic output as per POSIX standard.
  • One way to do this is to send stderr as ERROR or FATAL if the process crashed, and INFO otherwise, although I don't love this.
  • We also lose some context in boot.go - things bundle_id and transform_id that could tell you where the error happened.

@cozos
Copy link
Contributor

cozos commented Feb 21, 2023

See https://github.com/cozos/beam/pull/3/files for first attempt

@lostluck lostluck self-assigned this Mar 28, 2023
@lostluck
Copy link
Contributor

@cozos Sorry for not seeing that sooner.

That attempt would work, but it makes every SDK depend on a Go SDK specific harness detail. Those are best kept isolated. The boot container doesn't need anything nearly as involved to log over the FnAPI, so it can be simpler in order to keep it debuggable.

@lostluck
Copy link
Contributor

Also the boot.go will never have any meaningful associated bundle_id and transform_id since those require a live pipeline execution, and those ids are only meaningful to the runner that generated them.

@lostluck
Copy link
Contributor

The logging here should no longer be silent due to that previous fix. I beleive that did make it into 2.47.0, releasing soon.

We likely don't want to require users to have CGO disabled, but that would also prevent issues like this I think?

I'm less well versed in that. But I would love that verification so we can put that advice in the cross-compile documentation, or perhaps make the "autobuild and submit" binary mode set that too (though I beleive any env variables would also normally be propagated, so we'd want to respect that setting.)

The linker error is a real issue, and how to resolve it might not be obvious to users. Ideally we have the boot loader additional log when this kind of error is detected, the CGO workaround (if verified) , or point to instructions (on the cross-comp page) on how fix the issue.

@cozos
Copy link
Contributor

cozos commented Apr 26, 2023

Thanks so much!

@jrmccluskey
Copy link
Contributor

Looks like this never got closed because of how GitHub parses the "fixes" string in PR descriptions. Fixed by #26035

@github-actions github-actions bot added this to the 2.50.0 Release milestone Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants