Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go Direct Runner Improvements #20647

Closed
damccorm opened this issue Jun 4, 2022 · 1 comment
Closed

Go Direct Runner Improvements #20647

damccorm opened this issue Jun 4, 2022 · 1 comment

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 4, 2022

The Go SDK has a simple direct runner intended for basic batch framework testing. That is, it's only suitable for the barest tests, and not that it ensures that the basics work for arbitrary pipelines.

The runner has the following features:

  • Operates on the direct pipeline graph without marshalling through the beam protos.
    ** This prevents it from validating that the pipeline is valid for portable runners.
  • Executes the whole pipeline as a single bundle, on a single worker thread. "in process"
    ** This renders it only suitable for very small data sets, that likely operate in memory.
  • Doesn't marshal elements.
    ** While this avoids notionally unnecessary work, it's another reason why users will run into errors after using the direct runner to "validate" their pipeline before moving to Spark or Flink.

Further, the runner hasn't been validated for beam semantics, nor have more complex features of the Beam Model been implemented or validated. This makes it unsuitable for more than it's current use for demoing the SDK in basic batch operation, and the light use it has testing the SDK itself.

However, implementing full beam semantics for a runner, even without the distributed portion is a project in itself. It's part of the beam design that implementing the semantics for a beam runner to be more complicated on the runner side vs the SDK side. 

But there's no reason why we can't improve the Go Direct Runner to match all semantics required of beam for single machine contexts.

In particular the various improvements below could be made (and should probably be sharded into separate sub task JIRAs as required):

  • Convert the Go Direct Runner to a "Go Portable Runner" instead, which means implementing the Job Management  and  FnApi protocols. This would ensure that all runners are operating the Go SDK workers in the same way, via the harness. 
    ** This doesn't preclude "go awareness" for operating everthing in a single binary, or later re-optimizing to avoid serialization.
  • Allow the runner to execute "headless" (as a job submission server).
  • Allow the runner to execute more than a single bundle at once.
    ** Enabling better use of CPU cores in single execution mode.
  • Add loopback and docker execution mode support, in addition to the Go "in process" support it has.
  • Once the runner can execute portable pipelines done, it becomes possible to run the Python and Java Runner Validation Tests against the runner to validate all the features of the Beam Programming Model
    ** Each feature / TestSuite of which should be handled in separate JIRAs.
    ** Adding jenkins runs of those passing tests to ensure ongoing validation of the runner against the model.

 

A good place to start is being able to run and execute pipelines on the Python Portable runner, which implements all beam semantics correctly. Instructions for doing so are on Go Tips page in the Dev Wiki

 

Direct Runner Code: https://github.com/apache/beam/tree/master/sdks/go/pkg/beam/runners/direct

SDK Harness Code: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/harness/harness.go 

 

Imported from Jira BEAM-11076. Original Jira may contain additional context.
Reported by: lostluck.

@lostluck
Copy link
Contributor

lostluck commented Aug 3, 2023

Overall this issue has been supplanted entirely by Prism, and replacing the direct runner with Prism. Closing.

See #24789

@lostluck lostluck closed this as completed Aug 3, 2023
@lostluck lostluck self-assigned this Aug 3, 2023
@github-actions github-actions bot added this to the 2.50.0 Release milestone Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants