Revisit process of dependency staging in Beam Python #21073
Labels
core
done & done
Issue has been reviewed after it was closed for verification, followups, etc.
P3
python
wish
Milestone
There are a few issues:
Including Beam itself in requirements.txt is causing unnecessary friction, and is suboptimal, because Beam takes care to stage itself to the workers, and Beam workers include Beam dependencies. This is not clear from https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/. Yet from a user's perspective including Beam into requirements.txt seems natural.
Staging sources of all dependencies mentioned in requirements.txt, and their transitive dependencies, in some cases involves a hidden package recompilation, initiated by pip. The reason is that pip cannot reliably identify dependencies of a package without recompiling a package in certain cases, see [1-3] for pointers. This increases time it takes to launch a Beam job, and may require additional software (such as linux packages with header libraries or gcc deps) to be available. This causes friction, confusion, is not obvious and beyond Beam's control.
[1] pypa/pip#8387
[2] pypa/pip#7995
[3] https://discuss.python.org/t/pip-download-just-the-source-packages-no-building-no-metadata-etc/4651
Imported from Jira BEAM-12555. Original Jira may contain additional context.
Reported by: tvalentyn.
The text was updated successfully, but these errors were encountered: