Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-build toolchain registration to python.toolchain #2216

Open
gholms opened this issue Sep 11, 2024 · 4 comments
Open

Add in-build toolchain registration to python.toolchain #2216

gholms opened this issue Sep 11, 2024 · 4 comments

Comments

@gholms
Copy link

gholms commented Sep 11, 2024

🚀 feature request

Relevant Rules

  • //python/extensions:python.bzl: python.toolchain
  • //python:repositories.bzl: python_register_toolchains

Description

While the underlying rules for python toolchains/runtimes support python stacks which are actual Bazel-built targets passed in by label, python_register_toolchains, and by extension, the python.toolchain tag class, does not offer a way to do this. At present it supports only downloading hermetic toolchains from GitHub by version number. While one can technically do all the behind-the-scenes work that it performs using one's own module/workspace code, what exactly they do and what expectations they have for the repositories one points them at are not documented. Especially as we make more use of additional types of toolchains (cc toolchains, exec tools toolchains), discovering and writing this much code is an increasing burden. We should provide better support for people who supply their own python stacks, either built from source or precompiled.

Describe the solution you'd like

Digging through python_register_toolchains and a bunch of other code, it looks like each platform expects three toolchains, a coverage tool, a config setting, and the seven other targets which toolchain_aliases repositories name. For each case like this, we can either define a contract for what targets the user-supplied repository+package needs to provide (perhaps using some helpful macros or tag classes from us), then ask them to tell us where to fetch it from and how to patch it; or prompt for each of these things individually. Either way, we could do this either by extending the existing python_register_toolchains and py.toolchain, or by defining a new pair of these things dedicated to this use case.

For python stacks built from source, we can allow people to supply just one repository+package to use everywhere and let Bazel do all the platform management for us. I don't know how best to build a good UX around this for python stacks supplied as binaries without forcing a lot of repetition onto people who want to support lots of platforms. But to an extent that work is also self-imposed, and people could use macros or a platform-based, toolchain_aliases-like repository to manage that complexity.

Describe alternatives you've considered

Instead of supporting the in-build use case ourselves, we could document what people need to do in order to do it themselves and still be able to integrate with the python ecosystem correctly. That makes it far less likely that people will overlook things like they would with source code dives, but it does effectively force people to reimplement much of our finicky loading phase code. Most of the smaller bits comprising it aren't exposed publicly, but perhaps opening some of it up would help support that scenario a bit.

@aignas
Copy link
Collaborator

aignas commented Sep 13, 2024

I think once the python.*override classes are implemented, the code will be easier to work with this. I am not sure yet what has to change and I don't have time for this right now, but feel free to look into how the python.toolchain could be changed and leave the notes here.

@rickeylev
Copy link
Contributor

Digging through python_register_toolchains and a bunch of other code, it looks like each platform expects

Something to keep in mind is what python_register_toolchains does (or the bzlmod equivalent) under the hood carries a lot of extra code for various edge cases and backwards compatiblity. e.g. the "toolchain aliases" part I think is just so pip integration to get a reference to a host-platform compatible interpreter to run in the pip repo-rule phase code. So don't take everything they do as "i must also define these to have something that works"

Anyways, the core of defining a toolchain is just three parts:

  • //python:toolchain_type via py_runtime
  • //python:exec_tools_toolchain_type via py_exec_tools_toolchain
  • //python/cc:toolchain_type via py_cc_toolchain

we can either define a contract for what targets the user-supplied repository+package needs to provide (perhaps using some helpful macros or tag classes from us), then ask them to tell us where to fetch it from and how to patch it; or prompt for each of these things individually.

I'm not necessarily opposed to this, but I think it needs to be fleshed out more to see what it would look like. I just suspect this will turn into "one function with all the args from other functions it calls", as I describe below.

Yeah, hearing about what sort of toolchains you need to define and how they differ from the ones rules_python generates would be good info. I don't have any particularly great ideas for an API that

  • Makes it easier for users to define their own toolchain
  • Allows users to customize "the right parts" (of which there are many in-build vs platform, coverage tool, headers/libs, precompiler, exec interpreter, etc).
  • Allows us to tweak internals as necessary

When I implemented the local toolchains, I did try to unify the code paths for how toolchains are defined, but I wasn't able to figure out something that resembled the above. The best I could do is python/private/py_toolchain_suite.bzl#_internal_toolchain_suite -- all that does is make the native.toolchain() calls, which isn't that helpful.

Today, we have 3 toolchain implementations: hermetic, local, and runtime_env. Whenever I look through them to see how they call py_runtime et al and toolchain(), I fail to see a common denominator to factor out. The "api" I keep coming back to is basically a function that looks like one of:

  • def define_toolchain_suite(name, py_runtime_kwargs, py_exec_tools_kwargs, py_cc_kwargs)
  • def define_toolchain_suite(python_major_version, python_minor_version, python_patch_version, coverage_tool, cc_headers, cc_libs, etc) (one arg for every arg of the py_runtime/py_exec_tools/py_cc_toolchain rules)

Neither of those seem good for users or maintainers.

I'm curious what your in-build, built-from-source toolchain looks like (insofar as the calls to toolchain() and py_runtime/et al), and how it compares to the 3 toolchain implementations I mention above.

Within Google, where we also build from source, the toolchain() and py_runtime() calls are much simpler than the 3 impls I mention above. That, plus there's various idiosyncrasies to how their defined and the platforms supported, etc. In anycase, I can't imagine what sort of helper rules_python could provider that would make defining them easier.

@gholms
Copy link
Author

gholms commented Sep 14, 2024

I'm curious what your in-build, built-from-source toolchain looks like (insofar as the calls to toolchain() and py_runtime/et al), and how it compares to the 3 toolchain implementations I mention above.

Within Google, where we also build from source, the toolchain() and py_runtime() calls are much simpler than the 3 impls I mention above. That, plus there's various idiosyncrasies to how their defined and the platforms supported, etc. In anycase, I can't imagine what sort of helper rules_python could provider that would make defining them easier.

As far as Zoox goes, we're on an old enough (heavily-forked) rules_python that there is still just the one toolchain type. This all came up as I look into upgrading and hopefully reducing how much patching we wind up doing, if you're curious. As far as the toolchains go, we aren't doing anything special. Each @pythonX.Y repository provides the libpython cc_library and the py_runtime for itself. The main repository has a python_version build setting which controls everything that cares about versioning, as well as the actual py_runtime_pair and toolchain that we register. In theory, nothing stops us from moving the latter two out of the main repository and doing things just like rules_python.

This is pretty easy to manage since there's only one toolchain and type. We'd probably wind up doing the other two types with more straight-line BUILD file code. Registering a couple more toolchains is also easy; it's figuring out what to give the new toolchain rules that's more challenging here because there is minimal documentation about what the new toolchains' types even are, much less what their corresponding rules might want as attrs. Frankly, a toolchain_suite sort of function or even just doing the behind-the-scenes work ourselves would work for us just fine. We would just need to know what that work is because none of the ways anyone documents how to get a python toolchain work with in-build interpreters. This is almost certainly solvable with "advanced users" documentation.

But we also aren't doing anything meaningfully different from what rules_python is doing on its own; we just happen to be supplying our own archive/patches/BUILD file for it. This is where the contract idea springs from: at least it would be a way to make the UX better for other people. But that certainly comes with its own limitations. I just haven't managed to think of an alternative that doesn't boil down to "pass all the args," as you noted.

@rickeylev
Copy link
Contributor

FYI: I sent out

re: api docs in general: Have a look at:

(I added these docs like a week ago; I was adding some toolchains within Google and had some of the same questions as you and realized our doc site didn't have them listed :). Happy to approve doc improvements).

See https://rules-python.readthedocs.io/en/latest/api/rules_python/index.html for an overview of most everything. The site index is also handy for ctrl+f'ing particular terms that the regular search doesn't find.

Each @pythonX.Y repository provides the libpython cc_library and the py_runtime for itself. The main repository has a python_version build setting which controls everything that cares about versioning, as well as the actual py_runtime_pair and toolchain that we register.

Thanks for the info. Yeah, if you're building from source, you probably don't need many py_runtime's defined -- I'm guessing any platform differences are handled using select() as part of building the interpreter?

FWIW's:

  • A small thing you might run into when trying to use the py_cc_toolchain is it requires the python version explicitly set as a string. Something like feat: default py_runtime version info to --python_version #2198 for the py_cc_toolchain rule (by default, get the python version from the flag) would probably help reduce boilerplate; happy to approve a PR that does that.
  • We now have //python/config_settings:python_version
  • See also //python/cc:current_py_cc_{headers,libs} -- these targets get the headers/libs via toolchain resolution.

HTH

github-merge-queue bot pushed a commit that referenced this issue Sep 15, 2024
…2220)

From the discussion in #2216, it's clear that some better docs are
needed to help people
figure out how to define a toolchain and what the different pieces me.

This gives some more explanation of the toolchain types, what they do,
and how to
define them.

Along the way:
* Add some more bazel objects to the inventory
* Fix attribute lookups in the bazel inventory
* Allow using parens in crossrefs, e.g. `foo()`
github-merge-queue bot pushed a commit that referenced this issue Sep 15, 2024
This stems from conversation in #2216. Moving the logic into a
loading-phase macro makes it
easier to understand how a repo with a python-build-standalone extracted
archive has
its toolchain defined. This also makes it easier to compare with the
local runtime repo
setup to try and find commonality in the future.

Along the way:
* Remove some Bazel 5 compatibility code: coverage_tool is now always
passed to py_runtime.
(Bazel 5 support was dropped earlier this year and we stopped testing
with it; it's
  likely already broken with Bazel 5).
* Add `semver.to_dict`, which makes it easier to pass into format()
functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants