Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issue with Flux 0.58 or the 0.57 python bindings installed with pip #9

Open
jwhite242 opened this issue Feb 2, 2024 · 37 comments

Comments

@jwhite242
Copy link

jwhite242 commented Feb 2, 2024

Ran into some import issues with the latest flux 0.58 install (the public systems that are flux native at llnl), and the 0.57 python bindings from pypi: it seems the python bindings install isn't quite finding the right things? Importing flux fails with a missing function in the c-python bindings (stack trace below, but doesn't seem like that function's a particular problme, just the first to get hit):

Python 3.10.8 (main, Jan  5 2023, 10:38:19) [GCC 10.3.1 20210422 (Red Hat 10.3.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import flux
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/p/lustre1/white242/maestro_flux/flux58env310/lib/python3.10/site-packages/flux/__init__.py", line 14, in <module>
    import flux.core.handle
  File "/p/lustre1/white242/maestro_flux/flux58env310/lib/python3.10/site-packages/flux/core/handle.py", line 19, in <module>
    from flux.future import Future
  File "/p/lustre1/white242/maestro_flux/flux58env310/lib/python3.10/site-packages/flux/future.py", line 78, in <module>
    def init_callback(c_future, opaque_handle):
ffi.error: ffi.def_extern('init_callback'): no 'extern "Python"' function with this name

Manually pulling the PYTHONPATH from flux env and setting it in the current shell enables things to work again, including submitting/querying jobs from python.

edit: fix code snippet..

@vsoch
Copy link
Member

vsoch commented Feb 2, 2024

I think for 0.58.0 flux you'd need 0.58.0 bindings right? Would you like me to build them?

@vsoch
Copy link
Member

vsoch commented Feb 2, 2024

Also if I remember, the latest that maestro has is pretty old (0.49.0) so that would be something to update too (I'm working on mini-mummi that uses maestro under the hood, although not flux anymore)!

https://github.com/LLNL/maestrowf/tree/develop/maestrowf/interfaces/script/_flux

@vsoch
Copy link
Member

vsoch commented Feb 2, 2024

okay give 0.58.0 a shot. https://pypi.org/project/flux-python/ You can sanity check what flux sees with flux --version and make sure the versions match, and also import flux to see flux.__file__

@jwhite242
Copy link
Author

hmph, still get the same issue with the 0.58 bindings. And to be clear, I'm not doing this with maestro, rather just installing the binding and dropping into a python shell and attempting to import flux and getting that error.

On maestro's adapter though, that's not really tightly coupled to flux's version. Mostly just tracking breaking changes with it like api tweaks and status codes/abbreviations that leak through into the python bits. Will see if that needs updating after I can verify the binding install on it's own.

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

Can you please run flux version and see what it shows? I suspect your setup is wonky.

@jwhite242
Copy link
Author

jwhite242 commented Feb 6, 2024

Apologies, the vnc systems just booted me off for some updates, so will have to login in later this morning to get the flux version and env outputs for you. But, this isn't so much a particular environment issue as a system issue I think. This is using the system installs on the EAS machines that are flux native and getting python environments to talk to it, not bootstrapping flux brokers myself inside some other scheduler. The issue occurs on both rzvernal and tioga. I've been using python 3.10, but can retry with 3.9 in case that makes any difference.

EDIT: forgot one potentially important detail. These python environments are all spawned off of the system python installs available via the module system (which has 3.9 and 3.10 available).

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

The way to debug this is to absolutely check the .so that is being paired with the python bindings, and be absolutely sure they are in sync. That error message is the python cffi libraries saying "I don't know what that function is called" and my suspicion is that you still have a version mismatch. The next thing to debug would be Python version and cffi version, which could lead to something similar I think.

@jwhite242
Copy link
Author

Ok, here's the info:

Flux version:

commands:    		0.58.0
libflux-core:		0.58.0
libflux-security:	0.10.0
build-options:		+ascii-only+systemd+hwloc==2.8.0+zmq==4.3.4

flux env

export LUA_PATH="/usr/share/lua/5.3/?.lua;;;"
export FLUX_EXEC_PATH="/usr/libexec/flux/cmd"
export PYTHONPATH="/usr/lib64/flux/python3.6"
export LUA_CPATH="/usr/lib64/lua/5.3/?.so;;;"
export FLUX_CONNECTOR_PATH="/usr/lib64/flux/connectors"
export MANPATH="/usr/share/man:/opt/cray/pe/libsci/23.12.5/share/man:/opt/cray/pe/mpich/8.1.28/ofi/man:/opt/cray/pe/mpich/8.1.28/man/mpich:/opt/cray/pe/craype/2.7.30/man:/opt/cray/pe/cce/17.0.0/cce-clang/x86_64/share/man:/opt/cray/pe/cce/17.0.0/man:/opt/cray/pe/perftools/23.12.0/man:/opt/cray/pe/papi/7.0.1.2/share/pdoc/man:/opt/cray/libfabric/2.1/share/man:/usr/share/lmod/lmod/share/man:/usr/man:/usr/local/man:/usr/X11R6/man:/usr/lib64/mvapich/default/man:/usr/tce/man"
export FLUX_MODULE_PATH="/usr/lib64/flux/modules"

for the actual .so, how can i find that? (really, it looks like these python bindings are just failing to find that? -> exporting the above PYTHONPATH manually gets everything working again, though that definitely bypasses what the flux-python package is supposed to be doing so maybe a red-herring there..)

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

You said you were using a Python 3.8, but in the above I see 3.6 is first on PYTHONPATH.

@jwhite242
Copy link
Author

No, I've been using 3.10 the whole time. That PYTHONPATH shown is from the system install, reported by the flux env cmd-> lc and/or the flux team manages that, and I can't change it, or use it to build an env off of (though building off that wouldn't really be an option as 3.6 doesn't play nice with many current packages since it's been EOL'd for so long). That's been the path on those machines since ~0.49 and hasn't seemed to be an issue till the most recent system updates? At least it hasn't affected the subset of the api I've been using so far, so could be I've just been lucky on these flux managed machines up til now.

Note, that python path is only known by the flux install. Unless I manually add it to get around this import bug that env var doesn't exist in my environment. Checked one of my old 3.9 env's with flux 0.51 bindings installed and that one doesn't export any pythonpath into my env either (looks like that's expected with virtualenv simply prepending things to $PATH instead).

@grondo
Copy link

grondo commented Feb 6, 2024

Would it be helpful to know that init_callback was added as part of the FutureExt class which didn't appear until flux-core v0.53.0? Do your flux-python bindings for Python 3.10 just need to be rebuilt?

as 3.6 doesn't play nice with many current packages since it's been EOL'd for so long

Yeah, since Flux is a system package, we have to use the system Python by default, which on RHEL 8 for better or worse is Python 3.6. We do provide RPM subpackages for 3.8 and 3.9 since those exist as alternatives in RHEL 8, but there is no /usr/bin/python3.10 available at this time (but this problem is why flux-python exists).

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

Would it be helpful to know that init_callback was added as part of the FutureExt class which didn't appear until flux-core v0.53.0? Do your flux-python bindings for Python 3.10 just need to be rebuilt?

@jwhite242 that's why I think you have a mismatch of versions here. The error is telling us that it's expecting a function to exist that does not. If you had the correctly matched .so and flux-python you would not see that.

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

Manually pulling the PYTHONPATH from flux env and setting it in the current shell enables things to work again, including submitting/querying jobs from python.

Is that not the solution? flux-python can't control how your environment is setup, including PYTHONPATH and flux versions. Is there something specifically you are opening this issue to ask for flux-python to do?

@jwhite242
Copy link
Author

@grondo I would say it would be helpful to know that that was added as i often don't notic. But for an additional data point, these bindings have actually been working just fine for multiple versions of these standalone bindings up to the 0.58 that looks to have been installed last week, which is interesting. Forget when I switched to this package, maybe 0.51? This bug does happen on a fresh install in a newly created virtual environment, so the bindings shouldn't be suffering from just being old i think; just redid all of that from scratch this morning as well, on both 3.9 and 3.10 envs.

@vsoch so, we used to do it that way, but these lovely pip installable bindings had removed the need for that up until last week. (seriously, have been loving being able to just pip install with these and having them just work, so thanks for making this standalone package)

So, poking around in the flux-python setup.py, is the lib vs lib64 pathing an issue here (honestly can't remember if flux has been in usr/lib64 this whole time or not..)? The find_flux func in the pip installed bindings suggest it's returning /usr/ from the flux installed in /usr/bin/flux. then the custom_search is adding lib, not lib64 when it's looking for things. Also, when this package is installing is it actually compilng things under the covers and pip is just hiding all that work? and if so, is there maybe some default compilers in the way and messing things up?

@grondo
Copy link

grondo commented Feb 6, 2024

Well I do notice that the flux-python copy of callbacks.h is missing the definition of init_callback which is in flux-core:

//flux_msg_handler_f
extern "Python" void message_handler_wrapper(flux_t *, flux_msg_handler_t *, const flux_msg_t *, void *);
//flux_watcher_f
extern "Python" void timeout_handler_wrapper(flux_reactor_t *, flux_watcher_t *, int, void *);
extern "Python" void fd_handler_wrapper(flux_reactor_t *, flux_watcher_t *, int, void *);
extern "Python" void signal_handler_wrapper(flux_reactor_t *, flux_watcher_t *, int, void *);
//flux_continuation_f
extern "Python" void continuation_callback(flux_future_t *, void *);

I don't think this explains why things didn't break until now though, but maybe if we update these copies from flux-core things will work again?

@vsoch
Copy link
Member

vsoch commented Feb 6, 2024

This issue probably needs to be transferred to flux-core - any changes that are needed can be made there and trickle down here.

@grondo
Copy link

grondo commented Feb 6, 2024

I meant to say the local copy of callbacks.h here is out of date with respect to flux-core. I'm not sure there are any changes needed in flux-core at this point?

@jwhite242
Copy link
Author

One other data point. Tried out one of the other native machines: corona, with flux 0.57 fails the same way. So definitely not isolated to 0.58 and that was likely just timing coincidence that i ran into the issue this past week on that version.

@grondo
Copy link

grondo commented Mar 6, 2024

This is a flux-python issue. The copy of callbacks.h here is not up to date with flux-core. I'm not sure I really understand how this flux-python repo works, though, so I apologize if there is indeed something we need to do in flux-core. (Just let me know what that is)

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

Flux python builds directly from the code there, the only thing kept here are the wrappers around that.

@grondo
Copy link

grondo commented Mar 6, 2024

Thanks! Where is the stuff in src coming from?

@grondo
Copy link

grondo commented Mar 6, 2024

And I admit I'm confused, the init_callback is definitely present in the core generated bindings, or some basic commands would fail. I don't understand why it isn't included for the flux-python built bindings though.

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

I missed this comment from a while back: #9 (comment)

Is that file explicitly in flux-python here, and we need to update it? Is there a way to programmatically determine what should be there?

@grondo
Copy link

grondo commented Mar 6, 2024

I'm not sure as I have no idea how flux-python works.

The files here: https://github.com/flux-framework/flux-python/tree/main/src

Seem to have been copied from flux-core at some point (definitely before v0.53.0). Whenever they are changed in flux-core, I would think they would need to be updated here. Some of the files are generated during the build, but callbacks.h is a normal file committed to our repo.

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

Can you point me to callbacks.h? We likely just need to copy it from flux-core upstream (and I didn't know that).

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

I think what needs to happen is that if anything changes in that directory, someone pings me because it warrants an update here. The assumption for the builds here is that the contents of that directory don't change.

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

okay I opened a PR to update it - what I can do after that is a release candidate / patch version for the current flux python for someone to test. #10

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

@jwhite242 when that gets merged and I build you bindings to test, do you have preference for a version?

@jwhite242
Copy link
Author

Awesome, thanks for the fix!

For the versions, could a couple be built to blow away what's on pypi? I've had users reporting this on machines with 0.58 and 0.59 installed (0.58 being the toss4_cray variants for the flux native systems, and 0.59 on all the other slurm managed machines). Looks like corona got updated too, so 0.57 seems a non-issue now? at least i'm not aware of any other instances of it.

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

Yes but please first test https://pypi.org/project/flux-python/0.59.0rc0/.

@jwhite242
Copy link
Author

sounds good. will test that out shortly

@jwhite242
Copy link
Author

alright, looks like that works! import succeeded, and was able to use the bindings to run some jobs

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

okay I think what I can do is rebuild a final release for 0.59.0, and then for 58 just do a 0.58.1

@jwhite242
Copy link
Author

sounds great. thanks so much!

@vsoch
Copy link
Member

vsoch commented Mar 6, 2024

yep! Just kicked off the builds.

@vsoch
Copy link
Member

vsoch commented Mar 7, 2024

okay these should be good to go - https://pypi.org/project/flux-python/#history. Let me know if there are any other issues or if we are good to close here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants