Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner doesn't pick up config and code changes until restarted #155

Open
janosh opened this issue Jul 25, 2024 · 7 comments
Open

Runner doesn't pick up config and code changes until restarted #155

janosh opened this issue Jul 25, 2024 · 7 comments

Comments

@janosh
Copy link
Collaborator

janosh commented Jul 25, 2024

when i modify some-project.yaml config file, the changes are not picked up by current or new calculations that start after those changes were made until the Runner is killed and restarted. the same is true for direct changes to Python code (e.g. in jf-remote, qtoolkit, jobflow, atomate2). while i can see this being a feature in that it ensures homogeneous settings until you deliberately restart the Runner, it wasn't clear to me when i started using jf-remote and took at least an hour to figure out. ideally, this should be clearly documented somewhere (apologies if it is already and i missed it). another option to consider would be to make this configurable in ~/.jfremote.yaml

runner_mode: static | reload-config | reload-code | reload

static mode would be the current behavior while reload-config would reload the some-project.yaml once a minute or so. reload-code i'm less sure is worth implementing as it's more of an edge case and might be a can of worms to get working reliably. in case reload-config also goes against the jobflow-remote design philosophy, feel free to consider this issue as only a suggestion for more documention.

@davidwaroquiers
Copy link
Member

Thanks @janosh for this issue. Indeed, this may not be described in the documentation. I understand it could indeed be a use case and that the runner could regularly check if the config has been updated. We already though about this and discussed with @gpetretto but I remember there were a few tricky things. What are the things in the project config you'd like to update dynamically ? I guess it's the exec_config ? or is it something else ? maybe the workers ? Thing is what if someone modifies the stores (maybe by mistake). That would be quite problematic and the database could end up in an inconsistent state. If this dynamical reload is done, there should be some checks on what can be reloaded from the project and maybe the runner should stop itself if something "not allowed" has changed

In any case, we should at least document this more clearly

@janosh
Copy link
Collaborator Author

janosh commented Jul 25, 2024

What are the things in the project config you'd like to update dynamically ? I guess it's the exec_config ? or is it something else ? maybe the workers?

in my case both the workers (as i often have to iterate on pre_run and post_run for things like module load before things start running smoothly on a new cluster)

    pre_run: &pre_run |
      module purge
      module load slurm/stage openmpi/stage vasp/stage
      export ATOMATE2_VASP_CMD="srun vasp_std"
      echo "slurm job with id $SLURM_JOB_ID started at $(date +"%Y-%m-%d %H:%M:%S %Z") on $(hostname) running '$ATOMATE2_VASP_CMD'"
    post_run: &post_run |
      echo "slurm job ended at $(date +"%Y-%m-%d %H:%M:%S %Z")"

but also the database connection info for the queue_store and the job_store. those can also take a lot of trial and error to get right due to issues like materialsproject/maggma#981. having to constantly kill and restart the runner adds to the debugging burden as I occasionally forget to restart the runner and then i'm not sure what state of the config my jobs are running with.

@davidwaroquiers
Copy link
Member

Ok so it's not really about production, it's really about when you have to set up your project. Maybe then instead of using the daemon runner you could use the "foreground" runner using jf runner run. @gpetretto may comment on this but it might be more reasonable (and relatively easy to implement ?- to have an option for jf runner run to reload the config dynamically, at least at first (before having this, optionally or not, within the daemon itself). Maybe it could be useful to have an option to directly show log messages in the terminal for jf runner run ? @gpetretto what do you think ?

@janosh
Copy link
Collaborator Author

janosh commented Jul 25, 2024

show log messages in the terminal for jf runner run ? @gpetretto what do you think ?

that would very helpful! something like

jf runner start --debug
# or
jf runner start --dev

to show the runner's logs directly in stdout

@davidwaroquiers
Copy link
Member

Well actually I just checked (I wasn't sure about this), if you use jf runner run instead of jf runner start, you have the runner but not as a daemon (i.e. background process launched by supervisor) but as a foreround process and you get the logging messages. For testing/setup, I think this should already help you. There is also the jf project check that checks databases and connections to workers.

@gpetretto
Copy link
Contributor

Hi @janosh,
thanks for reporting your experience on this matter. I have a few comments

  • The need to restart the runner should indeed be documented. I wonder which would be the most suitable section. A reasonable option would be the configuration section, but I am not sure that this would be seen by the users. Where would you have expected to find such a warning? Maybe I should create a section of the documentation dedicated to the runner.
  • While implementing the automatic update should be feasible, I am not convinced that it would be a big improvement. In the end this means that you cannot know for sure the exact moment in which the changes are taken into account. So if you are making tests changing the configuration you will still need to be aware that you cannot submit a new job right away, because it may still be with the old configuration. For how the runner works I would not be able to guarantee that the check will happen exactly every minute. So one will need to wait a bit and then starting tests hoping that the runner has already updated the configuration. To me it seems easier to restart the runner and be 100% sure.
  • When making tests I indeed tend to run the runner in the foreground. Since I started doing this before the daemon was even implemented I have a script that does:
    from jobflow_remote.jobs.runner import Runner
    Runner(project_name="xxx").run()
    however this would be equivalent to run jf runner run as suggested by @davidwaroquiers. This means that the runner runs in the frontend, with the logs printed in stdout. I think that this is the most convenient option when making tests, because one can quickly check what is going on from the logs and just stop the runner with ctrl-c right away. In the end I did not think of jf runner run as something that would be used by the standard user, so it may need some updates to make it more user friendly.

In any case, if the automatic update would be preferable it can be indeed implemented as an option, as suggested by @janosh.

@janosh
Copy link
Collaborator Author

janosh commented Jul 26, 2024

thanks for all this helpful context @gpetretto. i'll be definitely be using jf runner run for setup from now on, e.g. once we've burned our AWS starter credits and migrate our simulations to GCP. :)

Maybe I should create a section of the documentation dedicated to the runner.

having a separate docs page for the Runner makes a lot of sense imo as it was the most opaque part of jf-remote to me and still is actually.

Where would you have expected to find such a warning?

another good place would be in the CLI help of jf runner start --help. something like

The runner uses project config and Python code in the state it was when the runner started. It does not use updated configs or code changes until the runner is killed and started again.

While implementing the automatic update should be feasible, I am not convinced that it would be a big improvement. In the end this means that you cannot know for sure the exact moment in which the changes are taken into account.

point taken! though i would argue that if you've made some changes, forgot to restart the runner and then run off to a meeting and come back to check results, it doesn't matter so much if the new config starts to take effect 1 min sooner or later but rather that it was used at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants