`Runner` doesn't pick up config and code changes until restarted #155

janosh · 2024-07-25T16:33:53Z

when i modify some-project.yaml config file, the changes are not picked up by current or new calculations that start after those changes were made until the Runner is killed and restarted. the same is true for direct changes to Python code (e.g. in jf-remote, qtoolkit, jobflow, atomate2). while i can see this being a feature in that it ensures homogeneous settings until you deliberately restart the Runner, it wasn't clear to me when i started using jf-remote and took at least an hour to figure out. ideally, this should be clearly documented somewhere (apologies if it is already and i missed it). another option to consider would be to make this configurable in ~/.jfremote.yaml

runner_mode: static | reload-config | reload-code | reload

static mode would be the current behavior while reload-config would reload the some-project.yaml once a minute or so. reload-code i'm less sure is worth implementing as it's more of an edge case and might be a can of worms to get working reliably. in case reload-config also goes against the jobflow-remote design philosophy, feel free to consider this issue as only a suggestion for more documention.

The text was updated successfully, but these errors were encountered:

davidwaroquiers · 2024-07-25T17:32:26Z

Thanks @janosh for this issue. Indeed, this may not be described in the documentation. I understand it could indeed be a use case and that the runner could regularly check if the config has been updated. We already though about this and discussed with @gpetretto but I remember there were a few tricky things. What are the things in the project config you'd like to update dynamically ? I guess it's the exec_config ? or is it something else ? maybe the workers ? Thing is what if someone modifies the stores (maybe by mistake). That would be quite problematic and the database could end up in an inconsistent state. If this dynamical reload is done, there should be some checks on what can be reloaded from the project and maybe the runner should stop itself if something "not allowed" has changed

In any case, we should at least document this more clearly

janosh · 2024-07-25T17:42:54Z

What are the things in the project config you'd like to update dynamically ? I guess it's the exec_config ? or is it something else ? maybe the workers?

in my case both the workers (as i often have to iterate on pre_run and post_run for things like module load before things start running smoothly on a new cluster)

    pre_run: &pre_run |
      module purge
      module load slurm/stage openmpi/stage vasp/stage
      export ATOMATE2_VASP_CMD="srun vasp_std"
      echo "slurm job with id $SLURM_JOB_ID started at $(date +"%Y-%m-%d %H:%M:%S %Z") on $(hostname) running '$ATOMATE2_VASP_CMD'"
    post_run: &post_run |
      echo "slurm job ended at $(date +"%Y-%m-%d %H:%M:%S %Z")"

but also the database connection info for the queue_store and the job_store. those can also take a lot of trial and error to get right due to issues like materialsproject/maggma#981. having to constantly kill and restart the runner adds to the debugging burden as I occasionally forget to restart the runner and then i'm not sure what state of the config my jobs are running with.

davidwaroquiers · 2024-07-25T21:54:49Z

Ok so it's not really about production, it's really about when you have to set up your project. Maybe then instead of using the daemon runner you could use the "foreground" runner using jf runner run. @gpetretto may comment on this but it might be more reasonable (and relatively easy to implement ?- to have an option for jf runner run to reload the config dynamically, at least at first (before having this, optionally or not, within the daemon itself). Maybe it could be useful to have an option to directly show log messages in the terminal for jf runner run ? @gpetretto what do you think ?

janosh · 2024-07-25T21:58:16Z

show log messages in the terminal for jf runner run ? @gpetretto what do you think ?

that would very helpful! something like

jf runner start --debug
# or
jf runner start --dev

to show the runner's logs directly in stdout

davidwaroquiers · 2024-07-26T06:46:50Z

Well actually I just checked (I wasn't sure about this), if you use jf runner run instead of jf runner start, you have the runner but not as a daemon (i.e. background process launched by supervisor) but as a foreround process and you get the logging messages. For testing/setup, I think this should already help you. There is also the jf project check that checks databases and connections to workers.

gpetretto · 2024-07-26T07:40:53Z

Hi @janosh,
thanks for reporting your experience on this matter. I have a few comments

The need to restart the runner should indeed be documented. I wonder which would be the most suitable section. A reasonable option would be the configuration section, but I am not sure that this would be seen by the users. Where would you have expected to find such a warning? Maybe I should create a section of the documentation dedicated to the runner.
While implementing the automatic update should be feasible, I am not convinced that it would be a big improvement. In the end this means that you cannot know for sure the exact moment in which the changes are taken into account. So if you are making tests changing the configuration you will still need to be aware that you cannot submit a new job right away, because it may still be with the old configuration. For how the runner works I would not be able to guarantee that the check will happen exactly every minute. So one will need to wait a bit and then starting tests hoping that the runner has already updated the configuration. To me it seems easier to restart the runner and be 100% sure.
When making tests I indeed tend to run the runner in the foreground. Since I started doing this before the daemon was even implemented I have a script that does:
```
from jobflow_remote.jobs.runner import Runner
Runner(project_name="xxx").run()
```
however this would be equivalent to run jf runner run as suggested by @davidwaroquiers. This means that the runner runs in the frontend, with the logs printed in stdout. I think that this is the most convenient option when making tests, because one can quickly check what is going on from the logs and just stop the runner with ctrl-c right away. In the end I did not think of jf runner run as something that would be used by the standard user, so it may need some updates to make it more user friendly.

In any case, if the automatic update would be preferable it can be indeed implemented as an option, as suggested by @janosh.

janosh · 2024-07-26T11:29:12Z

thanks for all this helpful context @gpetretto. i'll be definitely be using jf runner run for setup from now on, e.g. once we've burned our AWS starter credits and migrate our simulations to GCP. :)

Maybe I should create a section of the documentation dedicated to the runner.

having a separate docs page for the Runner makes a lot of sense imo as it was the most opaque part of jf-remote to me and still is actually.

Where would you have expected to find such a warning?

another good place would be in the CLI help of jf runner start --help. something like

The runner uses project config and Python code in the state it was when the runner started. It does not use updated configs or code changes until the runner is killed and started again.

While implementing the automatic update should be feasible, I am not convinced that it would be a big improvement. In the end this means that you cannot know for sure the exact moment in which the changes are taken into account.

point taken! though i would argue that if you've made some changes, forgot to restart the runner and then run off to a meeting and come back to check results, it doesn't matter so much if the new config starts to take effect 1 min sooner or later but rather that it was used at all

janosh mentioned this issue Sep 15, 2024

Add jf runner show-config #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Runner` doesn't pick up config and code changes until restarted #155

`Runner` doesn't pick up config and code changes until restarted #155

janosh commented Jul 25, 2024 •

edited

Loading

davidwaroquiers commented Jul 25, 2024

janosh commented Jul 25, 2024

davidwaroquiers commented Jul 25, 2024

janosh commented Jul 25, 2024

davidwaroquiers commented Jul 26, 2024

gpetretto commented Jul 26, 2024

janosh commented Jul 26, 2024 •

edited

Loading

Runner doesn't pick up config and code changes until restarted #155

Runner doesn't pick up config and code changes until restarted #155

Comments

janosh commented Jul 25, 2024 • edited Loading

davidwaroquiers commented Jul 25, 2024

janosh commented Jul 25, 2024

davidwaroquiers commented Jul 25, 2024

janosh commented Jul 25, 2024

davidwaroquiers commented Jul 26, 2024

gpetretto commented Jul 26, 2024

janosh commented Jul 26, 2024 • edited Loading

`Runner` doesn't pick up config and code changes until restarted #155

`Runner` doesn't pick up config and code changes until restarted #155

janosh commented Jul 25, 2024 •

edited

Loading

janosh commented Jul 26, 2024 •

edited

Loading