Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle datasets that require processing? #7

Open
hokinus opened this issue Jul 28, 2023 · 3 comments
Open

How to handle datasets that require processing? #7

hokinus opened this issue Jul 28, 2023 · 3 comments

Comments

@hokinus
Copy link
Contributor

hokinus commented Jul 28, 2023

Some of the datasets used for training require processing, such us upscaling. Example: https://github.com/YudongYao/AutoPhaseNN/blob/main/PyTorch/prepare_defectFree_data.ipynb

The procedure here is:

  1. Download provided data files
  2. Process files and upsample them
  3. Use upsampled files for training

I see few ways of handling this:

  1. As the upsampled dataset is static, create it externally and use as a reference dataset
    • does not require anything additional in Sabath, but we will need to setup repository for datasets
  2. Upsample after download
    • needs additional mechanism in Sabath to process data as part of dataset specification
    • it is done once
  3. Upsample before run

@luszczek @laszewsk What would be your preferred approach?

@luszczek
Copy link
Contributor

I compare this to regular software based on source code. To run it, we need to download it first which is handled by the fetch command. But then we have to go through multiple stages: configure, build, and install. I suggest to add stages to the commands. For downloading it could be the network transfer of a tar-ball and then expanding the tar-ball on disk. For running, it could be the stage of image upscaling and then training and/or inference. Something like this JSON example:

"run" : [
  {
    "name" : "upscale",
   "command" : "python upscale.py"
  },{
    "name" : "train",
    "command" : "python train.py"
 }
]

I'd have to add stage tracker so each run does not repeat all stages from scratch but makes progress by picking up where the previously interrupted stage finished. In the case of PtychoNN, upscaling should only happen once.

At this point, I don't know how to cleanly introduce checkpointing of stages in case of interruptions due to e.g. allocation running out or hardware failing.

@hokinus
Copy link
Contributor Author

hokinus commented Jul 31, 2023

In the past I was using the following mechanism to track the stage completion:

  1. mark finished it by placing an empty file .done once finished in respective output directories, that file can be stage specific .stage1_done etc
  2. More robust: track as above but keep information about inputs/configuration inside file. Either as a full configuration or just MD5 checksum for example:
input:
 model: 
  checksum: abcd
 data:
  checksum: xyz
 config:
  checksum: 1234

This way it is easy to check if the configs/processing/upstream data changed and processed data is stale and needs to be regenerated

@laszewsk
Copy link
Contributor

we also need some feature that tests condition, maybe we can generalize

success:

  • python: os.exists(filename}
  • sh: sha {filename} == xyz

note as lots of stuff in sh is complex, python may be also cool to have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants