How to handle datasets that require processing? #7

hokinus · 2023-07-28T18:24:05Z

Some of the datasets used for training require processing, such us upscaling. Example: https://github.com/YudongYao/AutoPhaseNN/blob/main/PyTorch/prepare_defectFree_data.ipynb

The procedure here is:

Download provided data files
Process files and upsample them
Use upsampled files for training

I see few ways of handling this:

As the upsampled dataset is static, create it externally and use as a reference dataset
- does not require anything additional in Sabath, but we will need to setup repository for datasets
Upsample after download
- needs additional mechanism in Sabath to process data as part of dataset specification
- it is done once
Upsample before run
- the advantage here is that it can be parameterized
- we can use the run specification with list of multiple commands similarly to stub that is here (https://github.com/icl-utk-edu/sabath/blob/e0fa492dfa8f90cdca75616e08cc318b262b2e72/var/sabath/assets/sabath/models/p/ptychonn.json#L9C13-L9C13)

@luszczek @laszewsk What would be your preferred approach?

luszczek · 2023-07-31T01:15:05Z

I compare this to regular software based on source code. To run it, we need to download it first which is handled by the fetch command. But then we have to go through multiple stages: configure, build, and install. I suggest to add stages to the commands. For downloading it could be the network transfer of a tar-ball and then expanding the tar-ball on disk. For running, it could be the stage of image upscaling and then training and/or inference. Something like this JSON example:

"run" : [
  {
    "name" : "upscale",
   "command" : "python upscale.py"
  },{
    "name" : "train",
    "command" : "python train.py"
 }
]

I'd have to add stage tracker so each run does not repeat all stages from scratch but makes progress by picking up where the previously interrupted stage finished. In the case of PtychoNN, upscaling should only happen once.

At this point, I don't know how to cleanly introduce checkpointing of stages in case of interruptions due to e.g. allocation running out or hardware failing.

hokinus · 2023-07-31T16:41:14Z

In the past I was using the following mechanism to track the stage completion:

mark finished it by placing an empty file .done once finished in respective output directories, that file can be stage specific .stage1_done etc
More robust: track as above but keep information about inputs/configuration inside file. Either as a full configuration or just MD5 checksum for example:

input:
 model: 
  checksum: abcd
 data:
  checksum: xyz
 config:
  checksum: 1234

This way it is easy to check if the configs/processing/upstream data changed and processed data is stale and needs to be regenerated

laszewsk · 2023-07-31T17:18:02Z

we also need some feature that tests condition, maybe we can generalize

success:

python: os.exists(filename}
sh: sha {filename} == xyz

note as lots of stuff in sh is complex, python may be also cool to have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle datasets that require processing? #7

How to handle datasets that require processing? #7

hokinus commented Jul 28, 2023 •

edited

Loading

luszczek commented Jul 31, 2023

hokinus commented Jul 31, 2023

laszewsk commented Jul 31, 2023

How to handle datasets that require processing? #7

How to handle datasets that require processing? #7

Comments

hokinus commented Jul 28, 2023 • edited Loading

luszczek commented Jul 31, 2023

hokinus commented Jul 31, 2023

laszewsk commented Jul 31, 2023

hokinus commented Jul 28, 2023 •

edited

Loading