make preparelocal use S3 for tarball and unify with --dryrun #6544

belforte · 2021-04-16T20:25:15Z

currently preparelocal creates a tarball with all needed stuff for executing the job wrapper
(inside DagmanCreator) called InputFiles.tar.gz and sends it to the schedd, from where crab preparelocal
fetches it to create local directory where to run the job.

Such tarball should be transferred via S3 cache instead, and possibly with same code as for --dryrun.

Even better, --dryrun should be executed inside the directory created by preparelocal, and should not be part of the submit command.

something like:

submit --dry : does all things like submit, uploads all tarballs, but does not submit to schedd
preparelocal : like now, creates local dir for a task submitted or submitted with --dry
crab testrun (new command) runs in the local director for 10 events (like current submit --dry)

Also, currently in the schedd there are both InputFiles.tar.gz and input_files.tar.gz ! pretty damn confusing

Difficulty is to have a way to implement a piece at a time, w/o breaking things.

The text was updated successfully, but these errors were encountered:

belforte · 2023-01-26T17:57:39Z

should be enough to go in this order:

make crab preparelocal use things from S3 and never talk to sched
implement crab testrun as something to do after crab preparelocal
change submit --dry to be stop after uploading and point user to crab testrun

belforte · 2023-01-26T18:13:41Z

this relates with #7461

novicecpp · 2024-08-15T15:35:58Z

~~The goals of this issue~~ The things we agreed in the last meeting are:

We can deprecate crab submit --dryrun behavior.
- Current behavior is we upload "custom" InputFiles.tar.gz to S3 and set tasks status to UPLOADED (see DryRunUploader). Then, client pull this custom inputFiles.tar.gz, extract, and run locally. Each time it run will increase number of event it consumed and measure a time to give user idea how to adjust splitting parameter.
- Users more likely want to check if crabConfig.py/PSet.py able to run in grid than see how long does it take (citation needed).
Nothing change to preparelocal. Download the submitted file and prepare env for run locally. Note that to do preparelocal, user need to submit the real task with limited total job to run.

novicecpp · 2024-08-15T15:36:32Z

I would like to explain how we move input files (wrapper, scripts, and user sandbox) in crab systems first.

In the normal job submission:

First, user submit the task, client upload the sandbox to S3 (s3://crabcache/<cernusername>/sandboxes/<sha1>.tar.gz).

Second, TW create InputFiles.tar.gz as zipped executable to transfer to schedd later

CRABServer/src/python/TaskWorker/Actions/DagmanCreator.py

Lines 726 to 729 in 9b4679d

    
           tf = tarfile.open('InputFiles.tar.gz', mode='w:gz') 
        
           try: 
        
               for ifname in inputFiles + subdags + ['input_args.json']: 
        
                   tf.add(ifname)

.

This include user sandbox.tar.gz where we download back from S3 then add to the list of InputFiles

CRABServer/src/python/TaskWorker/Actions/DagmanCreator.py

Lines 1239 to 1240 in 9b4679d

if kw['task'].get('tm_user_sandbox') == 'sandbox.tar.gz':

inputFiles.append('sandbox.tar.gz')

List of files in InputFiles.tar.gz is described in here

Third, the InputFiles.tar.gz will be "submitted" to schedd along with subdag.jdl (I do not know what it is) through transfer_input_files

CRABServer/src/python/TaskWorker/Actions/DagmanSubmitter.py

Line 540 in 9b4679d

jobJDL["transfer_input_files"] = str(info['inputFilesString'])

of condor submit command.
Forth, in schedd, dag_bootstrap_startup.sh extract InputFiles.tar.gz to SPOOL_DIR

CRABServer/scripts/dag_bootstrap_startup.sh

Line 54 in 9b4679d

TARBALL_NAME="InputFiles.tar.gz"

Finally, when Dagman submit the job, it use transfer_input_files again but only upload a necessary files. You can see the list here and here

CRABServer/src/python/TaskWorker/Actions/DagmanCreator.py

Lines 537 to 552 in 9b4679d

    
           info.setdefault("additional_input_file", "") 
        
           if os.path.exists("CMSRunAnalysis.tar.gz"): 
        
               info['additional_environment_options'] += 'CRAB_RUNTIME_TARBALL=local' 
        
               info['additional_input_file'] += ", CMSRunAnalysis.tar.gz" 
        
           else: 
        
               raise TaskWorkerException("Cannot find CMSRunAnalysis.tar.gz inside the cwd: %s" % os.getcwd()) 
        
           if os.path.exists("TaskManagerRun.tar.gz"): 
        
               info['additional_environment_options'] += ' CRAB_TASKMANAGER_TARBALL=local' 
        
           else: 
        
               raise TaskWorkerException("Cannot find TaskManagerRun.tar.gz inside the cwd: %s" % os.getcwd()) 
        
           if os.path.exists("sandbox.tar.gz"): 
        
               info['additional_input_file'] += ", sandbox.tar.gz" 
        
           info['additional_input_file'] += ", run_and_lumis.tar.gz" 
        
           info['additional_input_file'] += ", input_files.tar.gz" 
        
           info['additional_input_file'] += ", submit_env.sh" 
        
           info['additional_input_file'] += ", cmscp.sh"

novicecpp · 2024-08-15T16:30:20Z

So, to unify preparelocal and submit --dryrun,
I will make submit --dryrun to have the similar UX as preparelocal but without submitting to the schedd. This functionality is for users to test their new tasks before real crab submission (Thanks to Dario for the idea and pointout the pain of preaparelocal).

This is why I said earlier that I want to deprecate the behavior, not remove the command.

This is possible and easy to do in client side. Thanks to Dario again for crab recover command that use other command like crab status/kill behind the scence.

Basically, crab submit --dryrun will submit the task to let TW create the InputFiles.tar.gz, wait until task status change to UPLOADED and then continue to trigger preparelocal command.

For the server side, well..a lot of code changes needed, obviously the InputFiles.tar.gz need to upload to S3, and spliting sandbox from InputFiles.tar.gz to not reupload the sandbox again and again, especially the compaign-like task where it share sandbox.

I will write the detail down tomorrow.

novicecpp · 2024-08-15T16:42:04Z

And of course thanks to Stefano for the original ideas on how to unify both commands and the attempted of improving the --dryrun code (both client and server side) that I can reuse it to fix this issue.

belforte · 2024-08-16T13:46:18Z

sounds like you know more than me about this matter now :-)

subdag.jdl is used for automatic splitting, we do not want users to run automatic splitting machinery interactively via --dryrun

novicecpp · 2024-09-04T19:55:33Z

Here is the pointer to the code:
Server: PR #8645
Client: PR dmwm/CRABClient#5329

These PR changes 3 things:

Separate sandbox.tar.gz from InputFiles.tar.gz
- InputFiles.tar.gz is the file contains everything we need to run in schedd and work node.
- This is needed to unify the code between submit, submit --dryrun, and preparelocal.
crab submit --dryrun do preparelocal, but the task is not submit to schedd.
preparelocal --jobid 1 simply call preparelocal and execute in one go
- When I read the preparelocal code, I feel like it does not make any sense to have a different way to run the job. I might be wrong though.

Because I separated sandbox from InputFiles.tar.gz, I am not sure if some other commands or code have the assumption that it expects sandbox in there, beside the code I changed.
So, it needs more test. But the simple test is passed though https://gitlab.cern.ch/crab3/CRABServer/-/pipelines/7990590 ( the branch base on master 7f4e9eed).

novicecpp · 2024-09-04T19:57:21Z

@belforte I have moved this task to Todo in "CRAB Work Planner".
Feel free to bump priority as you see fit.

belforte · 2024-09-28T15:14:03Z

I looked at the code and have only some minor questions.
Time to lay out a plan. As a general strategy I'd rather break submit --dryrun for a while then mix old and new code. Cleaning up has always been difficult.

1. add support for checkS3Object in RESTCache . This can go in immediately and deploy new REST
2. add checkS3Object to ServerUtilities
3. Modify AdjustSites and dag_bootstrap_startup to download sandbox from S3. Can go in at any time
4. Modify DagmanCreator to stop downloading sandbox.tar.gz and upload slimmed InputFiles.tar.gz. Can go in at any time after 3.
5. could changes to Client go in here ? Need to test
6. Modify Handler to use new DryRun.py instead of DryRunUploader.py
7. Deploy new Client if not done at 5.
8. take this chance to also fix complete transitioning preparelocal to args via JSON CRABClient#5288 possibly adding the new tarball

belforte · 2024-09-30T12:44:21Z

OK. Let's deploy the change to DagmanCreator ASAP so we can push new client in production.

belforte · 2024-10-10T13:36:20Z

back to this. Code from Wa is in https://github.com/novicecpp/CRABServer/tree/get_sandbox_in_schedd
I have now imported that branch in my repo: https://github.com/belforte/CRABServer/tree/Sandbox-in-Sched-from-Wa

belforte · 2024-10-11T20:21:28Z

1 and 2 done in #8740
3 done in #8743

will do 4 once those are tested and merged
4 done in #8745

belforte · 2024-10-18T20:51:34Z

doing 4. I found a bug in #8740 which was only affecting sandbox existance check in new code.
Will fix in same PR as to push 4.

But I have also found that we also still need debug_files.tar.gz to be moved around via TW, which creates confusion. Should change AdjustSites.py to download that and create that debug_files directory at bootstrap time.
Better yet find a way to upload them as files to S3, like down for twlog, but that may require extensve changes to Client, ServerUtilities and RestCache. Current code supports uploading of a single tarball using objectype=debugfiles. But we want 3 separate files for config, pset and scriptExe

belforte · 2024-10-18T21:19:59Z

let's go with a:

4.5 expand debug_files.tar.gz in AdjustSites and stop handling it in DagmanCreator - remove debug_files handling in TW. Do it in scheduler #8747

belforte · 2024-10-22T19:06:06Z

at the moment preparelocal in current client does not work with new TW and new client does not work with current TW. I need to make at least one of them backward compatible.

belforte · 2024-10-24T16:31:24Z

I decided to go in steps.

deploy a Client compatible with new handling of tarballs via S3 add support for tarballs in S3, not on webdir CRABClient#5340
deploy TW changes done until now
do further changes for dryrun and do not worry if it stays broken until client is updated. Or maybe simply deprecate the "run" part of submit --dryrun and tell uses to use preparelocal and run_job

@aspiringmind-code @novicecpp I will gladly get your advice if you feel like suggesting something different

belforte added Type: Enhancement Priority: Medium labels Apr 16, 2021

belforte self-assigned this Apr 16, 2021

This was referenced Apr 20, 2021

use S3 to upload tarball needed for DryRun #6465

Closed

make submit --dry work with S3 #6545

Closed

belforte mentioned this issue Nov 18, 2021

crab preparelocal fails after crab submit --dryrun dmwm/CRABClient#5114

Open

belforte mentioned this issue Nov 17, 2022

change how we handle user sandbox #7461

Closed

belforte added the Status: Available label Nov 17, 2022

belforte removed their assignment Jan 19, 2023

mapellidario mentioned this issue Jan 19, 2023

crab submit --dryrun executes cmsRun over and over again #7493

Closed

novicecpp self-assigned this Aug 12, 2024

belforte mentioned this issue Aug 23, 2024

GOALS for 2024 #7876

Open

23 tasks

novicecpp assigned belforte and unassigned novicecpp Aug 23, 2024

aspiringmind-code mentioned this issue Sep 9, 2024

Action Items from Udine Workshop #8664

Closed

27 tasks

belforte mentioned this issue Sep 27, 2024

sort out confusion/duplicates in tarballs handling from TW to sched to Wrapper #8699

Closed

belforte mentioned this issue Sep 30, 2024

Adding WAITING status in checkstatusLoop dmwm/CRABClient#5336

Merged

This was referenced Sep 30, 2024

document TW tarballs #8728

Open

reorganizye scripts directory, simplity creation of TWTarballs #8727

Open

belforte mentioned this issue Oct 18, 2024

Stop downloadind user sandbox in tw #8745

Merged

belforte added Status: In Progress Priority: TOP and removed Priority: Medium Status: Available labels Oct 18, 2024

belforte mentioned this issue Oct 21, 2024

remove debug_files handling in TW. Do it in scheduler #8747

Merged

belforte mentioned this issue Oct 24, 2024

add support for tarballs in S3, not on webdir dmwm/CRABClient#5340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make preparelocal use S3 for tarball and unify with --dryrun #6544

make preparelocal use S3 for tarball and unify with --dryrun #6544

belforte commented Apr 16, 2021

belforte commented Jan 26, 2023

belforte commented Jan 26, 2023

novicecpp commented Aug 15, 2024 •

edited

Loading

novicecpp commented Aug 15, 2024 •

edited

Loading

novicecpp commented Aug 15, 2024 •

edited

Loading

novicecpp commented Aug 15, 2024

belforte commented Aug 16, 2024 •

edited

Loading

novicecpp commented Sep 4, 2024

novicecpp commented Sep 4, 2024

belforte commented Sep 28, 2024 •

edited

Loading

belforte commented Sep 30, 2024

belforte commented Oct 10, 2024 •

edited

Loading

belforte commented Oct 11, 2024 •

edited

Loading

belforte commented Oct 18, 2024

belforte commented Oct 18, 2024 •

edited

Loading

belforte commented Oct 22, 2024 •

edited

Loading

belforte commented Oct 24, 2024 •

edited

Loading

make preparelocal use S3 for tarball and unify with --dryrun #6544

make preparelocal use S3 for tarball and unify with --dryrun #6544

Comments

belforte commented Apr 16, 2021

belforte commented Jan 26, 2023

belforte commented Jan 26, 2023

novicecpp commented Aug 15, 2024 • edited Loading

novicecpp commented Aug 15, 2024 • edited Loading

novicecpp commented Aug 15, 2024 • edited Loading

novicecpp commented Aug 15, 2024

belforte commented Aug 16, 2024 • edited Loading

novicecpp commented Sep 4, 2024

novicecpp commented Sep 4, 2024

belforte commented Sep 28, 2024 • edited Loading

belforte commented Sep 30, 2024

belforte commented Oct 10, 2024 • edited Loading

belforte commented Oct 11, 2024 • edited Loading

belforte commented Oct 18, 2024

belforte commented Oct 18, 2024 • edited Loading

belforte commented Oct 22, 2024 • edited Loading

belforte commented Oct 24, 2024 • edited Loading

novicecpp commented Aug 15, 2024 •

edited

Loading

novicecpp commented Aug 15, 2024 •

edited

Loading

novicecpp commented Aug 15, 2024 •

edited

Loading

belforte commented Aug 16, 2024 •

edited

Loading

belforte commented Sep 28, 2024 •

edited

Loading

belforte commented Oct 10, 2024 •

edited

Loading

belforte commented Oct 11, 2024 •

edited

Loading

belforte commented Oct 18, 2024 •

edited

Loading

belforte commented Oct 22, 2024 •

edited

Loading

belforte commented Oct 24, 2024 •

edited

Loading