Skip to content
Nicola Amapane edited this page Apr 21, 2024 · 37 revisions

Submitting jobs

A list of samples to be processed is maintained in the csv files.

These files specify the name assigned to the sample, cross section, BR, dataset name, the number of files per job, and additional customization for each sample (JSONs, special parameters, etc); see below for more details.

One can comment lines to select which samples to run on.

Steps for a full production:

  1. Setup your GRID credentials:

    voms-proxy-init --voms cms
    
  2. Create the job scripts (for example, the 2018 Data):

    batch_Condor.py samples_2018_Data.csv
    

    The job scripts are creaded in a folder named PROD_<name_of_csv_file>_<CJLST_framework_commit_number>, eg. PROD_samples_2018_Data_b738728 in this case. Note that enough quota should be available in the submit area to hold all output. This is typlically not the case in AFS home areas. Since as of today (3/2023) submission from /eos/ areas are not permitted, it is possible to specify a destination folder in /eos/home/ where jobs will transfer their output. With this option, the output is directly transferred to /eos at the end of job, which also avoids ecxcessive load to AFS. Example:

    batch_Condor.py samples_2018_Data.csv -t /eos/user/j/john/230426
    

    In using this option, please note that:

    • The destination folder must be under /eos/user/. Writing to eos
    • Subfolders are created in the destination folder, matching the production directory tree (PROD_samples[...]/Chunks)
    • Only the .root file and job output log are transferred. the CONDOR logs are still created in the submission folder.
    • a link to the transfer area is created in the production folder.
  3. Submit the jobs (from lxplus; does not work from private machines):

    cd PROD_samples_2018_Data_b738728
    resubmit_Condor.csh
    

BONUS: Just before submitting the jobs, it's a time-saving trick to check the status of the server that will submit on your behalf (called bigbirdXX). First, type: condor_status -schedd Then, look for the bigbirdXX with the least amount of jobs running and switch your machine to that one by typing: export _condor_SCHEDD_HOST="bigbirdXX.cern.ch";export _condor_CREDD_HOST="bigbirdXX.cern.ch"

  1. Check how many jobs are still running or pending on Condor:

    condor_q
    
  2. Once the jobs are done, from the same folder (regardless of whether the option -t mentioned above is used or not)

    checkProd.csh
    

    This checks all jobs and moves all those which finished correctly to a subfolder named AAAOK.

    It can be safely run even while jobs are still running; jobs folders are moved only when they are fully completed.

    A failure rate of ~2% is normal due to a transient problem with working nodes. To resubmit failed jobs (repeat until all jobs succeed for data; a small fraction of failures is acceptable for MC):

    cleanup.csh
    resubmit_Condor.csh
    checkProd.csh
    # wait all jobs to finish
    
  3. Once all jobs are finished, the trees for the different jobs need to be hadded. This can be either done locally (for private tests and production) or into standard storage for a central production

    • for local adding (no -t option specified) run in the submission directory:
      haddChunks.py AAAOK
      
    • If the option -t has been used to transfer the output directly to EOS, the command should be run from the eos folder.
      haddChunks.py PROD_samples<XXX>
      
    • For a central production: either use -t or copy the hadded trees to a subdirectory in your CERNBox directory /eos/user/<initial>/<username>. From the CERNBox website (http://cernbox.cern.ch), find this subdirectory and share with the e-group "[email protected]", then let the production coordinator know the location of the trees.
  4. For data, haddData.py can be used to merge the different PDs.

Notes for advanced usage

CSV file field specification

identifier: name given to the sample.
process: reserved for future development.
crossSection, BR: The xsection weight in the trees will be set using xsection*BR. The two values are kept separate just for logging reasons as only their product is used, so one can directly set the product in the crossSection field, use BR for the filter efficiency, etc.
execute: reserved for future development.
dataset,prefix: if prefix = dbs, dataset is assumed to be the dataset name and the list of files is taken from das. Otherwise, the string "prefix+dataset" is assumed to be an EOS path.
pattern: reserved for future development.
splitLevel: number of files per job.
::variables: list of additional python variables to be set for this specific sample.
::pyFragments: list of additional python snippets to be loaded for this specific sample (eg json); assumed to be found in the pyFragments subdir.
comment: A text comment

resubmit.csh and grid certificates.

The scripts submits all jobs through CONDOR. In addition:

  • it logs the job number in the log files under log/ (useful in case the status a specific jobs need to be checked or killed)
  • it checks if a valid grid proxy is available, and in that case it makes it available to the working node. This is needed to run on samples that are not available on eos. See here for details on how to get a grid proxy.

checkProd.csh and handling of failures

The checkProd.csh script checks the cmsRun exit status (that is recorded in the exitStatus.txt file) and the validity of the root file after its copied it back to AFS.

It can be safely run even while jobs are still running; jobs folders are moved only when they are fully completed.

In addition, if "mf" is specified as argument, the scripts move jobs that failed with cmsRun exit status !=0 to a folder named AAAFAIL. These can be resubmitted with the usual scheme (cd in, cleanup, resubmit); but it's a good idea to rename the folder beforehand (eg mv AAAFAIL AAARESUB1) if other jobs are still pending (otherwise rerunning checkProd.csh mf will add more failed job to the same directory. Be careful not to resubmit jobs that are already running!! There is currently no protection against this.

Note that checkProd.csh does not catch all possible failures, as some job fail without reporting an exit status !=0.

Debugging failures

If a large fraction of chunk fail, please inspect the log files to determine the cause of the failures. Failures fall broadly into a few categories:

  • Files that are not accessible: an error appears right after attempting to open the file, possibly after several retries and/or after the open is attempted at different sites. Specifying the global redirector may help in this case (see below) as it may pick a replica in a different site.
  • Corrupted files: the file is opened, and some events are processed, but some read error happens in the middle. These can be due to corrupted replicas in some of the sites, or communication issues with the remote sites. Also in this case, Specifying the global redirector may help
  • CONDOR kills the job because it exceeeds resource usage; this is generally apparent from the files in the log/ folder
  • Genuine job crashes due to configuration or issues in the code; these needs to be debugged by experts.
  • In order to determine the source of the problem, start inspecting the log.txt.gz file for each failing chunk:
    • Start from the end of the file and go back to find the actual error. Note that the last error message is not necessarily the cause; for example, corrupted files will generally result in a file read error followed by a crash, that is the effect, not the cause of the problem.
    • If the log output stops with no error message, the problem may be due to a CONDOR problem, timeout, etc. Check the files in the log/ folder, which should report what happened on the CONDOR side.
    • In some cases, in particular when running with the EOS transfer (-t) option, the log.txt.gz file may be missing. In this case, try one of the following:
      • check the files in the log/ folder for any hint;
      • run the job interactively, redirecting the output (cmsRun run_cfg.py |& tee log.txt);
      • resubmit the job;
      • if the output is still missing try remaking the chunks in a different folder without the -t option and resubmit the failing one alone,

If the failure is due to a file access error, do the following:

  • Specify a redirector explicitly, and resubmit the job. This is the first thing to be tried in general.
    A script to patch chunks is available here.
  • If you suspect the file is not accessible or corrupted try copying the file locally:
    xrdcp -d 1 -f root://cms-xrd-global.cern.ch//store/[...] $TMP
    • If the transfer fails, report it to JIRA (see below).
    • Even if the transfer succeeds, the file may still be corrupted; this is likely the case if in the log.txt.gz the file is opened succesfully and an error appears after a number of events have been read. To verify this, obtain the checksum of the transferred file with: xrdadler32 $TMP/[file]
      and compare it with the value reported by das (search for the file path, click on "show" and search for the "adler32" value)
  • Sometimes, due to network problems, files on remote files keep failing even if copying them with xrdcp as specified above succeeds. In this case, it is possible to hack run_cfg.py to prefetch the files before starting the processing, adding this at the end of run_cfg.py (and REMOVING the redirector for loop added by forceRedirector.csh, if that is present):
    import os
    for i,f in enumerate(process.source.fileNames) :
      tempdir = "."
      print("prefetching", f)
      os.system("xrdcp -d 1 -f root://cmsxrootd.fnal.gov/"+f+" "+tempdir)
      process.source.fileNames[i] = "file:"+tempdir+"/"+os.path.basename(f)
    print("New fileNames:")
    print(process.source.fileNames)
    
    IMPORTANT: this is sufficient for job sets created with the -t transfer to EOS) option. If the production is set up to write jobs back to the submission area, CONDOR will try to copy back the prefetched files as well. To avoid this, please add the following to batchScript.sh script within each of the concerned chunks, before the last line (exit $cmsRunStatus):

find . -type f -name "*.root" ! -name 'ZZ4lAnalysis.root' -exec rm -f {}

  • Please note that opening the file from root via xrootd may fail for no good reason if the file is specified as command line argument; in case you want to try this please use TFile::Open("root://cms-xrd-global.cern.ch//store/[...]"). Even in this case, the fact that the file is opened correctly does not exclude it is bad (e.g. truncated) on the remote server.
  • Note that a file may have different replicas in different sites, so running at different times may lead to different outcome because xrootd picks a different replica. Some details on accessing a specific replica with rucio can be found in this thread.
  • It is possible to ignore file access failures adding the following at the end of run_cfg.py:
    process.source.skipBadFiles = True
    This can be useful if a job keeps failing for a given file, and one wants to process the rest of the job's files nevertheless. If using this, keep in mind that the job will succeed even if not all input files have been processed!

Reporting bad or unaccessible files

Please open a ticket for files you have verified to be unaccessible or corrupted (bad checksum) on JIRA. An example of such a report can be found here.

Debugging Condor problems

If jobs take forever to start, or go to HOLD state, please refer to this debugging guide.