Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preDag divides by 0 when all probe jobs fails #6926

Closed
belforte opened this issue Dec 21, 2021 · 1 comment
Closed

preDag divides by 0 when all probe jobs fails #6926

belforte opened this issue Dec 21, 2021 · 1 comment

Comments

@belforte
Copy link
Member

ref: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/6264/1.html
and related to dmwm/CRABClient#5142

when all probe jobs fail PreDag should detect it and handle properly, not blindly raise exception in
computing events throughput in

eventsThr = sumEventsThr / count
eventsSize = sumEventsSize / count

Tue, 21 Dec 2021 00:24:12 CET(+0000):INFO:PreDAG Pre-DAG started with output redirected to /data/srv/glidecondor/condor_local/spool/6577/0/cluster75266577.proc0.subproc0/prejob_logs/predag.0.txt
Tue, 21 Dec 2021 00:24:12 CET(+0000):INFO:PreDAG found 5 completed jobs
Tue, 21 Dec 2021 00:24:12 CET(+0000):INFO:PreDAG jobs remaining to process: 0-1, 0-2, 0-3, 0-4, 0-5
Tue, 21 Dec 2021 00:24:12 CET(+0000):INFO:PreDAG jobs remaining to process: 0-1, 0-2, 0-3, 0-4, 0-5
Got a fatal exception: division by zero
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/data/srv/glidecondor/condor_local/spool/6577/0/cluster75266577.proc0.subproc0/TaskWorker/TaskManagerBootstrap.py", line 85, in <module>
    retval = bootstrap()
  File "/data/srv/glidecondor/condor_local/spool/6577/0/cluster75266577.proc0.subproc0/TaskWorker/TaskManagerBootstrap.py", line 28, in bootstrap
    return PreDAG.PreDAG().execute(*sys.argv[2:])
  File "TaskWorker/Actions/PreDAG.py", line 114, in execute
    retval = self.executeInternal(*args)
  File "TaskWorker/Actions/PreDAG.py", line 209, in executeInternal
    eventsThr = sumEventsThr / count
ZeroDivisionError: division by zero

@belforte belforte self-assigned this Dec 21, 2021
@belforte belforte changed the title preDag divides by 0 when al probe jobs fails preDag divides by 0 when all probe jobs fails Jan 11, 2022
belforte added a commit to belforte/CRABServer that referenced this issue Jan 12, 2022
belforte added a commit that referenced this issue Jan 12, 2022
* protect against probe jobs returning no events. Fix #6926

* some pylint cleanups
@belforte
Copy link
Member Author

closed via #6959 in python3 branch

mapellidario added a commit to mapellidario/CRABServer that referenced this issue Feb 28, 2022
* parent 23707a1 (dmwm#6818)

Initial changes for python3. Make it possible to run with python3 on sched.

* use gocurl from CVMFS Fix dmwm#6822 (dmwm#6824)

* Belforte patch 1 (dmwm#6825)

* use gocurl from CVMFS Fix dmwm#6822 (dmwm#6823)

* add comment about py2/3 compatibility needs

* use status_cache in pickle format/. Fix dmwm#6820 (dmwm#6829)

* Remove most old "Panda" code (dmwm#6835)

* remove PandaServerInterface. for dmwm#6542

* remove unused taskbuffer. For dmwm#6542

* remove useless comment about Panda. For dmwm#6542

* remove PanDAExceptions. For dmwm#6542

* disallow panda scheduler in regexp. for dmwm#6542

* Remove old crab cache code (dmwm#6833)

* remove code in UserFileCache. for dmwm#6776

* remove reference to UserFileCache in setup.py. For dmwm#6776

* remove all code references to UserFileCache. For dmwm#6776

* remove all calls to panda stuff in the code (dmwm#6836)

* remove pada fields. For dmwm#6542

* remove references to pandajobid DB column in code. For dmwm#6542

* remove panda-related JobGroup. For dmwm#6542

* remove useless calls to JobGroup. For dmwm#6542

* remove all references in code to panda, jobset and jobgroups. For dmwm#6542

* Move away mysql fix 6837 (dmwm#6838)

* add a place for obsolete code

* move MYSQL code to obsolete dir. Fix dmwm#6837

* remove Databases/TaskDB/Oracle/JobGroup from build. Fix dmwm#6839 (dmwm#6840)

* use urllib3 in place of urllib2 (dmwm#6841)

* remove couchDb related code. Easy part for dmwm#6834 (dmwm#6842)

* Proper fix for autom split (dmwm#6843)

* py3 fix for hashblib

* proper py3 porting of urllib2.urlopen

* remove old code. For dmwm#6845 (dmwm#6847)

* Remove couch db code (dmwm#6848)

* remove couchDb related code. Easy part for dmwm#6834

* remove CouchDB code from DagmanResubmitter. For dmwm#6845

* remove CouchDB code from PostJob. For dmwm#6845

* remove isCouchDBURL, now unused. For dmwm#6845

* one more cleanup in PostJob. For dmwm#6845

* one more cleanup in PostJob. For dmwm#6845

* restore code deleted by mistake

* [py3] src/python/Databases suports py2 and py3 (dmwm#6828)

* scr/pytohn/CRABInterface supports py3 (dmwm#6831)

* [py3] src/python/CRABInterface - changes suggested by futurize

* removed uses of deprecated panda code

* validate_str instead of validate_ustr, deprecated in WMCore

* a hack to make it run for minimal purposes (dmwm#6850)

* complete removal of unused taskbuffer

* stop trying to remove failed migrations from 2019. Fix dmwm#6854 (dmwm#6856)

* Port to python3 recent small fixes from master (dmwm#6858)

* use gocurl from CVMFS Fix dmwm#6822 (dmwm#6823)

* add comment about py2/3 compatibility needs (dmwm#6826)

* add GH remote for Diego

* upload new config version (dmwm#6852)

* stop trying to remove failed migrations from 2019. Fix dmwm#6854 (dmwm#6855)

Co-authored-by: Daina <[email protected]>

* better logging of acquired publication files. Fix dmwm#6860 (dmwm#6861)

* remove unused/undef variable. fix dmwm#6864 (dmwm#6865)

* Second batch of fixes for crabserver REST in py3. (dmwm#6873)

* HTCondorWorkflow: decode to str before parsing

* HTCondorWorkflow: convert to str output of literal eval

* slight improve to stefano's `horrible hack`

* updated version of wmcore to 1.5.5 in requirements.txt

* Add more logging (dmwm#6877)

* add logging of tmp file removal

* avoid duplicating ids. Fix dmwm#6800

* get task (DAG) status form sched. Fix dmwm#6869 (dmwm#6874)

* get task (DAG) status form sched. Fix dmwm#6869

* improve comments

* rename cache_status_jel to cache_status and use it. Fix dmwm#6411 (dmwm#6878)

* validate both temp and final output LFNs. Fix dmwm#6871 (dmwm#6879)

* change back to use py3 for cache_status

last commit had changed by mistake to use python2 for cache_status

* make migration dbg Utils worn in container. Fix dmwm#6853 (dmwm#6886)

* Py3 for publisher (dmwm#6887)

* ensure tasks is a list

* basestring -> string

* no need to cast to unicode

* use python3 to start TaskPublish

* REST and TW - correctly encode/decode input/outputs of b64encode/b64decode

* stop inserting nose in TW tarball. Fix dmwm#6455 (dmwm#6888)

* stop inserting nose in TW tarball. Fix dmwm#6455

* make sure CRAB3.zip exists, improve comments

* improve log

* port to python3 branch of  dmwm@87ada3b

* port to python3 branch of dmwm@9a72d9e

* Make new publisher default (dmwm#6892)

* make NEW_PUBLISHER the default, fix dmwm#6412

* remove code swithing NEW_PUBLISHER. Fix dmwm#6410

* add comments

* start Publisher in py3 env (dmwm#6894)

* stupid typo

* py3 crabserver compatible with tasks submitted by py2 crabserver (dmwm#6907)

- tm_split_args: convert to unicode the values in the lists: 'lumis' and 'runs'wq!

* crabserver py3 - change tag for build with jenkins (dmwm#6908)

* Make tw work in py3 for dmwm#6899 (dmwm#6901)

* Queue is now lowercase, xrange -> range

* use python3 to start TW

* start TW from python3.8 dir

* workaround ldap currently missing in py3 build

* basestring --> str

* use binary files for pickle

* make sure to hande classAd defined as bytes as well

* remove MonALISA code. Fix dmwm#6911 (dmwm#6913)

* TW - new tag of WMCore with fix to credential/proxy (dmwm#6915)

* TW - remove Logger and ProcInfo from setup.py and from bin/htcondor_make_runtime.sh (dmwm#6916)

* TW - remove Logger and ProcInfo from setup.py

* TW - remove Logger and ProcInfo from bin/htcondor_make_runtime.sh

* TW - remove apmon from setup.py

* TW - update tag of WMCore to mapellidario/py3.211214patch1

* setup.py - remove RESTInteractions from CRABClient build (dmwm#6919)

* generate Error on bad extconfig format, remove old code, cleanup. Fix dmwm#6897 See also dmwm#6897 (comment) (dmwm#6910)

* better py3 comp. for authenticatedSubprocess. fix dmwm#6899 (comment) (dmwm#6927)

* remove references to asourl/asodb in TW (dmwm#6929)

* [py3] apply py3-modernization changes to whole dmwm/CRABServer (dmwm#6921)

* [py3] migrated TW/Actions/ to py3

* [py3] fix open() mode: str for json, bytes for pickle

* [py3] fix use of hashlib.sha1(): input must be bytes

* TaskWorker/Actions/StageoutCheck: use execute_command, not executeCommand

* Publish utils for py3 (dmwm#6941)

* use python3 to run DebugFailedBlockPublication

* use python3 to run FindFailedBlockPublication

* make py3 compat and improve printout. Fix dmwm#6939

* optionally create new publication

* Fix task publish 6940 (dmwm#6942)

* avoid using undefined variable. Fix dmwm#6940

* make sure all calls to DBS are in try/excect for dmwm#6940

* use Rucio client py2 for FTS_transfer.py. Fix dmwm#6948 (dmwm#6949)

* use Rucio client py2 for FTS_transfer.py. Fix dmwm#6948

* add comment about python version

* pass $XrdSecGSISRVNAMES to cmsRun. Fix dmwm#6953 (dmwm#6955) (dmwm#6956)

* Pre dag divide by zero fix 6926 (dmwm#6959)

* protect against probe jobs returning no events. Fix dmwm#6926

* some pylint cleanups

* Cleanup userproxy from rest fix 6931 (dmwm#6960)

* remove unused retrieveUserCert for dmwm#6931

* cleanup unused userproxy from REST fix dmwm#6931

* remove unused imports

* cleanup serverdn/serverproxy/serverkey from REST code. Fix dmwm#6961

* correct kill arguments. Fix dmwm#6928 (dmwm#6964)

* requirements.txt: update wmcore tag (dmwm#6966)

* REST-py3 backward compatibile with publisher-py2 (dmwm#6967)

* Fix mkruntime 6970 (dmwm#6971)

* non need for cherrypy in TW tarball. Fix dmwm#6970

* place dummyFile in local dir and clenaup

* remove useless encode. fix dmwm#6972 (dmwm#6973)

* use $STARTDIR for dummyFile. (dmwm#6974)

* enable TaskWorker to use IDTOKENS. Fix dmwm#6903 (dmwm#6975)

* update requirements.txt to dmwm/WMCore 1.5.7 (dmwm#6982)

* use different WEB_DIR for token auth. Fix dmwm#6905 (dmwm#6983)

* correct check for classAd existance. Fix dmwm#6986 (dmwm#6987)

* define CRAB_UserHN ad for task_wrapper. Fix dmwm#6981 (dmwm#6988)

* no spaces around = in bash. properly fix dmwm#6981

* fix not py3-compatible pycurl.error handling in RESTInteractions (dmwm#6996)

* make Pre/Post/RetryJob use existing WEB_DIR. Fix dmwm#6994 (dmwm#6998)

* remove extra / in API name. Fix dmwm#7004 (dmwm#7005)

* Remove extra slash fix 7004 (dmwm#7006)

* remove extra / in API name. Fix dmwm#7004

* remove extra / in API name. Fix dmwm#7004

* restore NoAvailableStie exception for TW. Fix dmwm#7038 (dmwm#7039)

* make sure classAds for matching are ORDERED lists, fix dmwm#7043 (dmwm#7044)

* make sure eventsThr and eventsSize are not used if not initialized. Fix dmwm#7065 (dmwm#7066)

* Adjust code to work with new DBS Go based server (dmwm#6969) (dmwm#7074)

Co-authored-by: Valentin Kuznetsov <[email protected]>

* user python3 for FTS_transfers. Fix dmwm#6909 (dmwm#7052)

* adapt to new DBS serverinfo API (dmwm#7093)

* use WMCore 2.0.1.pre3 - Fix dmwm#7096 (dmwm#7097)

* point user feedback to CmsTalk. Fix dmwm#7100 (dmwm#7101)

Co-authored-by: Stefano Belforte <[email protected]>
Co-authored-by: Daina <[email protected]>
Co-authored-by: Valentin Kuznetsov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant