Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crab recover "fail w/o reason" #5342

Open
2 of 3 tasks
belforte opened this issue Oct 25, 2024 · 1 comment
Open
2 of 3 tasks

crab recover "fail w/o reason" #5342

belforte opened this issue Oct 25, 2024 · 1 comment
Assignees

Comments

@belforte
Copy link
Member

belforte commented Oct 25, 2024

belforte@lxplus825/bot> crab recover -d crab_20241024_131730/
Rucio client intialized for account belforte
step kill - task already killed
Command recover failed
belforte@lxplus825/bot> echo $?
28
belforte@lxplus825/bot> 

Not only that, but crab.log has no useful information either [1]

In addition, that makes ClientValidation test fail.

But in this case the failure is simply because there are no failed jobs to recover, everything was successful [2]

  • conditions that cause command exit must be logged, e.g. in places like this
    retval = self.stepCheckKill()
    if retval["commandStatus"] != "SUCCESS": return self.stepExit(retval)
    Most likely leverage stepExit() for this as the comment already indicated. Unless @mapellidario knows some reason why this is a bad idea !
  • a message should be printed to console saying what's happened, it looks to me that recover never does this
  • we may want to reserve crab command to have a non-zero exit code for situation where something bad happens. Or otherwise rewrite ClientValidation

[1]

DEBUG 2024-10-25 16:12:55.619 UTC:       CRAB Client version: v3.240930
DEBUG 2024-10-25 16:12:55.619 UTC:       Running on: Linux lxplus825.cern.ch 4.18.0-553.22.1.el8_10.x86_64 #1 SMP Wed Sep 11 18:02:00 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
DEBUG 2024-10-25 16:12:55.619 UTC:       Not using Singularity
DEBUG 2024-10-25 16:12:55.619 UTC:       Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)
DEBUG 2024-10-25 16:12:55.619 UTC:       Executing command: 'recover'
[...]

DEBUG 2024-10-25 16:12:59.621 UTC:       stepStatus() - status, failedJobs: []
INFO 2024-10-25 16:12:59.622 UTC:        step kill - task already killed
DEBUG 2024-10-25 16:12:59.622 UTC:       stepCheckKill() - status COMPLETED
DEBUG 2024-10-25 16:12:59.622 UTC:       stepCheckKill() - command None
DEBUG 2024-10-25 16:12:59.622 UTC:       stepCheckKill() - dagStatus COMPLETED
DEBUG 2024-10-25 16:12:59.622 UTC:       stepCheckKill() - dbStatus KILLED
ERROR 2024-10-25 16:12:59.622 UTC:       Command recover failed
ERROR 2024-10-25 16:12:59.622 UTC:       Caught ClientException exception
Traceback (most recent call last):
  File "/cvmfs/cms.cern.ch/share/cms/crab-prod/v3.240930.00/bin/crab.py", line 152, in <module>
    client()
  File "/cvmfs/cms.cern.ch/share/cms/crab-prod/v3.240930.00/bin/crab.py", line 139, in __call__
    raise CommandFailedException("Command %s failed" % str(args[0]))
CRABClient.ClientExceptions.CommandFailedException: Command recover failed

[2]


 echo $?
28
belforte@lxplus825/bot> crab status -d crab_20241024_131730/

Rucio client intialized for account belforte
CRAB project directory:		/afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/bot/crab_20241024_131730
Task name:			241024_111731:cmsbot_crab_20241024_131730
Grid scheduler - Task Worker:	[email protected] - crab-preprod-tw01
Status on the CRAB server:	KILLED
Task URL to use for HELP:	https://cmsweb-testbed.cern.ch/crabserver/ui/task/241024_111731%3Acmsbot_crab_20241024_131730
Dashboard monitoring URL:	https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=cmsbot&var-task=241024_111731%3Acmsbot_crab_20241024_131730&from=1729765051000&to=now
Warning:			Task killed by crab recover on '2024-10-24 14:17:38.506361', by 'cmsbot'
Status on the scheduler:	COMPLETED

Jobs status:                    finished     		100.0% (10/10)
@belforte belforte self-assigned this Oct 25, 2024
belforte added a commit to belforte/CRABClient that referenced this issue Oct 25, 2024
@belforte
Copy link
Member Author

I decided to add a logger.info whenever there is a condition which leads to FAILED. It is more clear and flexible than do it in stepExit()

Now the existential question. If one types recover but there is nothing to do, do we end with exit code 0 ? Or with error ?
Already crab resubmit has a non-zero exit code when all jobs succeeded.

But we do not /can not check resubmit in ClientValidation.

I guess it has been one of the two:

  1. change semantics
  2. remove from ClientValidation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant