lock/query: make robust against paddles errors #1642

jdurgin · 2021-04-20T06:00:39Z

Retry paddles requests, and for get_status() return an empty dict
rather than None so callers behave.

get_status() failing in particular has caused the dispatcher and jobs
to fail several times over the past few weeks. With this change, we
should be able to run multiple paddles workers again, since all the
common callers will retry on error.

Signed-off-by: Josh Durgin [email protected]

Retry paddles requests, and for get_status() return an empty dict rather than None so callers behave. get_status() failing in particular has caused the dispatcher and jobs to fail several times over the past few weeks. With this change, we should be able to run multiple paddles workers again, since all the common callers will retry on error. Signed-off-by: Josh Durgin <[email protected]>

jdurgin · 2021-04-20T06:02:46Z

@susebot run deploy

susebot · 2021-04-20T06:17:18Z

Commit 179edf2 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/321/

kshtsk · 2021-04-20T09:13:11Z

Commit 179edf2 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/321/

This is failed in fact. Need to retest.

kshtsk · 2021-04-20T21:07:24Z

2021-04-20 09:25:55,603.603 INFO:teuthology.suite.util:build not complete
2021-04-20 09:25:55,603.603 ERROR:teuthology.suite.run:Packages for os_type 'centos', flavor basic and ceph hash 'e3523634d9c2227df9af89a4eac33d16738c49cb' not found
2021-04-20 09:28:31,998.998 ERROR:teuthology.suite.util:git refresh failed for ceph: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache/2.4.25 (Ubuntu) Server at git.ceph.com Port 8080</address>
</body></html>

2021-04-20 09:28:34,349.349 DEBUG:teuthology.suite.util:got response: {'committish': 'e3523634d9c2227df9af89a4eac33d16738c49cb', 'err': 'fatal: bad object e3523634d9c2227df9af89a4eac33d16738c49cb\n', 'sha1s': []}
Traceback (most recent call last):
  File "/home/runner/src/teuthology_master/virtualenv/bin/teuthology-suite", line 33, in <module>
    sys.exit(load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')())
  File "/home/runner/src/teuthology_master/scripts/suite.py", line 189, in main
    return teuthology.suite.main(args)
  File "/home/runner/src/teuthology_master/teuthology/suite/__init__.py", line 143, in main
    run.prepare_and_schedule()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 397, in prepare_and_schedule
    num_jobs = self.schedule_suite()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 612, in schedule_suite
    util.find_git_parent('ceph', self.base_config.sha1)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 491, in find_git_parent
    sha1s = get_sha1s(project, sha1, 2)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 485, in get_sha1s
    int(count), sha1, project, resp.json()['error'])
KeyError: 'error'

kshtsk · 2021-04-20T21:28:04Z

teuthology-suite -v --machine-type gra         --ceph octopus --suite smoke         -d centos -D 7.6         --filter-out ubuntu,rhel,7.7,rados_bench,kclient_workunit_suites_dbench,cfuse_workunit_suites_iozone,_s3tests         --limit 2         --seed 0         --newest 100

kshtsk · 2021-04-20T21:28:45Z

I guess there is no builds anymore for centos for octopus?

kshtsk · 2021-04-20T21:41:02Z

@susebot run deploy

susebot · 2021-04-20T21:56:06Z

Commit 179edf2 is NOT OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/324/

jdurgin · 2021-04-20T22:03:41Z

there should be octopus centos builds, if you're seeing something missing @djgalloway may be able to help

kshtsk · 2021-04-21T08:15:05Z

@susebot run deploy

susebot · 2021-04-21T08:30:13Z

Commit 179edf2 is NOT OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/325/

kshtsk · 2021-04-21T08:33:41Z

2021-04-21 08:29:51,334.334 DEBUG:teuthology.suite.util:got response: {'committish': 'e647a64c1e8147b04e84575a0fc53dee65cecab2', 'err': 'fatal: bad object e647a64c1e8147b04e84575a0fc53dee65cecab2\n', 'sha1s': []}
Traceback (most recent call last):
  File "/home/runner/src/teuthology_master/virtualenv/bin/teuthology-suite", line 33, in <module>
    sys.exit(load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')())
  File "/home/runner/src/teuthology_master/scripts/suite.py", line 189, in main
    return teuthology.suite.main(args)
  File "/home/runner/src/teuthology_master/teuthology/suite/__init__.py", line 143, in main
    run.prepare_and_schedule()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 397, in prepare_and_schedule
    num_jobs = self.schedule_suite()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 612, in schedule_suite
    util.find_git_parent('ceph', self.base_config.sha1)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 491, in find_git_parent
    sha1s = get_sha1s(project, sha1, 2)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 485, in get_sha1s
    int(count), sha1, project, resp.json()['error'])
KeyError: 'error'

> curl "https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=e647a64c1e8147b04e84575a0fc53dee65cecab2"
<html>
 <head>
  <title>302 Found</title>
 </head>
 <body>
  <h1>302 Found</h1>
  The resource was found at <a href="https://shaman.ceph.com/api/search/?status=ready&amp;project=ceph&amp;flavor=default&amp;distros=centos%2F8%2Fx86_64&amp;sha1=e647a64c1e8147b04e84575a0fc53dee65cecab2">https://shaman.ceph.com/api/search/?status=ready&amp;project=ceph&amp;flavor=default&amp;distros=centos%2F8%2Fx86_64&amp;sha1=e647a64c1e8147b04e84575a0fc53dee65cecab2</a>;
you should be redirected automatically.


 </body>

kshtsk · 2021-04-21T08:55:41Z

One of the problem I see that http://git.ceph.com:8080/ceph.git/history/ returns json with 'err' instead of 'error'.

kshtsk · 2021-04-21T09:00:21Z

what is that service and since when it got updated so teuthology cannot handle responses correctly

kshtsk · 2021-04-21T11:22:03Z

Scheduling is failed because arm build is failed, and build_complete returns False:

curl -s https://shaman.ceph.com/api/builds/ceph/octopus/e647a64c1e8147b04e84575a0fc53dee65cecab2/ | jq '.[] | select(.distro=="centos" and .distro_version=="8" and .flavor=="default") '

{
  "status": "failed",
  "sha1": "e647a64c1e8147b04e84575a0fc53dee65cecab2",
  "distro_arch": "arm64",
  "started": "2021-04-20 19:06:00.620116",
  "distro_codename": null,
  "completed": null,
  "extra": {
    "node_name": "172.21.4.63+confusa01",
    "version": "",
    "build_user": "",
    "root_build_cause": "SCMTRIGGER",
    "job_name": "ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic"
  },
  "modified": "2021-04-20 20:29:55.141954",
  "distro_version": "8",
  "project": "ceph",
  "url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457/",
  "log_url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457//consoleFull",
  "flavor": "default",
  "ref": "octopus",
  "distro": "centos"
}
{
  "status": "completed",
  "sha1": "e647a64c1e8147b04e84575a0fc53dee65cecab2",
  "distro_arch": "x86_64",
  "started": "2021-04-20 17:43:16.257781",
  "distro_codename": null,
  "completed": "2021-04-20 18:34:27.490982",
  "extra": {
    "node_name": "172.21.2.4+braggi04",
    "version": "15.2.11-166-ge647a64c",
    "build_user": "",
    "root_build_cause": "SCMTRIGGER",
    "job_name": "ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic"
  },
  "modified": "2021-04-20 18:34:27.492338",
  "distro_version": "8",
  "project": "ceph",
  "url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457/",
  "log_url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457//consoleFull",
  "flavor": "default",
  "ref": "octopus",
  "distro": "centos"
}

kshtsk · 2021-04-22T16:22:46Z

I have tried to address the issue with #1643 , but it still not able to schedule a test run because octopus arm build is failed for centos/8.

kshtsk · 2021-04-23T22:03:09Z

@susebot run deploy

susebot · 2021-04-23T23:08:15Z

Commit 179edf2 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/333/

jdurgin requested a review from liewegas April 20, 2021 06:03

ceph deleted a comment from susebot Apr 20, 2021

yuriw requested a review from kshtsk April 28, 2021 22:18

liewegas approved these changes Apr 28, 2021

View reviewed changes

jdurgin merged commit bea5b73 into ceph:master Apr 29, 2021

jdurgin mentioned this pull request May 5, 2021

config.py.in: use SERIALIZABLE isolation level for the db ceph/paddles#93

Merged

jdurgin deleted the wip-retry-paddles-reads branch June 28, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lock/query: make robust against paddles errors #1642

lock/query: make robust against paddles errors #1642

jdurgin commented Apr 20, 2021

jdurgin commented Apr 20, 2021

susebot commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

susebot commented Apr 20, 2021

jdurgin commented Apr 20, 2021

kshtsk commented Apr 21, 2021

susebot commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 22, 2021

kshtsk commented Apr 23, 2021

susebot commented Apr 23, 2021

lock/query: make robust against paddles errors #1642

lock/query: make robust against paddles errors #1642

Conversation

jdurgin commented Apr 20, 2021

jdurgin commented Apr 20, 2021

susebot commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

kshtsk commented Apr 20, 2021

susebot commented Apr 20, 2021

jdurgin commented Apr 20, 2021

kshtsk commented Apr 21, 2021

susebot commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 21, 2021

kshtsk commented Apr 22, 2021

kshtsk commented Apr 23, 2021

susebot commented Apr 23, 2021