Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lock/query: make robust against paddles errors #1642

Merged
merged 1 commit into from
Apr 29, 2021

Conversation

jdurgin
Copy link
Member

@jdurgin jdurgin commented Apr 20, 2021

Retry paddles requests, and for get_status() return an empty dict
rather than None so callers behave.

get_status() failing in particular has caused the dispatcher and jobs
to fail several times over the past few weeks. With this change, we
should be able to run multiple paddles workers again, since all the
common callers will retry on error.

Signed-off-by: Josh Durgin [email protected]

Retry paddles requests, and for get_status() return an empty dict
rather than None so callers behave.

get_status() failing in particular has caused the dispatcher and jobs
to fail several times over the past few weeks. With this change, we
should be able to run multiple paddles workers again, since all the
common callers will retry on error.

Signed-off-by: Josh Durgin <[email protected]>
@jdurgin
Copy link
Member Author

jdurgin commented Apr 20, 2021

@susebot run deploy

@jdurgin jdurgin requested a review from liewegas April 20, 2021 06:03
@susebot
Copy link

susebot commented Apr 20, 2021

Commit 179edf2 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/321/

@kshtsk
Copy link
Contributor

kshtsk commented Apr 20, 2021

Commit 179edf2 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/321/

This is failed in fact. Need to retest.

@kshtsk
Copy link
Contributor

kshtsk commented Apr 20, 2021

2021-04-20 09:25:55,603.603 INFO:teuthology.suite.util:build not complete
2021-04-20 09:25:55,603.603 ERROR:teuthology.suite.run:Packages for os_type 'centos', flavor basic and ceph hash 'e3523634d9c2227df9af89a4eac33d16738c49cb' not found
2021-04-20 09:28:31,998.998 ERROR:teuthology.suite.util:git refresh failed for ceph: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache/2.4.25 (Ubuntu) Server at git.ceph.com Port 8080</address>
</body></html>

2021-04-20 09:28:34,349.349 DEBUG:teuthology.suite.util:got response: {'committish': 'e3523634d9c2227df9af89a4eac33d16738c49cb', 'err': 'fatal: bad object e3523634d9c2227df9af89a4eac33d16738c49cb\n', 'sha1s': []}
Traceback (most recent call last):
  File "/home/runner/src/teuthology_master/virtualenv/bin/teuthology-suite", line 33, in <module>
    sys.exit(load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')())
  File "/home/runner/src/teuthology_master/scripts/suite.py", line 189, in main
    return teuthology.suite.main(args)
  File "/home/runner/src/teuthology_master/teuthology/suite/__init__.py", line 143, in main
    run.prepare_and_schedule()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 397, in prepare_and_schedule
    num_jobs = self.schedule_suite()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 612, in schedule_suite
    util.find_git_parent('ceph', self.base_config.sha1)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 491, in find_git_parent
    sha1s = get_sha1s(project, sha1, 2)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 485, in get_sha1s
    int(count), sha1, project, resp.json()['error'])
KeyError: 'error'

@kshtsk
Copy link
Contributor

kshtsk commented Apr 20, 2021

teuthology-suite -v --machine-type gra         --ceph octopus --suite smoke         -d centos -D 7.6         --filter-out ubuntu,rhel,7.7,rados_bench,kclient_workunit_suites_dbench,cfuse_workunit_suites_iozone,_s3tests         --limit 2         --seed 0         --newest 100

@kshtsk
Copy link
Contributor

kshtsk commented Apr 20, 2021

I guess there is no builds anymore for centos for octopus?

@ceph ceph deleted a comment from susebot Apr 20, 2021
@kshtsk
Copy link
Contributor

kshtsk commented Apr 20, 2021

@susebot run deploy

@ceph ceph deleted a comment from susebot Apr 20, 2021
@susebot
Copy link

susebot commented Apr 20, 2021

Commit 179edf2 is NOT OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/324/

@jdurgin
Copy link
Member Author

jdurgin commented Apr 20, 2021

there should be octopus centos builds, if you're seeing something missing @djgalloway may be able to help

@kshtsk
Copy link
Contributor

kshtsk commented Apr 21, 2021

@susebot run deploy

@susebot
Copy link

susebot commented Apr 21, 2021

Commit 179edf2 is NOT OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/325/

@kshtsk
Copy link
Contributor

kshtsk commented Apr 21, 2021

2021-04-21 08:29:51,334.334 DEBUG:teuthology.suite.util:got response: {'committish': 'e647a64c1e8147b04e84575a0fc53dee65cecab2', 'err': 'fatal: bad object e647a64c1e8147b04e84575a0fc53dee65cecab2\n', 'sha1s': []}
Traceback (most recent call last):
  File "/home/runner/src/teuthology_master/virtualenv/bin/teuthology-suite", line 33, in <module>
    sys.exit(load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')())
  File "/home/runner/src/teuthology_master/scripts/suite.py", line 189, in main
    return teuthology.suite.main(args)
  File "/home/runner/src/teuthology_master/teuthology/suite/__init__.py", line 143, in main
    run.prepare_and_schedule()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 397, in prepare_and_schedule
    num_jobs = self.schedule_suite()
  File "/home/runner/src/teuthology_master/teuthology/suite/run.py", line 612, in schedule_suite
    util.find_git_parent('ceph', self.base_config.sha1)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 491, in find_git_parent
    sha1s = get_sha1s(project, sha1, 2)
  File "/home/runner/src/teuthology_master/teuthology/suite/util.py", line 485, in get_sha1s
    int(count), sha1, project, resp.json()['error'])
KeyError: 'error'
> curl "https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=e647a64c1e8147b04e84575a0fc53dee65cecab2"
<html>
 <head>
  <title>302 Found</title>
 </head>
 <body>
  <h1>302 Found</h1>
  The resource was found at <a href="https://shaman.ceph.com/api/search/?status=ready&amp;project=ceph&amp;flavor=default&amp;distros=centos%2F8%2Fx86_64&amp;sha1=e647a64c1e8147b04e84575a0fc53dee65cecab2">https://shaman.ceph.com/api/search/?status=ready&amp;project=ceph&amp;flavor=default&amp;distros=centos%2F8%2Fx86_64&amp;sha1=e647a64c1e8147b04e84575a0fc53dee65cecab2</a>;
you should be redirected automatically.


 </body>

@kshtsk
Copy link
Contributor

kshtsk commented Apr 21, 2021

One of the problem I see that http://git.ceph.com:8080/ceph.git/history/ returns json with 'err' instead of 'error'.

@kshtsk
Copy link
Contributor

kshtsk commented Apr 21, 2021

what is that service and since when it got updated so teuthology cannot handle responses correctly

@kshtsk
Copy link
Contributor

kshtsk commented Apr 21, 2021

Scheduling is failed because arm build is failed, and build_complete returns False:

curl -s https://shaman.ceph.com/api/builds/ceph/octopus/e647a64c1e8147b04e84575a0fc53dee65cecab2/ | jq '.[] | select(.distro=="centos" and .distro_version=="8" and .flavor=="default") '                           
{
  "status": "failed",
  "sha1": "e647a64c1e8147b04e84575a0fc53dee65cecab2",
  "distro_arch": "arm64",
  "started": "2021-04-20 19:06:00.620116",
  "distro_codename": null,
  "completed": null,
  "extra": {
    "node_name": "172.21.4.63+confusa01",
    "version": "",
    "build_user": "",
    "root_build_cause": "SCMTRIGGER",
    "job_name": "ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic"
  },
  "modified": "2021-04-20 20:29:55.141954",
  "distro_version": "8",
  "project": "ceph",
  "url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457/",
  "log_url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457//consoleFull",
  "flavor": "default",
  "ref": "octopus",
  "distro": "centos"
}
{
  "status": "completed",
  "sha1": "e647a64c1e8147b04e84575a0fc53dee65cecab2",
  "distro_arch": "x86_64",
  "started": "2021-04-20 17:43:16.257781",
  "distro_codename": null,
  "completed": "2021-04-20 18:34:27.490982",
  "extra": {
    "node_name": "172.21.2.4+braggi04",
    "version": "15.2.11-166-ge647a64c",
    "build_user": "",
    "root_build_cause": "SCMTRIGGER",
    "job_name": "ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic"
  },
  "modified": "2021-04-20 18:34:27.492338",
  "distro_version": "8",
  "project": "ceph",
  "url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457/",
  "log_url": "https://jenkins.ceph.com/job/ceph-dev-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos8,DIST=centos8,MACHINE_SIZE=gigantic/47457//consoleFull",
  "flavor": "default",
  "ref": "octopus",
  "distro": "centos"
}

@kshtsk
Copy link
Contributor

kshtsk commented Apr 22, 2021

I have tried to address the issue with #1643 , but it still not able to schedule a test run because octopus arm build is failed for centos/8.

@kshtsk
Copy link
Contributor

kshtsk commented Apr 23, 2021

@susebot run deploy

@susebot
Copy link

susebot commented Apr 23, 2021

Commit 179edf2 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/333/

@yuriw yuriw requested a review from kshtsk April 28, 2021 22:18
@jdurgin jdurgin merged commit bea5b73 into ceph:master Apr 29, 2021
@jdurgin jdurgin deleted the wip-retry-paddles-reads branch June 28, 2021 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants