-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky tests: TestUsernsCheckpoint, TestCheckpoint #4273
Comments
I've seen this a few times, too. @lifubang this means that the kernel can't freeze the cgroup despite the repeated attempts, so criu gives up. Alas, this might be a kernel issue, and the CentOS 7 kernel is too old. In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see #2941 and the earlier PRs linked from there). We can either try to add similar kludges to https://github.com/checkpoint-restore/criu, or skip these tests on CentOS 7. |
I have to rerun the centos 7 tests manually for many times, so let’s skip them in centos 7? |
😢 It appeares in ubuntu now. https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300 Failure logs=== RUN TestCheckpoint checkpoint_test.go:115: === /tmp/TestCheckpoint1478934365/003/criu-parent/dump.log === checkpoint_test.go:115: (00.000021) Version: 3.19 (gitid 5c35d75) checkpoint_test.go:115: (00.000035) Running on fv-az[691](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:692)-944 Linux 5.15.0-1064-azure #73~20.04.1-Ubuntu SMP Mon May 6 09:43:44 UTC 2024 x86_64 checkpoint_test.go:115: (00.000038) Would overwrite RPC settings with values from /etc/criu/runc.conf checkpoint_test.go:115: (00.000061) Loaded kdat cache from /run/criu.kdat checkpoint_test.go:115: (00.000073) Hugetlb size 2 Mb is supported but cannot get dev's number checkpoint_test.go:115: (00.000081) Hugetlb size 1024 Mb is supported but cannot get dev's number checkpoint_test.go:115: (00.000391) rlimit: RLIMIT_NOFILE unlimited for self checkpoint_test.go:115: (00.000401) Enforcing memory tracking for pre-dump. checkpoint_test.go:115: (00.000403) Enforcing tasks run after pre-dump. checkpoint_test.go:115: (00.000428) irmap: Searching irmap cache in work dir checkpoint_test.go:115: (00.000437) No irmap-cache image checkpoint_test.go:115: (00.000440) irmap: Searching irmap cache in parent checkpoint_test.go:115: (00.000444) No parent images directory provided checkpoint_test.go:115: (00.000446) irmap: No irmap cache checkpoint_test.go:115: (00.000469) cpu: x86_family 25 x86_vendor_id AuthenticAMD x86_model_id AMD EPYC 7763 64-Core Processor checkpoint_test.go:115: (00.000476) cpu: fpu: xfeatures_mask 0x5 xsave_size 832 xsave_size_max 832 xsaves_size 832 checkpoint_test.go:115: (00.000487) cpu: fpu: x87 floating point registers xstate_offsets 0 / 0 xstate_sizes 160 / 160 checkpoint_test.go:115: (00.000491) cpu: fpu: AVX registers xstate_offsets 576 / 576 xstate_sizes 256 / 256 checkpoint_test.go:115: (00.000494) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1 checkpoint_test.go:115: (00.000651) Detected cgroup V1 freezer checkpoint_test.go:115: (00.000655) freezing processes: 100000 attempts with 100 ms steps checkpoint_test.go:115: (00.000665) freezer.state=THAWED checkpoint_test.go:115: (00.000674) freezer.state=FREEZING checkpoint_test.go:115: (00.100754) freezer.state=FREEZING checkpoint_test.go:115: (00.200851) freezer.state=FREEZING checkpoint_test.go:115: (00.300941) freezer.state=FREEZING checkpoint_test.go:115: (00.401039) freezer.state=FREEZING checkpoint_test.go:115: (00.501138) freezer.state=FREEZING checkpoint_test.go:115: (00.601233) freezer.state=FREEZING checkpoint_test.go:115: (00.701325) freezer.state=FREEZING checkpoint_test.go:115: (00.801419) freezer.state=FREEZING checkpoint_test.go:115: (00.901518) freezer.state=FREEZING checkpoint_test.go:115: (01.001609) freezer.state=FREEZING checkpoint_test.go:115: (01.101707) freezer.state=FREEZING checkpoint_test.go:115: (01.201801) freezer.state=FREEZING checkpoint_test.go:115: (01.301898) freezer.state=FREEZING checkpoint_test.go:115: (01.402005) freezer.state=FREEZING checkpoint_test.go:115: (01.502110) freezer.state=FREEZING checkpoint_test.go:115: (01.602214) freezer.state=FREEZING checkpoint_test.go:115: (01.702327) freezer.state=FREEZING checkpoint_test.go:115: (01.802432) freezer.state=FREEZING checkpoint_test.go:115: (01.902530) freezer.state=FREEZING checkpoint_test.go:115: (02.002627) freezer.state=FREEZING checkpoint_test.go:115: (02.102735) freezer.state=FREEZING checkpoint_test.go:115: (02.202838) freezer.state=FREEZING checkpoint_test.go:115: (02.302932) freezer.state=FREEZING checkpoint_test.go:115: (02.403025) freezer.state=FREEZING checkpoint_test.go:115: (02.503113) freezer.state=FREEZING checkpoint_test.go:115: (02.603232) freezer.state=FREEZING checkpoint_test.go:115: (02.703337) freezer.state=FREEZING checkpoint_test.go:115: (02.803439) freezer.state=FREEZING checkpoint_test.go:115: (02.903534) freezer.state=FREEZING checkpoint_test.go:115: (03.003627) freezer.state=FREEZING checkpoint_test.go:115: (03.103735) freezer.state=FREEZING checkpoint_test.go:115: (03.203828) freezer.state=FREEZING checkpoint_test.go:115: (03.303924) freezer.state=FREEZING checkpoint_test.go:115: (03.404029) freezer.state=FREEZING checkpoint_test.go:115: (03.504143) freezer.state=FREEZING checkpoint_test.go:115: (03.604243) freezer.state=FREEZING checkpoint_test.go:115: (03.704340) freezer.state=FREEZING checkpoint_test.go:115: (03.804425) freezer.state=FREEZING checkpoint_test.go:115: (03.904534) freezer.state=FREEZING checkpoint_test.go:115: (04.004650) freezer.state=FREEZING checkpoint_test.go:115: (04.104787) freezer.state=FREEZING checkpoint_test.go:115: (04.204909) freezer.state=FREEZING checkpoint_test.go:115: (04.305027) freezer.state=FREEZING checkpoint_test.go:115: (04.405145) freezer.state=FREEZING checkpoint_test.go:115: (04.505259) freezer.state=FREEZING checkpoint_test.go:115: (04.605384) freezer.state=FREEZING checkpoint_test.go:115: (04.705527) freezer.state=FREEZING checkpoint_test.go:115: (04.805639) freezer.state=FREEZING checkpoint_test.go:115: (04.905750) freezer.state=FREEZING checkpoint_test.go:115: (05.005870) freezer.state=FREEZING checkpoint_test.go:115: (05.105985) freezer.state=FREEZING checkpoint_test.go:115: (05.206093) freezer.state=FREEZING checkpoint_test.go:115: (05.306197) freezer.state=FREEZING checkpoint_test.go:115: (05.406293) freezer.state=FREEZING checkpoint_test.go:115: (05.506414) freezer.state=FREEZING checkpoint_test.go:115: (05.606538) freezer.state=FREEZING checkpoint_test.go:115: (05.706664) freezer.state=FREEZING checkpoint_test.go:115: (05.806777) freezer.state=FREEZING checkpoint_test.go:115: (05.906886) freezer.state=FREEZING checkpoint_test.go:115: (06.00[699](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:700)3) freezer.state=FREEZING checkpoint_test.go:115: (06.107105) freezer.state=FREEZING checkpoint_test.go:115: (06.207225) freezer.state=FREEZING checkpoint_test.go:115: (06.307351) freezer.state=FREEZING checkpoint_test.go:115: (06.407476) freezer.state=FREEZING checkpoint_test.go:115: (06.507600) freezer.state=FREEZING checkpoint_test.go:115: (06.607720) freezer.state=FREEZING checkpoint_test.go:115: (06.707852) freezer.state=FREEZING checkpoint_test.go:115: (06.807984) freezer.state=FREEZING checkpoint_test.go:115: (06.908105) freezer.state=FREEZING checkpoint_test.go:115: (07.008230) freezer.state=FREEZING checkpoint_test.go:115: (07.108347) freezer.state=FREEZING checkpoint_test.go:115: (07.208461) freezer.state=FREEZING checkpoint_test.go:115: (07.308576) freezer.state=FREEZING checkpoint_test.go:115: (07.408689) freezer.state=FREEZING checkpoint_test.go:115: (07.508813) freezer.state=FREEZING checkpoint_test.go:115: (07.608952) freezer.state=FREEZING checkpoint_test.go:115: (07.709072) freezer.state=FREEZING checkpoint_test.go:115: (07.809186) freezer.state=FREEZING checkpoint_test.go:115: (07.909295) freezer.state=FREEZING checkpoint_test.go:115: (08.009419) freezer.state=FREEZING checkpoint_test.go:115: (08.109523) freezer.state=FREEZING checkpoint_test.go:115: (08.209629) freezer.state=FREEZING checkpoint_test.go:115: (08.309736) freezer.state=FREEZING checkpoint_test.go:115: (08.409861) freezer.state=FREEZING checkpoint_test.go:115: (08.509985) freezer.state=FREEZING checkpoint_test.go:115: (08.610104) freezer.state=FREEZING checkpoint_test.go:115: (08.710225) freezer.state=FREEZING checkpoint_test.go:115: (08.810343) freezer.state=FREEZING checkpoint_test.go:115: (08.910458) freezer.state=FREEZING checkpoint_test.go:115: (09.010584) freezer.state=FREEZING checkpoint_test.go:115: (09.110[701](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:702)) freezer.state=FREEZING checkpoint_test.go:115: (09.210807) freezer.state=FREEZING checkpoint_test.go:115: (09.310927) freezer.state=FREEZING checkpoint_test.go:115: (09.411052) freezer.state=FREEZING checkpoint_test.go:115: (09.511165) freezer.state=FREEZING checkpoint_test.go:115: (09.611291) freezer.state=FREEZING checkpoint_test.go:115: (09.711398) freezer.state=FREEZING checkpoint_test.go:115: (09.811526) freezer.state=FREEZING checkpoint_test.go:115: (09.911645) freezer.state=FREEZING checkpoint_test.go:115: (10.000726) Error (criu/cr-dump.c:1784): Timeout reached. Try to interrupt: 0 checkpoint_test.go:115: (10.000770) freezer.state=FREEZING checkpoint_test.go:115: (10.000850) Unfreezing tasks into 1 checkpoint_test.go:115: (10.000857) Unseizing 12457 into 1 checkpoint_test.go:115: (10.000872) Error (compel/src/lib/infect.c:418): Unable to detach from 12457: No such process checkpoint_test.go:115: (10.000879) Writing image inventory (version 1) checkpoint_test.go:115: (10.000952) Error (criu/cr-dump.c:1898): Pre-dumping FAILED. checkpoint_test.go:115: === END === checkpoint_test.go:116: criu failed: type PRE_DUMP errno 0 log file: /tmp/TestCheckpoint1478934365/003/criu-parent/dump.log time="2024-06-03T01:06:39Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/pids/test/integration: device or resource busy" time="2024-06-03T01:06:39Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/blkio/test/integration: device or resource busy" --- FAIL: TestCheckpoint (10.25s) |
Fixes opencontainers#4273 Signed-off-by: Kir Kolyshkin <[email protected]>
For CentOS 7, we use somewhat dated criu v3.16 from https://copr.fedorainfracloud.org/coprs/adrian/criu-el7/builds/, with the latest one being v3.19. @adrianreber might or might not want to look into that, as CentOS 7 will be EOL in a year). For Ubuntu 20.04, we use latest criu v3.19 (thanks @rst0git for keeping up with the builds!), but it's an older kernel (5.15) which I think might be the reason (cgroup freezer fails). Maybe @avagin may shed some light as to why simple checkpointing might fail during freeze. |
I would not worry about CentOS 7. It goes EOL end of June 2024. Just disable it. The CentOS 7 kernel never really supported everything and CRIU support was always a tech preview. Newer versions of CRIU probably do not even build on CentOS 7 as we removed Python 2 support from CRIU. You can also disable CentOS Stream 8 based test. That went EOL end of May 2024. |
@kolyshkin Would it make sense to use a similar approach in |
Fixes opencontainers#4273 Signed-off-by: Kir Kolyshkin <[email protected]>
Fixes opencontainers#4273 Signed-off-by: Kir Kolyshkin <[email protected]>
Fixes opencontainers#4273 Signed-off-by: Kir Kolyshkin <[email protected]>
Alas, with all that jazz it still fails sometimes, and people suggest even longer delays (see e.g. #4388). The question is where to draw the line? Like, what amount of attempts is enough? |
Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <[email protected]>
I've decided to go ahead with this: checkpoint-restore/criu#2545 |
Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <[email protected]>
Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <[email protected]>
Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <[email protected]>
Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <[email protected]>
I saw this happend many times in centos7.
The text was updated successfully, but these errors were encountered: