Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky tests: TestUsernsCheckpoint, TestCheckpoint #4273

Open
lifubang opened this issue May 7, 2024 · 8 comments
Open

flaky tests: TestUsernsCheckpoint, TestCheckpoint #4273

lifubang opened this issue May 7, 2024 · 8 comments
Labels

Comments

@lifubang
Copy link
Member

lifubang commented May 7, 2024

I saw this happend many times in centos7.

=== RUN   TestUsernsCheckpoint
time="2024-05-07T10:08:51Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint611938415/003/criu-parent/dump.log\""
time="2024-05-07T10:08:51Z" level=warning msg="116:(09.514467) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="117:(09.614644) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="118:(09.714816) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="119:(09.814957) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="120:(09.915110) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="121:(10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0"
time="2024-05-07T10:08:51Z" level=warning msg="122:(10.000563) freezer.state=FREEZING"
time="2024-05-07T10:08:51Z" level=warning msg="123:(10.000694) Error (compel/src/lib/infect.c:234): Unseizable non-zombie 9017 found, state D, err -1/10"
time="2024-05-07T10:08:51Z" level=warning msg="124:(10.000773) Unfreezing tasks into 1"
time="2024-05-07T10:08:51Z" level=warning msg="125:(10.000778) \tUnseizing 9017 into 1"
time="2024-05-07T10:08:51Z" level=warning msg="126:(10.000783) Error (compel/src/lib/infect.c:355): Unable to detach from 9017: No such process"
time="2024-05-07T10:08:51Z" level=warning msg="127:(10.000800) Writing image inventory (version 1)"
time="2024-05-07T10:08:51Z" level=warning msg="128:(10.000976) Error (criu/cr-dump.c:1581): Pre-dumping FAILED."
time="2024-05-07T10:08:51Z" level=warning msg=---
    checkpoint_test.go:115: === /tmp/TestUsernsCheckpoint611938415/003/criu-parent/dump.log ===
    checkpoint_test.go:115: (00.000052) Version: 3.16 (gitid 0)
    checkpoint_test.go:115: (00.000067) Running on cirrus-task-5639495050067968 Linux 3.10.0-1160.114.2.el7.x86_64 #1 SMP Wed Mar 20 15:54:52 UTC 2024 x86_64
    checkpoint_test.go:115: (00.000070) Would overwrite RPC settings with values from /etc/criu/runc.conf
    checkpoint_test.go:115: (00.000094) Loaded kdat cache from /run/criu/criu.kdat
    checkpoint_test.go:115: (00.000142) rlimit: RLIMIT_NOFILE unlimited for self
    checkpoint_test.go:115: (00.000148) Enforcing memory tracking for pre-dump.
    checkpoint_test.go:115: (00.000156) Enforcing tasks run after pre-dump.
    checkpoint_test.go:115: (00.000170) irmap: Searching irmap cache in work dir
    checkpoint_test.go:115: (00.000180) No irmap-cache image
    checkpoint_test.go:115: (00.000181) irmap: Searching irmap cache in parent
    checkpoint_test.go:115: (00.000185) No parent images directory provided
    checkpoint_test.go:115: (00.000187) irmap: No irmap cache
    checkpoint_test.go:115: (00.000205) cpu: x86_family 25 x86_vendor_id AuthenticAMD x86_model_id AMD EPYC 7B13
    checkpoint_test.go:115: (00.000210) cpu: fpu: xfeatures_mask 0x5 xsave_size 832 xsave_size_max 2440 xsaves_size 832
    checkpoint_test.go:115: (00.000213) cpu: fpu: x87 floating point registers     xstate_offsets      0 / 0      xstate_sizes    160 / 160   
    checkpoint_test.go:115: (00.000215) cpu: fpu: AVX registers                    xstate_offsets    576 / 576    xstate_sizes    256 / 256   
    checkpoint_test.go:115: (00.000217) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:0
    checkpoint_test.go:115: (00.000338) Detected cgroup V1 freezer
    checkpoint_test.go:115: (00.000340) freezing processes: 100000 attempts with 100 ms steps
    checkpoint_test.go:115: (00.000351) freezer.state=THAWED
    checkpoint_test.go:115: (00.000358) freezer.state=FREEZING
    checkpoint_test.go:115: (00.100446) freezer.state=FREEZING
    checkpoint_test.go:115: (00.201766) freezer.state=FREEZING
    checkpoint_test.go:115: (00.301871) freezer.state=FREEZING
    checkpoint_test.go:115: (00.401990) freezer.state=FREEZING
    checkpoint_test.go:115: (00.502110) freezer.state=FREEZING
    checkpoint_test.go:115: (00.602214) freezer.state=FREEZING
    checkpoint_test.go:115: (00.702313) freezer.state=FREEZING
    checkpoint_test.go:115: (00.802425) freezer.state=FREEZING
    checkpoint_test.go:115: (00.902531) freezer.state=FREEZING
    checkpoint_test.go:115: (01.002635) freezer.state=FREEZING
    checkpoint_test.go:115: (01.102755) freezer.state=FREEZING
    checkpoint_test.go:115: (01.202870) freezer.state=FREEZING
    checkpoint_test.go:115: (01.303058) freezer.state=FREEZING
    checkpoint_test.go:115: (01.403208) freezer.state=FREEZING
    checkpoint_test.go:115: (01.503308) freezer.state=FREEZING
    checkpoint_test.go:115: (01.603429) freezer.state=FREEZING
    checkpoint_test.go:115: (01.703589) freezer.state=FREEZING
    checkpoint_test.go:115: (01.803726) freezer.state=FREEZING
    checkpoint_test.go:115: (01.903872) freezer.state=FREEZING
    checkpoint_test.go:115: (02.004022) freezer.state=FREEZING
    checkpoint_test.go:115: (02.104139) freezer.state=FREEZING
    checkpoint_test.go:115: (02.204270) freezer.state=FREEZING
    checkpoint_test.go:115: (02.304422) freezer.state=FREEZING
    checkpoint_test.go:115: (02.404578) freezer.state=FREEZING
    checkpoint_test.go:115: (02.504717) freezer.state=FREEZING
    checkpoint_test.go:115: (02.604860) freezer.state=FREEZING
    checkpoint_test.go:115: (02.704987) freezer.state=FREEZING
    checkpoint_test.go:115: (02.805144) freezer.state=FREEZING
    checkpoint_test.go:115: (02.905275) freezer.state=FREEZING
    checkpoint_test.go:115: (03.005410) freezer.state=FREEZING
    checkpoint_test.go:115: (03.105546) freezer.state=FREEZING
    checkpoint_test.go:115: (03.205676) freezer.state=FREEZING
    checkpoint_test.go:115: (03.305821) freezer.state=FREEZING
    checkpoint_test.go:115: (03.405941) freezer.state=FREEZING
    checkpoint_test.go:115: (03.506057) freezer.state=FREEZING
    checkpoint_test.go:115: (03.606181) freezer.state=FREEZING
    checkpoint_test.go:115: (03.706322) freezer.state=FREEZING
    checkpoint_test.go:115: (03.806446) freezer.state=FREEZING
    checkpoint_test.go:115: (03.906569) freezer.state=FREEZING
    checkpoint_test.go:115: (04.006738) freezer.state=FREEZING
    checkpoint_test.go:115: (04.106903) freezer.state=FREEZING
    checkpoint_test.go:115: (04.207032) freezer.state=FREEZING
    checkpoint_test.go:115: (04.307154) freezer.state=FREEZING
    checkpoint_test.go:115: (04.407273) freezer.state=FREEZING
    checkpoint_test.go:115: (04.507399) freezer.state=FREEZING
    checkpoint_test.go:115: (04.607502) freezer.state=FREEZING
    checkpoint_test.go:115: (04.707592) freezer.state=FREEZING
    checkpoint_test.go:115: (04.807698) freezer.state=FREEZING
    checkpoint_test.go:115: (04.907829) freezer.state=FREEZING
    checkpoint_test.go:115: (05.007957) freezer.state=FREEZING
    checkpoint_test.go:115: (05.108092) freezer.state=FREEZING
    checkpoint_test.go:115: (05.208199) freezer.state=FREEZING
    checkpoint_test.go:115: (05.308309) freezer.state=FREEZING
    checkpoint_test.go:115: (05.408418) freezer.state=FREEZING
    checkpoint_test.go:115: (05.508566) freezer.state=FREEZING
    checkpoint_test.go:115: (05.608724) freezer.state=FREEZING
    checkpoint_test.go:115: (05.708885) freezer.state=FREEZING
    checkpoint_test.go:115: (05.809035) freezer.state=FREEZING
    checkpoint_test.go:115: (05.909159) freezer.state=FREEZING
    checkpoint_test.go:115: (06.009283) freezer.state=FREEZING
    checkpoint_test.go:115: (06.109410) freezer.state=FREEZING
    checkpoint_test.go:115: (06.209537) freezer.state=FREEZING
    checkpoint_test.go:115: (06.309662) freezer.state=FREEZING
    checkpoint_test.go:115: (06.409787) freezer.state=FREEZING
    checkpoint_test.go:115: (06.509905) freezer.state=FREEZING
    checkpoint_test.go:115: (06.610031) freezer.state=FREEZING
    checkpoint_test.go:115: (06.710165) freezer.state=FREEZING
    checkpoint_test.go:115: (06.810288) freezer.state=FREEZING
    checkpoint_test.go:115: (06.910416) freezer.state=FREEZING
    checkpoint_test.go:115: (07.010552) freezer.state=FREEZING
    checkpoint_test.go:115: (07.110678) freezer.state=FREEZING
    checkpoint_test.go:115: (07.210806) freezer.state=FREEZING
    checkpoint_test.go:115: (07.310933) freezer.state=FREEZING
    checkpoint_test.go:115: (07.411069) freezer.state=FREEZING
    checkpoint_test.go:115: (07.511252) freezer.state=FREEZING
    checkpoint_test.go:115: (07.611415) freezer.state=FREEZING
    checkpoint_test.go:115: (07.711588) freezer.state=FREEZING
    checkpoint_test.go:115: (07.811742) freezer.state=FREEZING
    checkpoint_test.go:115: (07.911897) freezer.state=FREEZING
    checkpoint_test.go:115: (08.012029) freezer.state=FREEZING
    checkpoint_test.go:115: (08.112217) freezer.state=FREEZING
    checkpoint_test.go:115: (08.212392) freezer.state=FREEZING
    checkpoint_test.go:115: (08.312553) freezer.state=FREEZING
    checkpoint_test.go:115: (08.412734) freezer.state=FREEZING
    checkpoint_test.go:115: (08.512909) freezer.state=FREEZING
    checkpoint_test.go:115: (08.613067) freezer.state=FREEZING
    checkpoint_test.go:115: (08.713220) freezer.state=FREEZING
    checkpoint_test.go:115: (08.813373) freezer.state=FREEZING
    checkpoint_test.go:115: (08.913548) freezer.state=FREEZING
    checkpoint_test.go:115: (09.013704) freezer.state=FREEZING
    checkpoint_test.go:115: (09.113850) freezer.state=FREEZING
    checkpoint_test.go:115: (09.213999) freezer.state=FREEZING
    checkpoint_test.go:115: (09.314151) freezer.state=FREEZING
    checkpoint_test.go:115: (09.414305) freezer.state=FREEZING
    checkpoint_test.go:115: (09.514467) freezer.state=FREEZING
    checkpoint_test.go:115: (09.614644) freezer.state=FREEZING
    checkpoint_test.go:115: (09.714816) freezer.state=FREEZING
    checkpoint_test.go:115: (09.814957) freezer.state=FREEZING
    checkpoint_test.go:115: (09.915110) freezer.state=FREEZING
    checkpoint_test.go:115: (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
    checkpoint_test.go:115: (10.000563) freezer.state=FREEZING
    checkpoint_test.go:115: (10.000694) Error (compel/src/lib/infect.c:234): Unseizable non-zombie 9017 found, state D, err -1/10
    checkpoint_test.go:115: (10.000773) Unfreezing tasks into 1
    checkpoint_test.go:115: (10.000778) 	Unseizing 9017 into 1
    checkpoint_test.go:115: (10.000783) Error (compel/src/lib/infect.c:355): Unable to detach from 9017: No such process
    checkpoint_test.go:115: (10.000800) Writing image inventory (version 1)
    checkpoint_test.go:115: (10.000976) Error (criu/cr-dump.c:1581): Pre-dumping FAILED.
    checkpoint_test.go:115: === END ===
    checkpoint_test.go:119: criu failed: type PRE_DUMP errno 0
--- FAIL: TestUsernsCheckpoint (10.31s)
@kolyshkin
Copy link
Contributor

I've seen this a few times, too.

@lifubang this means that the kernel can't freeze the cgroup despite the repeated attempts, so criu gives up.

Alas, this might be a kernel issue, and the CentOS 7 kernel is too old. In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see #2941 and the earlier PRs linked from there).

We can either try to add similar kludges to https://github.com/checkpoint-restore/criu, or skip these tests on CentOS 7.

@lifubang
Copy link
Member Author

lifubang commented Jun 1, 2024

skip these tests on CentOS 7.

I have to rerun the centos 7 tests manually for many times, so let’s skip them in centos 7?

@lifubang
Copy link
Member Author

lifubang commented Jun 3, 2024

😢 It appeares in ubuntu now.

https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300

Failure logs === RUN TestCheckpoint checkpoint_test.go:115: === /tmp/TestCheckpoint1478934365/003/criu-parent/dump.log === checkpoint_test.go:115: (00.000021) Version: 3.19 (gitid 5c35d75) checkpoint_test.go:115: (00.000035) Running on fv-az[691](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:692)-944 Linux 5.15.0-1064-azure #73~20.04.1-Ubuntu SMP Mon May 6 09:43:44 UTC 2024 x86_64 checkpoint_test.go:115: (00.000038) Would overwrite RPC settings with values from /etc/criu/runc.conf checkpoint_test.go:115: (00.000061) Loaded kdat cache from /run/criu.kdat checkpoint_test.go:115: (00.000073) Hugetlb size 2 Mb is supported but cannot get dev's number checkpoint_test.go:115: (00.000081) Hugetlb size 1024 Mb is supported but cannot get dev's number checkpoint_test.go:115: (00.000391) rlimit: RLIMIT_NOFILE unlimited for self checkpoint_test.go:115: (00.000401) Enforcing memory tracking for pre-dump. checkpoint_test.go:115: (00.000403) Enforcing tasks run after pre-dump. checkpoint_test.go:115: (00.000428) irmap: Searching irmap cache in work dir checkpoint_test.go:115: (00.000437) No irmap-cache image checkpoint_test.go:115: (00.000440) irmap: Searching irmap cache in parent checkpoint_test.go:115: (00.000444) No parent images directory provided checkpoint_test.go:115: (00.000446) irmap: No irmap cache checkpoint_test.go:115: (00.000469) cpu: x86_family 25 x86_vendor_id AuthenticAMD x86_model_id AMD EPYC 7763 64-Core Processor checkpoint_test.go:115: (00.000476) cpu: fpu: xfeatures_mask 0x5 xsave_size 832 xsave_size_max 832 xsaves_size 832 checkpoint_test.go:115: (00.000487) cpu: fpu: x87 floating point registers xstate_offsets 0 / 0 xstate_sizes 160 / 160 checkpoint_test.go:115: (00.000491) cpu: fpu: AVX registers xstate_offsets 576 / 576 xstate_sizes 256 / 256 checkpoint_test.go:115: (00.000494) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1 checkpoint_test.go:115: (00.000651) Detected cgroup V1 freezer checkpoint_test.go:115: (00.000655) freezing processes: 100000 attempts with 100 ms steps checkpoint_test.go:115: (00.000665) freezer.state=THAWED checkpoint_test.go:115: (00.000674) freezer.state=FREEZING checkpoint_test.go:115: (00.100754) freezer.state=FREEZING checkpoint_test.go:115: (00.200851) freezer.state=FREEZING checkpoint_test.go:115: (00.300941) freezer.state=FREEZING checkpoint_test.go:115: (00.401039) freezer.state=FREEZING checkpoint_test.go:115: (00.501138) freezer.state=FREEZING checkpoint_test.go:115: (00.601233) freezer.state=FREEZING checkpoint_test.go:115: (00.701325) freezer.state=FREEZING checkpoint_test.go:115: (00.801419) freezer.state=FREEZING checkpoint_test.go:115: (00.901518) freezer.state=FREEZING checkpoint_test.go:115: (01.001609) freezer.state=FREEZING checkpoint_test.go:115: (01.101707) freezer.state=FREEZING checkpoint_test.go:115: (01.201801) freezer.state=FREEZING checkpoint_test.go:115: (01.301898) freezer.state=FREEZING checkpoint_test.go:115: (01.402005) freezer.state=FREEZING checkpoint_test.go:115: (01.502110) freezer.state=FREEZING checkpoint_test.go:115: (01.602214) freezer.state=FREEZING checkpoint_test.go:115: (01.702327) freezer.state=FREEZING checkpoint_test.go:115: (01.802432) freezer.state=FREEZING checkpoint_test.go:115: (01.902530) freezer.state=FREEZING checkpoint_test.go:115: (02.002627) freezer.state=FREEZING checkpoint_test.go:115: (02.102735) freezer.state=FREEZING checkpoint_test.go:115: (02.202838) freezer.state=FREEZING checkpoint_test.go:115: (02.302932) freezer.state=FREEZING checkpoint_test.go:115: (02.403025) freezer.state=FREEZING checkpoint_test.go:115: (02.503113) freezer.state=FREEZING checkpoint_test.go:115: (02.603232) freezer.state=FREEZING checkpoint_test.go:115: (02.703337) freezer.state=FREEZING checkpoint_test.go:115: (02.803439) freezer.state=FREEZING checkpoint_test.go:115: (02.903534) freezer.state=FREEZING checkpoint_test.go:115: (03.003627) freezer.state=FREEZING checkpoint_test.go:115: (03.103735) freezer.state=FREEZING checkpoint_test.go:115: (03.203828) freezer.state=FREEZING checkpoint_test.go:115: (03.303924) freezer.state=FREEZING checkpoint_test.go:115: (03.404029) freezer.state=FREEZING checkpoint_test.go:115: (03.504143) freezer.state=FREEZING checkpoint_test.go:115: (03.604243) freezer.state=FREEZING checkpoint_test.go:115: (03.704340) freezer.state=FREEZING checkpoint_test.go:115: (03.804425) freezer.state=FREEZING checkpoint_test.go:115: (03.904534) freezer.state=FREEZING checkpoint_test.go:115: (04.004650) freezer.state=FREEZING checkpoint_test.go:115: (04.104787) freezer.state=FREEZING checkpoint_test.go:115: (04.204909) freezer.state=FREEZING checkpoint_test.go:115: (04.305027) freezer.state=FREEZING checkpoint_test.go:115: (04.405145) freezer.state=FREEZING checkpoint_test.go:115: (04.505259) freezer.state=FREEZING checkpoint_test.go:115: (04.605384) freezer.state=FREEZING checkpoint_test.go:115: (04.705527) freezer.state=FREEZING checkpoint_test.go:115: (04.805639) freezer.state=FREEZING checkpoint_test.go:115: (04.905750) freezer.state=FREEZING checkpoint_test.go:115: (05.005870) freezer.state=FREEZING checkpoint_test.go:115: (05.105985) freezer.state=FREEZING checkpoint_test.go:115: (05.206093) freezer.state=FREEZING checkpoint_test.go:115: (05.306197) freezer.state=FREEZING checkpoint_test.go:115: (05.406293) freezer.state=FREEZING checkpoint_test.go:115: (05.506414) freezer.state=FREEZING checkpoint_test.go:115: (05.606538) freezer.state=FREEZING checkpoint_test.go:115: (05.706664) freezer.state=FREEZING checkpoint_test.go:115: (05.806777) freezer.state=FREEZING checkpoint_test.go:115: (05.906886) freezer.state=FREEZING checkpoint_test.go:115: (06.00[699](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:700)3) freezer.state=FREEZING checkpoint_test.go:115: (06.107105) freezer.state=FREEZING checkpoint_test.go:115: (06.207225) freezer.state=FREEZING checkpoint_test.go:115: (06.307351) freezer.state=FREEZING checkpoint_test.go:115: (06.407476) freezer.state=FREEZING checkpoint_test.go:115: (06.507600) freezer.state=FREEZING checkpoint_test.go:115: (06.607720) freezer.state=FREEZING checkpoint_test.go:115: (06.707852) freezer.state=FREEZING checkpoint_test.go:115: (06.807984) freezer.state=FREEZING checkpoint_test.go:115: (06.908105) freezer.state=FREEZING checkpoint_test.go:115: (07.008230) freezer.state=FREEZING checkpoint_test.go:115: (07.108347) freezer.state=FREEZING checkpoint_test.go:115: (07.208461) freezer.state=FREEZING checkpoint_test.go:115: (07.308576) freezer.state=FREEZING checkpoint_test.go:115: (07.408689) freezer.state=FREEZING checkpoint_test.go:115: (07.508813) freezer.state=FREEZING checkpoint_test.go:115: (07.608952) freezer.state=FREEZING checkpoint_test.go:115: (07.709072) freezer.state=FREEZING checkpoint_test.go:115: (07.809186) freezer.state=FREEZING checkpoint_test.go:115: (07.909295) freezer.state=FREEZING checkpoint_test.go:115: (08.009419) freezer.state=FREEZING checkpoint_test.go:115: (08.109523) freezer.state=FREEZING checkpoint_test.go:115: (08.209629) freezer.state=FREEZING checkpoint_test.go:115: (08.309736) freezer.state=FREEZING checkpoint_test.go:115: (08.409861) freezer.state=FREEZING checkpoint_test.go:115: (08.509985) freezer.state=FREEZING checkpoint_test.go:115: (08.610104) freezer.state=FREEZING checkpoint_test.go:115: (08.710225) freezer.state=FREEZING checkpoint_test.go:115: (08.810343) freezer.state=FREEZING checkpoint_test.go:115: (08.910458) freezer.state=FREEZING checkpoint_test.go:115: (09.010584) freezer.state=FREEZING checkpoint_test.go:115: (09.110[701](https://github.com/opencontainers/runc/actions/runs/9342659593/job/25711108530?pr=4300#step:8:702)) freezer.state=FREEZING checkpoint_test.go:115: (09.210807) freezer.state=FREEZING checkpoint_test.go:115: (09.310927) freezer.state=FREEZING checkpoint_test.go:115: (09.411052) freezer.state=FREEZING checkpoint_test.go:115: (09.511165) freezer.state=FREEZING checkpoint_test.go:115: (09.611291) freezer.state=FREEZING checkpoint_test.go:115: (09.711398) freezer.state=FREEZING checkpoint_test.go:115: (09.811526) freezer.state=FREEZING checkpoint_test.go:115: (09.911645) freezer.state=FREEZING checkpoint_test.go:115: (10.000726) Error (criu/cr-dump.c:1784): Timeout reached. Try to interrupt: 0 checkpoint_test.go:115: (10.000770) freezer.state=FREEZING checkpoint_test.go:115: (10.000850) Unfreezing tasks into 1 checkpoint_test.go:115: (10.000857) Unseizing 12457 into 1 checkpoint_test.go:115: (10.000872) Error (compel/src/lib/infect.c:418): Unable to detach from 12457: No such process checkpoint_test.go:115: (10.000879) Writing image inventory (version 1) checkpoint_test.go:115: (10.000952) Error (criu/cr-dump.c:1898): Pre-dumping FAILED. checkpoint_test.go:115: === END === checkpoint_test.go:116: criu failed: type PRE_DUMP errno 0 log file: /tmp/TestCheckpoint1478934365/003/criu-parent/dump.log time="2024-06-03T01:06:39Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/pids/test/integration: device or resource busy" time="2024-06-03T01:06:39Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/blkio/test/integration: device or resource busy" --- FAIL: TestCheckpoint (10.25s)

kolyshkin added a commit to kolyshkin/runc that referenced this issue Jun 3, 2024
@kolyshkin
Copy link
Contributor

For CentOS 7, we use somewhat dated criu v3.16 from https://copr.fedorainfracloud.org/coprs/adrian/criu-el7/builds/, with the latest one being v3.19. @adrianreber might or might not want to look into that, as CentOS 7 will be EOL in a year).

For Ubuntu 20.04, we use latest criu v3.19 (thanks @rst0git for keeping up with the builds!), but it's an older kernel (5.15) which I think might be the reason (cgroup freezer fails). Maybe @avagin may shed some light as to why simple checkpointing might fail during freeze.

@adrianreber
Copy link
Contributor

I would not worry about CentOS 7. It goes EOL end of June 2024. Just disable it. The CentOS 7 kernel never really supported everything and CRIU support was always a tech preview. Newer versions of CRIU probably do not even build on CentOS 7 as we removed Python 2 support from CRIU. You can also disable CentOS Stream 8 based test. That went EOL end of May 2024.

@rst0git
Copy link
Contributor

rst0git commented Jun 4, 2024

In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see #2941 and the earlier PRs linked from there).

@kolyshkin Would it make sense to use a similar approach in freeze_processes()?

kolyshkin added a commit to kolyshkin/runc that referenced this issue Jun 4, 2024
kolyshkin added a commit to kolyshkin/runc that referenced this issue Jun 5, 2024
kolyshkin added a commit to kolyshkin/runc that referenced this issue Jun 5, 2024
@kolyshkin
Copy link
Contributor

@kolyshkin Would it make sense to use a similar approach in freeze_processes()?

Alas, with all that jazz it still fails sometimes, and people suggest even longer delays (see e.g. #4388). The question is where to draw the line? Like, what amount of attempts is enough?

@kolyshkin kolyshkin changed the title flaky test: TestUsernsCheckpoint flaky tests: TestUsernsCheckpoint, TestCheckpoint Oct 28, 2024
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 13, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor

In general, cgroup freezer is not very reliable, I previously had to implement some hacks in runc to work around it (see #2941 and the earlier PRs linked from there).

@kolyshkin Would it make sense to use a similar approach in freeze_processes()?

I've decided to go ahead with this: checkpoint-restore/criu#2545

kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 16, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 16, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 16, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/criu that referenced this issue Dec 17, 2024
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants