Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test/DNM] Checking if criu cgroup v1 kludges help #4559

Closed
wants to merge 1 commit into from

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Dec 17, 2024

Testing criu PR checkpoint-restore/criu#2545.

Related to #4273, #4457 etc.


Results:

  1. Freeze fixes and v1 kludges checkpoint-restore/criu#2545 indeed fixes the "unable to freeze" issue (can't repro with the patches applied)
  2. Another issue is found (reported as page-xfer error during TestUsernsCheckpoint in runc CI checkpoint-restore/criu#2551)

Some more details:

  • It is reproducible on Cirrus CI on AlmaLinux 8 (kernel 4.18.0-553.30.1.el8_10.x86_64)
  • I was unable to reproduce in locally in a vagrant VM running AlmaLinux 8 (kernel 4.18.0-553.16.1.el8_10.x86_64)
  • It is enough to run 10-20 iterations of make localunittest to reproduce it. Each iteration takes ~30s.
  • Out of 24 CI runs got 9 "unable to freeze" failures (all when using criu-dev) and 11 "page-xfer" failures (7 using Freeze fixes and v1 kludges checkpoint-restore/criu#2545 and 4 using criu-dev)

Also

@kolyshkin kolyshkin force-pushed the test-criu-pr-2545 branch 2 times, most recently from 6f5c745 to a7ec538 Compare December 17, 2024 06:51
@kolyshkin
Copy link
Contributor Author

Was only able to get some other criu error, not related to freezing. From https://cirrus-ci.com/task/6207980225429504 (look for missing):

=== RUN   TestUsernsCheckpoint/0
=== RUN   TestUsernsCheckpoint/1
time="2024-12-17T06:57:27Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint12863306293/003/criu/dump.log\""
time="2024-12-17T06:57:27Z" level=warning msg="841:(00.111681) page-xfer: Transferring pages:"
time="2024-12-17T06:57:27Z" level=warning msg="842:(00.111682) page-xfer: \tbuf 1/1"
time="2024-12-17T06:57:27Z" level=warning msg="843:(00.111684) page-xfer: \tp 0x7fff903b8000 [1]"
time="2024-12-17T06:57:27Z" level=warning msg="844:(00.111689) page-xfer: \th 0x7fff903b9000 [1]"
time="2024-12-17T06:57:27Z" level=warning msg="845:(00.111691) page-xfer: Checking 0x7fff903b9000/4096 hole"
time="2024-12-17T06:57:27Z" level=warning msg="846:(00.111693) Error (criu/page-xfer.c:299): page-xfer: Missing 7fff903b9000 in parent pagemap"
time="2024-12-17T06:57:27Z" level=warning msg="847:(00.111697) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7fff903b9000/4096 not found in parent"
time="2024-12-17T06:57:27Z" level=warning msg="848:(00.111716) page-pipe: Killing page pipe"
time="2024-12-17T06:57:27Z" level=warning msg="849:(00.111760) ----------------------------------------"
time="2024-12-17T06:57:27Z" level=warning msg="850:(00.111764) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-17T06:57:27Z" level=warning msg=...
time="2024-12-17T06:57:27Z" level=warning msg="860:(00.112043) net: Unlock network"
time="2024-12-17T06:57:27Z" level=warning msg="861:(00.112046) Running network-unlock scripts"
time="2024-12-17T06:57:27Z" level=warning msg="862:(00.112048) \tRPC"
time="2024-12-17T06:57:27Z" level=warning msg="863:(00.133784) Unfreezing tasks into 1"
time="2024-12-17T06:57:27Z" level=warning msg="864:(00.133799) \tUnseizing 97673 into 1"
time="2024-12-17T06:57:27Z" level=warning msg="865:(00.133822) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-17T06:57:27Z" level=warning msg=---
    checkpoint_test.go:118: criu failed: type DUMP errno 0
=== RUN   TestUsernsCheckpoint/2
=== RUN   TestUsernsCheckpoint/3

@kolyshkin
Copy link
Contributor Author

Again (from https://cirrus-ci.com/task/6751216950050816, also on almalinux-8):

=== RUN   TestUsernsCheckpoint
=== RUN   TestUsernsCheckpoint/0
time="2024-12-17T07:10:49Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint02954809155/003/criu/dump.log\""
time="2024-12-17T07:10:49Z" level=warning msg="842:(00.186487) page-xfer: Transferring pages:"
time="2024-12-17T07:10:49Z" level=warning msg="843:(00.186489) page-xfer: \tbuf 1/1"
time="2024-12-17T07:10:49Z" level=warning msg="844:(00.186491) page-xfer: \tp 0x7ffde716c000 [1]"
time="2024-12-17T07:10:49Z" level=warning msg="845:(00.186498) page-xfer: \th 0x7ffde716d000 [1]"
time="2024-12-17T07:10:49Z" level=warning msg="846:(00.186499) page-xfer: Checking 0x7ffde716d000/4096 hole"
time="2024-12-17T07:10:49Z" level=warning msg="847:(00.186502) Error (criu/page-xfer.c:299): page-xfer: Missing 7ffde716d000 in parent pagemap"
time="2024-12-17T07:10:49Z" level=warning msg="848:(00.186506) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7ffde716d000/4096 not found in parent"
time="2024-12-17T07:10:49Z" level=warning msg="849:(00.186529) page-pipe: Killing page pipe"
time="2024-12-17T07:10:49Z" level=warning msg="850:(00.186561) ----------------------------------------"
time="2024-12-17T07:10:49Z" level=warning msg="851:(00.186563) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-17T07:10:49Z" level=warning msg=...
time="2024-12-17T07:10:49Z" level=warning msg="861:(00.186977) net: Unlock network"
time="2024-12-17T07:10:49Z" level=warning msg="862:(00.186981) Running network-unlock scripts"
time="2024-12-17T07:10:49Z" level=warning msg="863:(00.186983) \tRPC"
time="2024-12-17T07:10:49Z" level=warning msg="864:(00.204552) Unfreezing tasks into 1"
time="2024-12-17T07:10:49Z" level=warning msg="865:(00.204578) \tUnseizing 95994 into 1"
time="2024-12-17T07:10:49Z" level=warning msg="866:(00.204602) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-17T07:10:49Z" level=warning msg=---
    checkpoint_test.go:118: criu failed: type DUMP errno 0
=== RUN   TestUsernsCheckpoint/1
=== RUN   TestUsernsCheckpoint/2

@kolyshkin
Copy link
Contributor Author

This failed elsewhere just today (https://cirrus-ci.com/task/6204970728423424?logs=unit_tests#L764).

I've added testing of both criu versions to ci / almalinux jobs.

@kolyshkin kolyshkin force-pushed the test-criu-pr-2545 branch 13 times, most recently from 697e71f to 7b2855e Compare December 18, 2024 04:10
@kolyshkin
Copy link
Contributor Author

Seeing this for the third time. From https://cirrus-ci.com/task/5627926906929152?logs=unit_tests_1#L20

=== RUN   TestUsernsCheckpoint
time="2024-12-18T04:09:25Z" level=warning msg="--- Quoting \"/tmp/TestUsernsCheckpoint1601804805/003/criu/dump.log\""
time="2024-12-18T04:09:25Z" level=warning msg="843:(00.143747) page-xfer: Transferring pages:"
time="2024-12-18T04:09:25Z" level=warning msg="844:(00.143748) page-xfer: \tbuf 1/1"
time="2024-12-18T04:09:25Z" level=warning msg="845:(00.143750) page-xfer: \tp 0x7ffcd5eec000 [1]"
time="2024-12-18T04:09:25Z" level=warning msg="846:(00.143756) page-xfer: \th 0x7ffcd5eed000 [1]"
time="2024-12-18T04:09:25Z" level=warning msg="847:(00.143758) page-xfer: Checking 0x7ffcd5eed000/4096 hole"
time="2024-12-18T04:09:25Z" level=warning msg="848:(00.143761) Error (criu/page-xfer.c:299): page-xfer: Missing 7ffcd5eed000 in parent pagemap"
time="2024-12-18T04:09:25Z" level=warning msg="849:(00.143764) Error (criu/page-xfer.c:342): page-xfer: Hole 0x7ffcd5eed000/4096 not found in parent"
time="2024-12-18T04:09:25Z" level=warning msg="850:(00.143793) page-pipe: Killing page pipe"
time="2024-12-18T04:09:25Z" level=warning msg="851:(00.143820) ----------------------------------------"
time="2024-12-18T04:09:25Z" level=warning msg="852:(00.143822) Error (criu/mem.c:672): Can't dump page with parasite"
time="2024-12-18T04:09:25Z" level=warning msg=...
time="2024-12-18T04:09:25Z" level=warning msg="862:(00.144124) net: Unlock network"
time="2024-12-18T04:09:25Z" level=warning msg="863:(00.144129) Running network-unlock scripts"
time="2024-12-18T04:09:25Z" level=warning msg="864:(00.144131) \tRPC"
time="2024-12-18T04:09:25Z" level=warning msg="865:(00.155348) Unfreezing tasks into 1"
time="2024-12-18T04:09:25Z" level=warning msg="866:(00.155361) \tUnseizing 96793 into 1"
time="2024-12-18T04:09:25Z" level=warning msg="867:(00.155389) Error (criu/cr-dump.c:2111): Dumping FAILED."
time="2024-12-18T04:09:25Z" level=warning msg=---
    checkpoint_test.go:113: criu failed: type DUMP errno 0
--- FAIL: TestUsernsCheckpoint (0.60s)

(still no luck catching the original issue)

@kolyshkin
Copy link
Contributor Author

Seeing this for the third time.

Filed checkpoint-restore/criu#2551.

@kolyshkin
Copy link
Contributor Author

One more, https://cirrus-ci.com/task/5065636230987776?logs=unit_tests_stock_criu#L765

So,

  • able to reproduce on AlmaLinux 8 (kernel 4.18.0-553.30.1.el8_10.x86_64) with stock criu (3.18-5.module_el8.10.0+3926+f12484f5) in 5...10% of all runs.
  • not able to reproduce on AlmaLinux 8 with criu compiled from Freeze fixes and v1 kludges checkpoint-restore/criu#2545
  • not able to reproduce on AlmaLinux 9 (kernel 5.14.0-503.15.1.el9_5.x86_64) with stock criu (3.19)
  • not able to reproduce on GHA CI with Ubuntu 20.04.

To be absolutely sure, let's run more CI rounds with AlmaLinux 8 only, with and without the patches from checkpoint-restore/criu#2545

@kolyshkin kolyshkin force-pushed the test-criu-pr-2545 branch 24 times, most recently from 1b56e0d to 72a4c9f Compare December 18, 2024 14:04
Testing criu PR 2545.

Now with AlmaLinux 8 on Cirrus CI only.

Signed-off-by: Kir Kolyshkin <[email protected]>

Iteration 24 - 2024-12-18 06:16:13
@kolyshkin
Copy link
Contributor Author

Added an overview of what I found in the description

@kolyshkin kolyshkin closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant