[1.1] join the cgroup after the initial setup finished #4439

lifubang · 2024-10-11T01:58:15Z

We should join the cgroup after the initial setup finished,
but before runc init clone new children processes. (#4427)

Because we should try our best to reduce the influence of
memory cgroup accounting from all runc init processes
before we start the container init process.

With this patch, it will eliminate
the impacts of memory accounting from ensure_clone_binary.

Signed-off-by: Akhil Mohan <[email protected]>

akhilerm · 2024-10-11T07:35:37Z

libcontainer/process_linux.go

@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) {
 		}
 	}()

+	// We should join the cgroup after the initial setup finished,
+	// but before runc init clone new children processes. (#4427)
+	err = <-waitInit


A noob question: Will this have any performace impact as the join cgroup and init are now not parallel?

Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s

So, we need to do more detailed control between runc init and the main process.

Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s

^^ Is this degradation within the acceptable limits?

@opencontainers/runc-maintainers PTAL

Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?

libcontainer/process_linux.go

kolyshkin · 2024-10-12T00:59:54Z

IMO this should not be a backport, but rather an original PR targeted for release-1.1 specifically (see #4438 (comment)). In other words, it does not make sense for the main branch to have this code.

For 1.1, this commit makes sense. OR, we can port 0e9a335 and remove initWaiter entirely.

kolyshkin · 2024-10-12T01:01:46Z

OR, we can port commit 0e9a335

Well, it will be hard to do I guess. So maybe this makes sense to have, as a hotfix. But please don't make it a backport @lifubang

lifubang · 2024-10-12T01:16:13Z

Well, it will be hard to do I guess. So maybe this makes sense to have, as a hotfix. But please don't make it a backport

OK, please see my comments in #4438, if you still think it's not worth for the main branch, I'll open it as a hotfix.

We should join the cgroup after the initial setup finished, but before runc init clone new children processes. (opencontainers#4427) Because we should try our best to reduce the influence of memory cgroup accounting from all runc init processes before we start the container init process. Signed-off-by: lifubang <[email protected]>

rata

@lifubang LGTM, thanks again for this fix!

I see this is not mentioned as a backport, so that change @kolyshkin requested is done too :)

rata · 2024-10-14T15:37:07Z

libcontainer/process_linux.go

@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) {
 		}
 	}()

+	// We should join the cgroup after the initial setup finished,
+	// but before runc init clone new children processes. (#4427)
+	err = <-waitInit


Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?

rata · 2024-10-14T15:42:42Z

Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence.

kolyshkin

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit. IOW:

diff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index ac3b104e..da4db4e2 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -407,6 +407,14 @@ func (p *initProcess) start() (retErr error) {
                }
        }()
 
+       if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
+               return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
+       }
+       err = <-waitInit
+       if err != nil {
+               return err
+       }
+
        // Do this before syncing with child so that no children can escape the
        // cgroup. We don't need to worry about not doing this and not being root
        // because we'd be using the rootless cgroup manager in that case.
@@ -418,14 +426,6 @@ func (p *initProcess) start() (retErr error) {
                        return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
                }
        }
-       if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
-               return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
-       }
-       err = <-waitInit
-       if err != nil {
-               return err
-       }
-
        childPid, err := p.getChildPid()
        if err != nil {
                return fmt.Errorf("can't get final child's PID from pipe: %w", err)

kolyshkin · 2024-10-15T00:47:45Z

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit.

I did some simple testing (similar to what @lifubang did). Compared to current runc (1.1.15), seeing ~40% higher times without the suggested change, and ~25% higher times with it. So, yes, it helps a bit.

lifubang · 2024-10-15T01:42:19Z

Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence.

~~How about to keep this?~~
Because I think #4020 also can be reverted in the main branch too, but needs more test. We need to do revert in the main branch first.
We should know that we need #4020 is because of #3931, at that time, we have not yet move binary clone from runc init to runc parent process.

lifubang · 2024-10-15T01:54:49Z

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit. IOW:

Maybe the suggest change will cause a new race?
In fact, besides we use bootstrapData as a data transfer between runc init and runc parent process, we also use it as a sync mechanism to ensure runc parent process has put the runc init process in the cgroup, before runc init clone new children process. Please see:

runc/libcontainer/process_linux.go

Lines 417 to 418 in 34f1d3b

    
           // Do this before syncing with child so that no children can escape the 
        
           // cgroup. We don't need to worry about not doing this and not being root

lifubang · 2024-10-15T04:10:33Z

Maybe we can add another wait before ‘runc init’ clones new child process. But I don’t know whether it worth to do or not.

EDIT: I test it, there is no benefit, so it's not worth to do this.

lifubang · 2024-10-15T04:16:34Z

Furthermore, maybe we can just put only the last ‘runc init’ process into the cgroup, because it’s the real container process. But there is a kernel error when putting the process into an cgroup after the process has joined a new pid/cgroup namespace.

EDIT: It seems no good idea to fix this error, so it's also not worth to do this.

lifubang · 2024-10-15T06:17:55Z

It seems no other way to speed up, and it has the same speed as in the main branch.
I think we should focus on how to raise the speed in the main branch.

lifubang · 2024-10-15T09:51:08Z

Because I think #4020 also can be reverted in the main branch too, but needs more test. We need to do revert in the main branch first.

The revert PR for the main branch is #4446, it seems that it could work fine.

As we will fix the race between binary clone and cgroup join, we can eliminate the impacts of memory accounting from ensure_clone_binary. So runc will support lower memory useage the same as before. This reverts commit 719e2bc. Signed-off-by: lifubang <[email protected]>

lifubang · 2024-10-15T09:59:41Z

Can you revert #4423 here too?

Has reverted it.

rata · 2024-10-15T14:50:38Z

@kolyshkin I agree with @lifubang here. If we move it as you proposed here, then nsexec has all the info to parse the netlink data and might fork before we add it to the cgroup. Or is there some other point that will stop this race from happening?

I really don't see why this perf impact is a big issue. Is not 10ms per container, and therefore within the noise of a go GC or some other things we do?

cyphar · 2024-10-16T00:28:12Z

Maybe we can just backport the code that moved the copying logic rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x? Or we can revert the removal of bindfd from 1.1.x.

(Sorry for not catching this when doing the original patch nor the backport...)

lifubang · 2024-10-16T01:24:48Z

rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x?

In fact, it should also be fixed in the main branch after merged #3931, instead of #4020. And then backport to release-1.1.
But I'm sorry that when I wrote #4020, I didn't catch this race.
From this inspect, it's reasonable to fix it in release-1.1.

As you say, there are two other ways to fix this race:

Backport the code that moved the copying logic from runc init to runc parent process;
Revert [1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392.

I think both are OK. It's just a choice question.

rata · 2024-10-16T09:05:14Z

@cyphar bindfd is causing a lot of real issues for us in 1.1. That is not a good way forward :(

Those other changes are huge to backport, it is quite a risky backport IMHO, maybe time consuming too. Why don't we do this trivial PR, that fixes exactly what we need, containerd at least (maybe other prjects too) are waiting for this fix. And we can aim to release 1.2 soon.

If we fail to release 1.2 soon, we can consider that backport. But without the rush of having users waiting for a fix.

What do you think?

kolyshkin · 2024-11-02T00:30:04Z

I think this fix makes sense as a stop-gap measure, in case there are issues. The full backport is complicated and is not really necessary.

kolyshkin

lgtm

rata

This LGTM. I have no strong opinion on doing a last 1.1 release, though. I leave this for others to decide.

Inigo test is failing to set memory with "device is busy" error. It looks like 1.1.15 has an issue where process joins cgroup before the setup is finished - opencontainers/runc#4439

thaJeztah · 2024-12-10T11:23:01Z

I was curious indeed if we still wanted to do 1.1.x releases, or consider 1.2 to be the only one that's maintained. Not strongly against as it's still relatively short since 1.2, but perhaps we need to document our policy somewhere (I'm assuming we don't want to continue maintaining both 1.1 and 1.2 side-by-side forever)

lifubang · 2024-12-10T13:09:38Z

I was curious indeed if we still wanted to do 1.1.x releases, or consider 1.2 to be the only one that's maintained. Not strongly against as it's still relatively short since 1.2, but perhaps we need to document our policy somewhere (I'm assuming we don't want to continue maintaining both 1.1 and 1.2 side-by-side forever)

Yes, please see #4549 (comment)

kolyshkin · 2024-12-20T01:13:14Z

I think (and hope) we won't be doing another 1.1.x (1.1.16) because 1.2.x is good in general, with a few rough edges fixed in point releases.

Having said that, let's wait another couple of months before we will definitely say "no".

cyphar · 2024-12-20T11:53:12Z

@thaJeztah Your comments in #4557 on this topic would be appreciated 😉. The current text in that PR implies that we are not required to do 1.0.x or 1.1.x releases. IMHO we should only consider doing a 1.1.x release at this point if there is a critical security issue discovered.

thaJeztah · 2024-12-20T12:56:10Z

Oh! Thanks for the nudge @cyphar - have some calls coming up, but will try to have a look (apologies again for me replying slow .. bit overloaded currently).

cyphar · 2024-12-20T12:58:29Z

No rush, I just wanted to make sure you had a chance to give your input since you commented about this exact topic here. 😸

This was referenced Oct 11, 2024

runc 1.1.15 OOMs in Kubernetes e2e tests with containerd, cgroup v2, and cgroupfs driver #4427

Open

DO NOT MERGE - runc 1.1.15 testing containerd/containerd#10799

Closed

akhilerm added a commit to akhilerm/containerd that referenced this pull request Oct 11, 2024

Test with runc from opencontainers/runc#4439

bfc170b

Signed-off-by: Akhil Mohan <[email protected]>

akhilerm reviewed Oct 11, 2024

View reviewed changes

rata reviewed Oct 11, 2024

View reviewed changes

libcontainer/process_linux.go Show resolved Hide resolved

lifubang mentioned this pull request Oct 12, 2024

join the cgroup after the initial setup finished #4438

Closed

lifubang force-pushed the backport-join-cgroup-later branch from ea5f8e0 to 34f1d3b Compare October 13, 2024 10:30

lifubang changed the title ~~[backport 1.1] join the cgroup after the initial setup finished~~ [1.1] join the cgroup after the initial setup finished Oct 13, 2024

lifubang added this to the 1.1.16 milestone Oct 13, 2024

lifubang requested review from rata, cyphar, kolyshkin and AkihiroSuda October 13, 2024 10:39

rata approved these changes Oct 14, 2024

View reviewed changes

rata mentioned this pull request Oct 14, 2024

Fix data loss in rootfs overlayfs when unmount of tmp dirs fail with idmap mounts containerd/containerd#10721

Merged

kolyshkin reviewed Oct 15, 2024

View reviewed changes

lifubang mentioned this pull request Oct 15, 2024

nsexec: cloned_binary: remove bindfd logic entirely #3931

Merged

lifubang force-pushed the backport-join-cgroup-later branch from dedf481 to 2813f71 Compare October 15, 2024 09:57

lifubang force-pushed the backport-join-cgroup-later branch from 2813f71 to e52d0d1 Compare October 15, 2024 09:58

kolyshkin approved these changes Nov 2, 2024

View reviewed changes

rata approved these changes Nov 5, 2024

View reviewed changes

mariash mentioned this pull request Nov 6, 2024

Pin opencontainers/runc to 1.1.14 cloudfoundry/guardian#461

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.1] join the cgroup after the initial setup finished #4439

[1.1] join the cgroup after the initial setup finished #4439

lifubang commented Oct 11, 2024 •

edited

Loading

akhilerm Oct 11, 2024

lifubang Oct 11, 2024 •

edited

Loading

akhilerm Oct 11, 2024

lifubang Oct 11, 2024

rata Oct 14, 2024 •

edited

Loading

kolyshkin commented Oct 12, 2024

kolyshkin commented Oct 12, 2024

lifubang commented Oct 12, 2024

rata left a comment •

edited

Loading

rata Oct 14, 2024 •

edited

Loading

rata commented Oct 14, 2024

kolyshkin left a comment

kolyshkin commented Oct 15, 2024

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024

lifubang commented Oct 15, 2024

lifubang commented Oct 15, 2024

rata commented Oct 15, 2024 •

edited

Loading

cyphar commented Oct 16, 2024

lifubang commented Oct 16, 2024 •

edited

Loading

rata commented Oct 16, 2024

kolyshkin commented Nov 2, 2024

kolyshkin left a comment

rata left a comment

thaJeztah commented Dec 10, 2024

lifubang commented Dec 10, 2024

kolyshkin commented Dec 20, 2024

cyphar commented Dec 20, 2024

thaJeztah commented Dec 20, 2024

cyphar commented Dec 20, 2024

[1.1] join the cgroup after the initial setup finished #4439

Are you sure you want to change the base?

[1.1] join the cgroup after the initial setup finished #4439

Conversation

lifubang commented Oct 11, 2024 • edited Loading

akhilerm Oct 11, 2024

Choose a reason for hiding this comment

lifubang Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

akhilerm Oct 11, 2024

Choose a reason for hiding this comment

lifubang Oct 11, 2024

Choose a reason for hiding this comment

rata Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

kolyshkin commented Oct 12, 2024

kolyshkin commented Oct 12, 2024

lifubang commented Oct 12, 2024

rata left a comment • edited Loading

Choose a reason for hiding this comment

rata Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

rata commented Oct 14, 2024

kolyshkin left a comment

Choose a reason for hiding this comment

kolyshkin commented Oct 15, 2024

lifubang commented Oct 15, 2024 • edited Loading

lifubang commented Oct 15, 2024 • edited Loading

lifubang commented Oct 15, 2024 • edited Loading

lifubang commented Oct 15, 2024 • edited Loading

lifubang commented Oct 15, 2024

lifubang commented Oct 15, 2024

lifubang commented Oct 15, 2024

rata commented Oct 15, 2024 • edited Loading

cyphar commented Oct 16, 2024

lifubang commented Oct 16, 2024 • edited Loading

rata commented Oct 16, 2024

kolyshkin commented Nov 2, 2024

kolyshkin left a comment

Choose a reason for hiding this comment

rata left a comment

Choose a reason for hiding this comment

thaJeztah commented Dec 10, 2024

lifubang commented Dec 10, 2024

kolyshkin commented Dec 20, 2024

cyphar commented Dec 20, 2024

thaJeztah commented Dec 20, 2024

cyphar commented Dec 20, 2024

lifubang commented Oct 11, 2024 •

edited

Loading

lifubang Oct 11, 2024 •

edited

Loading

rata Oct 14, 2024 •

edited

Loading

rata left a comment •

edited

Loading

rata Oct 14, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 15, 2024 •

edited

Loading

rata commented Oct 15, 2024 •

edited

Loading

lifubang commented Oct 16, 2024 •

edited

Loading