Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.1] join the cgroup after the initial setup finished #4439

Open
wants to merge 2 commits into
base: release-1.1
Choose a base branch
from

Conversation

lifubang
Copy link
Member

@lifubang lifubang commented Oct 11, 2024

We should join the cgroup after the initial setup finished,
but before runc init clone new children processes. (#4427)

Because we should try our best to reduce the influence of
memory cgroup accounting from all runc init processes
before we start the container init process.

With this patch, it will eliminate
the impacts of memory accounting from ensure_clone_binary.

@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) {
}
}()

// We should join the cgroup after the initial setup finished,
// but before runc init clone new children processes. (#4427)
err = <-waitInit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A noob question: Will this have any performace impact as the join cgroup and init are now not parallel?

Copy link
Member Author

@lifubang lifubang Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s

So, we need to do more detailed control between runc init and the main process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s

^^ Is this degradation within the acceptable limits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opencontainers/runc-maintainers PTAL

Copy link
Member

@rata rata Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?

@kolyshkin
Copy link
Contributor

IMO this should not be a backport, but rather an original PR targeted for release-1.1 specifically (see #4438 (comment)). In other words, it does not make sense for the main branch to have this code.

For 1.1, this commit makes sense. OR, we can port 0e9a335 and remove initWaiter entirely.

@kolyshkin
Copy link
Contributor

OR, we can port commit 0e9a335

Well, it will be hard to do I guess. So maybe this makes sense to have, as a hotfix. But please don't make it a backport @lifubang

@lifubang
Copy link
Member Author

Well, it will be hard to do I guess. So maybe this makes sense to have, as a hotfix. But please don't make it a backport

OK, please see my comments in #4438, if you still think it's not worth for the main branch, I'll open it as a hotfix.

We should join the cgroup after the initial setup finished,
but before runc init clone new children processes. (opencontainers#4427)
Because we should try our best to reduce the influence of
memory cgroup accounting from all runc init processes
before we start the container init process.

Signed-off-by: lifubang <[email protected]>
@lifubang lifubang force-pushed the backport-join-cgroup-later branch from ea5f8e0 to 34f1d3b Compare October 13, 2024 10:30
@lifubang lifubang changed the title [backport 1.1] join the cgroup after the initial setup finished [1.1] join the cgroup after the initial setup finished Oct 13, 2024
@lifubang lifubang added this to the 1.1.16 milestone Oct 13, 2024
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang LGTM, thanks again for this fix!

I see this is not mentioned as a backport, so that change @kolyshkin requested is done too :)

@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) {
}
}()

// We should join the cgroup after the initial setup finished,
// but before runc init clone new children processes. (#4427)
err = <-waitInit
Copy link
Member

@rata rata Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?

@rata
Copy link
Member

rata commented Oct 14, 2024

Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit. IOW:

diff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index ac3b104e..da4db4e2 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -407,6 +407,14 @@ func (p *initProcess) start() (retErr error) {
                }
        }()
 
+       if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
+               return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
+       }
+       err = <-waitInit
+       if err != nil {
+               return err
+       }
+
        // Do this before syncing with child so that no children can escape the
        // cgroup. We don't need to worry about not doing this and not being root
        // because we'd be using the rootless cgroup manager in that case.
@@ -418,14 +426,6 @@ func (p *initProcess) start() (retErr error) {
                        return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
                }
        }
-       if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
-               return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
-       }
-       err = <-waitInit
-       if err != nil {
-               return err
-       }
-
        childPid, err := p.getChildPid()
        if err != nil {
                return fmt.Errorf("can't get final child's PID from pipe: %w", err)

@kolyshkin
Copy link
Contributor

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit.

I did some simple testing (similar to what @lifubang did). Compared to current runc (1.1.15), seeing ~40% higher times without the suggested change, and ~25% higher times with it. So, yes, it helps a bit.

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence.

How about to keep this?
Because I think #4020 also can be reverted in the main branch too, but needs more test. We need to do revert in the main branch first.
We should know that we need #4020 is because of #3931, at that time, we have not yet move binary clone from runc init to runc parent process.

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit. IOW:

Maybe the suggest change will cause a new race?
In fact, besides we use bootstrapData as a data transfer between runc init and runc parent process, we also use it as a sync mechanism to ensure runc parent process has put the runc init process in the cgroup, before runc init clone new children process. Please see:

// Do this before syncing with child so that no children can escape the
// cgroup. We don't need to worry about not doing this and not being root

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

Maybe we can add another wait before ‘runc init’ clones new child process. But I don’t know whether it worth to do or not.

EDIT: I test it, there is no benefit, so it's not worth to do this.

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

Furthermore, maybe we can just put only the last ‘runc init’ process into the cgroup, because it’s the real container process. But there is a kernel error when putting the process into an cgroup after the process has joined a new pid/cgroup namespace.

EDIT: It seems no good idea to fix this error, so it's also not worth to do this.

@lifubang
Copy link
Member Author

It seems no other way to speed up, and it has the same speed as in the main branch.
I think we should focus on how to raise the speed in the main branch.

@lifubang
Copy link
Member Author

Because I think #4020 also can be reverted in the main branch too, but needs more test. We need to do revert in the main branch first.

The revert PR for the main branch is #4446, it seems that it could work fine.

@lifubang lifubang force-pushed the backport-join-cgroup-later branch from dedf481 to 2813f71 Compare October 15, 2024 09:57
As we will fix the race between binary clone and cgroup join, we can eliminate the
impacts of memory accounting from ensure_clone_binary. So runc will support lower memory
useage the same as before.

This reverts commit 719e2bc.

Signed-off-by: lifubang <[email protected]>
@lifubang lifubang force-pushed the backport-join-cgroup-later branch from 2813f71 to e52d0d1 Compare October 15, 2024 09:58
@lifubang
Copy link
Member Author

Can you revert #4423 here too?

Has reverted it.

@rata
Copy link
Member

rata commented Oct 15, 2024

@kolyshkin I agree with @lifubang here. If we move it as you proposed here, then nsexec has all the info to parse the netlink data and might fork before we add it to the cgroup. Or is there some other point that will stop this race from happening?

I really don't see why this perf impact is a big issue. Is not 10ms per container, and therefore within the noise of a go GC or some other things we do?

@cyphar
Copy link
Member

cyphar commented Oct 16, 2024

Maybe we can just backport the code that moved the copying logic rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x? Or we can revert the removal of bindfd from 1.1.x.

(Sorry for not catching this when doing the original patch nor the backport...)

@lifubang
Copy link
Member Author

lifubang commented Oct 16, 2024

rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x?

In fact, it should also be fixed in the main branch after merged #3931, instead of #4020. And then backport to release-1.1.
But I'm sorry that when I wrote #4020, I didn't catch this race.
From this inspect, it's reasonable to fix it in release-1.1.

As you say, there are two other ways to fix this race:

  1. Backport the code that moved the copying logic from runc init to runc parent process;
  2. Revert [1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392.
    

I think both are OK. It's just a choice question.

@rata
Copy link
Member

rata commented Oct 16, 2024

@cyphar bindfd is causing a lot of real issues for us in 1.1. That is not a good way forward :(

Those other changes are huge to backport, it is quite a risky backport IMHO, maybe time consuming too. Why don't we do this trivial PR, that fixes exactly what we need, containerd at least (maybe other prjects too) are waiting for this fix. And we can aim to release 1.2 soon.

If we fail to release 1.2 soon, we can consider that backport. But without the rush of having users waiting for a fix.

What do you think?

@kolyshkin
Copy link
Contributor

I think this fix makes sense as a stop-gap measure, in case there are issues. The full backport is complicated and is not really necessary.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I have no strong opinion on doing a last 1.1 release, though. I leave this for others to decide.

mariash added a commit to cloudfoundry/guardian that referenced this pull request Nov 6, 2024
Inigo test is failing to set memory with "device is busy" error. It
looks like 1.1.15 has an issue where process joins cgroup before the
setup is finished - opencontainers/runc#4439
@thaJeztah
Copy link
Member

I was curious indeed if we still wanted to do 1.1.x releases, or consider 1.2 to be the only one that's maintained. Not strongly against as it's still relatively short since 1.2, but perhaps we need to document our policy somewhere (I'm assuming we don't want to continue maintaining both 1.1 and 1.2 side-by-side forever)

@lifubang
Copy link
Member Author

I was curious indeed if we still wanted to do 1.1.x releases, or consider 1.2 to be the only one that's maintained. Not strongly against as it's still relatively short since 1.2, but perhaps we need to document our policy somewhere (I'm assuming we don't want to continue maintaining both 1.1 and 1.2 side-by-side forever)

Yes, please see #4549 (comment)

@kolyshkin
Copy link
Contributor

I think (and hope) we won't be doing another 1.1.x (1.1.16) because 1.2.x is good in general, with a few rough edges fixed in point releases.

Having said that, let's wait another couple of months before we will definitely say "no".

@cyphar
Copy link
Member

cyphar commented Dec 20, 2024

@thaJeztah Your comments in #4557 on this topic would be appreciated 😉. The current text in that PR implies that we are not required to do 1.0.x or 1.1.x releases. IMHO we should only consider doing a 1.1.x release at this point if there is a critical security issue discovered.

@thaJeztah
Copy link
Member

Oh! Thanks for the nudge @cyphar - have some calls coming up, but will try to have a look (apologies again for me replying slow .. bit overloaded currently).

@cyphar
Copy link
Member

cyphar commented Dec 20, 2024

No rush, I just wanted to make sure you had a chance to give your input since you commented about this exact topic here. 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants