-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.1] join the cgroup after the initial setup finished #4439
base: release-1.1
Are you sure you want to change the base?
[1.1] join the cgroup after the initial setup finished #4439
Conversation
Signed-off-by: Akhil Mohan <[email protected]>
@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) { | |||
} | |||
}() | |||
|
|||
// We should join the cgroup after the initial setup finished, | |||
// but before runc init clone new children processes. (#4427) | |||
err = <-waitInit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A noob question: Will this have any performace impact as the join cgroup and init are now not parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s
So, we need to do more detailed control between runc init and the main process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s
^^ Is this degradation within the acceptable limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@opencontainers/runc-maintainers PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?
IMO this should not be a backport, but rather an original PR targeted for release-1.1 specifically (see #4438 (comment)). In other words, it does not make sense for the main branch to have this code. For 1.1, this commit makes sense. OR, we can port 0e9a335 and remove initWaiter entirely. |
OK, please see my comments in #4438, if you still think it's not worth for the main branch, I'll open it as a hotfix. |
We should join the cgroup after the initial setup finished, but before runc init clone new children processes. (opencontainers#4427) Because we should try our best to reduce the influence of memory cgroup accounting from all runc init processes before we start the container init process. Signed-off-by: lifubang <[email protected]>
ea5f8e0
to
34f1d3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifubang LGTM, thanks again for this fix!
I see this is not mentioned as a backport, so that change @kolyshkin requested is done too :)
@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) { | |||
} | |||
}() | |||
|
|||
// We should join the cgroup after the initial setup finished, | |||
// but before runc init clone new children processes. (#4427) | |||
err = <-waitInit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?
Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can gain some of the lost speed back by moving io.Copy
of the bootstrapData together with waitInit. IOW:
diff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index ac3b104e..da4db4e2 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -407,6 +407,14 @@ func (p *initProcess) start() (retErr error) {
}
}()
+ if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
+ return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
+ }
+ err = <-waitInit
+ if err != nil {
+ return err
+ }
+
// Do this before syncing with child so that no children can escape the
// cgroup. We don't need to worry about not doing this and not being root
// because we'd be using the rootless cgroup manager in that case.
@@ -418,14 +426,6 @@ func (p *initProcess) start() (retErr error) {
return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
}
}
- if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
- return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
- }
- err = <-waitInit
- if err != nil {
- return err
- }
-
childPid, err := p.getChildPid()
if err != nil {
return fmt.Errorf("can't get final child's PID from pipe: %w", err)
I did some simple testing (similar to what @lifubang did). Compared to current runc (1.1.15), seeing ~40% higher times without the suggested change, and ~25% higher times with it. So, yes, it helps a bit. |
|
Maybe the suggest change will cause a new race? runc/libcontainer/process_linux.go Lines 417 to 418 in 34f1d3b
|
Maybe we can add another wait before ‘runc init’ clones new child process. But I don’t know whether it worth to do or not. EDIT: I test it, there is no benefit, so it's not worth to do this. |
Furthermore, maybe we can just put only the last ‘runc init’ process into the cgroup, because it’s the real container process. But there is a kernel error when putting the process into an cgroup after the process has joined a new pid/cgroup namespace. EDIT: It seems no good idea to fix this error, so it's also not worth to do this. |
It seems no other way to speed up, and it has the same speed as in the main branch. |
dedf481
to
2813f71
Compare
As we will fix the race between binary clone and cgroup join, we can eliminate the impacts of memory accounting from ensure_clone_binary. So runc will support lower memory useage the same as before. This reverts commit 719e2bc. Signed-off-by: lifubang <[email protected]>
2813f71
to
e52d0d1
Compare
Has reverted it. |
@kolyshkin I agree with @lifubang here. If we move it as you proposed here, then nsexec has all the info to parse the netlink data and might fork before we add it to the cgroup. Or is there some other point that will stop this race from happening? I really don't see why this perf impact is a big issue. Is not 10ms per container, and therefore within the noise of a go GC or some other things we do? |
Maybe we can just backport the code that moved the copying logic rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x? Or we can revert the removal of (Sorry for not catching this when doing the original patch nor the backport...) |
In fact, it should also be fixed in the main branch after merged #3931, instead of #4020. And then backport to
I think both are OK. It's just a choice question. |
@cyphar bindfd is causing a lot of real issues for us in 1.1. That is not a good way forward :( Those other changes are huge to backport, it is quite a risky backport IMHO, maybe time consuming too. Why don't we do this trivial PR, that fixes exactly what we need, containerd at least (maybe other prjects too) are waiting for this fix. And we can aim to release 1.2 soon. If we fail to release 1.2 soon, we can consider that backport. But without the rush of having users waiting for a fix. What do you think? |
I think this fix makes sense as a stop-gap measure, in case there are issues. The full backport is complicated and is not really necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I have no strong opinion on doing a last 1.1 release, though. I leave this for others to decide.
Inigo test is failing to set memory with "device is busy" error. It looks like 1.1.15 has an issue where process joins cgroup before the setup is finished - opencontainers/runc#4439
I was curious indeed if we still wanted to do 1.1.x releases, or consider 1.2 to be the only one that's maintained. Not strongly against as it's still relatively short since 1.2, but perhaps we need to document our policy somewhere (I'm assuming we don't want to continue maintaining both 1.1 and 1.2 side-by-side forever) |
Yes, please see #4549 (comment) |
I think (and hope) we won't be doing another 1.1.x (1.1.16) because 1.2.x is good in general, with a few rough edges fixed in point releases. Having said that, let's wait another couple of months before we will definitely say "no". |
@thaJeztah Your comments in #4557 on this topic would be appreciated 😉. The current text in that PR implies that we are not required to do 1.0.x or 1.1.x releases. IMHO we should only consider doing a 1.1.x release at this point if there is a critical security issue discovered. |
Oh! Thanks for the nudge @cyphar - have some calls coming up, but will try to have a look (apologies again for me replying slow .. bit overloaded currently). |
No rush, I just wanted to make sure you had a chance to give your input since you commented about this exact topic here. 😸 |
We should join the cgroup after the initial setup finished,
but before runc init clone new children processes. (#4427)
Because we should try our best to reduce the influence of
memory cgroup accounting from all runc init processes
before we start the container init process.
With this patch, it will eliminate
the impacts of memory accounting from ensure_clone_binary.