-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure we can always terminate the parent process on error #4355
base: main
Are you sure you want to change the base?
Conversation
77069ba
to
df3a94f
Compare
libcontainer/container_linux.go
Outdated
if process.Init { | ||
if err := ignoreTerminateErrors(process.ops.terminate()); err != nil { | ||
logrus.WithError(err).Warn("unable to terminate initProcess") | ||
} | ||
if err := c.cgroupManager.Destroy(); err != nil { | ||
logrus.WithError(err).Warn("unable to destroy cgroupManager") | ||
} | ||
if c.intelRdtManager != nil { | ||
if err := c.intelRdtManager.Destroy(); err != nil { | ||
logrus.WithError(err).Warn("unable to destroy intelRdtManager") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the info in the PR and this github diff, it's not obvious why this makes sense. Can you please elaborate why before in many places we were not calling all of this and now we do? For example, the cgroup manager destroy wasn't called in all the code-paths before.
Was it missing? It wasn't needed but it's idempotent so it is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether these destroy methods are called or not both are OK.
We should not have different results in one function without a specific reason, for example: in container.Run()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we should not destroy them in before, because runc doesn’t delete the failure container automatically, users must have to use ‘runc delete’ to destroy the failure container created by ‘runc create’ or ‘runc run’. How about remove these destroy methods call in here? I think it has no compatibility problems. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure I followed. Why do you want to remove these methods?
Also, what is the state the container is left? The delete operation must only work for stopped containers: https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure I followed. Why do you want to remove these methods?
Yes, these destroy methods can't be removed, because we should destroy the cgroup & intelRdt manager manually if we haven't saved container's state yet.
@@ -229,6 +239,30 @@ func (c *Container) Exec() error { | |||
return c.exec() | |||
} | |||
|
|||
// terminate is to kill the container's init/exec process when got failure. | |||
func (c *Container) terminate(process *Process) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it simple to add tests? Maybe for this function. We can fake the process interface and make it return an error, and test everything that needs to happen, indeed happens?
This might not add a lot of value now, but it will if we refactor this code in the future.
8dcd42b
to
5a06eb6
Compare
I think this is really a bug, but we will hit it with a very very tiny probability, do we think we want this PR in next 1.2.0 release candidate? |
5a06eb6
to
150c32f
Compare
libcontainer/container_linux.go
Outdated
func (c *Container) Start(process *Process) error { | ||
func (c *Container) Start(process *Process) (retErr error) { | ||
c.m.Lock() | ||
defer c.m.Unlock() | ||
defer func() { | ||
if retErr != nil { | ||
c.terminate(process) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this looks overcomplicated, given that this function only calls c.start.
An alternative would be
func (c *Container) Start(process *Process) error {
c.m.Lock()
defer c.m.Unlock()
if err := c.start(process); err != nil {
c.terminate(process)
return err
}
}
libcontainer/container_linux.go
Outdated
c.terminate(process) | ||
} | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to 150c32f#r1866835966, there's no need for a defer here. Something like this would work:
if !process.Init {
return nil
}
err := c.exec()
if err != nil {
c.terminate(process)
}
return err
As we all know, we should terminate the parent process if there is an error when starting the container process, but these terminate function are called in many places, for example: `initProcess`, `setnsProcess`, and `Container`, if we forget this terminate call for some errors, it will let the container in unknown state, so we should change to call it in some final places. Signed-off-by: lifubang <[email protected]>
150c32f
to
8ff0b71
Compare
As we all know, we should terminate the parent process if there is an error when starting the container process,
but these terminate function are called in many places, for example:
initProcess
,setnsProcess
, andContainer
,if we forget this terminate call for some errors, it will let the container in unknown state, so we should change to
call it in some final places.
One possible place that missing terminate action:
https://github.com/opencontainers/runc/blob/v1.2.0-rc.2/libcontainer/container_linux.go#L357-L360