Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure we can always terminate the parent process on error #4355

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lifubang
Copy link
Member

As we all know, we should terminate the parent process if there is an error when starting the container process,
but these terminate function are called in many places, for example: initProcess, setnsProcess, and Container,
if we forget this terminate call for some errors, it will let the container in unknown state, so we should change to
call it in some final places.

One possible place that missing terminate action:
https://github.com/opencontainers/runc/blob/v1.2.0-rc.2/libcontainer/container_linux.go#L357-L360

@lifubang lifubang force-pushed the fix-no-terminate-onerror branch from 77069ba to df3a94f Compare July 22, 2024 11:12
libcontainer/container_linux.go Outdated Show resolved Hide resolved
Comment on lines 247 to 260
if process.Init {
if err := ignoreTerminateErrors(process.ops.terminate()); err != nil {
logrus.WithError(err).Warn("unable to terminate initProcess")
}
if err := c.cgroupManager.Destroy(); err != nil {
logrus.WithError(err).Warn("unable to destroy cgroupManager")
}
if c.intelRdtManager != nil {
if err := c.intelRdtManager.Destroy(); err != nil {
logrus.WithError(err).Warn("unable to destroy intelRdtManager")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the info in the PR and this github diff, it's not obvious why this makes sense. Can you please elaborate why before in many places we were not calling all of this and now we do? For example, the cgroup manager destroy wasn't called in all the code-paths before.

Was it missing? It wasn't needed but it's idempotent so it is fine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether these destroy methods are called or not both are OK.
We should not have different results in one function without a specific reason, for example: in container.Run().

Copy link
Member Author

@lifubang lifubang Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think we should not destroy them in before, because runc doesn’t delete the failure container automatically, users must have to use ‘runc delete’ to destroy the failure container created by ‘runc create’ or ‘runc run’. How about remove these destroy methods call in here? I think it has no compatibility problems. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I followed. Why do you want to remove these methods?

Also, what is the state the container is left? The delete operation must only work for stopped containers: https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I followed. Why do you want to remove these methods?

Yes, these destroy methods can't be removed, because we should destroy the cgroup & intelRdt manager manually if we haven't saved container's state yet.

libcontainer/process_linux.go Show resolved Hide resolved
@@ -229,6 +239,30 @@ func (c *Container) Exec() error {
return c.exec()
}

// terminate is to kill the container's init/exec process when got failure.
func (c *Container) terminate(process *Process) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it simple to add tests? Maybe for this function. We can fake the process interface and make it return an error, and test everything that needs to happen, indeed happens?

This might not add a lot of value now, but it will if we refactor this code in the future.

@lifubang lifubang force-pushed the fix-no-terminate-onerror branch 2 times, most recently from 8dcd42b to 5a06eb6 Compare September 23, 2024 12:00
@lifubang
Copy link
Member Author

One possible place that missing terminate action:
https://github.com/opencontainers/runc/blob/v1.2.0-rc.2/libcontainer/container_linux.go#L357-L360

I think this is really a bug, but we will hit it with a very very tiny probability, do we think we want this PR in next 1.2.0 release candidate?

@lifubang lifubang force-pushed the fix-no-terminate-onerror branch from 5a06eb6 to 150c32f Compare September 26, 2024 05:00
Comment on lines 204 to 211
func (c *Container) Start(process *Process) error {
func (c *Container) Start(process *Process) (retErr error) {
c.m.Lock()
defer c.m.Unlock()
defer func() {
if retErr != nil {
c.terminate(process)
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this looks overcomplicated, given that this function only calls c.start.

An alternative would be

func (c *Container) Start(process *Process) error {
	c.m.Lock()
	defer c.m.Unlock()
	if err := c.start(process); err != nil {
		c.terminate(process)
		return err
	}
}

c.terminate(process)
}
}()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to 150c32f#r1866835966, there's no need for a defer here. Something like this would work:

	if !process.Init {
		return nil
	}
	err := c.exec()
	if err != nil {
		c.terminate(process)
	}
	return err

As we all know, we should terminate the parent process if there is an
error when starting the container process, but these terminate function
are called in many places, for example: `initProcess`, `setnsProcess`,
and `Container`, if we forget this terminate call for some errors, it
will let the container in unknown state, so we should change to call it
in some final places.

Signed-off-by: lifubang <[email protected]>
@lifubang lifubang force-pushed the fix-no-terminate-onerror branch from 150c32f to 8ff0b71 Compare December 11, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants