Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a service to call signpost when we determine that the boot was successful #481

Merged
merged 1 commit into from
Nov 15, 2019

Conversation

jamieand
Copy link
Contributor

@jamieand jamieand commented Nov 4, 2019

This implements a method to measure whether or not the host booted successfully, and then call signpost when that is achieved. It consists of a new unit in the updater:

  • mark-successful-boot.service - This unit will have dependencies injected into it by services that are required for a successful boot, by use of the RequiredBy= option in those services.

In addition to this new service, changes are required to services that we want to fail the boot if they can't start:

  • The service must be of Type=notify to ensure that it does not transition to Active if it fails and systemd restarts it. This could require patches for services that don't already support sd_notify.
  • The service must include RequiredBy=mark-successful-boot.service in its [Install] section.

This change includes on-boarding kubelet.service to this way of boot measurement.

Fixes Issue #86

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, I have a hard time understanding the need for three unit files, and understanding the connection between them; I think it'd be good to put a brief explanation in the files themselves.

packages/workspaces/measure-successful-boot.timer Outdated Show resolved Hide resolved
packages/workspaces/measure-successful-boot.timer Outdated Show resolved Hide resolved
packages/workspaces/measure-successful-boot.timer Outdated Show resolved Hide resolved
packages/workspaces/measure-successful-boot.timer Outdated Show resolved Hide resolved
packages/workspaces/measure-successful-boot.service Outdated Show resolved Hide resolved
packages/workspaces/mark-successful-boot.service Outdated Show resolved Hide resolved
@tjkirch
Copy link
Contributor

tjkirch commented Nov 4, 2019

Jamie and I had some further out-of-band discussion based on my questions above. One point that helps clarify is that the measure timer and service are effectively for humans - they make sure that systemctl status doesn't show running if we didn't mark this as a successful boot. The mark service works by itself, in that it doesn't call signpost if we didn't get our success notifications, and it doesn't depend on measure, but if you look at the system via systemctl status you might still think everything is fine. I suggested some possible naming changes that might clarify this distinction.

@tjkirch
Copy link
Contributor

tjkirch commented Nov 7, 2019

Jamie and I had some further further out-of-band discussion. We landed on the idea of making kubelet (or other services that we deem required for boot) retry a set number of times (per failure incident) rather than forever. That would allow us to have services (namely mark-successful-boot) depend on them normally, and the system status would be degraded if they fail, rather than requiring out-of-band methods like timers and systemctl calls to determine whether boot was successful at some arbitrary time after start.

@jamieand jamieand force-pushed the mark-successful-boot branch from 676ac97 to a4b4ed4 Compare November 7, 2019 21:33
@jamieand
Copy link
Contributor Author

jamieand commented Nov 7, 2019

Removed measure-successful-boot.timer and measure-successful.service. This reduces the scope of this change to calling signpost if the boot succeeded. Rolling back and observing or reporting on the health of the host will be implemented in future changes.

Copy link
Contributor

@tjkirch tjkirch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the description updated to reflect the new smaller scope, and the little cleanup with the Wants line, but no blockers from me!

@jamieand jamieand force-pushed the mark-successful-boot branch from a4b4ed4 to f25198b Compare November 14, 2019 21:33
@jamieand
Copy link
Contributor Author

Removed the Wants= from mark-successful-boot.service

@jamieand jamieand removed the request for review from iliana November 15, 2019 16:54
@tjkirch tjkirch merged commit 7342e4e into develop Nov 15, 2019
@tjkirch tjkirch deleted the mark-successful-boot branch November 15, 2019 18:57
@iliana iliana added this to the v0.2.0 milestone Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants