Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refinery] Add option to run refinery as a statefulset #237

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kamalmarhubi
Copy link

@kamalmarhubi kamalmarhubi commented Apr 19, 2023

Using a statefulset allows giving the pods a stable network identity. Setting setHostnameAsFQDN means that this stable network identity is what os.Hostname reports. Together, they allow using the hostname in the peer list, so that the peer list remains stable even as pods are rescheduled. This improves trace routing stability during refinery upgrades and Kubernetes cluster operations (upgrades / scale-downs), and even makes it possible to run refinery with an affinity preference for spot instances.

@kamalmarhubi
Copy link
Author

We've been running a statefulset on spot instances in GKE for a couple of days and it looks like it's working out ok.

Main reason this is a draft: should I instead put the statefulset in a separate file and instead avoid repetition via a pod template helper + some mustMerge stuff?

@kamalmarhubi kamalmarhubi force-pushed the refinery-sts branch 2 times, most recently from 263406d to c429925 Compare April 19, 2023 18:05
@TylerHelmuth
Copy link
Contributor

Main reason this is a draft: should I instead put the statefulset in a separate file and instead avoid repetition via a pod template helper + some mustMerge stuff?

Ya lets have separate deployment.yaml and statefulset.yaml and try to reuse what we can with helper template. A little bit of duplication is fine if it means simplicity..

@kamalmarhubi kamalmarhubi force-pushed the refinery-sts branch 3 times, most recently from 373880f to 41e6685 Compare June 6, 2023 20:56
@kamalmarhubi
Copy link
Author

kamalmarhubi commented Jun 6, 2023

@TylerHelmuth I finally got around to this. I had to debug some nindent wrongness* in _pod.tpl. I think I got it right, but I only tried our setup, where it didn't result in any diff from the pre-split version (after rebasing onto current HEAD). I don't know if some other sets of values might have annoying nindent stuff wrong, but I hope not.

‌* I love text templating white-space–sensitive structured data 🙈

@kamalmarhubi kamalmarhubi marked this pull request as ready for review June 6, 2023 21:02
@kamalmarhubi kamalmarhubi requested a review from a team as a code owner June 6, 2023 21:02
Using a statefulset allows giving the pods a stable network identity.
Setting `setHostnameAsFQDN` means that this stable network identity is
what `os.Hostname` reports. Together, they allow using the hostname in
the peer list, so that the peer list remains stable even as pods are
rescheduled. This improves trace routing stability during refinery
upgrades and Kubernetes cluster operations (upgrades / scale-downs),
and even makes it possible to run refinery with an affinity preference
for spot instances.
@kamalmarhubi
Copy link
Author

@TylerHelmuth fyi I just rebased this on top of the current main so we could upgrade to refinery 2.x. We've been running refinery as a statefulset for slightly over 3 months at this point with no real issues.

@TylerHelmuth
Copy link
Contributor

@kamalmarhubi thats awesome! I'm glad it is working well. There is some other 2.x cleanup we are prioritizing right now but we'll get back to this PR after.

@kamalmarhubi
Copy link
Author

@TylerHelmuth bumpty bump on this PR. We're upgrading refinery and I'm wondering if I should rebase this once more or not.

It kind of sounds like the pub/sub changes in 2.7, specifically sharing peer membership, makes cluster membership changes less disruptive; is that right? I think I'd prefer stable pod identity, but I'd also like to stop maintaining a long-running fork of the official chart—whether by rebasing once more and merging, or dropping the statefulset option.

Let me know which works best for you!

@TylerHelmuth
Copy link
Contributor

@kamalmarhubi you're right, and 2.8 should expand on those features. There is working ongoing in refinery to make scaling it up and down smoother. Let's wait for that work to complete and then revisit if statefulset is still something we'll need to support.

@kamalmarhubi
Copy link
Author

@TylerHelmuth sounds good. How does the cluster respond to non-resize membership changes under 2.7? More precisely, how disruptive is it if the membership changes as the deployment's pods are cycled and come up with new names? I recall this was fairly ungood in the past.

@TylerHelmuth
Copy link
Contributor

@kamalmarhubi in 2.7 peer membership updates are now pub/sub, so it happens quicker, but the refinery instances still have to rebalance when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants