-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: finish Distributed architecture support #3
Comments
Re TLS: the only option I see now is to create a wildcard certificate for Meanwhile, chances are very high that the default hostname won't work for RPC since it's just the hostname of the node (because of the host networking). |
In a distributed setup, conductor hostnames will be used for RPC. Currently, the IP address is used which may make TLS configuration much harder. With this change, certain tricks are possible, e.g. using pod DNS names (`<ip>.<namespace>.pod.cluster.local`) combined with a wildcard certificate (`*.<namespace>.pod.cluster.local`). See also metal3-io/ironic-standalone-operator#3 Signed-off-by: Dmitry Tantsur <[email protected]>
Re dnsmasq: I'm thinking of running one instance, but we could also use some fancy subnet magic to run 3 of them. Do we want to? In OpenShift we probably don't... Without that, iPXE can be weird so. |
Hostname preparation work metal3-io/ironic-image#449 |
iPXE: we could loop over all control plane nodes in the initial script, trying to load the 2nd stage script. For that, we need to know the IP addresses of all control plane nodes, which is probably doable via the Ironic DaemonSet? |
I feel like a lot of context is missing here. What is the goal? Which containers are we talking about? Which of them need to be together in one pod and which can be separate? Which need host networking? Have we properly considered more cloud-native alternatives (e.g. LoadBalancers)? Regarding TLS and JSON RPC, I think it is worth noting that StatefulSets have predictable hostnames without IP addresses so you would not need wildcard certificates. The most cloud-native solution for mTLS is probably a service mesh though. I would not want to make that a requirement, but it could be a good idea to at least take some ideas from that area. |
@lentzi90 updated the description with a lot of text. Sorry, should have done it from the beginning - I forgot that not everyone with in the performance&scale subteam discussions.
I'm not sure what that gives us: Ironic API is load balanced already.
Yeah, I've considered them. I think their limitations may be quite painful in our case.
Me neither... I'm also not sure what it solves for us: Ironic already maintains a list of its peers, we still need to configure TLS properly. |
Thank you! 😊
Sorry, I should have explained a bit more what I meant. From my perspective, host networking is a bit problematic so I have been thinking about alternatives. In the CI we currently set up a VIP and use keepalived to move it between nodes as needed. A more cloud-native way of doing this would be to use a Service of type LoadBalancer. There are a few implementations that will work on baremetal, e.g. Metallb. The point is that we would get an "external" IP without host networking, which should help with some of the issues.
Anything in particular? The headless service? I think it would not affect us much in the current configuration at least, since we anyway use the host network. I may easily be missing something though.
I was mostly thinking that we could take some ideas for how to configure TLS from them. Most of them work so that they inject a sidecar container that (together with a central controller) set up TLS for all pods. Each pod gets their own unique certificate and the sidecar basically acts as a proxy for the other containers in the pod. All traffic goes through it and it handles TLS for the other containers transparently. This means that the application does not even need to know about it. Obviously Ironic already handles TLS, but perhaps we can get an idea for how to generate the certificates "on the fly" like with a service mesh. |
I don't think we can rely on a LoadBalancer being present, especially when the cluster is bootstrapping itself. (The downside of Kubernetes: not a lot of things can be assumed to be present...) I've considered a HostPort service, but the unpredictable port is a no-go.
It's a good improvement, but I don't think it helps with any of these issues? Furthermore, I don't think we can use dnsmasq without host networking. Nor support provisioning networks at all.
The service is easy to create, although I'm not sure which sense it makes for us. I'm worried about using persistent volumes, as well as the limitations around clean up. If you have a more detailed guide on StatefulSets, I'd be happy to read it - the kubernetes docs are notoriously brief.
It's interesting, do you have a good write-up on this as well? We already have httpd responsible for TLS. If we could make it generate certificates... I'm not sure how signing will work though. By reaching to the cert-service from inside the container? But who will approve the CSR? |
Good discussion! I can see a way forward here! Let me see if I can get some kind of prototype up to see how/if it works with a LoadBalancer. This is possible to do also on minikube or kind so I don't think bootstrapping is an issue. It is just a matter of moving the IP from the bootstrap cluster to the self managed cluster, just like we do today with keepalived.
It should be possible to pick the port. However, there is always the risk of someone else trying to use the same port or it already being in use. 🙁
Unfortunately I don't have a good doc. 🙁 However, I don't think volumes are required. It is just the most common use case for StatefulSets so they appear in all examples...
Unfortunately I don't have very deep knowledge on how it works. I think this should answer most questions about how Istio does it though: https://istio.io/latest/docs/concepts/security/ |
While looking at ingress in OpenShift, I've seen this issue: ingress controllers are deployed on workers, but workers are not up yet when Metal3 is deploying them. Let's make sure we don't end up in this situation.
Only from a strange pre-defined range. We need to be able to tell admins in advance, which ports they must enable in the firewalls. Ideally, we should converge to one well-known port (currently, OpenShift uses four: 6385/ironic, 5050/inspector, 6180/httpd, 6183/httpd+tls, with inspector gone soon). |
Re dnsmasq and iPXE: I think something like https://bugs.launchpad.net/ironic/+bug/2044561 is doable and solves the issue with finding the right Ironic instance. Let us see what the community thinks. |
Meanwhile, adding support for unix sockets in JSON RPC, if we need that: https://review.opendev.org/c/openstack/ironic-lib/+/901863 |
In a distributed setup, conductor hostnames will be used for RPC. Currently, the IP address is used which may make TLS configuration much harder. With this change, certain tricks are possible, e.g. using pod DNS names (`<ip>.<namespace>.pod.cluster.local`) combined with a wildcard certificate (`*.<namespace>.pod.cluster.local`). See also metal3-io/ironic-standalone-operator#3 Signed-off-by: Dmitry Tantsur <[email protected]>
Re dnsmasq and iPXE: it's possible that Ironic actually has all we need. It now features a way to manage dnsmasq. We could run 3 dnsmasq's with host-ignore:!known and use this feature to populate host files on the appropriate Ironic instance only with the required options instead of relying on a static dnsmasq.conf. This will complicate adding new host auto-discovery if we ever decide to support that. The last time, this feature requested has been rejected. Another problem: we'd need disjoint DHCP ranges for different dnsmasq instances, otherwise they'll try to allocate the same IP to different nodes. It's a matter of configuration though. |
Disjoint DHCP ranges can become a problem: how to pass the right subrange into pods? The only hack I can think of is for the operator to list control plane nodes and prepare a mapping nodeIP->subrange. Then the dnsmasq start-up script can pick the right one based on which hostIP it actually got. May get ugly with dual-stack... An obvious problem: node replacement will always cause a re-deployment. |
I have not checked the architecture jet but shouldn't there be a CR for each Ironic instance, so what I mean I would expect see EDIT: |
There is nothing "Ironic" there it's just a normal DaemonSet/Deployment resulting in Pods. We've discovered on the meeting that we can annotate pods, and the deployment won't undo that. The missing part would be: how to pass annotations to a pod after it's started. This is a challenge. Maybe we need a sidecar container that manages DHCP ranges and generally manages dnsmasq? |
This was further discussed on community meeting and on other channels. I am not sure that the best way to do this would be a sidecar, maybe configuring distributed layer 2 network services would require it's own controller, maybe we shouldn't put this on the back of Ironic neither on a sidecar or the existing dnsmasq container. Now please forgive me if the nomenclature is not correct I hope it is but not sure, I'd like to convey just the general concept. |
For the JSON RPC TLS as I have mentioned on the community meeting, my first ide would be just regular K8s services for each Ironic pod. |
That would surely work, but it does not sound like a good kubernetes practice? Why is it better than relying on |
I've considered that, but in a non-distributed world, dnsmasq must leave inside the metal3 pod. |
Okay, so for the sake for making progress, here is where we stand with the HA MVP:
|
I have played a bit with loadbalancer instead of host network here: https://github.com/lentzi90/playground/tree/ironic-loadbalancer#metal3 If you want to try it:
Is there a way to tell Inspector to just use the IP I give to reach Ironic, without also requiring this to be the IP that is associated with the listening interface? Edit: I should mention that this is similar to the BMO e2e setup where libvirt takes care of dnsmasq. I.e. there is no dnsmasq container in the Ironic Pod. I can try the same with dnsmasq in the Pod also of course but at this point I don't see how it would be useful. |
Isn't a loadbalancer optional in kubernetes? If yes, we need a solution for the cases when 1) it is not present, 2) a different implementation than metallb is present. For OpenShift, I've hacked together a poor man's load balancer based on httpd that is run as a daemonset: https://github.com/openshift/ironic-image/blob/master/ironic-config/apache2-proxy.conf.j2. I'm not proud of it, but it works. Most critically, it works even when the cluster is heavily degraded. |
Well, yes loadbalancers are optional and metallb is one implementation that can be used. So I would say that if there is no loadbalancer implementation, then the solution is to add metallb or kube-vip for example. The implementation should not matter since they all implement the same API. Is there a benefit to hacking together a custom poor man's load balancer just for Ironic, instead of using an off-the-shelf implementation? From what I understand they practically do the same thing. 🤔 |
Do off-the-shelf load balancer operate when the cluster is degraded so much it has no workers? I was told that no (maybe the one we use? not sure) |
That would depend on configuration. The default manifests that I used for metallb has tolerations on the daemonset so it will run on control-plane nodes. However, the controller that is reconciling the API objects does not, so it would not run if all nodes have control-plane taints. On the other hand, if you have a management cluster in the cloud, or let's say on OpenStack, that provides the load balancer then it is completely unaffected by the cluster state. What I'm trying to say is that this all comes down to config. The custom implementation works in a given situation. I bet that we could make metallb or kube-vip work in the same way. And it is possible to break/circumvent any load balancer implementation by configuring the cluster or workload in a bad way. I'm not sure if it makes sense for Ironic, but what I would generally expect in an operator like this is that it assumes there is a load balancer implementation already in place. Then the operator can work from this. For the cluster admin, this means it is possible to choose how to do it. They can use metallb, a custom thing or maybe even run in a cloud that provides it. Quite flexible and nice. With the custom implementation "hard coded" it would be a pain to switch to another implementation I guess? |
I just tie the ironic-image issue here that I think is relevant: metal3-io/ironic-image#468 |
Adding links from discussions at kubecon. This is about wildcard certificates for headless services: |
Hi folks! Let me do a recap of our hallway discussions during KubeCon EU. After some back-and-forth, we seem to be converging towards StatefulSets. I have done experiments with Kind locally, and it seems that our previous concerns are ungrounded: EmptyDir volumes work, deleting a StatefulSet works. And (completely undocumented) it is possible to use a normal (not headless) service with StatefulSets, which will take both roles: a load-balanced service DNS name and pod-specific DNS names. That means, we can generate TLS certificates for There have been some progress on the load balancing discussion, but it still has unsolved issues we need to dive into. |
Note: the discussion around load balancers and host networking has been split into #21. |
/triage accepted |
An obvious problems with StatefulSets: they don't work as DaemonSets (who even decided these things are orthogonal??). |
Can we maybe use HostAliases to allow conductors to talk to each other? https://kubernetes.io/docs/tasks/network/customize-hosts-file-for-pods/ |
I'm not following. What is the issue with StatefulSets not working as DaemonSets? I guess you want to spread the pods so they don't run on the same host? But do you mean that you want a one-to-one mapping between pods and nodes as a DaemonSet would give? That can probably be useful in some situations but I'm not sure it would be the "default use-case". With StatefulSets, we would get a "cloud native" configuration:
With DaemonSets, we could do a more "traditional" configuration:
About HostAliases, I think we can use it but I don't get why/how? If we can configure a HostAlias, can we not then just use the IP directly anyway? |
I think we can use this soluation |
IronicStandalone contains http,dnsmasq,ironic,ramdisk-logs. Only the dnsmasq and ironic need hostnetwork. Dnsmasq use network to realize dhcp and ironic use it to realize setting boot and power status and other function. |
Yes, otherwise we introduce a dependency on the host network removal, which is still far away. With host networking, it's more or less a requirement.
This is interested, but I wonder if it's maintained.
Let's not mix the host networking question in the picture please. |
I've just realized we have a possible stopgap measure for JSON RPC TLS. We can make ironic-standalone-operator generate a CA and then make each Ironic instance generate and sign its own certificate based on its IP address. Yes, it's not pretty, but it can work right now. Then only the Ironic boot configuration API will be a requirement. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
/lifecycle frozen |
The fact that we only run 1 Ironic instance is somewhat unfortunate: Ironic has a built-in active/active HA that spreads the load evenly in the cluster by assigning each Node to one Ironic instance. Ironic also has a take over process making sure that Nodes never go orphaned. On the other hand, due to its usage of green threads, each Ironic instance only uses 1 CPU core. Having 3 instances will improve CPU utilizations. For instance, CERN manages around 10k of Nodes through its 9 (at some point, not sure about the current state) Ironic conductors.
An easy step to do is to use a DaemonSet for Ironic instead of the Deployment. We will need to drop Inspector because, unlike Ironic, it's not HA ready. The new inspection implementation won't have this issue. I believe that will give us virtual media deployments without a provisioning network right away. We will need to sort our JSON RPC since all Ironic instances need to talk to each other. If we use the pods' cluster IPs, TLS may be an issue since they're not quite predictable.
DHCP is also a problem. If we run one dnsmasq, we won't be able to direct a Node to the iPXE server of the Ironic instance that handles it (Ironic API itself is fine: any Ironic will respond correctly for any Node, redirecting the request internally through the RPC). Not without changes to Ironic, at least. If we run several dnsmasq instances, that's still a problem: the request will land on the random one. Also, the networking configuration will be a challenge.
my-pod.my-namespace.pod
to make TLS usable)The text was updated successfully, but these errors were encountered: