-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA Thespian #62
Comments
Thanks for the nudge, @pjz. Thespian is already moderately resilient to failures in member nodes for the Convention, it's just the Convention Leader that is a singular component that must be running for the rest of the Convention to operate correctly. The techniques provided by PySyncObj could be used to propagate convention-related information amongst potential convention leaders, but there are still higher level behaviors and operational concerns to work out. I'm going to start documenting those here as a way to gather and identify the information and decisions that must be made before the point where this work is undertaken:
Please feel free to update this issue with comments on either the above or with additional considerations not yet captured above. I'm particularly interested in use-cases where the ability to migrate Convention leadership would be useful and effective for people: there are a lot of potential aspects to this, so being able to focus the efforts on the most applicable use-cases will help drive some of the design decisions for this. |
There are definitely a lot of potential edge cases, some of the easier to address than others. Maybe start with the easy solutions and work up from there?
FWIW, my use case is fairly simple: essentially a streaming web service with some, but not a lot, of synchronization necessary on the back end, with a very flat topology node-wise.. and relatively few (under a dozen) nodes, at that. Since it's so flat, any of them are capable of being CL, so I'd like to make sure that failover works. IMO, once joined to the Convention, it should be possible to seamlessly swap ConventionLeaders without interruption at any time. I think depending on certain actors being running on the ConventionLeader should be discouraged - instead actors can advertise their utility via capabilities and globalNames, or perhaps on a ConventionRegistrar, and then be delegated to - this lessens the reliance on a single central system and also improves reliability by not only spreading the service(s) across multiple nodes but also easily allowing multiple of whatever vital service. I look forward to seeing how Thespian evolves - and helping with that evolution, if possible. |
Good info, thanks @pjz. I'll go through your detailed responses in more depth soon for any followups. I want to be careful not to overpromise and to be clear on the supportable scope of HA-related work but your insight is a good one: a relatively simple solution may solve 80% of the need and go a long way to helping people in this area. To that end, I appreciate your description of the use case you are solving, and look forward to hearing about any other use cases people might provide to help scope this work. Thanks for your continued support and involvement! |
I'm trying to put something into pre-production; to that end, I'd like to know what happens if the 'Convention Address.IPv4' capability is updated ? Does it correctly point it at a new Convention Leader? Could this be used to point existing ASs at a new Convention Leader if the original goes down? |
I haven't had time to try it for details, but in thinking about this I believe that the local Admin would need to have the updated convention leader address and there's no way to change this at the moment without restarting the local Admin. |
Being able to change the Convention Address of a running AS seems like a good fundament first step, so I went and read a bunch of code, and it looks like the LocalConventionState already handles some amount of re-registration, but the logger isn't anything like so dynamic. My experiments ended up with some kind of infinite loop of log messages.. but I'm still trying to figure out if the loop is due to the reported messages looping or the logger looping :) |
@kquick - "Currently all members identify the Convention Leader via a single IP address. Should this be expanded to a list of IP addresses? Should multicast addressing be used for attempting to identify a leader? Should broadcast conventions (e.g. port 5670) be used for leader identification?" This is a relevant use case for something I am also POC'ing and I happened to raise a bug ( #74 ) without seeing this first. |
I haven't had any time to work on HA/failover capabilities, so nothing is imminent. I think this is a fairly interesting idea to pursue, although largely academic since there haven't been a large number of people expressing interest in this or upvoting this particular issue. I'm happy to support anyone wanting to work on extending Thespian in this area, and I may do it someday myself, but I don't envision being able to work on this for the foreseeable future (i.e. at least the next couple of months). |
Kevin, I'll be happy to take a stab at this, and possibly create a PR for you to review. But I could possibly use some pointers to get off the ground. Best. |
That would be great, Arnab. There's a lot of scoping and discussion above in this issue, and quite a bit of that is intended to scope the overall issue but should not be a concern with the initial POC work. If I were to sketch out initial POC, I would suggest:
The above is using a very simplistic election mechanism for convention leader; in the longer term, a more sophisticated election process should be used (e.g. RAFT) along with synchronization methods to ensure that actions occurring during the leadership changes aren't lost, and possibly even supporting multiple leaders. I think it will be helpful to defer these more sophisticated processes though until some of the basic functionality can be explored using the above. There are a lot of corner cases and timing issues not handled by the above as well (and it should not attempt to support AdminRouting or AdminRoutingTXOnly modes), so it won't be very robust, but it should help to ensure we have a good identification of what information should be shared/synchronized between leaders and reveal where there are unforeseen issues. Naturally, you are free to chart a different path if you are working on a POC and I'm happy to support you either way. For ongoing discussion, please feel free to use the thespian mailing list, or if you wish to discuss lower-level issues that may not be of interest to the mailing list, you can reach me directly via gmail at s/kq/kq1q/ of my github username. |
@kquick thank for the details. I'll start going through the code to see how exactly life cycle for the convention IP is handled. I guess I'll start off by forking this repo so that I don't mess up something here. Once I have made progress, I'll share either the forum or offline, with a limited audience. Best. |
Update: @arnab-chanda has done a great job and as of #79 we now have initial support for HA capabilities. The current leader selection methodology is fairly simple and we will probably look at using a more sophisticated algorithm in the future, but this implementation should be enough to let people start trying out this functionality and providing feedback on where there are things still needing attention. This implementation should be backward compatible with existing Thespian; enabling HA is done simply by specifying the |
The initial support for this is available now in 3.10.6. It should be considered Beta: there is as yet no attempt to synchronize information between different leaders, only the ability to allow a leader to take over if the previously active leader exits. |
What features are needed to make Thespian more resilient to individual node loss? I see that thespianpy#21 is still open, 3.5 years later - is PySyncObj, as suggested in thespianpy#20 to (re) determine the Convention Leader still the solution? And a global Registrar to coordinate Actors cross-node? Is that really all that's missing?
The text was updated successfully, but these errors were encountered: