Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA Thespian #62

Open
pjz opened this issue Mar 9, 2020 · 13 comments
Open

HA Thespian #62

pjz opened this issue Mar 9, 2020 · 13 comments

Comments

@pjz
Copy link

pjz commented Mar 9, 2020

What features are needed to make Thespian more resilient to individual node loss? I see that thespianpy#21 is still open, 3.5 years later - is PySyncObj, as suggested in thespianpy#20 to (re) determine the Convention Leader still the solution? And a global Registrar to coordinate Actors cross-node? Is that really all that's missing?

@kquick
Copy link
Owner

kquick commented Mar 10, 2020

Thanks for the nudge, @pjz.

Thespian is already moderately resilient to failures in member nodes for the Convention, it's just the Convention Leader that is a singular component that must be running for the rest of the Convention to operate correctly.

The techniques provided by PySyncObj could be used to propagate convention-related information amongst potential convention leaders, but there are still higher level behaviors and operational concerns to work out. I'm going to start documenting those here as a way to gather and identify the information and decisions that must be made before the point where this work is undertaken:

  • I believe it should be possible to support Convention Leader migration for the asymmetric configurations (multiprocTCPBase with "Admin Routing" and/or "Outbound Only" capabilities set as well as the symmetric configurations. However, the asymmetric configurations may require a limited group of systems to be designated as "Leader-capable" instead of a more broad approach that could be taken for the symmetric configurations.

  • The convention management code has been more cleanly isolated in the implementation ( thespian/system/admin/convention.py ) which should help to identify the set of information that must be migrated/assumed by a new Convention Leader. This data should be reviewed to determine if it is sufficient for migration of leadership or if additional data needs to be migrated.

  • Currently all members identify the Convention Leader via a single IP address. Should this be expanded to a list of IP addresses? Should multicast addressing be used for attempting to identify a leader? Should broadcast conventions (e.g. port 5670) be used for leader identification?

  • Some Convention configurations have a large number of members (multiple thousands). Should convention leadership potential be distributed evenly across all members or should there be a subset of members identified as leaders?

  • In-flight actor creation must be addressed with regards to a leader migration event.

  • Presumably a partitioning event would result in two (or more) distinct conventions; evaluate current management to ensure that each partitioned convention receives appropriate actor exited events for actors no longer part of the current partition.

  • How should Dead Letter handling be managed in a partitioned scenario?

  • On de-partitioning, how to reconcile and merge the information from individual conventions?

  • Should there be a capability which is automatically added for the current convention leader, to allow Actors to follow that convention leader (especially those using the notifyOnSystemRegistrationChanges extension)?

  • How should logging be handled for a partitioning scenario? For a de-partitioning scenario?

  • Currently the Convention Leader can act as the distribution node for loadable sources: they are directly loaded only on the Convention Leader and can be propagated on-demand to other nodes. For migration scenarios, how should loadable sources be managed? Will this require on-disk presence of the same loadable sources for all potential convention leaders? Should a newly-elected convention leader be allowed to pull a loaded source from the disk directly (currently this requires an external agent to load the sources, like the Director). Should potential convention leaders be required to pull any loaded source for the current leader before being considered as a leader? How should this affect distribution of the loadable source verification Actor?

Please feel free to update this issue with comments on either the above or with additional considerations not yet captured above. I'm particularly interested in use-cases where the ability to migrate Convention leadership would be useful and effective for people: there are a lot of potential aspects to this, so being able to focus the efforts on the most applicable use-cases will help drive some of the design decisions for this.

@pjz
Copy link
Author

pjz commented Mar 10, 2020

There are definitely a lot of potential edge cases, some of the easier to address than others. Maybe start with the easy solutions and work up from there?

  • I agree, supporting a 'ConventionLeaderCandidate' capability would allow Thespian to be informed about what is and isn't tenable with more exotic network topologies.
  • Allowing 'Admin Port' to be a list instead of a single string would help in not only the multi-interface case, but also the proxy case, and so seems useful.
  • Searching via various broadcast methods... IMO, the list of leaders should be dynamically pre-defined. eg. whenever a ConventionLeaderCandidate is added to the cluster, everyone agrees where that one goes in the line of succession and so fails over to it if the main Convention Leader fails. Broadcast is, IMO, for ActorSystem autodiscovery, not ConventionLeader discovery.
  • I think letting the developer designate ConventionLeaderCandidates solves the 'massive membership' problem by letting the developer solve it. Maybe they want to designate N of them for redundancy. Maybe they want to limit candidates to the set of machines closest to the middle of their topology. Let them decide.
  • I don't know enough about the internals of Thespian to have an opinion on how to handle in-flight actor creation or messaging.
  • I think partitioning is usually resolved by requiring a quorum to proceed; this may lead to a requirement of an odd number of ConventionLeaderCandidates so that a quorum (half, rounded up) always exists. Partitions with a quorum continue operating, respawning missing actors. Partitions with less than a quorum are to shut down as gracefully as possible, or perhaps just shut down their actors and keep attempting to reconnect to the main body to aid in recovery from temporary outages (though at a reasonable interval so as not to cause problems when the connection does come back up)
  • Re: adding a capability for the current convention leader... maybe? currently all nodes have a capability saying "My Convention Leader is X" right? Giving X a capability indicating that he's the convention leader seems redundant.
  • Re: Loadable Sources: Those are all good possibilities. I don't think changing the Convention Leader should give the new one any extra abilities - if a Director is being used, it needs to be used no matter which node is ConventionLeader. Redundancy may require more than one Director - each with its own key pair and copy of the source. Maybe add a capability of 'hasSourceAuthority' instead of requiring it to run on the CL, and have the CL, when asked to provide source, delegate that request to a node in the Convention that 'hasSourceAuthority' instead of answering it directly. (It could, of course, be the only AS in the Convention with that Capability if that's what's desired)

FWIW, my use case is fairly simple: essentially a streaming web service with some, but not a lot, of synchronization necessary on the back end, with a very flat topology node-wise.. and relatively few (under a dozen) nodes, at that. Since it's so flat, any of them are capable of being CL, so I'd like to make sure that failover works.

IMO, once joined to the Convention, it should be possible to seamlessly swap ConventionLeaders without interruption at any time. I think depending on certain actors being running on the ConventionLeader should be discouraged - instead actors can advertise their utility via capabilities and globalNames, or perhaps on a ConventionRegistrar, and then be delegated to - this lessens the reliance on a single central system and also improves reliability by not only spreading the service(s) across multiple nodes but also easily allowing multiple of whatever vital service.

I look forward to seeing how Thespian evolves - and helping with that evolution, if possible.

@kquick
Copy link
Owner

kquick commented Mar 10, 2020

Good info, thanks @pjz. I'll go through your detailed responses in more depth soon for any followups.

I want to be careful not to overpromise and to be clear on the supportable scope of HA-related work but your insight is a good one: a relatively simple solution may solve 80% of the need and go a long way to helping people in this area. To that end, I appreciate your description of the use case you are solving, and look forward to hearing about any other use cases people might provide to help scope this work.

Thanks for your continued support and involvement!

@pjz
Copy link
Author

pjz commented Mar 11, 2020

I'm trying to put something into pre-production; to that end, I'd like to know what happens if the 'Convention Address.IPv4' capability is updated ? Does it correctly point it at a new Convention Leader? Could this be used to point existing ASs at a new Convention Leader if the original goes down?

@kquick
Copy link
Owner

kquick commented Mar 11, 2020

I haven't had time to try it for details, but in thinking about this I believe that the local Admin would need to have the updated convention leader address and there's no way to change this at the moment without restarting the local Admin.

@pjz
Copy link
Author

pjz commented Mar 12, 2020

Being able to change the Convention Address of a running AS seems like a good fundament first step, so I went and read a bunch of code, and it looks like the LocalConventionState already handles some amount of re-registration, but the logger isn't anything like so dynamic. My experiments ended up with some kind of infinite loop of log messages.. but I'm still trying to figure out if the loop is due to the reported messages looping or the logger looping :)

@ghost
Copy link

ghost commented May 7, 2021

@kquick -

"Currently all members identify the Convention Leader via a single IP address. Should this be expanded to a list of IP addresses? Should multicast addressing be used for attempting to identify a leader? Should broadcast conventions (e.g. port 5670) be used for leader identification?"

This is a relevant use case for something I am also POC'ing and I happened to raise a bug ( #74 ) without seeing this first.
IMHO, similar feature is already present in implementations of actor system in scala, java etc.
Will this be incorporated anytime soon?

@kquick
Copy link
Owner

kquick commented May 10, 2021

I haven't had any time to work on HA/failover capabilities, so nothing is imminent. I think this is a fairly interesting idea to pursue, although largely academic since there haven't been a large number of people expressing interest in this or upvoting this particular issue. I'm happy to support anyone wanting to work on extending Thespian in this area, and I may do it someday myself, but I don't envision being able to work on this for the foreseeable future (i.e. at least the next couple of months).

@ghost
Copy link

ghost commented May 10, 2021

Kevin,

I'll be happy to take a stab at this, and possibly create a PR for you to review. But I could possibly use some pointers to get off the ground.

Best.
Arnab

@kquick
Copy link
Owner

kquick commented May 10, 2021

That would be great, Arnab. There's a lot of scoping and discussion above in this issue, and quite a bit of that is intended to scope the overall issue but should not be a concern with the initial POC work. If I were to sketch out initial POC, I would suggest:

  1. Changing the Convention Address.IPv4 attribute to take a list.
  2. On startup, the default convention leader is the first address in that group (lets call the others "Convention Supporters").
  3. When other Convention Supporters check in with the Convention Leader, the latter should recognize them as a Convention Supporter.
  4. Synchronize updates from the Convention Leader to the Convention Supporters.
  5. Convention Supporters should periodically check the health of the Convention Supporter/Leader preceding them in the list.
  6. Differentiate between active and offline Convention Leaders and Supporters in the list.
  7. Update step 5 to check both the preceding entry in the list and the active entry in the list (skipping inactive members, and the active member is the same as the preceding member if the preceding member is up).
  8. When the Convention Supporter whose preceding active member is the Convention Leader detects that the Leader is offline, it should change mode to become the Convention Leader, and send an internal message to all other Actors and Admins to notify them it is the new Convention Leader.
  9. Test all of the above to make sure the Convention can continue to run if the current Convention Leader goes offline.
  10. If a Convention Supporter comes online it should try to register with the current Convention Leader by sending an internal Thespian message to all other entries in the list; the current Convention Leader should be the only one to respond.
    a. If the Convention Leader preceeds the current ConventionSupporter in the list, the Convention Supporter simply registers in the convention and begins to get updates from the Leader.
    b. If the Convention Leader follows the current ConventionSupporter in the list, the Convention Supporter registers as in step 10a, and after it gets a convention update, it sends out a message to all other Convention members that it is the new Convention Leader. It also sends out a message to all Actors (duplicating step 8 above).
    c. If there is no Convention Leader and this is not the first entry in the list, log a failure message and exit

The above is using a very simplistic election mechanism for convention leader; in the longer term, a more sophisticated election process should be used (e.g. RAFT) along with synchronization methods to ensure that actions occurring during the leadership changes aren't lost, and possibly even supporting multiple leaders. I think it will be helpful to defer these more sophisticated processes though until some of the basic functionality can be explored using the above. There are a lot of corner cases and timing issues not handled by the above as well (and it should not attempt to support AdminRouting or AdminRoutingTXOnly modes), so it won't be very robust, but it should help to ensure we have a good identification of what information should be shared/synchronized between leaders and reveal where there are unforeseen issues.

Naturally, you are free to chart a different path if you are working on a POC and I'm happy to support you either way. For ongoing discussion, please feel free to use the thespian mailing list, or if you wish to discuss lower-level issues that may not be of interest to the mailing list, you can reach me directly via gmail at s/kq/kq1q/ of my github username.

@ghost
Copy link

ghost commented May 11, 2021

@kquick thank for the details. I'll start going through the code to see how exactly life cycle for the convention IP is handled.

I guess I'll start off by forking this repo so that I don't mess up something here. Once I have made progress, I'll share either the forum or offline, with a limited audience.

Best.

@kquick
Copy link
Owner

kquick commented Jun 22, 2021

Update: @arnab-chanda has done a great job and as of #79 we now have initial support for HA capabilities. The current leader selection methodology is fairly simple and we will probably look at using a more sophisticated algorithm in the future, but this implementation should be enough to let people start trying out this functionality and providing feedback on where there are things still needing attention.

This implementation should be backward compatible with existing Thespian; enabling HA is done simply by specifying the Convention Leader.IPv4 capability as a list instead of a single string. Further information will be available as the documentation is updated (coming soon).

@kquick
Copy link
Owner

kquick commented Jan 10, 2022

The initial support for this is available now in 3.10.6. It should be considered Beta: there is as yet no attempt to synchronize information between different leaders, only the ability to allow a leader to take over if the previously active leader exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants