-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constant fail to create actor with global name on second attempt #50
Comments
Hi @GeorgJung, Thank you for the clean sample code, and this is a Thespian bug, not something you are doing wrong. The issue is in the combined usage of globalName actors and multi-system Conventions, which has some caveats. I have identified a probable fix and should have it available for you in a couple of hours (along with more details on those caveats), pending some additional testing. Regards, |
Hi @GeorgJung, In regards to using globalName with remote Actor Systems, I have updated the documentation (http://thespianpy.com/doc/using.html#hH-2d5ef9c7-2132-45f9-b6eb-6f2729f5b232 and http://thespianpy.com/doc/using.html#hH-3c8aa454-9a21-4d24-a623-ad86bfd4f5a7, committed via kquick@59e4fb1) to describe some of the considerations for using this configuration. Please let me know if you have questions or suggestions for further clarification. Regards, |
Hi @kquick , wow, that was a fast response. I checked out your repo and tried the fix on my skeleton example. Works like a charm. It will take me a little longer to try it out in the actual project code that unveiled the bug in first place, but I am working on that. A quick comment on the documentation in section 11.3, "Multi-system Global Name support": When you describe the registrar pattern (I think it was called coordinator pattern in other issues, at least it is very similar) in the third paragraph, it might be a bit confusing: The named registrar that runs on the convention leader still might have an unwanted twin inside some convention participant. I think it should be emphasized that the developer still needs to make sure by additional means that only one registrar exists in the convention, even when using the "globalName". Maybe you could mention that a combination of unique name and unique capability (identifying the named actor and the convention participant that runs the named actor) ensure a globally unique, identifyable, actor. At least this is my understanding of the current state. One more question: Do you plan to change the mechanisms for named actors so that they will be unique throughout a convention, or is it inteded behavior that they are only guaranteed to be unique within one single participant? And a more global comment: You make a point in the documantation about being careful with using named actors for performance reasons. I'd argue that for actors that really need to be unique, the conveninence of the feature of the global name outhweights performance considerations most of the time. After all, we use an actor framework for robustness and correctness, and writing your own registrar is error prone. In other words: Keep up the great work of making features like that generic and opaque. Again, thanks for the quick response, this support is a huge argument in favor of using Thespian! |
Hi @kquick , I now tested the original project code with your fix. Unfortunately, I have to report that I still get the same failure. The error message in the project code is:
The minimal scenario that I created to report the error however works fine now. I did not manage to reproduce the error in a small scenario yet. The error message is produced in |
Hi @GeorgJung, I'm glad the fix worked for your sample, but that's disappointing news on the original project. One thing to check is to make sure that all of the previous ActorSystems have been shutdown before re-trying the new code, to make sure there aren't still processes running the older code (in this case, the local Admin would be the culprit). On a related note, I made an adjustment when working with your samples:
You can also look in $TMPDIR/thespian.log: this is a very short log of some Thespian internals. It wraps at 50KB for normal production configurations, but you can change that by editing the beginning of thespian/system/utilis.py and changing the "50" there to something larger if needed. I'd be happy to take a look at what you are getting in that file to help track down why you are still getting this issue. Thanks for the votes of support, and also the feedback on the documentation! I appreciate both and the latter are very welcome to ensure the documentation is understandable. I'm not entirely sure whether I want to extend the globalName singleton quality to cross multiple actor systems (and if I did, I would make sure there was some other argument or something needed to enable the new behavior, keeping the backward-compatible current behavior for anyone using it). Extending Thespian to support cross-system singletons would potentially make things easier for users, but it also masks over a lot of details that might be important in that scenario (as well as adding latencies and being just generally trickier to make things look "invisible" to the user). The registrar/coordinator pattern adds some complexities for the user as you noted, but I've used it multiple times on different Actor projects with pretty good success. At the present time, I'm thinking that maybe something in the Thespian to facilitate using the registrar/coordinator, similar to the approach of the Troupe decorator or the ActorTypeDispatcher: a base implementation people can build from as necessary without completely internalizing it into Thespian which would make it harder to adapt to different circumstances. I'd be interested to hear your thoughts on the issue since it's something you are addressing at this point. Thanks, |
Hi @kquick , thanks yet again for the quick answer! Awesome. I follow your argument on the introduction of global names in conventions, the rationale is sound. I might switch to implementing the registrar pattern myself for this project. I am sure that I have no old actor system running. In fact, in the project code I start actor systems as a context:
With the actor system starting with:
which in effect is close to your try/except clause. I will inspect the logs in Thanks |
Hi @kquick , here is the information from the logfile. Apparently, there is a problem in the TCP communication: In line 17 there is an error about closing the socket. This run was done in a VM running Ubuntu, all on 127.0.0.1. There is a mockup hardware in the loop modbus interface running on port 127.0.0.1:10502 (it's a little Node.js/node-red hack; it's not always running, it should not be the culprit). I will write a separate comment with my thoughts about the convention global names. One quick thought: Would it help if I try running the code with
|
Hi @kquick , about the registrar pattern. In my code, I start the convention leader first. I create a backbone actor that listens to system events. Whenever a convention participant connects, the convention leader backbone creates a site backbone that requires the capabilities of that site. Thus, I do already have a star-shaped network of backbones for each convention participant. Now when a site backbone creates its device actors, they each should be unique (globally only one per device, managing the device access). Thus, I thought it was a good idea to create those as named actors. By creating an actor of the same name, other convention participants should gain access to the device manager. If I use the registrar pattern (the leader backbone becoming the registrar), then each site backbone would need to
Each exchange is through unidirectional messages, which means that a mechanism is needed to pair up request and reply and the whole setup is prone to race conditions (i.e., I need a locking mechanism / mutual exclusion protocol at the registrar site). Both tasks are algorithmically non-trivial. I am a bit reluctant before trying that. So yeah, it would be cool for me if I could just use some mechanics provided by Thespian. How do you handle race conditions with named actors? |
Hi @GeorgJung, I think the error reported in line 17 is probably fine (I should probably change these to a warning): closing down a socket will sometimes encounter this when closing sockets for a forked child. The key problem is indicated by this line: You can issue a There shouldn't be any conflict with your node.js instance running on another port. The "Admin Port" for each Actor System is explicit and must be available for use because that is the primary coordination point for that Actor System, but all other Actor ports are supplied by the system (via a bind with a port number of 0), so there shouldn't be any collisions unless your available port space is severely depleted. You can use the multiprocUDPBase if you'd like: there should be no difference to your Actors, and it just changes the underlying transport used internally in Thespian. There are some restrictions of some system bases v.s. other system bases (http://thespianpy.com/doc/using.html#outline-container-hH-2a5fa63d-e6eb-43b9-bea8-47223b27544e), but the main difference between TCP and UDP is that the latter doesn't have any validation of sends so it's possible for messages to get dropped under load or with intermittent network connectivity. Let me know if the investigation into the capabilities/requirements of the BatteryManagementActor resolves the problems for you. Regards, |
Yes, this sounds like the way I would approach it.
Alternatively, you can internalize the "create if it doesn't exist" into the registrar:
If you use the alternative above, that is one way of solving the race condition, because there is a central Actor performing the "create if doesn't exist" operation, and Actor message handling is synchronous for each Actor (although asynchronous with other Actors), so there's no additional locking/mutex needed. If you use your original method, I would just have the registration response confirm the address, which would allow the Registrar to send and exit to any superfluous actors and just retain the original. I think this is what you were suggesting in number 5 above, but just to be sure, I would envision something like this:
Both of these forms would have a client something like the following:
If you use the second form of the Registrar, the client also needs to update the Registrar when it gets a Note that the client uses Python's open structure capabilities to preserve the original sender and message on requests to the Registrar, and the Registrar is written to preserve the original request, which makes the registrar's response fully informative and directly actionable. The alternative would have been to save the original request and sender in a list or something in the client, and then retrieve them once the Registrar response had been received; I've found this latter method to be longer and more awkward (lookups may find multiples or none), so preserving the original information on the subsequent requests has been more useful, but it does require the target Actor (in this case, the Registrar) to return the request message in some form to preserve this information. Hopefully this helps give you some ideas about how to approach this pattern. Regards, |
Hi @kquick , first of all: Thanks! You solved my problem. It was indeed the capability check. I had noticed the failure message I meanwhile implemented my own registrar pattern implementation where each convention participant is responsible for the devices at their site and registers them with the registrar. Other sites will have to ask the registrar and their request gets queued until the device appears. It works in my scenario, in fact it's a near perfect fit to my requirements. So, due to the specifics of my application, the registrar idea turned out to be a lot easier than I thought (also no locking/muex necessary). I will however certaily implement the much more generic approach that you proposed in subsequent projects. So many thanks again, ps.: Once the fix you created Sunday night makes it into the official Thespian version, I'd like to go back to using that. When do you think I should start looking for that? |
Hi @GeorgJung, That's great news! I'm glad the problems got resolved, although I'd like to see if there's some way I can make this capability check mismatch easier to diagnose in the future. Do you know if you were getting any Please don't feel bad about bothering me: you definitely helped uncover a bug and also revealed an area where it would be nice to make some sort of improvement. I'm always happy to help people using Thespian because I think the Actor model has a lot to offer, especially in today's more distributed development environments. I'm also glad the Registrar pattern is working for you, and I will try to get something written up to describe this common pattern better and make it easier to use. I've created kquick#13 to track that. I will try to get a release cut in the next couple of days, and I'll leave this issue open until I get that done. Regards, |
Fixed in release 3.8.3 (https://github.com/kquick/Thespian/releases/tag/thespian-3.8.3) and published to pypi.python.org. |
Hi @kquick , I solemly declare that I have not tried that. However, I went back to a pre-fix commit now and added a The message wrapped into the poison message (attribute I did not expect a |
Thanks for publishing the fix, I'll switch back to the original version now! |
Thanks, @GeorgJung . I've created kquick#14 to work on making this better in the future. |
Hey all,
I just recently started using Thespian and I can only thank you for this great package.
I have an issue with named actors. I have a distributed system where a convention leader gets notified when new actor systems connect. The convention leader backbone actor then starts a site backbone which in turn starts actors that manage devices on site. For various reasons, I thought that the device actors would best be named actors because each device manager can only exist once (although I read that I should probably manage that by hand using the site backbones).
I want to be able to access the device managers from other convention participants (i.e., mainly from the convention leader). Assuming A is the leader, and B is a participant. B starts a device management actor named 'b'. Now A also starts an actor named 'b' to talk to the device. However, A always fails to create that actor!
I created three python scripts that illustrate the issue:
A file
thespian_sandbox_testing/global_name_actor
with and actorMyActor
receiving a simple message:A simple starter for the actor system at Alice's side (convention leader):
A simple starter for the actor system at Bob's side (convention participant):
I start the actor system at Alice's side, wait for the "ready to proceed" box, then start the actor system at Bob's side, and when it runs, I click "proceed" on Alice's side. Now both systems remain until the "shutdown" boxes are clicked.
Alice tries to create a Bob-specific actor, and should instead receive the address of the existing actor at Bob's side. Instead I get this error message (after the timeout of 50 seconds):
thespian.actors.ActorSystemRequestTimeout: No response received to PendingActor request to Admin at ActorAddr-(T|:1900) from ActorAddr-(T|:46523)
What am I doing wrong?
The text was updated successfully, but these errors were encountered: