Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant fail to create actor with global name on second attempt #50

Closed
GeorgJung opened this issue Nov 5, 2017 · 16 comments
Closed

Constant fail to create actor with global name on second attempt #50

GeorgJung opened this issue Nov 5, 2017 · 16 comments

Comments

@GeorgJung
Copy link

GeorgJung commented Nov 5, 2017

Hey all,

I just recently started using Thespian and I can only thank you for this great package.

I have an issue with named actors. I have a distributed system where a convention leader gets notified when new actor systems connect. The convention leader backbone actor then starts a site backbone which in turn starts actors that manage devices on site. For various reasons, I thought that the device actors would best be named actors because each device manager can only exist once (although I read that I should probably manage that by hand using the site backbones).

I want to be able to access the device managers from other convention participants (i.e., mainly from the convention leader). Assuming A is the leader, and B is a participant. B starts a device management actor named 'b'. Now A also starts an actor named 'b' to talk to the device. However, A always fails to create that actor!

I created three python scripts that illustrate the issue:

  1. A file thespian_sandbox_testing/global_name_actor with and actor MyActor receiving a simple message:

     from thespian.actors import ActorTypeDispatcher, requireCapability
     
     class Message(object):
     
         def __init__(self, msg):
             self.msg = msg
     
     @requireCapability('Bob')
     class MyActor(ActorTypeDispatcher):
         def __init__(self):
             print('Bob actor created')
     
         def receiveMsg_Message(self, message, sender):
             print('Bob actor received message ' + message.msg +
                   ' from ' + str(sender))
    
  2. A simple starter for the actor system at Alice's side (convention leader):

     from easygui import msgbox
     from thespian.actors import ActorSystem
     
     from thespian_sandbox_testing.global_name_actor import Message, MyActor
     
     
     if __name__ == '__main__':
     
         asys = ActorSystem(systemBase='multiprocTCPBase',
                            capabilities={'Alice': True,
                                          'Admin Port': 1900})
     
         msgbox(msg='Convention leader (Alice) is running.\n\nPress button to proceed.',
                    title='Alice',
                    ok_button='proceed')
     
         bob_actor = asys.createActor(MyActor, globalName='Bob actor')
     
         asys.tell(bob_actor, Message('ALICE'))
     
         msgbox(msg='Convention leader (Alice) is running.\n\nPress button to shut down.',
                title='Alice',
                ok_button='shut down')
     
         print('Bob actor address: ' + str(bob_actor))
     
         asys.shutdown()
    
  3. A simple starter for the actor system at Bob's side (convention participant):

     from easygui import msgbox
     from thespian.actors import ActorSystem
     
     from thespian_sandbox_testing.global_name_actor import Message, MyActor
     
    
     if __name__ == '__main__':
     
         asys = ActorSystem(systemBase='multiprocTCPBase',
                            capabilities={'Bob': True,
                                          'Admin Port': 1902,
                                          'Convention Address.IPv4': ('127.0.0.1', 1900)})
     
         bob_actor = asys.createActor(MyActor, globalName='Bob actor')
     
         asys.tell(bob_actor, Message('BOB'))
     
         msgbox(msg='Convention participant (Bob) is running.\n\nPress button to shut down.',
                title='Bob',
                ok_button='shut down')
     
         print('Bob actor address: ' + str(bob_actor))
     
         asys.shutdown()
    

I start the actor system at Alice's side, wait for the "ready to proceed" box, then start the actor system at Bob's side, and when it runs, I click "proceed" on Alice's side. Now both systems remain until the "shutdown" boxes are clicked.

Alice tries to create a Bob-specific actor, and should instead receive the address of the existing actor at Bob's side. Instead I get this error message (after the timeout of 50 seconds):

thespian.actors.ActorSystemRequestTimeout: No response received to PendingActor request to Admin at ActorAddr-(T|:1900) from ActorAddr-(T|:46523)

What am I doing wrong?

@kquick
Copy link
Member

kquick commented Nov 6, 2017

Hi @GeorgJung,

Thank you for the clean sample code, and this is a Thespian bug, not something you are doing wrong.

The issue is in the combined usage of globalName actors and multi-system Conventions, which has some caveats. I have identified a probable fix and should have it available for you in a couple of hours (along with more details on those caveats), pending some additional testing.

Regards,
Kevin

@kquick
Copy link
Member

kquick commented Nov 6, 2017

Hi @GeorgJung,
I have pushed a fix for this issue to https://github.com/kquick/Thespian (note that this is the current development repo, which is different than the one where you filed this ticket). I have not created a pypi release with this fix yet: if it is convenient for you to checkout the master from that repo and try it I would appreciate the confirmation of the fix. If it is not convenient, I can generate a new release and upload it to pypi, just let me know.

In regards to using globalName with remote Actor Systems, I have updated the documentation (http://thespianpy.com/doc/using.html#hH-2d5ef9c7-2132-45f9-b6eb-6f2729f5b232 and http://thespianpy.com/doc/using.html#hH-3c8aa454-9a21-4d24-a623-ad86bfd4f5a7, committed via kquick@59e4fb1) to describe some of the considerations for using this configuration. Please let me know if you have questions or suggestions for further clarification.

Regards,
Kevin

@GeorgJung
Copy link
Author

Hi @kquick ,

wow, that was a fast response. I checked out your repo and tried the fix on my skeleton example. Works like a charm. It will take me a little longer to try it out in the actual project code that unveiled the bug in first place, but I am working on that.

A quick comment on the documentation in section 11.3, "Multi-system Global Name support": When you describe the registrar pattern (I think it was called coordinator pattern in other issues, at least it is very similar) in the third paragraph, it might be a bit confusing: The named registrar that runs on the convention leader still might have an unwanted twin inside some convention participant. I think it should be emphasized that the developer still needs to make sure by additional means that only one registrar exists in the convention, even when using the "globalName". Maybe you could mention that a combination of unique name and unique capability (identifying the named actor and the convention participant that runs the named actor) ensure a globally unique, identifyable, actor. At least this is my understanding of the current state.

One more question: Do you plan to change the mechanisms for named actors so that they will be unique throughout a convention, or is it inteded behavior that they are only guaranteed to be unique within one single participant?

And a more global comment: You make a point in the documantation about being careful with using named actors for performance reasons. I'd argue that for actors that really need to be unique, the conveninence of the feature of the global name outhweights performance considerations most of the time. After all, we use an actor framework for robustness and correctness, and writing your own registrar is error prone. In other words: Keep up the great work of making features like that generic and opaque.

Again, thanks for the quick response, this support is a huge argument in favor of using Thespian!

@GeorgJung
Copy link
Author

Hi @kquick ,

I now tested the original project code with your fix. Unfortunately, I have to report that I still get the same failure. The error message in the project code is:

ERROR Pending Actor create for ActorAddr-(T|:60385) failed (3586): None

The minimal scenario that I created to report the error however works fine now. I did not manage to reproduce the error in a small scenario yet.

The error message is produced in system/actorManager, but I do not have enough insight into your code to trace it. Do you have a recommendation as to where I should look for the issue?

@kquick
Copy link
Member

kquick commented Nov 6, 2017

Hi @GeorgJung,

I'm glad the fix worked for your sample, but that's disappointing news on the original project. One thing to check is to make sure that all of the previous ActorSystems have been shutdown before re-trying the new code, to make sure there aren't still processes running the older code (in this case, the local Admin would be the culprit). On a related note, I made an adjustment when working with your samples:

    asys = ActorSystem(...)
    try:
        {tests creating and messaging actors}
    finally:
        asys.shutdown()

You can also look in $TMPDIR/thespian.log: this is a very short log of some Thespian internals. It wraps at 50KB for normal production configurations, but you can change that by editing the beginning of thespian/system/utilis.py and changing the "50" there to something larger if needed. I'd be happy to take a look at what you are getting in that file to help track down why you are still getting this issue.

Thanks for the votes of support, and also the feedback on the documentation! I appreciate both and the latter are very welcome to ensure the documentation is understandable.

I'm not entirely sure whether I want to extend the globalName singleton quality to cross multiple actor systems (and if I did, I would make sure there was some other argument or something needed to enable the new behavior, keeping the backward-compatible current behavior for anyone using it). Extending Thespian to support cross-system singletons would potentially make things easier for users, but it also masks over a lot of details that might be important in that scenario (as well as adding latencies and being just generally trickier to make things look "invisible" to the user). The registrar/coordinator pattern adds some complexities for the user as you noted, but I've used it multiple times on different Actor projects with pretty good success. At the present time, I'm thinking that maybe something in the Thespian to facilitate using the registrar/coordinator, similar to the approach of the Troupe decorator or the ActorTypeDispatcher: a base implementation people can build from as necessary without completely internalizing it into Thespian which would make it harder to adapt to different circumstances. I'd be interested to hear your thoughts on the issue since it's something you are addressing at this point.

Thanks,
Kevin

@GeorgJung
Copy link
Author

Hi @kquick ,

thanks yet again for the quick answer! Awesome. I follow your argument on the introduction of global names in conventions, the rationale is sound. I might switch to implementing the registrar pattern myself for this project.

I am sure that I have no old actor system running. In fact, in the project code I start actor systems as a context:

@contextmanager
def device_proxy_actor_system(system_base, log_defs, capabilities):
    '''
    @summary: Context manager which ensures that the actor system shuts
    down when the program exists.
    '''
    actor_system = ActorSystem(systemBase=system_base,
                               logDefs=log_defs,
                               capabilities=capabilities)
    try:
        yield actor_system
    finally:
        logging.info(CONTEXT_LOGGING_PREFIX + 'Actor system shut down.')
        actor_system.shutdown()

With the actor system starting with:

    with ts.device_proxy_actor_system(system_base=SYSTEM,
                                      log_defs=LOGDEF,
                                      capabilities=CAPABILITIES) as asys:
        # CODE

which in effect is close to your try/except clause.

I will inspect the logs in /tmp first thing tomorrow (I am on Central European Time)

Thanks

@GeorgJung
Copy link
Author

Hi @kquick ,

here is the information from the logfile. Apparently, there is a problem in the TCP communication: In line 17 there is an error about closing the socket. This run was done in a VM running Ubuntu, all on 127.0.0.1. There is a mockup hardware in the loop modbus interface running on port 127.0.0.1:10502 (it's a little Node.js/node-red hack; it's not always running, it should not be the culprit). I will write a separate comment with my thoughts about the convention global names.

One quick thought: Would it help if I try running the code with multiprocUDPBase?

2017-11-07 08:28:25.967416 p4328 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 8)
2017-11-07 08:28:26.005719 p4328 I    Pending Actor request received for actors.cluster_backbone_actor.ClusterBackboneActor reqs {'site_id': 'cluster'} from ActorAddr-(T|:41163)
2017-11-07 08:28:26.020094 p4330 I    Starting Actor actors.cluster_backbone_actor.ClusterBackboneActor at ActorAddr-(T|:38329) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900), srcHash None)
2017-11-07 08:29:12.644019 p4358 I    ++++ Admin started @ ActorAddr-(T|:1902) / gen (3, 8)
2017-11-07 08:29:12.693397 p4358 I    Admin registering with Convention @ ActorAddr-(T|:1900) (first time)
2017-11-07 08:29:12.700181 p4358 I    Setting log aggregator of ActorAddr-(T|:38677) to ActorAddr-(T|:1900)
2017-11-07 08:29:12.706856 p4328 I    Got Convention registration from ActorAddr-(T|:1902) (first time) (new? True)
2017-11-07 08:29:12.722675 p4358 I    Got Convention registration from ActorAddr-(T|:1900) (re-registering) (new? True)
2017-11-07 08:29:12.729876 p4328 I    Pending Actor request received for actors.site_backbone_actor.SiteBackboneActor reqs {'site_id': 'stuttgart_hil'} from ActorAddr-(T|:38329)
2017-11-07 08:29:12.733193 p4328 I    Requesting creation of actors.site_backbone_actor.SiteBackboneActor on remote admin ActorAddr-(T|:1902)
2017-11-07 08:29:12.734197 p4358 I    Pending Actor request received for actors.site_backbone_actor.SiteBackboneActor reqs {'site_id': 'stuttgart_hil'} from ActorAddr-(T|:1900)
2017-11-07 08:29:12.836455 p4360 I    Starting Actor actors.site_backbone_actor.SiteBackboneActor at ActorAddr-(T|:32825) (parent ActorAddr-(T|:1902), admin ActorAddr-(T|:1902), srcHash None)
2017-11-07 08:29:12.887806 p4358 I    Pending Actor request received for actors.battery_management_actor.BatteryManagementActor reqs {'site_id': 'stuttgart_hil'} from ActorAddr-(T|:32825)
2017-11-07 08:29:12.903011 p4361 I    Starting Actor actors.battery_management_actor.BatteryManagementActor at ActorAddr-(T|:40689) (parent ActorAddr-(T|:1902), admin ActorAddr-(T|:1902), srcHash None)
2017-11-07 08:29:12.969714 p4358 I    Admin registering with Convention @ ActorAddr-(T|:1900) (re-registering)
2017-11-07 08:29:12.970969 p4358 I    Setting log aggregator of ActorAddr-(T|:38677) to ActorAddr-(T|:1900)
2017-11-07 08:29:12.985087 p4328 I    Error during shutdown of socket <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor
2017-11-07 08:29:12.985417 p4328 I    Got Convention registration from ActorAddr-(T|:1902) (re-registering) (new? False)
2017-11-07 08:29:24.612772 p4328 I    Pending Actor request received for actors.battery_play_actor.BatteryPlayActor reqs None from ActorAddr-(T|:40181)
2017-11-07 08:29:24.624601 p4363 I    Starting Actor actors.battery_play_actor.BatteryPlayActor at ActorAddr-(T|:43797) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900), srcHash None)
2017-11-07 08:29:24.661043 p4328 I    Pending Actor request received for actors.battery_management_actor.BatteryManagementActor reqs {'site_id': 'stuttgart_hil_test'} from ActorAddr-(T|:43797)
2017-11-07 08:30:12.145070 p4363 I    Pending Actor create for ActorAddr-(T|:43797) failed (3586): None
2017-11-07 08:30:12.145847 p4363 I    completion error: ************* TransportIntent(ActorAddr-LocalAddr.0-=-SENDSTS_FAILED-<class 'actors.actor_messages.InitMsg'>-{'HEADERS': {}, 'MAIN': {'name': 'blue_sky_battery01', 'type': 'battery'}, 'MODBUS': {'server_ip': '127.0.0.1', 'number_port': 10502}, 'DEVICE': {'cha...-quit_0:04:12.523252)
2017-11-07 08:30:12.147556 p4363 I    completion error: ************* TransportIntent(ActorAddr-LocalAddr.0-=-SENDSTS_FAILED-<class 'actors.actor_messages.SubscribeMsg'>-<actors.actor_messages.SubscribeMsg object at 0x7fd5ad56c8d0>-quit_0:04:12.525470)
2017-11-07 08:30:12.147819 p4363 I    completion error: ************* TransportIntent(ActorAddr-LocalAddr.0-=-SENDSTS_FAILED-<class 'actors.actor_messages.ReqMsg'>-<actors.actor_messages.ReqMsg object at 0x7fd5b107beb8>-quit_0:04:59.986351)
2017-11-07 08:30:20.457072 p4326 I    ActorSystem shutdown requested.
2017-11-07 08:30:20.458065 p4328 I    Convention cleanup or deregistration for ActorAddr-(T|:1902) (known? True)
2017-11-07 08:30:20.459151 p4358 I    Convention cleanup or deregistration for ActorAddr-(T|:1900) (known? True)
2017-11-07 08:30:20.473522 p4358 I    Admin de-registering with Convention @ ActorAddr-(T|:1900)
2017-11-07 08:30:20.476251 p4358 I    Convention cleanup or deregistration for ActorAddr-(T|:1900) (known? False)
2017-11-07 08:30:20.503210 p4328 I    Convention cleanup or deregistration for ActorAddr-(T|:1902) (known? False)
2017-11-07 08:30:20.525408 p4358 I    Error during shutdown of socket <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor
2017-11-07 08:30:20.542679 p4361 I    Error during shutdown of socket <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor
2017-11-07 08:30:20.546625 p4358 I    Error during shutdown of socket <socket.socket [closed] fd=-1, family=AddressFamily.AF_INET, type=2049, proto=6>: [Errno 9] Bad file descriptor
2017-11-07 08:30:20.560040 p4326 I    ActorSystem shutdown complete.
2017-11-07 08:30:20.564700 p4328 I    ---- shutdown completed
2017-11-07 08:30:20.570154 p4358 I    ConnRefused to ActorAddr-(T|:32825); declaring as DeadTarget.
2017-11-07 08:30:20.589875 p4358 I    ---- shutdown completed
2017-11-07 08:30:20.596654 p4358 I    ConnRefused to ActorAddr-(T|:32825); declaring as DeadTarget.
2017-11-07 08:30:27.005522 p4350 I    ActorSystem shutdown requested.
2017-11-07 08:30:27.041309 p4350 I    ConnRefused to ActorAddr-(T|:1902); declaring as DeadTarget.
2017-11-07 08:30:27.044433 p4350 Warn Could not send shutdown request to Admin; aborting but not necessarily stopped

@GeorgJung
Copy link
Author

Hi @kquick ,

about the registrar pattern. In my code, I start the convention leader first. I create a backbone actor that listens to system events. Whenever a convention participant connects, the convention leader backbone creates a site backbone that requires the capabilities of that site. Thus, I do already have a star-shaped network of backbones for each convention participant.

Now when a site backbone creates its device actors, they each should be unique (globally only one per device, managing the device access). Thus, I thought it was a good idea to create those as named actors. By creating an actor of the same name, other convention participants should gain access to the device manager.

If I use the registrar pattern (the leader backbone becoming the registrar), then each site backbone would need to

  1. inquire with the registrar if a name exists
  2. wait for answer from the registrar
  3. if the name does not exist, create the actor
  4. report the creation (name, address) to the registrar which stores it with in its table
  5. if the name exists, use the address provided by the registrar

Each exchange is through unidirectional messages, which means that a mechanism is needed to pair up request and reply and the whole setup is prone to race conditions (i.e., I need a locking mechanism / mutual exclusion protocol at the registrar site). Both tasks are algorithmically non-trivial. I am a bit reluctant before trying that.

So yeah, it would be cool for me if I could just use some mechanics provided by Thespian. How do you handle race conditions with named actors?

@kquick
Copy link
Member

kquick commented Nov 7, 2017

Hi @GeorgJung,

I think the error reported in line 17 is probably fine (I should probably change these to a warning): closing down a socket will sometimes encounter this when closing sockets for a forked child.

The key problem is indicated by this line:
2017-11-07 08:30:12.145070 p4363 I Pending Actor create for ActorAddr-(T|:43797) failed (3586): None
This is an indication that the actorSystemCapabilityCheck() (or requireCapability decorator) didn't match any running ActorSystem's capabilities. This was most likely for the BatteryManagementActor, so I would check to make sure that the intended ActorSystem's capabilities will be matched for that Actor. I believe this should have caused a ChildActorExited message to be sent to the actor that initiated the create (BatteryPlayActor?)

You can issue a SIGUSR1 to any of these Actors ($ kill -USR1 {pid} on Linux) and they will generate a status output to the thespian.log file. When an ActorSystem admin generates this status output it will include the capabilities of that Actor System. Alternatively, you can use the thespian.shell utility to query the status and print the output to the current terminal (see http://thespianpy.com/doc/in_depth.html#outline-container-hH-058d8939-b973-4270-975b-3afd9c607176).

There shouldn't be any conflict with your node.js instance running on another port. The "Admin Port" for each Actor System is explicit and must be available for use because that is the primary coordination point for that Actor System, but all other Actor ports are supplied by the system (via a bind with a port number of 0), so there shouldn't be any collisions unless your available port space is severely depleted.

You can use the multiprocUDPBase if you'd like: there should be no difference to your Actors, and it just changes the underlying transport used internally in Thespian. There are some restrictions of some system bases v.s. other system bases (http://thespianpy.com/doc/using.html#outline-container-hH-2a5fa63d-e6eb-43b9-bea8-47223b27544e), but the main difference between TCP and UDP is that the latter doesn't have any validation of sends so it's possible for messages to get dropped under load or with intermittent network connectivity.

Let me know if the investigation into the capabilities/requirements of the BatteryManagementActor resolves the problems for you.

Regards,
Kevin

@kquick
Copy link
Member

kquick commented Nov 7, 2017

about the registrar pattern. In my code, I start the convention leader first. I create a backbone actor that listens to system events. Whenever a convention participant connects, the convention leader backbone creates a site backbone that requires the capabilities of that site. Thus, I do already have a star-shaped network of backbones for each convention participant.

Yes, this sounds like the way I would approach it.

Now when a site backbone creates its device actors, they each should be unique (globally only one per device, managing the device access). Thus, I thought it was a good idea to create those as named actors. By creating an actor of the same name, other convention participants should gain access to the device manager.

If I use the registrar pattern (the leader backbone becoming the registrar), then each site backbone would need to

  1. inquire with the registrar if a name exists
  2. wait for answer from the registrar
  3. if the name does not exist, create the actor
  4. report the creation (name, address) to the registrar which stores it with in its table
  5. if the name exists, use the address provided by the registrar

Alternatively, you can internalize the "create if it doesn't exist" into the registrar:

class GetKnownActor(object):
    "Message sent to the Registrar to get an address"
    def __init__(self, name, reqs={}):
        self.name = name
        self.reqs = reqs

class KnownActorAddr(object):
    "Response message sent from the Registrar with the requested actor's address"
    def __init__(self, name, addr, reqmsg):
        self.name = name
        self.addr = addr
        self.reqmsg = reqmsg

class Registrar(ActorTypeDispatcher):
    def __init__(self, *args, **kw):
        super(Registrar, self).__init__(*args, **kw)
        self.known_actors = []
    def receiveMsg_GetKnownActor(self, gka_msg, sender):
        if not self.known_actors.get(gka_msg.name, None):
            self.known_actors[gka_msg.name] = self.createActor(gka_msg.name,
                                                               targetActorRequirements=gka_msg.reqs)
        self.send(sender,
                  KnownActorAddr(gka_msg.name, self.known_actors[gka_msg.name], gka_msg))
    def receiveMsg_ChildActorExited(self, exitmsg, sender):
        try:
            self.known_actors.remove(exitmsg.childAddress)
        except ValueError: pass

Each exchange is through unidirectional messages, which means that a mechanism is needed to pair up request and reply and the whole setup is prone to race conditions (i.e., I need a locking mechanism / mutual exclusion protocol at the registrar site). Both tasks are algorithmically non-trivial. I am a bit reluctant before trying that.

If you use the alternative above, that is one way of solving the race condition, because there is a central Actor performing the "create if doesn't exist" operation, and Actor message handling is synchronous for each Actor (although asynchronous with other Actors), so there's no additional locking/mutex needed.

If you use your original method, I would just have the registration response confirm the address, which would allow the Registrar to send and exit to any superfluous actors and just retain the original. I think this is what you were suggesting in number 5 above, but just to be sure, I would envision something like this:

class GetKnownActor(object):
    "Message sent to the Registrar to get an address"
    def __init__(self, name):
        self.name = name

class KnownActorAddr(object):
    "Response message sent from the Registrar with the requested actor's address"
    def __init__(self, name, addr, reqmsg):
        self.name = name
        self.addr = addr  # will be None if not currently known
        self.reqmsg = reqmsg

class RegisterNewActor(object):
    "Message sent to the Registrar to register a newly created Actor"
    def __init__(self, name, addr):
        self.name = name
        self.addr = addr

class Registrar(ActorTypeDispatcher):
    def __init__(self, *args, **kw):
        super(Registrar, self).__init__(*args, **kw)
        self.known_actors = []
    def receiveMsg_GetKnownActor(self, gka_msg, sender):
        self.send(sender, KnownActorAddr(gka_msg.name,
                                         self.known_actors.get(gka_msg.name, None), gka_msg))
    def receiveMsg_RegisterNewActor(self, rna_msg, sender):
        if not self.known_actors.get(rna_msg.name, None):
            self.known_actors[rna_msg.name] = rna_msg.addr
        else:
            self.send(rna_msg.addr, ActorExitRequest())
        self.send(sender, 
                  KnownActorAddr(rna_msg.name, self.known_actors[rna_msg.name], rna_msg))

Both of these forms would have a client something like the following:

class BackboneActor(ActorTypeDispatcher):
    def __init__(self, *args, **kw):
        super(BackboneActor, self).__init__(*args, **kw)
        self.battery_actor = None
    def receiveMsg_something(self, msg, sender):
        if self.battery_actor:
            self.handle_something(msg, sender)
        else:
            reqmsg = GetKnownActor("BatteryManagementActor")  # if registrar creates then this also has arg: , { 'has_battery': True })
            reqmsg.orig_msg = msg  # save the "something" message for later
            reqmsg.orig_sender = sender
            # self.registrar_addr was already set via some other message
            self.send(self.registrar_addr,  reqmsg)
    def handle_something(msg, sender):
        self.send(self.battery_actor, ...)
        self.send(sender, BatterySetup())
    def receiveMsg_KnownActorAddr(self, kaa_msg, sender):
        if kaa_msg.addr:
            # could examine kaa_msg.name if there were potentially multiple different actors being registered
            self.battery_actor = kaa_msg.addr
            self.handle_something(kaa_msg.req_msg.orig_msg, kaa_msg.req_msg.orig_sender)
        else:
            # This clause doesn't exist in the "registrar creates" version
            reqmsg = RegisterNewActor(kaa_msg.name,
                                      self.createActor('BatteryManagementActor',
                                                       targetActorRequirements = {'has_battery': True})
            reqmsg.orig_msg = kaa_msg.req_msg.orig_msg  # preserve stored information for clause above
            reqmsg.orig_sender = kaa_msg.req_msg.orig_sender
            self.send(self.registrar_addr, reqmsg)

If you use the second form of the Registrar, the client also needs to update the Registrar when it gets a ChildActorExited message. For both implementations, if the client gets a PoisonMessage as a result of an attempt to send to the registered target, it could mean that the target is dead, so it should get or register a new child (the creation of the new child is automatically handled in the "registrar creates" version).

Note that the client uses Python's open structure capabilities to preserve the original sender and message on requests to the Registrar, and the Registrar is written to preserve the original request, which makes the registrar's response fully informative and directly actionable. The alternative would have been to save the original request and sender in a list or something in the client, and then retrieve them once the Registrar response had been received; I've found this latter method to be longer and more awkward (lookups may find multiples or none), so preserving the original information on the subsequent requests has been more useful, but it does require the target Actor (in this case, the Registrar) to return the request message in some form to preserve this information.

Hopefully this helps give you some ideas about how to approach this pattern.

Regards,
Kevin

@GeorgJung
Copy link
Author

Hi @kquick ,

first of all: Thanks! You solved my problem.

It was indeed the capability check. I had noticed the failure message Pending Actor create for ActorAddr-(T|:43797) failed (3586): None before, but I did not know the root cause of it. Once you had mentioned the capability check and I knew where to look I finally found a typo in one of the configuration files that describe my secenarios. So it was a very simple (however hard-to-find) bug in the end, with a little more insight into Thespian (which I gained now) I think I could have found it. I feel a bit guilty for bothering you with what turned out to be a typo, but at least we squished a small bug together and I do appreciate a lot the insight you shared about distributed architecture.

I meanwhile implemented my own registrar pattern implementation where each convention participant is responsible for the devices at their site and registers them with the registrar. Other sites will have to ask the registrar and their request gets queued until the device appears. It works in my scenario, in fact it's a near perfect fit to my requirements. So, due to the specifics of my application, the registrar idea turned out to be a lot easier than I thought (also no locking/muex necessary). I will however certaily implement the much more generic approach that you proposed in subsequent projects.

So many thanks again,
I guess we can close this thread,
Georg

ps.: Once the fix you created Sunday night makes it into the official Thespian version, I'd like to go back to using that. When do you think I should start looking for that?

@kquick
Copy link
Member

kquick commented Nov 8, 2017

Hi @GeorgJung,

That's great news! I'm glad the problems got resolved, although I'd like to see if there's some way I can make this capability check mismatch easier to diagnose in the future. Do you know if you were getting any ChildActorExited or PoisonMessage messages in the broken configuration?

Please don't feel bad about bothering me: you definitely helped uncover a bug and also revealed an area where it would be nice to make some sort of improvement. I'm always happy to help people using Thespian because I think the Actor model has a lot to offer, especially in today's more distributed development environments.

I'm also glad the Registrar pattern is working for you, and I will try to get something written up to describe this common pattern better and make it easier to use. I've created kquick#13 to track that.

I will try to get a release cut in the next couple of days, and I'll leave this issue open until I get that done.

Regards,
Kevin

@kwquick
Copy link
Contributor

kwquick commented Nov 9, 2017

Fixed in release 3.8.3 (https://github.com/kquick/Thespian/releases/tag/thespian-3.8.3) and published to pypi.python.org.

@kwquick kwquick closed this as completed Nov 9, 2017
@GeorgJung
Copy link
Author

Hi @kquick ,

I solemly declare that I have not tried that. However, I went back to a pre-fix commit now and added a PoisonMessage handler to the actor that tried to create the named actor with the wrong capability requirements. It fired four times on the first run.

The message wrapped into the poison message (attribute poisonMessage) is the initialization message sent to the named actor after creation. The attribute details contains the string Child Aborted (the wording might not be the best choice... maybe take Child Actor Creation Aborted). So there is no report about the reason as to why the actor creation was aborted.

I did not expect a ChildActorExited message as the named actor does not have a parent. I created a handler anyways, and as expected it didn't seem to receive any message.

@GeorgJung
Copy link
Author

Thanks for publishing the fix, I'll switch back to the original version now!

@kquick
Copy link
Member

kquick commented Nov 9, 2017

Thanks, @GeorgJung . I've created kquick#14 to work on making this better in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants