Control over failure retries #21

kquick · 2018-02-01T06:46:47Z

[Originally suggested by Daniel Mitterdorfer, restated with some modifications here:]

Currently an actor failure when processing a message causes that message to get retried and on the second failure the PoisonMessage is sent back to the original sender. In many cases, it may be useless or undesirable to retry the message and instead send the PoisonMessage on the first failure. It may also be useful to delay the retry for a brief period instead of retrying immediately. Investigate potential methods for allowing this control over the response to message failures (flags? base classes?) and evaluate it to make sure it doesn't cause other behavioral problems.

The text was updated successfully, but these errors were encountered:

asomov · 2018-12-28T12:43:44Z

Hi, am evaluating the library I see important deviation from the Erlang OTP where the actor simply dies when a message fails and it is up the its supervisor to decide what to do next.

I found no implementation for supervisor and Supervisor Behaviour. Is it a design decision or it is not YET implemented ?
The sender does not have the whole context and it may not properly decide what to do with the 'failing' message
The order of the messages cannot be guaranteed because the actor will drop the 'failing' message and continue to process other messages from its mailbox.

kquick · 2018-12-29T00:52:33Z

Hi @asomov,

Thespian handles actor relationships slightly differently than the Erlang/OTP, but you should have all the same functionality available in both systems.

In Thespian, the Parent actor acts as a supervisor for all of its children and can make decisions about what to do when the child fails (https://thespianpy.com/doc/using.html#hH-41cd5450-c34f-4672-aafa-c96ed29c3f01).

Any message delivery to the child is automatically retried once on failure in case the failure was due to a transient issue. If the actor fails the second time on delivery of that message, the sender receives the message back with a PoisonMessage wrapper (https://thespianpy.com/doc/using.html#hH-407c4c79-2a05-442d-b6e8-5bf7c2f2d068) and can decide what action to take. You can easily add a try: ... except: self.send(self.myAddress, ActorExitRequest()) wrapper around the body of the receiveMessage to cause the child to fail when it receives a message it cannot handle. If the child actor completely dies, the parent is notified via the ChildActorExited message (https://thespianpy.com/doc/using.html#hH-ed408390-5a74-4955-9f7d-a84e87595459) and can decide whether to re-create that child or not at that point.

The above describes the Thespian alternative approach to the supervisor behavior in Erlang. Please let me know if this insufficient for your specific needs: I'm happy to consider additional functionality if it fits well within the existing architecture.
I'm not sure what you are referring to as "the whole context", or which is the "failing message" in this situation. If it is the PoisonMessage response to the sender, the sender is the one that originally sent the message and should have context information about why it sent that message (this can be attached to the message itself because the PoisonMessage wrapper returns the original message. If you meant the "failing message" was the ChildActorExited message, I'm not sure what additional context you would want to include on that failure: there are a large number of reasons why the child actor could exit, so it would be hard to enumerate them all in a notification message like this, but I'm curious as to what type of context you are used to receiving.
Yes, Thespian itself provides best effort delivery, and while delivery of messages between two actors is usually ordered, it is not guaranteed to be ordered. Any strict ordering requirements would need to be implemented by the actors themselves using some sort of id, timestamp, or other mechanism to ensure ordering (and completeness). In the limiting case, this issue is fundamentally about the CAP theorem, which states that in the presence of a partitioning event (P), does the system prioritize completeness (C) or availability (A)? For Thespian, the approach was to prioritize A, based on the observation that C can usually be built on top of A, but not vice-versa. Given this architectural basis, I would be interested to hear about any alternatives or issues with this approach, and as with the above I am open to enhancements that are still in line with the core architectural principles.

asomov · 2018-12-31T01:25:06Z

(Well, I am not sure this issue is a good place to discuss. Mailing list would be a better alternative.)
Consider the use case:
Actor A send 2 messages to actor B. First is to create a user account the second is to notify actor C about the created account.
The first message fails. How the sender (actor A) can fix the problem ? It cannot send a message to replace the failed one. It can only send a message to the end of the mailbox which will change (break) the order.

kquick · 2019-01-01T03:37:49Z

Hi @asomov ,

I'm happy to transfer this discussion to the mailing list. There is information on the "Contribute" page of thespianpy.com about joining the mailing list.

Does Actor A send both messages to Actor B in your scenario, or is the second message from Actor A to Actor C? In the first case, B should not forward a message to C for a user it has not performed/received a creation message for. In the second case, A should not send to C until B confirms the operation is completed (even if there is no loss of messages, there is no guarantee that Actor B runs before Actor C, even in Erlang/OTP, so any action which is contingent on successful completion of another Actor's run should involve either receiving a confirmation of completion from that actor or else allowing that actor to forward only on completion.

Here's an example of the first method: requiring a confirmation of completion:

class ActorA(ActorTypeDispatcher):
    def receiveMsg_str(self, username_msg, sender):
        b = self.createActor(ActorB)
        c = self.createActor(ActorC)
        self.send(b, CreateUser(username_msg))
    def receiveMsg_Created(self, created_msg, sender):
        self.send(c, created_msg)

class CreateUser(object):
    def __init__(self, username):
        self.username = username

class Created(object):
    def __init__(self, createuser_obj):
         self.user_created = createuser_obj.username

class ActorB(ActorTypeDispatcher):
    def receiveMsg_CreateUser(self, create_msg, sender):
        [...do work to create requested user....]
        self.send(sender, Created(create_msg))

if __name__ == "__main__":
    asys = ActorSystem(...)
    a = asys.createActor(ActorA)
    asys.ask(a, "user_foo", 5)

In the scenario above, ActorA will not notify ActorC of the user creation until ActorB confirms that creation via a Created message. An example of the alternative method where B forwards on completion:

class ActorA(ActorTypeDispatcher):
    def receiveMsg_str(self, username_msg, sender):
        b = self.createActor(ActorB)
        c = self.createActor(ActorC)
        self.send(b, CreateUser(username_msg, c))

class CreateUser(object):
    def __init__(self, username, notify_actor):
        self.username = username
        self.notify_addr = notify_actor

class Created(object):
    def __init__(self, createuser_obj):
         self.user_created = createuser_obj.username

class ActorB(ActorTypeDispatcher):
    def receiveMsg_CreateUser(self, create_msg, sender):
        [...do work to create requested user....]
        self.send(create_msg.notify_addr, Created(create_msg))

if __name__ == "__main__":
    asys = ActorSystem(...)
    a = asys.createActor(ActorA)
    asys.ask(a, "user_foo", 5)

In this form, B does not send the message to ActorC until the user is successfully created. These are the two most common methods to ensure proper ordering of events for the Actor Model.

asomov · 2019-01-02T14:07:54Z

Hi Kevin,
your message does not answer my question. I will try to ask in a different way.
Use case: actors A and B send messages to actor C.

if actor C fails to process the message it should never try to process it again. (As in Erlang and Akka). This is important
if actor C fails to process the message the sending actor (either A or B) should never receive "replies" on this failure and try to "heal" the problem. They simply do not have the context to decide. (As in Erlang and Akka). But it should be possible to listen to actor failures to be able to react. (The listener may be the actor which sends the messages and may be not.)

Is it possible to achieve ?

kquick · 2019-01-02T19:50:17Z

Hi Andrey,

Thanks for the clarification in your scenario.

The general behavior of Thespian is that it will re-attempt delivery of a message to an actor once if the actor's receiveMessage() method throws an uncaught exception. If the second attempt encounters an exception (any exception, not necessarily the same one) then it will send the message back to the original sender in a PoisonMessage wrapper and the current actor will proceed with new messages.

If you do not want the Thespian auto-retry of a message, add a global exception catch to your receiveMessage() method:
```
class MyActor(Actor):
    def receiveMessage(self, message, sender):
        try:
            [handle the message here]
        except Exception:
            # log.error('Failed to handle message %s', str(message))
            pass
```
The try/except block will capture all exceptions and not allow them to be seen by the calling Thespian code, so Thespian will never retry delivery of the message.
The assumption in Thespian is that the actor which sent the original message ("ActorA" or "ActorB") is the one that has the most contextual information for deciding the appropriate activity when the delivery of the message fails. If you have a separate actor ("ActorX") which keeps the state information needed for recovery then you can simply have ActorA and ActorB forward the PoisonMessage from ActorC to ActorX:
```
class ActorA(Actor):
    def receiveMessage(self, message, sender):
        if isinstance(message, PoisonMessage):
            self.send(self.actorX_addr, message)
        else:
            [handle message normally]
```
It is also feasible for ActorA and/or ActorB to ignore the PoisonMessage entirely if neither they nor any other actor will be able to perform a recovery. Or you could have the try/except I described above send the message to ActorX in the exception case.

Also note that there is a ChildActorExited message that is delivered to the parent when an actor exits; this is separate and distinct from the PoisonMessage that is delivered back to the original sender when an exception is thrown by the actor's receiveMessage().

Thespian does have some different architectural choices than either Erlang or Akka, but I believe that equivalent functionality is achievable with all three. I do appreciate your questions: these help me to validate this belief and also show where I can extend the documentation to facilitate the use of Thespian by developers familiar with Erlang or Akka.

asomov · 2019-01-03T09:19:43Z

Dear Kevin,
thank you for your time and explanations.
I see that the library does not promote the "let it crash" approach. Instead it encourages defensive programming (manual try-catch).
It is just different from my expectation what actor is and how it works.

kquick added the enhancement label Feb 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control over failure retries #21

Control over failure retries #21

kquick commented Feb 1, 2018

asomov commented Dec 28, 2018

kquick commented Dec 29, 2018

asomov commented Dec 31, 2018 •

edited

Loading

kquick commented Jan 1, 2019

asomov commented Jan 2, 2019

kquick commented Jan 2, 2019

asomov commented Jan 3, 2019

Control over failure retries #21

Control over failure retries #21

Comments

kquick commented Feb 1, 2018

asomov commented Dec 28, 2018

kquick commented Dec 29, 2018

asomov commented Dec 31, 2018 • edited Loading

kquick commented Jan 1, 2019

asomov commented Jan 2, 2019

kquick commented Jan 2, 2019

asomov commented Jan 3, 2019

asomov commented Dec 31, 2018 •

edited

Loading