Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscribed messages which go out of scope get garbage collected and cause undefined behaviour #676

Open
dpad opened this issue Apr 30, 2024 · 9 comments · May be fixed by #884
Open

Subscribed messages which go out of scope get garbage collected and cause undefined behaviour #676

dpad opened this issue Apr 30, 2024 · 9 comments · May be fixed by #884
Assignees
Labels
bug Something isn't working

Comments

@dpad
Copy link
Contributor

dpad commented Apr 30, 2024

Describe the bug

When creating a stand-alone message, as in the Creating Stand-Alone Messages example, if the user creates the message such that it goes out of scope and gets garbage-collected by Python, the subscribing module will still refer to the subscribed memory location.

This causes buggy undefined behaviour, including using potential garbage data as inputs to the module.

Impact
The faulty behaviour which occurs from this is extremely hard to track down and even harder to explain for users without a deep understanding of the C++/Python interaction underlying Basilisk.

For example, say a user sets up the MsisAtmosphere model with some initial space-weather parameters (in some stand-alone messages). Due to this bug, the MsisAtmosphere model never actually computes an appropriate atmospheric density, and the user has basically no idea why.

To reproduce
The example below sets an initial message to the CppModuleTemplate and checks that the output values are computed from that initial message. However, because the initial message goes out of scope and is garbage collected, its data gets overwritten with random garbage and still gets used by the CppModuleTemplate.

from Basilisk.architecture import messaging
from Basilisk.moduleTemplates import cppModuleTemplate
from Basilisk.utilities import SimulationBaseClass
from Basilisk.utilities import macros

import numpy as np

def addLocalStandaloneMessage(module):
    """
    This function creates a stand-alone message in the local function scope and
    subscribes the given module to it.

    Once this function returns, the message goes out of scope and is destroyed
    by the Python garbage collector, and its memory gets re-used.
    This causes a silent failure in the module, where it simply uses whatever
    value is in the re-used memory without realising it's not a message.
    """
    # create stand-alone input message
    msgData = messaging.CModuleTemplateMsgPayload()
    msgData.dataVector = [10., 20., 30.]
    msg = messaging.CModuleTemplateMsg().write(msgData)

    # connect to stand-alone msg
    module.dataInMsg.subscribeTo(msg)

def test_StandaloneMessageGoesOutOfScope():
    #  Create a sim module as an empty container
    scSim = SimulationBaseClass.SimBaseClass()

    #  create the simulation process
    dynProcess = scSim.CreateNewProcess("dynamicsProcess")

    # create the dynamics task and specify the integration update time
    dynProcess.addTask(scSim.CreateNewTask("dynamicsTask", macros.sec2nano(1.)))

    # create modules
    mod1 = cppModuleTemplate.CppModuleTemplate()
    mod1.ModelTag = "cppModule1"
    scSim.AddModelToTask("dynamicsTask", mod1)

    # setup message recording
    msgRec = mod1.dataOutMsg.recorder()
    scSim.AddModelToTask("dynamicsTask", msgRec)

    # Subscribe to the input message in a local scope, so that it doesn't exist here.
    addLocalStandaloneMessage(mod1)

    #  initialize Simulation:
    scSim.InitializeSimulation()

    #   configure a simulation stop time and execute the simulation run
    scSim.ConfigureStopTime(macros.sec2nano(3.0))
    scSim.ExecuteSimulation()

    # Confirm that the simulated data matches the expected
        expected = np.array([
        [11., 20., 30.],
        [12., 20., 30.],
        [13., 20., 30.],
        [14., 20., 30.]
    ])
    assert (msgRec.dataVector == expected).all()

if __name__ == "__main__":
    test_StandaloneMessageGoesOutOfScope()

The above test will fail, because the msg and msgData get destroyed after exiting the addLocalStandaloneMessage function scope. Because of this, the msgData.dataVector memory location gets re-used and overwritten with unknown data, but the mod1 module still reads from that location, so the msgRec.dataVector ends up with weird random garbage values in it.

Expected behavior

Any message that is subscribed to should not be allowed to get garbage collected. A simple fix is given below in newMessaging.ih, although I'm not sure if this is a totally appropriate fix (and I don't know if there are other places that would need to change). With this fix, the above test case passes.

diff --git a/src/architecture/messaging/newMessaging.ih b/src/architecture/messaging/newMessaging.ih
index b382c842ac..9cde823538 100644
--- a/src/architecture/messaging/newMessaging.ih
+++ b/src/architecture/messaging/newMessaging.ih
@@ -38,12 +38,18 @@ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
             def subscribeTo(self, source):
                 if type(source) == messageType:
                     self.__subscribe_to(source)
+                    source.thisown = False
                     return
                 
                 try:
                     from Basilisk.architecture.messaging.messageType ## Payload import messageType ## _C
                     if type(source) == messageType ## _C:
                         self.__subscribe_to_C(source)
+                        source.thisown = False
                         return
                 except ImportError:
                     pass
-- 

I imagine a correct fix would need to take ownership of the subscribed message, or otherwise increment its reference count so that it doesn't get garbage collected until the subscription is removed. But I have zero experience with SWIG so I don't really know the best way to do this, nor the memory impact of keeping these objects alive.

Screenshots
If applicable, add screenshots/plots to help explain your problem.

Desktop (please complete the following information):

  • OS: Linux
  • Basilisk Version: 2.2.2
  • Python Version: 3.10.12

Additional context

Probably relevant? https://www.swig.org/Doc4.2/Python.html#Python_memory_management_member_variables

@joaogvcarneiro joaogvcarneiro added the bug Something isn't working label May 6, 2024
@joaogvcarneiro
Copy link
Contributor

I have come across this behavior before as well. My workaround in Python is to always output the message from the local function so that it's available on the main function and, therefore, doesn't get garbage collected.

@juan-g-bonilla I'm tagging you here because I vaguely remember you talking about this at some point. Maybe you can give some feedback or comment on the suggested approach.

@sassy-asjp
Copy link
Contributor

I don't think the quick fix @dpad wrote is appropriate as it basically intentionally leaks the source message. While it shouldn't pose a problem for most users, it seems wrong, and might cause problems in some niche cases. A good general solution would require doing the Python reference count bookkeeping correctly.

I took a stab at the problem and came up with this. While it isn't elegant, I believe this does the reference count bookkeeping correctly. It would be nice to figure out a way to integrate registerPyObjectSource into the subscribe functions, but I'm not sure if that's possible.

diff --git a/src/architecture/messaging/messaging.h b/src/architecture/messaging/messaging.h
index b7007b174..684fce3b3 100644
--- a/src/architecture/messaging/messaging.h
+++ b/src/architecture/messaging/messaging.h
@@ -24,6 +24,7 @@ ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 #include "architecture/utilities/bskLogging.h"
 #include <typeinfo>
 #include <stdlib.h>
+#include <Python.h>
 
 /*! forward-declare sim message for use by read functor */
 template<typename messageType>
@@ -39,6 +40,7 @@ private:
     messageType* payloadPointer;    //!< -- pointer to the incoming msg data
     MsgHeader *headerPointer;      //!< -- pointer to the incoming msg header
     bool initialized;               //!< -- flag indicating if the input message is connect to another message
+    PyObject *source;
     
 public:
     //!< -- BSK Logging
@@ -47,10 +49,16 @@ public:
 
 
     //! constructor
-    ReadFunctor() : initialized(false) {};
+    ReadFunctor() : initialized(false), source(nullptr) {};
 
     //! constructor
-    ReadFunctor(messageType* payloadPtr, MsgHeader *headerPtr) : payloadPointer(payloadPtr), headerPointer(headerPtr), initialized(true){};
+    ReadFunctor(messageType* payloadPtr, MsgHeader *headerPtr) : payloadPointer(payloadPtr), headerPointer(headerPtr), initialized(true), source(nullptr) {};
+
+    ~ReadFunctor() {
+        if (this->source) {
+            Py_DECREF(this->source);
+        }
+    }
 
     //! constructor
     const messageType& operator()(){
@@ -123,6 +131,14 @@ public:
         this->initialized = true;
     };
 
+    void registerPyObjectSource(PyObject *source) {
+        if (this->source) {
+            Py_DECREF(this->source);
+        }
+        this->source = source;
+        Py_INCREF(this->source);
+    }
+
 
     //! Check if self has been subscribed to a C message
     uint8_t isSubscribedToC(void *source){
diff --git a/src/architecture/messaging/newMessaging.ih b/src/architecture/messaging/newMessaging.ih
index b382c842a..4cfa9e677 100644
--- a/src/architecture/messaging/newMessaging.ih
+++ b/src/architecture/messaging/newMessaging.ih
@@ -38,12 +38,14 @@ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
             def subscribeTo(self, source):
                 if type(source) == messageType:
                     self.__subscribe_to(source)
+                    self.registerPyObjectSource(source)
                     return
                 
                 try:
                     from Basilisk.architecture.messaging.messageType ## Payload import messageType ## _C
                     if type(source) == messageType ## _C:
                         self.__subscribe_to_C(source)
+                        self.registerPyObjectSource(source)
                         return
                 except ImportError:
                     pass

@juan-g-bonilla
Copy link
Contributor

Thanks for the summon @joaogvcarneiro . This has tripped us many times in the past, so I'd say looking for a good fix is valuable. In the past, I have just held references of these messages in the SimBaseClass object so that when that goes out of scope all messages will also get cleaned. However, this is too manual.

As @sassy-asjp says, the original fix would leak the messages, which should be a pretty small amount of memory but it would be best avoided anyway.

The second approach with counting the references seems like a better approach, but I am a bit unconfortable with having to import <Python.h> + do the python object manipulation as part of messaging.h. AFAIK, the core codebase is C and C++ only, and any handling of Python happens on SWIG, which is kind of an independent layer from the core codebase. Right now, it is possible to compile the C and C++ modules without <Python.h>, which some users might be doing. With this change, this would not be possible. Maybe other people can offer better insight into how much of a problem this would really pose.

@sassy-asjp
Copy link
Contributor

sassy-asjp commented May 9, 2024

@juan-g-bonilla

We could have void *source in messaging.h and doing all the reference counting logic in the extension in newMessaging.ih. I'm not sure if it would still handle decref on destruction correctly in all cases, and it would make it hard to use the reference counting feature outside of SWIG though.

diff --git a/src/architecture/messaging/messaging.h b/src/architecture/messaging/messaging.h
index b7007b174..bb30cd270 100644
--- a/src/architecture/messaging/messaging.h
+++ b/src/architecture/messaging/messaging.h
@@ -44,13 +44,14 @@ public:
     //!< -- BSK Logging
     BSKLogger bskLogger;            //!< -- bsk logging instance
     messageType zeroMsgPayload ={}; //!< -- zero'd copy of the message payload type
+    void *source;
 
 
     //! constructor
-    ReadFunctor() : initialized(false) {};
+    ReadFunctor() : initialized(false), source(nullptr) {};
 
     //! constructor
-    ReadFunctor(messageType* payloadPtr, MsgHeader *headerPtr) : payloadPointer(payloadPtr), headerPointer(headerPtr), initialized(true){};
+    ReadFunctor(messageType* payloadPtr, MsgHeader *headerPtr) : payloadPointer(payloadPtr), headerPointer(headerPtr), initialized(true), source(nullptr) {};
 
     //! constructor
     const messageType& operator()(){
@@ -123,7 +124,6 @@ public:
         this->initialized = true;
     };
 
-
     //! Check if self has been subscribed to a C message
     uint8_t isSubscribedToC(void *source){
         
diff --git a/src/architecture/messaging/newMessaging.ih b/src/architecture/messaging/newMessaging.ih
index b382c842a..21fabd6b1 100644
--- a/src/architecture/messaging/newMessaging.ih
+++ b/src/architecture/messaging/newMessaging.ih
@@ -34,16 +34,34 @@ OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 
 %template(messageType ## Reader) ReadFunctor<messageTypePayload>;
 %extend ReadFunctor<messageTypePayload> {
+        ~ReadFunctor() {
+            if (self->source) {
+                Py_DECREF(static_cast<PyObject*>(self->source));
+            }
+        }
+
+        void registerPyObjectSource(PyObject *source) {
+            if (self->source) {
+                Py_DECREF(static_cast<PyObject*>(self->source));
+            }
+            self->source = source;
+            if (source) {
+                Py_INCREF(source);
+            }
+        }
+
         %pythoncode %{
             def subscribeTo(self, source):
                 if type(source) == messageType:
                     self.__subscribe_to(source)
+                    self.registerPyObjectSource(source)
                     return
                 
                 try:
                     from Basilisk.architecture.messaging.messageType ## Payload import messageType ## _C
                     if type(source) == messageType ## _C:
                         self.__subscribe_to_C(source)
+                        self.registerPyObjectSource(source)
                         return
                 except ImportError:
                     pass

@sassy-asjp
Copy link
Contributor

@juan-g-bonilla @dpad

Okay, after further testing, I think correctly handling the reference count means doing it in messaging.h. Extending on the SWIG wrapper side doesn't create real class functions swig/swig#503 (comment) thus:

  1. SWIG extension destructors only get called when the ReadFunctor is directly created from Python, e.g., InMsg = messaging.CModuleTemplateMsgReader() but not when the ReadFunctor is created from C++ such as as a member of a C++ Module.
  2. A more complete implementation would need to handle the reference count correctly through copy, move, etc. operations, which like destructor would not work when done on the C++ side.
  3. A pure Python approach (effectively self.source = source instead of self.thisown = False) would also only work when the ReadFunctor is directly created from Python, otherwise the wrapper disappears after use, taking the Python side only reference with it.

If there is a requirement to be able to build C/C++ modules without Python dev dependency, then the options are:

  1. Accept the memory leak
  2. Use conditional compilation directives to enable/disable the reference counting code
  3. A pure C++ side reference count with a global table of void* references to Python objects where the C++ side count has reached zero that SWIG wrapper code will look at to manage the Python reference count. Please don't do this.

@juan-g-bonilla
Copy link
Contributor

@sassy-asjp Thanks for the detailed response. You are right that any changes on SWIG will not affect the original class definition, so any ReadFunctor created in C++ will be oblivious to any logic added in SWIG.

There is at least a way to to implement the reference counting as you describe without depending on <Python.h>. Instead of changing ReaderFunctor to hold a pointer to PyObject and then calling Py_INCREF and Py_DECREF, one can hold a function pointer that is a callback for an action that should be called when the reader unsubscribes from a source (and on destruction). On SWIG, we could INCREF before calling the C++ subscribe, while also passing a callback that calls DECREF on that object. That way, Reader doesn't need to know anything about python, it just callbacks blindly, which will perform the DECREF.

There is also the issue of C modules. These just hold objects of the type *Msg_C, and there is no difference between writers and readers. Moreover, there is no destructor that gets called to do the final reference count down. These are not necessarily unsolvable problems, but we do need to solve them.

@sassy-asjp
Copy link
Contributor

@juan-g-bonilla

I think the handing of reference counting would have to be smarter than just one callback function, as it would also need to be able to Py_INCREF to handle copy construction and copy assignment. The alternative would be to make ReadFunctor non-copyable.

Considering the C use case, it might be better that the the source message header itself have an optional place to store the pyobject reference, and increment and decrement function pointers instead of in the ReadFunctor.

For C, some cooperation from the module writer is going to be required to call some extra function to free, and maybe some rules about when/how to copy linked input messages.

@juan-g-bonilla
Copy link
Contributor

@sassy-asjp Then we can use more than one callback function and add as many as needed for copy operations. Not really arguing whether it's the correct way to go about it, just that it would be possible to implement this without having <Python.h> in messaging.h.

Maybe we can explore this further, but it's hard to see how much additional complexity would be added, and whether it's worth it, without seeing a concrete implementation that supports C modules.

@sassy-asjp
Copy link
Contributor

@juan-g-bonilla

I think we're going to use a hacky workaround on our internal fork for now. For our use cases, the impact of the memory leak is minimal.

If no one else picks up this issue in the mean time, I'd like to get around to writing a proper and complete solution eventually. However that might be a while sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 👀 In review
5 participants