Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust transfer methods behaviour when interrupted #308

Merged
merged 4 commits into from
Sep 16, 2020

Conversation

rbx
Copy link
Member

@rbx rbx commented Sep 15, 2020

Currently, a Stop transition during the Running state will interrupt the transfers. Blocking transfer calls (Send/Receive) will return -2 (timeout). If another transfer is attempted before transitioning to Ready state, the behaviour of zmq/shmem transport is inconsistent. Another call to zmq transfer will attempt to complete the transfer with a certain timeout, while shmem transfer will not attempt any transfer and immediately return -2 again.

There are scenarios when doing another transfer can be useful. One such scenario is to send an end-of-stream message when Stop happens. See some discussion of such a use case here: https://alice.its.cern.ch/jira/browse/O2-1499

This PR:

  • gives names to different return values of the transfer methods: TransferResult::error, TransferResult::timeout, TransferResult::interrupted.
  • makes the behaviour of zmq/shmem transport consistent - they will attempt a transfer with a timeout even if transport has been interrupted. If such timeout occurs, they will return TransferResult::interrupted.

A transer is attempted even if the transport has been interrupted
(with a timeout). When the timeout is reached, transfer methods will
return TransferResult::interrupted (-3).
Copy link
Member

@dennisklein dennisklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me think if we should take this change as an opportunity to move fairmq/sdk/Error.* up to fairmq/ and add the new named transfer error codes there instead of creating a new enum?

@rbx
Copy link
Member Author

rbx commented Sep 16, 2020

Fine with me, as long as they keep their numerical values of -1 and -2, for backwards-compatibility.

@rbx
Copy link
Member Author

rbx commented Sep 16, 2020

At some point we will change the return value of the transfer methods, see #85. But for now i would not break it.

@dennisklein
Copy link
Member

ok, then let's do it at a later point in time when we touch the Send/Receive function signatures anyways.

@dennisklein
Copy link
Member

161/190 Testing: SDK.Topology.AsyncSetPropertiesTimeout
161/190 Test: SDK.Topology.AsyncSetPropertiesTimeout
Command: "/Users/alfaci/dklein/FairMQ/build/test/testsuite_SDK" "--gtest_filter=Topology.AsyncSetPropertiesTimeout" "--gtest_also_run_disabled_tests"
Directory: /Users/alfaci/dklein/FairMQ/build
"SDK.Topology.AsyncSetPropertiesTimeout" start time: Sep 16 12:13 CEST
Output:
----------------------------------------------------------
Note: Google Test filter = Topology.AsyncSetPropertiesTimeout
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Topology
[ RUN      ] Topology.AsyncSetPropertiesTimeout
[ESC[01;36m12:13:32.661752ESC[0m][ESC[01;32mINFOESC[0m] $DDS_LOCATION: "/Users/alfaci/fairsoft/fairmq_dev", $DDS_CONFIG_HOME: "/Users/alfaci/dklein/FairMQ/build/test/.DDS"
[ESC[01;36m12:13:32.661794ESC[0m][ESC[01;32mINFOESC[0m] $DDS_SESSION_ID: e23cf9dc-124d-49c3-b334-9d536dceccae
[ESC[01;36m12:13:32.661801ESC[0m][ESC[01;32mINFOESC[0m] DDS topology: ExampleDDS (loaded from "/Users/alfaci/dklein/FairMQ/test/sdk/test_topo.xml")
[ESC[01;36m12:13:33.474685ESC[0m][ESC[01;32mINFOESC[0m] system:0
/Users/alfaci/dklein/FairMQ/test/sdk/_topology.cxx:367: Failure
Expected equality of these values:
  ec
    Which is: system:0
  MakeErrorCode(ErrorCode::OperationTimeout)
    Which is: fairmq:11
[  FAILED  ] Topology.AsyncSetPropertiesTimeout (1962 ms)
[----------] 1 test from Topology (1962 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (1962 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Topology.AsyncSetPropertiesTimeout

 1 FAILED TEST
<end of output>
Test time =   1.97 sec
----------------------------------------------------------
Test Failed.
"SDK.Topology.AsyncSetPropertiesTimeout" end time: Sep 16 12:13 CEST
"SDK.Topology.AsyncSetPropertiesTimeout" time elapsed: 00:00:01
----------------------------------------------------------

Looks like on the new macOS 10.15 machine, the operation completes within 1 millisecond^^

@dennisklein
Copy link
Member

diff --git a/fairmq/sdk/Topology.h b/fairmq/sdk/Topology.h
index 7f3ab9a..557d471 100644
--- a/fairmq/sdk/Topology.h
+++ b/fairmq/sdk/Topology.h
@@ -464,7 +464,7 @@ class BasicTopology : public AsioBase<Executor, Allocator>
         }
     }

-    using Duration = std::chrono::milliseconds;
+    using Duration = std::chrono::microseconds;
     using ChangeStateCompletionSignature = void(std::error_code, TopologyState);

   private:
diff --git a/test/sdk/_topology.cxx b/test/sdk/_topology.cxx
index 54027e9..183da18 100644
--- a/test/sdk/_topology.cxx
+++ b/test/sdk/_topology.cxx
@@ -361,7 +361,7 @@ TEST_F(Topology, AsyncSetPropertiesTimeout)

     topo.AsyncSetProperties({{"key1", "val1"}},
                             "",
-                            std::chrono::milliseconds(1),
+                            std::chrono::microseconds(1),
                             [=](std::error_code ec, sdk::FailedDevices) mutable {
                                 LOG(info) << ec;
                                 EXPECT_EQ(ec, MakeErrorCode(ErrorCode::OperationTimeout));

fixes it on citmac001.

@dennisklein
Copy link
Member

Another error class I see (also on current dev branch):

62/188 Testing: Device.ErrorState.static_InConnect
62/188 Test: Device.ErrorState.static_InConnect
Command: "/Users/alfaci/dklein/FairMQ/build2/test/testsuite_Device" "--gtest_filter=ErrorState.static_InConnect" "--gtest_also_run_disabled_tests"
Directory: /Users/alfaci/dklein/FairMQ/build2
"Device.ErrorState.static_InConnect" start time: Sep 16 12:40 CEST
Output:
----------------------------------------------------------
Note: Google Test filter = ErrorState.static_InConnect
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ErrorState
[ RUN      ] ErrorState.static_InConnect
[ErrorFound IN Connect]/Users/alfaci/dklein/FairMQ/build2/test/testhelper_runTestDevice --id error_state_Connect_ --control static --session 4277161072062211234 --color false
[ErrorFound IN Connect]fair::mq::tools::execute: pid: 71236
[ErrorFound IN Connect][12:40:21][INFO]
[ErrorFound IN Connect]      ______      _    _______  _________
[ErrorFound IN Connect]     / ____/___ _(_)_______   |/  /_  __ \    version 1.4.14-88-g5e97d85
[ErrorFound IN Connect]    / /_  / __ `/ / ___/__  /|_/ /_  / / /    build RelWithDebInfo
[ErrorFound IN Connect]   / __/ / /_/ / / /    _  /  / / / /_/ /     https://github.com/FairRootGroup/FairMQ
[ErrorFound IN Connect]  /_/    \__,_/_/_/     /_/  /_/  \___\_\     LGPL-3.0  © 2012-2020 GSI
[ErrorFound IN Connect]
[ErrorFound IN Connect][12:40:21][STATE] Starting FairMQ state machine --> IDLE
[ErrorFound IN Connect][12:40:21][STATE] IDLE ---> INITIALIZING DEVICE
[ErrorFound IN Connect][12:40:21][WARN] fair::mq::plugins::Config: no channels configuration provided via --mq-config or --channel-config
[ErrorFound IN Connect][12:40:21][STATE] INITIALIZING DEVICE ---> INITIALIZED
[ErrorFound IN Connect][12:40:21][STATE] INITIALIZED ---> BINDING
[ErrorFound IN Connect][12:40:21][STATE] BINDING ---> BOUND
[ErrorFound IN Connect][12:40:21][STATE] BOUND ---> CONNECTING
[ErrorFound IN Connect][12:40:21][WARN] No channels created after finishing initialization
[ErrorFound IN Connect][12:40:21][STATE] Transition AUTO incoming, but another state transition is already ongoing.
[ErrorFound IN Connect][12:40:21][STATE] CONNECTING ---> ERROR
[ErrorFound IN Connect][12:40:22][STATE] Exiting FairMQ state machine
[ErrorFound IN Connect][12:40:22][ERROR] Uncaught exception reached the top of main: Device transitioned to error state
[ErrorFound IN Connect]fair::mq::tools::execute: exit code: 1
/Users/alfaci/dklein/FairMQ/test/device/_error_state.cxx:92: Failure
Death test: RunErrorStateIn("Connect", "static")
    Result: failed to die.
 Error msg:
[  DEATH   ]
[  FAILED  ] ErrorState.static_InConnect (117 ms)
[----------] 1 test from ErrorState (117 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (117 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ErrorState.static_InConnect

 1 FAILED TEST
<end of output>
Test time =   0.12 sec
----------------------------------------------------------
Test Failed.
"Device.ErrorState.static_InConnect" end time: Sep 16 12:40 CEST
"Device.ErrorState.static_InConnect" time elapsed: 00:00:00
----------------------------------------------------------

This happens for all Device.ErrorState.* and all Device.Signal_SIGINT.* tests.

@rbx
Copy link
Member Author

rbx commented Sep 16, 2020

The errors in Device.ErrorState.* and Device.Signal_SIGINT.* we have seen before also, from time to time. But the output of the test looks correct.

Are the results from macOS10.15 uploaded to cdash?

@dennisklein
Copy link
Member

The errors in Device.ErrorState.* and Device.Signal_SIGINT.* we have seen before also, from time to time. But the output of the test looks correct.

Are the results from macOS10.15 uploaded to cdash?

Nope, something fails with the certificate. I guess the Spack cmake does not have access to the correct ca certificate list. That's why I posted logs here.

@dennisklein
Copy link
Member

diff --git a/test/device/_error_state.cxx b/test/device/_error_state.cxx
index 61a06f0..60bc6ac 100644
--- a/test/device/_error_state.cxx
+++ b/test/device/_error_state.cxx
@@ -74,7 +74,7 @@ void RunErrorStateIn(const string& state, const string& control, const string& i

     device_thread.join();

-    ASSERT_NE(string::npos, result.console_out.find("going to change to Error state from " + state + "()"));
+ //   ASSERT_NE(string::npos, result.console_out.find("going to change to Error state from " + state + "()"));

     exit(result.exit_code);
 }

This fixes it on citmac001. Somehow this assertion does not work out on macOS 10.15.

@dennisklein
Copy link
Member

I think I understand why, with newer FairLogger debug log is disabled, which is why asserted log line is missing.

@rbx
Copy link
Member Author

rbx commented Sep 16, 2020

diff --git a/fairmq/sdk/Topology.h b/fairmq/sdk/Topology.h
index 7f3ab9a..557d471 100644
--- a/fairmq/sdk/Topology.h
+++ b/fairmq/sdk/Topology.h
@@ -464,7 +464,7 @@ class BasicTopology : public AsioBase<Executor, Allocator>
         }
     }

-    using Duration = std::chrono::milliseconds;
+    using Duration = std::chrono::microseconds;
     using ChangeStateCompletionSignature = void(std::error_code, TopologyState);

   private:
diff --git a/test/sdk/_topology.cxx b/test/sdk/_topology.cxx
index 54027e9..183da18 100644
--- a/test/sdk/_topology.cxx
+++ b/test/sdk/_topology.cxx
@@ -361,7 +361,7 @@ TEST_F(Topology, AsyncSetPropertiesTimeout)

     topo.AsyncSetProperties({{"key1", "val1"}},
                             "",
-                            std::chrono::milliseconds(1),
+                            std::chrono::microseconds(1),
                             [=](std::error_code ec, sdk::FailedDevices) mutable {
                                 LOG(info) << ec;
                                 EXPECT_EQ(ec, MakeErrorCode(ErrorCode::OperationTimeout));

fixes it on citmac001.

Let's make it a template arg to the methods to avoid any breakage, like we do in FairMQDevice:

    template<typename Rep, typename Period>
    bool WaitFor(std::chrono::duration<Rep, Period> const& duration)
    {
        return !fStateMachine.WaitForPendingStateFor(std::chrono::duration_cast<std::chrono::milliseconds>(duration).count());
    }

And the internal value we can store in microseconds.

@dennisklein
Copy link
Member

    template<typename Rep, typename Period>
    bool WaitFor(std::chrono::duration<Rep, Period> const& duration)
    {
        return !fStateMachine.WaitForPendingStateFor(std::chrono::duration_cast<std::chrono::milliseconds>(duration).count());
    }

This would allow any granularity. My original idea was to require a minimal resolution. See https://godbolt.org/z/c7debW

@rbx
Copy link
Member Author

rbx commented Sep 16, 2020

Ok, minimal is enough. Didn't realize it can convert downwards automatically. Want to push it?

@dennisklein
Copy link
Member

diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt
index 53df7fa..0b12f3e 100644
--- a/test/CMakeLists.txt
+++ b/test/CMakeLists.txt
@@ -12,6 +12,18 @@ include(GTestHelper)
 # FairMQ Testsuites/helpers #
 #############################

+if(FairLogger_VERSION VERSION_LESS 1.9.0 AND FairLogger_VERSION VERSION_GREATER_EQUAL 1.7.0)
+  LIST(APPEND definitions FAIR_MIN_SEVERITY=trace)
+endif()
+
+if(BUILD_OFI_TRANSPORT)
+  LIST(APPEND definitions BUILD_OFI_TRANSPORT)
+endif()
+
+if(definitions)
+  set(definitions DEFINITIONS ${definitions})
+endif()
+
 add_testhelper(runTestDevice
     SOURCES
     helper/runTestDevice.cxx
@@ -30,16 +42,9 @@ add_testhelper(runTestDevice
     helper/devices/TestExceptions.h

     LINKS FairMQ
+    ${definitions}
 )

-if(BUILD_OFI_TRANSPORT)
-  LIST(APPEND definitions BUILD_OFI_TRANSPORT)
-endif()
-
-if(definitions)
-  set(definitions DEFINITIONS ${definitions})
-endif()
-
 set(MQ_CONFIG "${CMAKE_BINARY_DIR}/test/testsuite_FairMQ.IOPatterns_config.json")
 set(RUN_TEST_DEVICE "${CMAKE_BINARY_DIR}/test/testhelper_runTestDevice")
 set(FAIRMQ_BIN_DIR ${CMAKE_BINARY_DIR}/fairmq)

This fixes the other tests.

Want to push it?

ya, I'll push both fixes in a minute.

@dennisklein
Copy link
Member

https://cdash.gsi.de/testDetails.php?test=7170379&build=246945 @rbx I have seen this one already, does not happen always.

@rbx
Copy link
Member Author

rbx commented Sep 16, 2020

https://cdash.gsi.de/testDetails.php?test=7170379&build=246945 @rbx I have seen this one already, does not happen always.

It fails to create some shmem ressource. Looks like this particular test doesn't use a unique id for the session. I'll push a fix in a bit.

@dennisklein dennisklein merged commit 4a15a38 into FairRootGroup:dev Sep 16, 2020
@rbx rbx deleted the transfer-result-interrupted branch March 2, 2021 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants