-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DART-MPI: call into MPI every once in a while for local put/get #712
base: development
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## development #712 +/- ##
===============================================
+ Coverage 84.06% 85.19% +1.12%
===============================================
Files 336 336
Lines 24954 24945 -9
Branches 11349 11540 +191
===============================================
+ Hits 20977 21251 +274
Misses 3693 3693
+ Partials 284 1 -283
|
@@ -30,6 +30,11 @@ | |||
#include <math.h> | |||
#include <alloca.h> | |||
|
|||
/* the number of consecutive memcpy for local put/get before calling into MPI */ | |||
#define NUM_CONSECUTIVE_MEMCPY 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 16 ? educated guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm open to other suggestions :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least document, that 16 is just an educated guess
@@ -30,6 +30,11 @@ | |||
#include <math.h> | |||
#include <alloca.h> | |||
|
|||
/* the number of consecutive memcpy for local put/get before calling into MPI */ | |||
#define NUM_CONSECUTIVE_MEMCPY 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least document, that 16 is just an educated guess
#define NUM_CONSECUTIVE_MEMCPY 16 | ||
|
||
/* number of performed local memcpy between calling into MPI */ | ||
static _Thread_local int num_local_memcpy = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs C11, is this already ensured by CMake? I also think its better to use <threads.h>
and thread_local
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and no = 0
needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we don't officially require a C11 compiler. I am not going to open the discussion on whether we should abandon support for >20 year old compilers though...
// use direct memcpy if we are on the same unit | ||
memcpy(dest, seginfo->selfbaseptr + offset, | ||
nelem * dart__mpi__datatype_sizeof(dtype)); | ||
DART_LOG_DEBUG("dart_get: memcpy nelem:%zu " | ||
"source (coll.): offset:%lu -> dest: %p", | ||
nelem, offset, dest); | ||
num_local_memcpy = 0; | ||
return DART_OK; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_shared_mem
and put_shared_mem
also call memcpy
, are they also effected by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do not, as that doesn't depend on progress triggered locally.
6105977
to
151d79d
Compare
After having given this some more thought I modified the PR to always call into MPI for local memory accesses. If fast local access without progress is required the application should just dereference the native pointer provided by DASH. Once we are in DART we have lost the latency race anyway. Everything else just adds complexity. |
+1 |
We need to call into MPI every once in a while as otherwise polling on a local variable may not trigger progress if that is needed.