-
Notifications
You must be signed in to change notification settings - Fork 0
Hierarchical Communicators Proposal
The original proposal relies on communicators creation to provide the user with means to access/use/exploit the underlying physical topology by their MPI applications. In this proposal, a communicator is created for each hardware resource used by the application processes (e.g a L3 cache). Hence, if some processes want to leverage this resource to communicate, they can do so by using the relevant communicator.
There are two types of functions proposed:
- Communicator creation functions
- Query functions
MPI_Comm_split_type(MPI_Comm oldcomm, int split_type, int key, MPI_Info info, MPI_Comm *newcomm)
With a new value provided to split_type
: MPI_COMM_TYPE_PHYSICAL_TOPOLOGY
(or any suitable name, of course).
Hence, we propose to use the existing MPI_Comm_plit_type
function and to introduce this new value.
Here is an example of code that shows how this could be used:
MPI_Comm newcomm[NLEVELS];
MPI_Comm oldcomm = MPI_COMM_WORLD;
int rank, idx = 0;
while(oldcomm != MPI_COMM_NULL){
MPI_Comm_rank(oldcomm,&rank);
MPI_Comm_split_type(oldcom, MPI_COMM_TYPE_PHYSICAL_TOPOLOGY, rank, MPI_INFO_NULL, &newcomm[idx]);
oldcomm = newcomm[idx++];
}
- Instead of recursively creating the communicators, why not creating the whole hierarchy with just a single call?
- Should we introduce several values for the
split_type
argument? E.g:MPI_COMM_TYPE_MEMORY
MPI_COMM_TYPE_COMPUTE
MPI_COMM_TYPE_NETWORK
A call to MPI_Comm_split_type
with this new value shall yield a communicator
corresponding to the highest possible level in the hierarchy tree representing the hardware
topology. This newly produced communicator can then be used as an input argument in
subsequent calls to MPI_Comm_split_type
to produce other "children" subcommunicators
that correspond to deeper levels (as seen in the code above). Also:
- The last valid communicator produced in this fashion may be identical to
MPI_COMM_SELF
, but not necessarily. - Each recursively created new communicator should be a strict subset of it parent (input) communicator.
That is, a call to
MPI_Comm_compare( oldcomm , newcomm )
must returnMPI_UNEQUAL
. This propriety ensures that no unnecessary new communicators are created in case of redundancies of levels in the hardware topology. For instance, if a L3 cache and a L2 cache are shared between all processes, there is no need to create a communicator for both resources. - If no valid communicator is to be created,
MPI_COMM_NULL
should (obviously) be returned.
Pictures to be added later
One other useful addition is the ability to create at the same time at each level of the hierarchy yet another communicator which includes all root processes of a hierarchical communicator. This forms another kind of hierarchy of its own and could ease the communication between all the levels of the original hierarchy. This function could have the following prototype:
int MPI_Comm_hsplit_with_roots(MPI_Comm oldcomm, MPI_Info info, MPI_Comm *newcomm, MPI_Comm *rootscomm)
This function can be used in the same (recursive) way as the original MPI_Comm_split_type
function.
The split_type
parameter is not needed here.
It is possible to create the roots communicators with the current set of routines available in MPI. However, if the MPI library implements this, it can use it own tools (e.g hwloc) and it will be more efficient (less collective communications).
In order to retrieve information about a level, the following function could be called:
int MPI_Comm_get_hlevel_info(MPI_Comm comm, int *num_comms, int *index, char **type)
With:
-
comm
: communicator (handle) -
num_comms
: number of sibling communicators (integer) -
index
: communicator index (integer). It is the "rank" of the communicator among all communicators created by its parent communicator. -
type
: type of communicator (string). Should be unambiguous, likeL2\_Cache
,L3_Cache
orNumaNode
.
All this information should be cached by the communicator in an info object attached to it containing
a set of (key,value) pairs properly defined when the communicator comm
is created with a call
to MPI_Comm_split_type
or MPI_Comm_hsplit_with_roots
.
This info object creation would require the use of the MPI_Comm_set/get_info
functions.
Another helpful feature would be the ability for a programmer to know the name (type) of the lowest level in the hardware hierarchy that is shared by some processes. To this end, we propose to add the following function:
int MPI_Comm_get_min_hlevel(MPI_Comm comm,`int nranks, int *ranks, char **type)
With:
-
comm
: communicator (handle) -
nranks
: number of MPI processes (integer) -
ranks
: list of MPI process ranks (array) -
type
: type of the resource (string)
This function returns the name of the lowest level in the hierarchy shared by all the
MPI processes which ranks in the communicator comm
are listed in the rank
array.
If the calling process rank is not among the ranks listed in the array passed as an argument,
the type returned should be Unknown
or Invalid
.
Tele-con Minutes