Skip to content

Hierarchical Communicators Proposal

Guillaume Mercier edited this page Feb 20, 2018 · 1 revision

Original Proposal

The original proposal relies on communicators creation to provide the user with means to access/use/exploit the underlying physical topology by their MPI applications. In this proposal, a communicator is created for each hardware resource used by the application processes (e.g a L3 cache). Hence, if some processes want to leverage this resource to communicate, they can do so by using the relevant communicator.

There are two types of functions proposed:

  • Communicator creation functions
  • Query functions

Communicator creation functions

Communicators could be recursively created with a call to:

MPI_Comm_split_type(MPI_Comm oldcomm, int split_type, int key, MPI_Info info, MPI_Comm *newcomm)                            

With a new value provided to split_type: MPI_COMM_TYPE_PHYSICAL_TOPOLOGY (or any suitable name, of course). Hence, we propose to use the existing MPI_Comm_plit_type function and to introduce this new value. Here is an example of code that shows how this could be used:

MPI_Comm newcomm[NLEVELS];        
MPI_Comm oldcomm = MPI_COMM_WORLD;                                                                                                                 
int rank, idx = 0;                                                                                                                
while(oldcomm != MPI_COMM_NULL){                                                                                  
  MPI_Comm_rank(oldcomm,&rank);                                                                                              
  MPI_Comm_split_type(oldcom, MPI_COMM_TYPE_PHYSICAL_TOPOLOGY, rank, MPI_INFO_NULL, &newcomm[idx]);
  oldcomm = newcomm[idx++];                            
}

Discussion:

  • Instead of recursively creating the communicators, why not creating the whole hierarchy with just a single call?
  • Should we introduce several values for the split_type argument? E.g:
    • MPI_COMM_TYPE_MEMORY
    • MPI_COMM_TYPE_COMPUTE
    • MPI_COMM_TYPE_NETWORK

Communicators proprieties

A call to MPI_Comm_split_type with this new value shall yield a communicator corresponding to the highest possible level in the hierarchy tree representing the hardware topology. This newly produced communicator can then be used as an input argument in subsequent calls to MPI_Comm_split_type to produce other "children" subcommunicators that correspond to deeper levels (as seen in the code above). Also:

  • The last valid communicator produced in this fashion may be identical to MPI_COMM_SELF, but not necessarily.
  • Each recursively created new communicator should be a strict subset of it parent (input) communicator. That is, a call to MPI_Comm_compare( oldcomm , newcomm ) must return MPI_UNEQUAL. This propriety ensures that no unnecessary new communicators are created in case of redundancies of levels in the hardware topology. For instance, if a L3 cache and a L2 cache are shared between all processes, there is no need to create a communicator for both resources.
  • If no valid communicator is to be created, MPI_COMM_NULL should (obviously) be returned.

Example

Pictures to be added later

Creation of roots communicators

One other useful addition is the ability to create at the same time at each level of the hierarchy yet another communicator which includes all root processes of a hierarchical communicator. This forms another kind of hierarchy of its own and could ease the communication between all the levels of the original hierarchy. This function could have the following prototype:

int MPI_Comm_hsplit_with_roots(MPI_Comm oldcomm, MPI_Info info, MPI_Comm *newcomm, MPI_Comm *rootscomm)

This function can be used in the same (recursive) way as the original MPI_Comm_split_type function.

Note

The split_type parameter is not needed here.

Discussion

It is possible to create the roots communicators with the current set of routines available in MPI. However, if the MPI library implements this, it can use it own tools (e.g hwloc) and it will be more efficient (less collective communications).

Query functions

Getting information for a specific hierarchical level

In order to retrieve information about a level, the following function could be called:

int MPI_Comm_get_hlevel_info(MPI_Comm comm, int *num_comms, int *index, char **type)

With:

  • comm: communicator (handle)
  • num_comms: number of sibling communicators (integer)
  • index: communicator index (integer). It is the "rank" of the communicator among all communicators created by its parent communicator.
  • type: type of communicator (string). Should be unambiguous, like L2\_Cache, L3_Cache or NumaNode.

Discussion

All this information should be cached by the communicator in an info object attached to it containing a set of (key,value) pairs properly defined when the communicator comm is created with a call to MPI_Comm_split_type or MPI_Comm_hsplit_with_roots. This info object creation would require the use of the MPI_Comm_set/get_info functions.

Getting the minimal level

Another helpful feature would be the ability for a programmer to know the name (type) of the lowest level in the hardware hierarchy that is shared by some processes. To this end, we propose to add the following function:

int MPI_Comm_get_min_hlevel(MPI_Comm comm,`int nranks, int *ranks, char **type)

With:

  • comm: communicator (handle)
  • nranks: number of MPI processes (integer)
  • ranks: list of MPI process ranks (array)
  • type: type of the resource (string)

This function returns the name of the lowest level in the hierarchy shared by all the MPI processes which ranks in the communicator comm are listed in the rank array. If the calling process rank is not among the ranks listed in the array passed as an argument, the type returned should be Unknown or Invalid.