2018 02 09 Telecon Minutes

Meeting Agenda

Hardware Topology WG Kick-Off Meeting

Participants

Guillaume Mercier
Brice Goglin
Takahiro Kawashima
Shinji Sumimoto
Julien Adam
George Bosilca
Ken Raffenetti
Julien Jaeger
Knadhu
Edgar Leon
Jean-Baptiste Besnard

Summary

Highlights

Welcome to the first Telco

We started with a discussion of the meeting schedule. We considered that Friday was a good time to meet. Japanese attendants outline the late hour and we considered the possibility of alternating between two schedules: a meeting at 17h GMT+1 and another at 9h GMT+1 on Friday this would allow us to provide reasonable meeting time to all time-zones. In this context, the WG attendees agreed on the importance of taking precise notes of what happens to allow all the attendees to catch up on the work.

How the WG is envisioned

Guillaume presented a first outline of the kick-off meeting with an organizational focus. Then he insisted that he sees the WG as a place to exchange ideas with the clear idea of proposing work candidate for the MPI standard. We, therefore (as a group) have to anticipate remarks and concerns from the Forum. In this purpose, the WG shall do its best to provide reasonable inputs for the benefit of the MPI interface.

A version of Guillaume's paper on topological communicator splitting is on the WG Github. An extended version accounting for the network topology is to be expected in the near future at the same place. This work is to be seen as a first proposal to initiate WG discussions.

Guillaume noted that James Dinan outlined that there were some existing tickets dealing with HW topology in terms of new values to MPI_COMM_SPLIT_TYPES we should look for them and account for the previous discussions in our developments.

We insisted on the idea that the WG should be as open as possible and accept the input from as many people as possible. We see as important to start from the existing and to account for current practices (even outside MPI) in terms of topology management. The goal is to identify whether part of this process shall be exposed inside MPI and how it should be done.

Tickets

We agreed that tickets should be used freely in the repository to keep track of our discussion and that they may be used including for more informal / futuristic discussions.

The current Comm Split proposal

Guillaume presented the comm-split proposal detailed in the paper (found on the WG GitHub). He rapidly recalled how it was done allowing to split comms relatively to topological constraints (SOCKETS, L1, ... ). He also explained that the current version is based on hwloc/netloc allowing to catch switches information as well as intra-node memory hierarchy information. The question of non-hierarchical networks has been raised. Brice explained briefly how they would handle this issue in netloc.

The fact that considering topology details in MPI would never be accepted by the forum came to discussion. It appeared clear that an abstraction is needed to describe the hardware. Moreover, how such abstraction can be portable when moving a code between machines? However, the current proposition already take these aspects into account.

Also, a solution based on hints and keyvals provided by the process manager/ runtime system to build the comms hierarchy has been discussed. This alternative should be discussed in more details.

The principle of query functions to gather information about the HW levels seems to make consensus. Such functions are actually part of the current proposition.

This was an introductory discussion more to come in the future.

The need for feedback

We then discussed rapidly of what is the topology for an MPI process, we mentioned Netloc and how the network topology seemed to be one of the most impacting factors, in addition to inter-node discovery. On that we noted that the ND-torus topologies in Japanese machines already had this allocation issues, providing to the end-user spatially localized cores. It would be interesting to have some feedback on how it is done.

Similarly, we acknowledged that the PMI and transitively Slurm's (or put your batch manager here) CGROUPs are how the topology is being currently inherited inside the MPI job. It would be interesting to see how it relates to the exposure of the topology that we want to provide in this WG.

We are interested in an interface allowing you to query who is your neighbor. Indeed, currently is commonly admitted that rank close to each other are topologically close. Is it sufficient to consider a 1D map? What about more complex topologies?

It appeared clear that any input is welcome and that the WG should be open on that to allow a clear understanding of HW topology expectations.

On the goal

The discussion then moved to alternative approaches including the MPI Session URI model (MPI://RACK, MPI://Node, MPI://Socket, ... ) which by construction provides the kind of topological communicators described in the topological split proposal. How does it relate? Some mentioned that communicators could be lighter than groups (a comm key vs fully-descriptive lists implementation wise). But would the info the build the comm or the group be enough?

We then spent some times on abstracting topological levels as the standard cannot account for HW specificities. However, a query interface may provide these specificities. We then have to work on these aspects.

Current TODOs

Revive previous tickets linked to the topology
See how Sessions URI relate to the split proposal
Have a more detailed presentation of the split proposal. Guillaume will present the proposition at the physical meeting in Portland in February/March
Understand how MPI is subject to the topology (and at which verbosity level)
Get input from as many people as possible
Prepare tickets on the GitHub to outline our first exploration tracks
See how we can prepare the Portland F2F meeting

Tele-con Minutes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly