Skip to content

2018 05 31 Telecon Minutes

Besnard Jean-Baptiste edited this page Jul 9, 2018 · 2 revisions

Meeting Agenda

netloc and hwloc presentation by Brice Goglin

31/05/2018

Participants

  • Guillaume Mercier (Bordeaux-INP/Inria)
  • Jean-Baptiste BESNARD (ParaTools)
  • Shinji Sumimoto (Fujitsu)
  • Julien Jaeger (CEA)
  • Julien Adam (ParaTools)
  • Brice Goglin (Inria)
  • Edgar Leon (LLNL)

Summary

In this conf-call, Brice Goglin presented his work around netloc. The call was organized as an open discussion with Brice relatively to his work on the netloc and hwloc software.

Netloc

The work on netloc did start 5 years ago with the involvement of Jeff Squyres, Joshua Hursey, and Brice Goglin. It had a slow start, first developments identified many use-cases but it was observed that the interface was highly dependent from these use-cases. One of the goals is to expose the complete topology of the fabric, NIC to NIC including the multi-NIC case.

What is supported now:

  • Extraction of the Network Topology Graph
  • Regular network descriptions (e.g. 3D Torus, Coordinates)
  • Infiniband, when not regular, is presented as a full graph
  • Information is attached to each node (IP, MAC, Hostname … ) and has to be correlated to the MPI rank manually

The user can get a full graph on metrics and also rely on a synthetic view as all networks can be presented as a Fat-Tree. Netloc is available in hwloc 2.0 but is not enabled by default.

Some observations on how to present « generic » network topologies:

  • The representation has to be generic and hence advocate for privileging Fat-Tree
  • No need to support IP, only fast networks
  • Two graphs one logical (e.g. application communication scheme) and topological (the actual network)

Then the discussion moved to the fact that the actual topology is not a mandatory artifact, some simple metrics such as combined metrics or implicit queries might be more efficient and generic. Dealing with the classes of graph supported by netloc there are mostly fat-trees and Tori but there are much more classes in use currently — how can they be accounted for? There is a prototype assigning any network type to a fat-tree with similar characteristics but it is naturally not optimal for non-hierarchical networks.

One current issue with netloc is the difficulty to update the graph dynamically (e.g. a node becomes down) as it requires a complex comparison with the existing data-structure to identify « what is new », leading to a full scan.

Can Netloc be used for Service discovery? For example to locate IO nodes, Tools, Debuggers. In theory yes but this would it be called PMIx? It already exposes hwloc keys with the topology XML. The frontier is thin and PMIx is handling all the dynamic MPI is willing to use in Sessions.

Hwloc

What is new in 2.0? The main advantage is the support for multiple memories. Discussions on how to migrate from previous versions also pointed to a migration guide present in the documentation that Brice encouraged users to rely on.