-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load Balancing is slower #321
Comments
Alternative, more simple setup that can reproduce the issue on a smaller scale: Run both alternatives option A and B from the following config. config.xml<?xml version='1.0' encoding='UTF-8'?>
<mardyn version="20100525" >
<refunits type="SI" >
<length unit="nm">0.1</length>
<mass unit="u">1</mass>
<energy unit="K">1</energy>
</refunits>
<simulation type="MD" >
<integrator type="Leapfrog" >
<timestep unit="reduced" >0.00182367</timestep>
</integrator>
<run>
<currenttime>0.0</currenttime>
<production>
<steps>100</steps>
</production>
</run>
<ensemble type="NVT">
<temperature unit="reduced" >110</temperature>
<domain type="box">
<lx>100</lx>
<ly>100</ly>
<lz>100</lz>
</domain>
<components>
<moleculetype id="1" name="LJ">
<site type="LJ126" id="1" name="Ar">
<coords> <x>0.0</x> <y>0.0</y> <z>0.0</z> </coords>
<mass>1</mass>
<sigma>1</sigma>
<epsilon>1</epsilon>
<shifted>true</shifted>
</site>
</moleculetype>
</components>
<phasespacepoint>
<generator name="MultiObjectGenerator">
<objectgenerator>
<filler type="GridFiller">
<lattice system="cubic" centering="face">
<vec id="a"> <x>1</x> <y>0</y> <z>0</z> </vec>
<vec id="b"> <x>0</x> <y>1</y> <z>0</z> </vec>
<vec id="c"> <x>0</x> <y>0</y> <z>1</z> </vec>
</lattice>
<basis>
<site>
<componentid>0</componentid>
<coordinate> <x>0.5</x> <y>0.5</y> <z>0.5</z> </coordinate>
</site>
</basis>
<latticeOccupancy>1</latticeOccupancy>
<density>0.000501</density>
</filler>
<object type="Cuboid">
<lower> <x>0</x> <y>0</y> <z>0</z> </lower>
<upper> <x>100</x> <y>100</y> <z>100</z> </upper>
</object>
<velocityAssigner type="EqualVelocityDistribution"></velocityAssigner>
</objectgenerator>
</generator>
</phasespacepoint>
</ensemble>
<algorithm>
<!-- ======================== RELEVANT PART START ======================== -->
<!-- OPTION A -->
<parallelisation type="DomainDecomposition">
<timerForLoad>SIMULATION_FORCE_CALCULATION</timerForLoad>
</parallelisation>
<!-- OPTION B -->
<parallelisation type="GeneralDomainDecomposition">
<timerForLoad>SIMULATION_FORCE_CALCULATION</timerForLoad>
<updateFrequency>5</updateFrequency>
<loadBalancer type="ALL"/>
</parallelisation>
<!-- ======================== RELEVANT PART END ========================== -->
<datastructure type="LinkedCells">
<cellsInCutoffRadius>1</cellsInCutoffRadius>
</datastructure>
<cutoffs type="CenterOfMass" >
<defaultCutoff unit="reduced" >2.5</defaultCutoff>
<radiusLJ unit="reduced" >2.5</radiusLJ>
</cutoffs>
<electrostatic type="ReactionField" >
<epsilon>1.0e+10</epsilon>
</electrostatic>
</algorithm>
<output>
</output>
</simulation>
</mardyn> Run with e.g. OMP_NUM_THREADS mpirun -n 16 --oversubscribe src/MarDyn config.xml | tee output_OptX.out Then compare the two outputs via: grep -Erni -A 5 'Decomposition took' output*.out The output from |
The difference from the perspective of the decompositions is that while both call
This however shouldn't really make a difference because the two calls from GDD should do exactly the same as the one call from DD. The latter also invokes |
The non-rebalance case n Throwing counters around every single method call in the rebalance part of Rough sketch how this was done:
|
After adding extensive timers to
diff file to add the counters
diff --git a/src/parallel/GeneralDomainDecomposition.cpp b/src/parallel/GeneralDomainDecomposition.cpp
index efd6aa750..e323e0c07 100644
--- a/src/parallel/GeneralDomainDecomposition.cpp
+++ b/src/parallel/GeneralDomainDecomposition.cpp
@@ -69,7 +69,16 @@ void GeneralDomainDecomposition::initializeALL() {
double GeneralDomainDecomposition::getBoundingBoxMin(int dimension, Domain* /*domain*/) { return _boxMin[dimension]; }
-GeneralDomainDecomposition::~GeneralDomainDecomposition() = default;
+GeneralDomainDecomposition::~GeneralDomainDecomposition() {
+
+ int myRank{};
+ MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+ std::ofstream logFile{"counters_r" + std::to_string(myRank)};
+ for (const auto &[name, value] : counters) {
+ logFile << name << " = " << value << std::endl;
+ }
+}
double GeneralDomainDecomposition::getBoundingBoxMax(int dimension, Domain* /*domain*/) { return _boxMax[dimension]; }
@@ -80,9 +89,31 @@ bool GeneralDomainDecomposition::queryRebalancing(size_t step, size_t updateFreq
void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bool forceRebalancing,
ParticleContainer* moleculeContainer, Domain* domain) {
+
+ if (counters.empty()) {
+ counters["Rebalancing - numIterations"] = 0;
+ counters["Rebalancing - exchangeMoleculesMPI 1"] = 0;
+ counters["Rebalancing - deleteOuterParticles"] = 0;
+ counters["Rebalancing - rebalance"] = 0;
+ counters["Rebalancing - latchToGridSize"] = 0;
+ counters["Rebalancing - migrateParticles"] = 0;
+ counters["Rebalancing - initCommPartners"] = 0;
+ counters["Rebalancing - exchangeMoleculesMPI 2"] = 0;
+ counters["SendLeavingWithCopies - numIterations"] = 0;
+ counters["SendLeavingWithCopies - exchangeMoleculesMPI"] = 0;
+ }
+
const bool rebalance =
queryRebalancing(_steps, _rebuildFrequency, _initPhase, _initFrequency, lastTraversalTime) or forceRebalancing;
+ const auto moep = [&](const auto &msg) {
+ int myRank{0};
+ MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+ if (myRank == 0 ) {
+ std::cout << msg << "\n";
+ }
+ };
if (_steps == 0) {
+ moep("_steps == 0");
// ensure that there are no outer particles
moleculeContainer->deleteOuterParticles();
// init communication partners
@@ -90,11 +121,16 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
} else {
if (rebalance) {
+ ++counters["Rebalancing - numIterations"];
// first transfer leaving particles
+ auto start = std::chrono::high_resolution_clock::now();
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
+ counters["Rebalancing - exchangeMoleculesMPI 1"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
// ensure that there are no outer particles
+ start = std::chrono::high_resolution_clock::now();
moleculeContainer->deleteOuterParticles();
+ counters["Rebalancing - deleteOuterParticles"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
// rebalance
Log::global_log->debug() << "rebalancing..." << std::endl;
@@ -102,13 +138,21 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
Log::global_log->set_mpi_output_all();
Log::global_log->debug() << "work:" << lastTraversalTime << std::endl;
Log::global_log->set_mpi_output_root(0);
+ start = std::chrono::high_resolution_clock::now();
auto [newBoxMin, newBoxMax] = _loadBalancer->rebalance(lastTraversalTime);
+ counters["Rebalancing - rebalance"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
if (_latchGridSize.has_value()) {
+ start = std::chrono::high_resolution_clock::now();
std::tie(newBoxMin, newBoxMax) = latchToGridSize(newBoxMin, newBoxMax);
+ counters["Rebalancing - latchToGridSize"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
}
// migrate the particles, this will rebuild the moleculeContainer!
Log::global_log->debug() << "migrating particles" << std::endl;
+ start = std::chrono::high_resolution_clock::now();
migrateParticles(domain, moleculeContainer, newBoxMin, newBoxMax);
+ counters["Rebalancing - migrateParticles"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
#ifndef MARDYN_AUTOPAS
// The linked cells container needs this (I think just to set the cells to valid...)
@@ -121,13 +165,24 @@ void GeneralDomainDecomposition::balanceAndExchange(double lastTraversalTime, bo
// init communication partners
Log::global_log->debug() << "updating communication partners" << std::endl;
+ start = std::chrono::high_resolution_clock::now();
initCommPartners(moleculeContainer, domain);
+ counters["Rebalancing - initCommPartners"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
Log::global_log->debug() << "rebalancing finished" << std::endl;
+ start = std::chrono::high_resolution_clock::now();
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
+ counters["Rebalancing - exchangeMoleculesMPI 2"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
} else {
if (sendLeavingWithCopies()) {
+ ++counters["SendLeavingWithCopies - numIterations"];
+ const auto start = std::chrono::high_resolution_clock::now();
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_AND_HALO_COPIES);
+ counters["SendLeavingWithCopies - exchangeMoleculesMPI"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
} else {
+ moep("not sendLeavingWithCopies");
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, LEAVING_ONLY);
moleculeContainer->deleteOuterParticles();
DomainDecompMPIBase::exchangeMoleculesMPI(moleculeContainer, domain, HALO_COPIES);
diff --git a/src/parallel/GeneralDomainDecomposition.h b/src/parallel/GeneralDomainDecomposition.h
index 12dd37f02..253610c38 100644
--- a/src/parallel/GeneralDomainDecomposition.h
+++ b/src/parallel/GeneralDomainDecomposition.h
@@ -13,6 +13,7 @@
#include "particleContainer/ParticleContainer.h"
#include "DomainDecompMPIBase.h"
#include "LoadBalancer.h"
+#include "map"
/**
* This decomposition is meant to be able to call arbitrary load balancers.
@@ -94,6 +95,8 @@ public:
throw std::runtime_error("GeneralDomainDecomposition::getNeighboursFromHaloRegion() not yet implemented");
}
+ std::map<std::string, long> counters{};
+
private:
/**
* Method that initializes the ALLLoadBalancer
diff --git a/src/parallel/NeighborAcquirer.cpp b/src/parallel/NeighborAcquirer.cpp
index 7f1d00784..6b74ba891 100644
--- a/src/parallel/NeighborAcquirer.cpp
+++ b/src/parallel/NeighborAcquirer.cpp
@@ -9,6 +9,30 @@
#include "Domain.h"
#include "HaloRegion.h"
+std::map<std::string, long> NeighborAcquirer::counters{
+ {"acquireNeighbors - init", 0},
+ {"acquireNeighbors - fill send buffer", 0},
+ {"acquireNeighbors - count global receive bytes (Allgather)", 0},
+ {"acquireNeighbors - build stride", 0},
+ {"acquireNeighbors - gather regions (Allgatherv)", 0},
+ {"acquireNeighbors - parse received regions", 0},
+ {"acquireNeighbors - merge char arrays", 0},
+ {"acquireNeighbors - count regions to send (Allreduce)", 0},
+ {"acquireNeighbors - send regions to send (Isend)", 0},
+ {"acquireNeighbors - receive regions and deserialize (Get_count, Recv)", 0},
+ {"acquireNeighbors - synchronize (Wait)", 0},
+ {"acquireNeighbors - barrier (Barrier)", 0},
+};
+void NeighborAcquirer::dumpCounters() {
+ int myRank{};
+ MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+ std::ofstream logFile{"countersNeighborAcqu_r" + std::to_string(myRank)};
+ for (const auto &[name, value] : counters) {
+ logFile << name << " = " << value << '\n';
+ }
+}
+
/*
* 1. Initial Exchange of all desired regions.
* 2. Each process checks whether he owns parts of the desired regions and will save those regions in partners02.
@@ -21,13 +45,16 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
const std::array<double, 3> &globalDomainLength, HaloRegion *ownRegion, const std::vector<HaloRegion> &desiredRegions,
const MPI_Comm &comm, bool excludeOwnRank) {
+ auto start = std::chrono::high_resolution_clock::now();
int my_rank{}; // my rank
MPI_Comm_rank(comm, &my_rank);
int num_processes{}; // the number of processes in comm
MPI_Comm_size(comm, &num_processes);
const auto num_regions = desiredRegions.size(); // the number of regions I would like to acquire from other processes
+ counters["acquireNeighbors - init"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
// tell the other processes how much you are going to send
// how many bytes am I going to send to all the other processes
const int num_bytes_send =
@@ -54,24 +81,31 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
memcpy(outgoingDesiredRegionsVector.data() + bufferPosition, ®ion.width, sizeof(double));
bufferPosition += sizeof(double);
}
+ counters["acquireNeighbors - fill send buffer"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
// set up structure information data for the Allgatherv operation
// vector of number of bytes I am going to receive
std::vector<int> num_bytes_receive_vec(num_processes, 0);
MPI_Allgather(&num_bytes_send, 1, MPI_INT, num_bytes_receive_vec.data(), 1, MPI_INT, comm);
+ counters["acquireNeighbors - count global receive bytes (Allgather)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
// vector of offsets (=displacement in MPI) in the receive buffer
+ start = std::chrono::high_resolution_clock::now();
std::vector<int> num_bytes_displacements(num_processes, 0);
int num_bytes_receive = 0;
for (int j = 0; j < num_processes; j++) {
num_bytes_displacements[j] = num_bytes_receive;
num_bytes_receive += num_bytes_receive_vec[j];
}
+ counters["acquireNeighbors - build stride"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
std::vector<unsigned char> incomingDesiredRegionsVector(num_bytes_receive); // the incoming byte buffer
// send your regions
MPI_Allgatherv(outgoingDesiredRegionsVector.data(), num_bytes_send, MPI_BYTE, incomingDesiredRegionsVector.data(),
num_bytes_receive_vec.data(), num_bytes_displacements.data(), MPI_BYTE, comm);
+ counters["acquireNeighbors - gather regions (Allgatherv)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
std::vector<int> numberOfRegionsToSendToRank(num_processes, 0); // outgoing row
@@ -80,8 +114,8 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
sizeof(double) * 3 + sizeof(double) * 3 + sizeof(int) * 3 + sizeof(double) + sizeof(double) * 3;
// the regions I own and want to send: ranks<regions<regionData>>
std::vector<std::vector<std::vector<unsigned char>>> sendingList(num_processes);
-
std::vector<CommunicationPartner> comm_partners02{};
+ start = std::chrono::high_resolution_clock::now();
bufferPosition = 0;
while (bufferPosition < num_bytes_receive /*== buffer length*/) {
@@ -164,7 +198,9 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
}
}
}
+ counters["acquireNeighbors - parse received regions"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
std::vector<std::vector<unsigned char>> merged(num_processes); // Merge each list of char arrays into one char array
for (int j = 0; j < num_processes; j++) {
if (numberOfRegionsToSendToRank[j] > 0) {
@@ -177,6 +213,7 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
merged[j] = std::move(mergedRegions);
}
}
+ counters["acquireNeighbors - merge char arrays"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
// We cannot know how many regions we are going to receive from each process a-priori.
// So we need to figure this out.
@@ -202,11 +239,13 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
* In this case, process 0 will send 3 regions to process 2 and process 1 will send 2 regions to process 3.
* After the Allreduce step every process has the information how many regions it will receive.
*/
+ start = std::chrono::high_resolution_clock::now();
std::vector<int> numberOfRegionsToReceive(num_processes, 0); // how many bytes does each process expect?
MPI_Allreduce(numberOfRegionsToSendToRank.data(), numberOfRegionsToReceive.data(), num_processes, MPI_INT, MPI_SUM, comm);
+ counters["acquireNeighbors - count regions to send (Allreduce)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
// all the information for the final information exchange has been collected -> final exchange
-
+ start = std::chrono::high_resolution_clock::now();
std::vector<MPI_Request> requests(num_processes, MPI_REQUEST_NULL);
MPI_Status probe_status;
MPI_Status rec_status;
@@ -218,7 +257,9 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
&requests[j]); // tag is one
}
}
+ counters["acquireNeighbors - send regions to send (Isend)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
std::vector<CommunicationPartner> comm_partners01; // the communication partners
// receive data (blocking)
@@ -265,14 +306,18 @@ std::tuple<std::vector<CommunicationPartner>, std::vector<CommunicationPartner>>
region.offset, enlarged);
}
}
+ counters["acquireNeighbors - receive regions and deserialize (Get_count, Recv)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
// ensure that all sends have been finished.
for (int j = 0; j < num_processes; j++) {
if (numberOfRegionsToSendToRank[j] > 0) MPI_Wait(&requests[j], MPI_STATUS_IGNORE);
}
+ counters["acquireNeighbors - synchronize (Wait)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
- // barrier for safety.
+ start = std::chrono::high_resolution_clock::now();
MPI_Barrier(comm);
+ counters["acquireNeighbors - barrier (Barrier)"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
return std::make_tuple(squeezePartners(comm_partners01), squeezePartners(comm_partners02));
}
diff --git a/src/parallel/NeighborAcquirer.h b/src/parallel/NeighborAcquirer.h
index 0f92e35ce..2f4f6056a 100644
--- a/src/parallel/NeighborAcquirer.h
+++ b/src/parallel/NeighborAcquirer.h
@@ -34,6 +34,8 @@ public:
static std::vector<CommunicationPartner> squeezePartners(const std::vector<CommunicationPartner>& partners);
+ static void dumpCounters();
+
private:
static bool isIncluded(HaloRegion* myRegion, HaloRegion* inQuestion);
@@ -51,5 +53,7 @@ private:
static std::pair<std::vector<HaloRegion>, std::vector<std::array<double, 3>>> getPotentiallyShiftedRegions(
const std::array<double, 3>& domainLength, const HaloRegion& region);
+ static std::map<std::string, long> counters;
+
friend class NeighborAcquirerTest;
};
diff --git a/src/parallel/NeighbourCommunicationScheme.cpp b/src/parallel/NeighbourCommunicationScheme.cpp
index ab4190d75..6b7b49b53 100644
--- a/src/parallel/NeighbourCommunicationScheme.cpp
+++ b/src/parallel/NeighbourCommunicationScheme.cpp
@@ -52,6 +52,14 @@ NeighbourCommunicationScheme::NeighbourCommunicationScheme(
}
NeighbourCommunicationScheme::~NeighbourCommunicationScheme() {
+ int myRank{};
+ MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
+
+ std::ofstream logFile{"countersNeighbourCom_r" + std::to_string(myRank)};
+ for (const auto &[name, value] : counters) {
+ logFile << name << " = " << value << std::endl;
+ }
+ NeighborAcquirer::dumpCounters();
if (_pushPull) {
delete _haloExportForceImportNeighbours;
delete _haloImportForceExportNeighbours;
@@ -433,8 +441,20 @@ void NeighbourCommunicationScheme::selectNeighbours(MessageType msgType, bool im
void DirectNeighbourCommunicationScheme::initCommunicationPartners(double cutoffRadius, Domain * domain,
DomainDecompMPIBase* domainDecomp, ParticleContainer* moleculeContainer) {
+ if (counters.empty()) {
+ counters["initArrays"] = 0;
+ counters["clearVectors"] = 0;
+ counters["ownHalo"] = 0;
+ counters["getHaloSize"] = 0;
+ counters["getHaloImportForceExportRegions"] = 0;
+ counters["getLeavingExportRegions"] = 0;
+ counters["globalDomainLength"] = 0;
+ counters["acquireNeighborsHalo"] = 0;
+ counters["acquireNeighborsLeaving"] = 0;
+ }
// corners of the process-specific domain
static_assert(DIMgeom == 3); // The initialization here assumes 3 dimensions!
+ auto start = std::chrono::high_resolution_clock::now();
const std::array<double, DIMgeom> localLowerCorner{
domainDecomp->getBoundingBoxMin(0, domain),
domainDecomp->getBoundingBoxMin(1, domain),
@@ -447,37 +467,59 @@ void DirectNeighbourCommunicationScheme::initCommunicationPartners(double cutoff
};
if (_pushPull) {
+ start = std::chrono::high_resolution_clock::now();
for (unsigned int d = 0; d < _commDimms; d++) { // why free?
(*_haloExportForceImportNeighbours)[d].clear();
(*_haloImportForceExportNeighbours)[d].clear();
(*_leavingExportNeighbours)[d].clear();
(*_leavingImportNeighbours)[d].clear();
}
+ counters["initArrays"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
} else {
for (unsigned int d = 0; d < _commDimms; d++) { // why free?
(*_neighbours)[d].clear();
}
}
+ start = std::chrono::high_resolution_clock::now();
HaloRegion ownRegion = {localLowerCorner[0], localLowerCorner[1], localLowerCorner[2], localUpperCorner[0], localUpperCorner[1], localUpperCorner[2], 0, 0, 0, cutoffRadius};
+ counters["ownRegion"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
if (_pushPull) {
+ start = std::chrono::high_resolution_clock::now();
double* const cellLength = moleculeContainer->getHaloSize();
+ counters["getHaloSize"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
// halo/force regions
+ start = std::chrono::high_resolution_clock::now();
std::vector<HaloRegion> haloOrForceRegions =
_zonalMethod->getHaloImportForceExportRegions(ownRegion, cutoffRadius, _coversWholeDomain, cellLength);
+ counters["getHaloImportForceExportRegions"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
+ start = std::chrono::high_resolution_clock::now();
std::vector<HaloRegion> leavingRegions =
_zonalMethod->getLeavingExportRegions(ownRegion, cutoffRadius, _coversWholeDomain);
+ counters["getLeavingExportRegions"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+ start = std::chrono::high_resolution_clock::now();
const std::array<double, 3> globalDomainLength{domain->getGlobalLength(0), domain->getGlobalLength(1),
domain->getGlobalLength(2)};
+ counters["globalDomainLength"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
// assuming p1 sends regions to p2
+ start = std::chrono::high_resolution_clock::now();
std::tie((*_haloImportForceExportNeighbours)[0], (*_haloExportForceImportNeighbours)[0]) =
NeighborAcquirer::acquireNeighbors(globalDomainLength, &ownRegion, haloOrForceRegions,
domainDecomp->getCommunicator(), _useSequentialFallback);
+ counters["acquireNeighborsHalo"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
// p1 notes reply, p2 notes owned as haloExportForceImport
+ start = std::chrono::high_resolution_clock::now();
std::tie((*_leavingExportNeighbours)[0], (*_leavingImportNeighbours)[0]) = NeighborAcquirer::acquireNeighbors(
globalDomainLength, &ownRegion, leavingRegions, domainDecomp->getCommunicator(), _useSequentialFallback);
+ counters["acquireNeighborsLeaving"] += std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count();
+
// p1 notes reply, p2 notes owned as leaving import
} else {
diff --git a/src/parallel/NeighbourCommunicationScheme.h b/src/parallel/NeighbourCommunicationScheme.h
index a9c3d5ebc..b32171a3f 100644
--- a/src/parallel/NeighbourCommunicationScheme.h
+++ b/src/parallel/NeighbourCommunicationScheme.h
@@ -123,6 +123,10 @@ protected:
bool _pushPull;
bool _useSequentialFallback{true};
+
+public:
+ std::map<std::string, long> counters{};
+
};
class DirectNeighbourCommunicationScheme: public NeighbourCommunicationScheme {
the consistent highest count is The suspected offender is the repeated all-to-all communication in |
Using load balancing
GeneralDomainDecomposition
is slower thanDomainDecomposition
, opposite to what is reported.Used version ls1-MarDyn version 1.2.0_master_98d77b7d3, commit 98d77b7 compiled with gcc 12
Main Problem
The problem was tested using the droplet coalescence example. The issue was found using the last script be run at
examples DropletCoalescene/vle/config_5_droplet.xml
. This problem was found using 4 nodes with 72 tasks per node on HSuper.The process should run faster, if the parallelisation node is set to
GeneralDomainDecomposition
and the loadBalancer node is set toall
. This should be faster than setting parallelisation toDomainDecomposition
.The configurations used are shown in the simulation output, not in consecutive lines, as:
DomainDecomposition
GeneralDomainDecomposition
Parallel Issues
Both configurations can be directly compared by checking the output line
Decomposition took
, for the previously described case we have two different measured times:Decomposition took: 32.665 sec
Decomposition took: 9.26896 sec
These times were obtained with the settings given above.
NeighbourCommunicationScheme
Starting at the
DomainDecompMPIBase
class, the issue can be traced to the routineexchangeMoleculesMPI
, which calls the class member_neighbourCommunicationsScheme
and its member functionexchangeMoleculesMPI
. Using theTimer
class inside that function (found inparallel/NeighbourCommunicationScheme.cpp
) for theDirectNeighbourCommunicationScheme
derived class shows that most of the time is spent inside an if statement which checks ifmsgType == LEAVING_AND_HALO_COPIES
. Further investigation is required.Adding timers
If someone wants to add timers to check this issue do as follows:
io/TimerProfiler.cpp
modify the functionreadInitialTimersFromFile
by adding your own category (optional) totimerClasses
, and your own timer totimerAttrs
(mandatory).Simulation.cpp
modifypreSimLoopSteps
by adding the output strings to your respective timers. The names have to match on both files.TimerProfiler
class in step 1, by usingglobal_simulation->timers()->getTimer("NAME")
. Then use start and stop.TimerProfiler
then they should be at the end.The text was updated successfully, but these errors were encountered: