Skip to content

Commit

Permalink
Consider the remaining load capacity in main loop
Browse files Browse the repository at this point in the history
This changes CanRunMore() to return an int instead of a bool. The
return value is the "remaining load capacity. That is the number of
new jobs that can be spawned without saturating a potential enabled
load limitation (if ninja's -l option is used). We assume that every
started edge increases the load by one. Hence the available "load
capacity" is the maximum allowed load minus the current load.

Previously, ninja would oversaturate the system with jobs, even though
a load and job limit was provided, when multiple ninja builds are
running. This is because changes in load average are inert, newly
started dobs to no immediatly change the load average, yet ninja
assumed that new jobs are immediately reflected in the load
average. Ninja would retrieve the current 1min load average, check if
it is below the limit and, if so, start a new job, and then
repeat. Since it takes a while for the new job to get reflected in the
load average, ninja would often spawn jobs until the job limit ("-j")
is reached. If this is done by multiple parallel ninja builds, then
the system becomes oversaturated, causing excessing context switches,
which eventually slow down each and every build process.

We can easily prevent this by considering the remaining load capacity
in ninja's main loop.

The following benchmark demonstrates how the change of this comit
helps to speed up multiple parallel builds on the same host. We
compare the total build times of 8 parallel builds of LLVM on a
256-core system using "ninja -j 258".

ninja-master:        1351 seconds
ninja-load-capacity:  920 seconds

That is, with this commit, the whole process becomes 1.46× faster.

The used benchmark script created and prepared 8 build directories,
records the start time, spawns 8 subshells invoking "ninja -j 258",
awaits the termination of those subshells, and records the end
time. Besides the total running time, it also outputs /proc/loadavg,
provides an indication of where the performance is gained:

ninja-master:         3.90  93.94 146.38 1/1936 209125
ninja-load-capacity: 92.46 210.50 199.90 1/1936 36917

So with this change, ninja uses the available hardware cores better in
the presence of competing ninja processes, while it does not overload
the system.

Finally, let us look at the two "dstat -cdgyl 60" traces of 8
parallel LLVM builds on a 256-core machine using "ninja -l 258":

ninja-master
--total-cpu-usage-- -dsk/total- ---paging-- ---system-- ---load-avg---
usr sys idl wai stl| read  writ|  in   out | int   csw | 1m   5m  15m
  1   0  99   0   0|  12k 4759k|   5B   55B|1135   455 |17.9 70.3 38.1
 38   6  56   0   0|2458B 7988k| 205B    0 |  34k   23k| 466  170 73.2
 26   3  71   0   0| 102k   94M|   0     0 |  22k 6265 | 239  156 74.3
 50   5  45   0   0|3149B   97M|   0     0 |  37k   12k| 257  191 92.2
 58   6  36   0   0|  90k   71M|   0     0 |  43k   12k| 320  224  110
 50   4  46   0   0|  52k   78M|   0     0 |  38k 6690 | 247  223  117
 50   5  45   0   0| 202k   90M|   0     0 |  37k 9876 | 239  238  130
 60   5  34   0   0| 109k   93M|   0     0 |  44k 8950 | 247  248  140
 69   5  26   0   0|5939B   93M|   0     0 |  50k   11k| 309  268  154
 49   4  47   0   0| 172k  111M|   0     0 |  36k 7835 | 283  267  161
 58   7  35   0   0|  29k  142M|   0     0 |  45k 7666 | 261  267  168
 72   4  24   0   0|  46k  281M|   0     0 |  50k   13k| 384  296  183
 49   6  46   0   0|  68B  198M|   0     0 |  37k 6847 | 281  281  185
 82   6  12   0   0|   0    97M|   0     0 |  59k   15k| 462  323  205
 31   5  63   0   0|   0   301M|   0     0 |  26k 5350 | 251  291  202
 66   7  28   0   0|  68B  254M|   0     0 |  49k 9091 | 270  292  208
 68   8  25   0   0|   0   230M|   0     0 |  51k 8186 | 287  292  213
 52   5  42   1   0|   0   407M|   0     0 |  42k 5619 | 207  271  211
 29   7  64   0   0|   0   418M|   0     0 |  27k 2801 | 131  241  205
  1   1  98   0   0| 137B  267M|   0     0 |1944   813 |55.8  199  193
  0   0 100   0   0|2253B   43M|   0     0 | 582   365 |26.8  165  181
  0   0  99   0   0|   0    68M|   0     0 | 706   414 |11.5  136  170
4   0  96   0   0|   0    13M|   0     0 |2892   378 |10.0  113  160

ninja-load-capacity
--total-cpu-usage-- -dsk/total- ---paging-- ---system-- ---load-avg---
usr sys idl wai stl| read  writ|  in   out | int   csw | 1m   5m  15m
  1   0  98   0   0|  12k 5079k|   5B   55B|1201   470 |1.35 40.2  115
 43   6  51   0   0|3345B   78M|   0     0 |  34k   20k| 247  127  142
 71   6  23   0   0|   0    59M|   0     0 |  53k 8485 | 286  159  152
 60   5  35   0   0|  68B  118M|   0     0 |  45k 7125 | 277  178  158
 62   4  35   0   0|   0   115M|   0     0 |  45k 6036 | 248  188  163
 61   5  34   0   0|   0    96M|   0     0 |  44k 9448 | 284  212  173
 66   5  28   0   0|   9B   94M|   0     0 |  49k 5733 | 266  219  178
 64   7  29   0   0|   0   159M|   0     0 |  49k 6350 | 241  223  182
 66   6  28   0   0|   0   240M|   0     0 |  50k 9325 | 285  241  191
 68   4  27   0   0|   0   204M|   0     0 |  49k 5550 | 262  241  194
 68   8  24   0   0|   0   161M|   0     0 |  53k 6368 | 255  244  198
 79   7  14   0   0|   0   325M|   0     0 |  59k 5910 | 264  249  202
 72   6  22   0   0|   0   367M|   0     0 |  54k 6684 | 253  249  205
 71   6  22   1   0|   0   377M|   0     0 |  52k 8175 | 284  257  211
 48   8  44   0   0|   0   417M|   0     0 |  40k 5878 | 223  247  210
 23   4  73   0   0|   0   238M|   0     0 |  22k 1644 | 114  214  201
  0   0 100   0   0|   0   264M|   0     0 |1016   813 |43.3  175  189
  0   0 100   0   0|   0    95M|   0     0 | 670   480 |17.1  144  177

As one can see in the above dstat traces, ninja-master will have a
high 1min load average, of up to 462. This is because ninja will not
considered the remaining load capacity when spawning new jobs, but
instead spawn as new jobs until it runs into the -j limitation. This,
in turn, causes an increase of context switches: the rows with a high
1min load average also have >10k context switches (csw). Whereas a
remaining load-capacity aware ninja avoids oversaturing the system
with excessive additional jobs.

Note that since the load average is an exponentially damped moving
sum, build systems that take the load average into consideration to
limit the load average to the number of available processors will
always (slightly) overprovision the system with tasks. Eventually,
this change decreases the aggressiveness ninja schedules new jobs if
the '-l' knob is used, and by that, the level of overprovisioning, to
a reasonable level compared to the status quo. It should be mentioned
that this means that an individual build using '-l' will now be
potentially a bit slower. However, this can easily be fixed by
increase the value provided to the '-l' argument.

The benchmarks where performed using the following script:

set -euo pipefail

VANILLA_NINJA=~/code/ninja-master/build/ninja
LOAD_CAPACITY_AWARE_NINJA=~/code/ninja-load-capacity/build/ninja
CMAKE_NINJA_PROJECT_SOURCE=~/code/llvm-project/llvm

declare -ir PARALLEL_BUILDS=8
readonly TMP_DIR=$(mktemp --directory --tmpdir=/var/tmp)

cleanup() {
    rm -rf "${TMP_DIR}"
}
trap cleanup EXIT

BUILD_DIRS=()
echo "Preparing build directories"
for i in $(seq 1 ${PARALLEL_BUILDS}); do
	BUILD_DIR="${TMP_DIR}/${i}"
	mkdir "${BUILD_DIR}"
	(
		cd "${BUILD_DIR}"
		cmake -G Ninja "${CMAKE_NINJA_PROJECT_SOURCE}" \
			&> "${BUILD_DIR}/build.log"
	)&
	BUILD_DIRS+=("${BUILD_DIR}")
done
wait

NPROC=$(nproc)
MAX_LOAD=$(echo "${NPROC} + 2" | bc )
SLEEP_SECONDS=300

NINJA_BINS=(
    "${VANILLA_NINJA}"
    "${LOAD_CAPACITY_AWARE_NINJA}"
)
LAST_NINJA_BIN="${LOAD_CAPACITY_AWARE_NINJA}"

for NINJA_BIN in "${NINJA_BINS[@]}"; do
	echo "Cleaning build dirs"
	for BUILD_DIR in "${BUILD_DIRS[@]}"; do
		(
			"${NINJA_BIN}" -C "${BUILD_DIR}" clean &> "${BUILD_DIR}/build.log"
		)&
	done
	wait

	echo "Starting ${PARALLEL_BUILDS} parallel builds with ${NINJA_BIN} using -j ${MAX_LOAD}"
	START=$(date +%s)
	for BUILD_DIR in "${BUILD_DIRS[@]}"; do
		(
			"${NINJA_BIN}" -C "${BUILD_DIR}" -l "${MAX_LOAD}" &> "${BUILD_DIR}/build.log"
		)&
	done
	wait
	STOP=$(date +%s)

	DELTA_SECONDS=$((STOP - START))
	echo "Using ${NINJA_BIN} to perform ${PARALLEL_BUILDS} of ${CMAKE_NINJA_PROJECT_SOURCE}"
	echo "took ${DELTA_SECONDS} seconds on this ${NPROC} core system using -j ${MAX_LOAD}"
	echo "/proc/loadavg:"
	cat /proc/loadavg
	echo "ninja --version:"
	"${NINJA_BIN}" --version

	if [[ "${NINJA_BIN}" != "${LAST_NINJA_BIN}" ]]; then
	    echo "Sleeping ${SLEEP_SECONDS} seconds to bring system into quiescent state"
	    sleep ${SLEEP_SECONDS}
	fi
done
  • Loading branch information
Flowdalic committed Sep 9, 2021
1 parent ffb47fc commit 420c81b
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 17 deletions.
53 changes: 40 additions & 13 deletions src/build.cc
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <climits>
#include <stdint.h>
#include <functional>

#if defined(__SVR4) && defined(__sun)
Expand Down Expand Up @@ -46,16 +48,16 @@ struct DryRunCommandRunner : public CommandRunner {
virtual ~DryRunCommandRunner() {}

// Overridden from CommandRunner:
virtual bool CanRunMore() const;
virtual size_t CanRunMore() const;
virtual bool StartCommand(Edge* edge);
virtual bool WaitForCommand(Result* result);

private:
queue<Edge*> finished_;
};

bool DryRunCommandRunner::CanRunMore() const {
return true;
size_t DryRunCommandRunner::CanRunMore() const {
return SIZE_MAX;
}

bool DryRunCommandRunner::StartCommand(Edge* edge) {
Expand Down Expand Up @@ -437,7 +439,7 @@ void Plan::Dump() const {
struct RealCommandRunner : public CommandRunner {
explicit RealCommandRunner(const BuildConfig& config) : config_(config) {}
virtual ~RealCommandRunner() {}
virtual bool CanRunMore() const;
virtual size_t CanRunMore() const;
virtual bool StartCommand(Edge* edge);
virtual bool WaitForCommand(Result* result);
virtual vector<Edge*> GetActiveEdges();
Expand All @@ -460,12 +462,26 @@ void RealCommandRunner::Abort() {
subprocs_.Clear();
}

bool RealCommandRunner::CanRunMore() const {
size_t RealCommandRunner::CanRunMore() const {
size_t subproc_number =
subprocs_.running_.size() + subprocs_.finished_.size();
return (int)subproc_number < config_.parallelism
&& ((subprocs_.running_.empty() || config_.max_load_average <= 0.0f)
|| GetLoadAverage() < config_.max_load_average);

long capacity = config_.parallelism - subproc_number;

if (config_.max_load_average > 0.0f) {
int load_capacity = config_.max_load_average - GetLoadAverage();
if (load_capacity < capacity)
capacity = load_capacity;
}

if (capacity < 0)
capacity = 0;

if (!capacity && subprocs_.running_.empty())
// Ensure that we make progress.
capacity = 1;

return capacity;
}

bool RealCommandRunner::StartCommand(Edge* edge) {
Expand Down Expand Up @@ -596,8 +612,13 @@ bool Builder::Build(string* err) {
// Second, we attempt to wait for / reap the next finished command.
while (plan_.more_to_do()) {
// See if we can start any more commands.
if (failures_allowed && command_runner_->CanRunMore()) {
if (Edge* edge = plan_.FindWork()) {
if (failures_allowed) {
size_t capacity = command_runner_->CanRunMore();
while (capacity > 0) {
Edge* edge = plan_.FindWork();
if (!edge)
break;

if (edge->GetBindingBool("generator")) {
scan_.build_log()->Close();
}
Expand All @@ -616,11 +637,17 @@ bool Builder::Build(string* err) {
}
} else {
++pending_commands;
}

// We made some progress; go back to the main loop.
continue;
--capacity;

// Re-evaluate capacity.
size_t current_capacity = command_runner_->CanRunMore();
if (current_capacity < capacity)
capacity = current_capacity;
}
}

if (!plan_.more_to_do()) break;
}

// See if we can reap any finished commands.
Expand Down
2 changes: 1 addition & 1 deletion src/build.h
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ struct Plan {
/// RealCommandRunner is an implementation that actually runs commands.
struct CommandRunner {
virtual ~CommandRunner() {}
virtual bool CanRunMore() const = 0;
virtual size_t CanRunMore() const = 0;
virtual bool StartCommand(Edge* edge) = 0;

/// The result of waiting for a command.
Expand Down
11 changes: 8 additions & 3 deletions src/build_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
#include "build.h"

#include <assert.h>
#include <climits>
#include <stdint.h>

#include "build_log.h"
#include "deps_log.h"
Expand Down Expand Up @@ -473,7 +475,7 @@ struct FakeCommandRunner : public CommandRunner {
max_active_edges_(1), fs_(fs) {}

// CommandRunner impl
virtual bool CanRunMore() const;
virtual size_t CanRunMore() const;
virtual bool StartCommand(Edge* edge);
virtual bool WaitForCommand(Result* result);
virtual vector<Edge*> GetActiveEdges();
Expand Down Expand Up @@ -574,8 +576,11 @@ void BuildTest::RebuildTarget(const string& target, const char* manifest,
builder.command_runner_.release();
}

bool FakeCommandRunner::CanRunMore() const {
return active_edges_.size() < max_active_edges_;
size_t FakeCommandRunner::CanRunMore() const {
if (active_edges_.size() < max_active_edges_)
return SIZE_MAX;

return 0;
}

bool FakeCommandRunner::StartCommand(Edge* edge) {
Expand Down

0 comments on commit 420c81b

Please sign in to comment.