This repository has been archived by the owner on Apr 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
rts_thread_pinning.tex
187 lines (172 loc) · 7.82 KB
/
rts_thread_pinning.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
\section{Thread Pinning}
\label{sec:rts_thread_pinning}
\status{Removed}
As explained in chapter \ref{chap:background},
Mercury manages its parallel work with userspace scheduling.
Engines execute context and switching contexts is done in userspace.
Each Mercury engine corresponds to an operating system thread such as a
POSIX thread.
%\cite{butenhof1997:pthreads}
The operating system manages these threads including their mapping onto
processors.
\begin{figure}
\begin{center}
\includegraphics[width=0.75\textwidth]{i7-hierarchy}
\end{center}
\caption{Memory hierarchy of the Intel i7-2600K processor and 16GB of main
memory}
\label{fig:i7_hierarchy}
\end{figure}
\plan{Introduce SMT}
The multi-processor (and multi-core) systems we have been targeting our work
towards have been symmetric multi-processor (SMP) systems.
This means that they have two or more, equivalent processors,
and that the memory is approximately the same distance from each of the
processors (the system is not a non-uniform memory architecture (NUMA)
system).
Many current SMP systems are also symmetric multi-threading (SMT)\footnote{
Intel commonly calls SMT ``hyper-threading''}
systems.
As an example, figure \ref{fig:i7_hierarchy} shows the memory hierarchy of
an Intel i7-2600K based system with 16GB of memory;
each of the four cores has two hardware threads,
labelled PU (Processing Unit) in the figure.
Each of the four physical processor cores can execute two hardware threads
\emph{concurrently}.
It is important to recognise that hardware threads on the same core are not
executed in parallel;
which is how we are used to thinking of multi-processing.
Instead, each core is able to execute more than one instruction stream at a
time.
If the execution of one instruction stream must wait for something such as a
value to be retrieved from memory,
then the core is able to execute the other instruction stream.
This improves performance by allowing the processor to make use the time
when it would otherwise be stalled.
To allow programs to use SMT systems the the operating system makes the
extra hardware threads available to applications by making it appear as if
the system has more processors than it actually does.
% Allowing each core to run a second thread can increase performance,
% but it will not double it.
When talking about an SMT thread we will say ``hardware thread'',
allowing the term ``thread'' to refer to a software thread managed by the
operating system.
\plan{Explain why we want $P$ engines when there are $P$ processors}
In order to make use of each of the system's processor cores and hardware
threads (``processors'' for brevity)
we should create at least one thread for each processor.
Creating more than this consumes more resources than necessary,
and when more than one thread is actively using the same processor the
operating system must perform context switches between these threads.
Therefore it is best to use exactly one thread per processor,
which is the default.
\plan{Explain briefly how we detect how many processors there are,}
We have been using the
\code{sysconf(\_SC\_NPROCESSORS\_ONLN)}
system call to detect the number of
processors in the system.
On the Intel i7-2600K (figure \ref{fig:i7-hierarchy}
\code{sysconf(\_SC\_NPROCESSORS\_ONLN)}
returns 8.
This is supported on several but not all platforms.
If Mercury cannot detect the number of processors in the system it defaults
to using a single engine.
\plan{Explain why we want thread pinning.}
We have been relying on the operating system to assign threads onto
processors.
This is usually acceptable,
but it is generally considered more reliable to pin threads to specific
processors.
%This is beginning to become more important as processors are made with
%larger numbers of cores and therefore more complex memory hierarchies.
\plan{Explain how we get thread pinning.}
We can \emph{pin} a thread to a specific processor using the
\code{setcpuaffinity()} system call,
which is available on many Unix-like OSs.
Our initial thread pinning implementation pinned each of Mercury's engines
to a separate processor.
The goals are to ensure that two thread are not contending for the same
processor, leaving another processor idle;
and that threads are not needlessly migrated between processors.
Thread pinning should also work when the user requests fewer engines
than processors.
When the user requests more engines than processors,
only the first $P$ thread are pinned (where there are $P$ processors).
% Benchmarks are inconclusive.
%\begin{table}
%\begin{center}
%\begin{tabular}{rr|r|rrrrrrrr}
%\multicolumn{1}{c|}{Method} &
%\multicolumn{1}{c|}{Pinning} &
%\multicolumn{1}{c|}{Sequmential} &
%\multicolumn{8}{c}{Parallel w/ $N$ Engines} \\
%\Cbr{}& \Cbr{}& \Cbr{TS} & \C{1}& \C{2}& \C{3}& \C{4}& \C{5}& \C{6}&
% \C{7}& \C{8}\\
%\hline
%\hline
%\multicolumn{11}{c}{Right recursive mandelbrot (max.\ 1024 contexts)} \\
%\hline
%setcpuaffinity() & no
% & 15.1 & 15.2 & 7.6 & 5.1 & 3.8 & 3.7 & 3.5 & 3.4 & 3.3 \\
%hwloc & no
% & 15.2 & 15.2 & 7.6 & 5.1 & 3.8 & 3.7 & 3.5 & 3.4 & 3.3 \\
%setcpuaffinity() & yes
% & - & 15.2 & 7.6 & 5.1 & 3.8 & 3.7 & 3.6 & 3.4 & 3.3 \\
%hwloc & yes
% & - & 15.4 & 7.7 & 5.1 & 3.9 & 3.7 & 3.6 & 3.5 & 3.3 \\
%\end{tabular}
%\end{center}
%\end{table}
\plan{Problem with SMT}
\plan{What about SMT, not all processors are equal.}
This works well on SMP systems that do not use SMT.
On systems that use SMP and SMT
\footnote{
SMT is sometimes called ``hyper-threading'' especially for marketing.}
(symmetric multi-threading) as well as SMP this causes problems.
Consider the memory hierarchy of the Intel i7-2600K processor
(figure \ref{fig:i7_hierarchy}).
The i7-2600K has four cores, each with two hardware threads,
program units (PUs) in the figure.
If Mercury successfully detects that the system has eight logical processors
or the user specifies eight engines then thread pinning will correctly pin
each engine to a separate thread.
However,
if the user requests only four engines, which cores will be used?
depending on how they are selected,
and how the OS's \code{setcpuaffinity()} system call labels them,
all four Mercury engines could be pinned to just two of the CPU's cores
instead of being distributed uniformly across all four cores.
Uniform distribution will result in higher performance for the Mercury
program,
but it may affect other processes running on the same system.
\plan{How do we handle SMT}
Unfortunately there is no method provided by all operating systems to
determine the system's memory hierarchy.
Therefore we use a third party library
named Portable Hardware Locality (hwloc) \citep{broquedis:2010:hwloc}.
Hwloc provides some tools such as the one that drew figure
\ref{fig:i7_hierarchy},
but more importantly it provides an API for querying the memory hierarchy of
the system and methods for pinning threads to (sets of) cores or hardware
threads.
We use hwloc to determine the number of hardware threads in a system,
and optionally pin engines to hardware threads.
If the user specified the number of engines to use,
then we pin each engine to a hardware thread in each core so that engines
are distributed to cores as evenly as possible.
If the user specifies more engines than are available
(which we admit would be silly) then we allow more than one engine per
hardware thread,
meanwhile keeping engines as evenly distributed as possible amongst the
threads and cores.
%\plan{Further work}
%As we discussed in the previous section,
%knowledge of the memory hierarchy could help make better scheduling
%decisions.
%When using thread pinning we can guarantee that a particular engine is
%running on a particular processor/hardware thread,
%combining this with information about the memory hierarchy such as shared
%caches/sockets we can know which processors share the same cache(s) with the
%current processor either prefer to steal work from them or send work to
%them.