forked from GlobalArrays/ga
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NOTES
228 lines (174 loc) · 8.64 KB
/
NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
BRIEF TOUBLESHOOTING NOTES
~~~~~~~~~~~~~~~~~~~~~~~~~~
NON-PLATFORM SPECIFIC PROBLEMS
==============================
1. Your make is not GNU make.
Symptoms:
Make: makefile: Must be a separator on line 2. Stop.
Fix: use GNU make.
To verify if make is a GNU make, type make -v. The output should look
like::
% make -v
GNU Make version 3.74, by Richard Stallman and Roland McGrath.
Copyright (C) 1988, 89, 90, 91, 92, 93, 94, 95 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
2. Limited IPC resources on shared memory machines
Symptoms:
GA programs fail in initialization when trying to allocate semaphores or
shared memory.
Fix:
Most often semaphore/shared memory IDs are used by processes that
terminated and left them unallocated. Run UNIX commands "ipcs" and then
"ipcrm" to remove any IDs that belong to your dead processes or contact
your sys admin for help.
Related:
On some platforms, insufficient amount of swap space might cause
shmget/shmat fail. In this case, either swap space has to be increased or
constant _SHMMAX in file global/src/shmem.c decreased. The current values
of _SHMMAX might be too optimistic. For example, on the PNNL IBM laptops
running Linux, kernel was rebuilt to allow 24 MB of shared memory.
PLATFORM SPECIFIC PROBLEMS
==========================
PLEASE NOTE:
------------
For up-to-date information regarding platform specific problems, go to Global
Arrays support page:
http://www.emsl.pnl.gov/docs/global/support.html
Systems with QSnet interconnects
--------------------------------
Due to a bug in elan library (old versions between and including 1.3 and
1.4.1), programs using the elan library can crash when transmiting 4-byte C
Floats. Please contact Quadrics support for further information about this
problem. Since ARMCI uses the elan library on the QSnet interconnect, this
elan library bug can cause a problem if your system is using the above
mentioned version of the elan library. This problem does not occur with the
new 1.4.2 version of the elan library.
Linux
-----
The release 3.1 of GA supports Intel (x86), PowerPC, Alpha, and Sparc Ultra
processors.
A kernel patch helps to cut down latency in the TCP/IP socket based
communication on the Linux clusters connected with Ethernet networks. We
strongly recommend people who use Global Arrays extensively in this
environment to apply the patch to avoid communication performance problems.
The Linux kernel has traditionally fairly small limit for the shared memory
segment size (SHMMAX). In kernels 2.2.x it is 32MB on Intel, 16MB on Sun
Ultra, and 4MB on Alpha processors. There are two ways to increase this limit:
* rebuild the kernel after changing SHMMAX in
/usr/src/linux/include/asm-i386/shmparam.h, for example, setting SHMMAX
as 0x8000000 (128MB)
* a system admin can increase the limit without rebuilding
the kernel, for example: echo "134217728" >/proc/sys/kernel/shmmax
More info on this subject:
* http://ps-ax.com/shared-mem.html and on the Linux SMP
discussion list.
* Setting very large values of SHMMAX is not recommended.
In particular, it should not be bigger than the swap space or
even the amount of physical memory in the system. For most
applications, 128-256MB values should work fine.
Issues related to Myrinet :
* On Linux/x86 clusters with Myrinet, the release of GM 1.4
leads to hangs in ARMCI and GA. This problem has been solved
in GM 1.4.1pre6 which is available on the Myricom ftp site.
Versions 1.2, 1.3, 1.4pre48 of GM do not have that problem.
* With MPICH/GM versions >1.2.3, the GM version 1.4.1pre14 or
higher must be used.
IBM SP
------
POE environment variable settings for the parallel environment PSSP 3.1
(Troutbeck): Global Arrays applications like any other LAPI-based codes must
define MP_MSG_API=lapior MP_MSG_API=mpi,lapi(when using GA and MPI)
The LAPI-based implementation of GA cannot be used on the very old (made 5-7
years ago) SP-2 systems because LAPI does not support the TB2 switch used in
these models. If in doubt which switch you got use odmget command: odmget -q
name=css0 CuDv
Under AIX 4.3.1 and later one must set environment variable AIXTHREAD_SCOPE=S
to assure correct operation of LAPI (IBM should do it in PSSP by default).
under AIX 4.3.3 and later an additional environment variable is required
RT_GRQ=ON to restore the original thread scheduling LAPI relies upon.
SGI
---
In older versions of GA (<3.1) there is a possibility of conflict between the
SGI implementation of MPI (but not others, MPICH for example) and ARMCI in
their use of the SGI specific interprocessor communication facility called
arena.The release 3.1 does not use arenas to avoid the problem.
If processors are oversubscribed (you are using more processes than
processors), the SGI spin locks used for synchronization in GA are a very bad
idea. In such a case Sys V semaphores work much better.
Sun
---
Solaris by default provides only 1MB limit for the largest shared memory
segment. You need to increase this value to do any useful work with GA. For
example to make SHMMAX= 2GB, add either of the lines to /etc/system::
set shmsys:shminfo_shmmax=0x80000000 /* hexidecimal */
set shmsys:shminfo_shmmax=2147483648 /* decimal */
After rebooting, you should be able to take advantage of the increased shared
memory limits. For more info, please refer to the article in SunWorld:
http://www.itworld.com/Comp/2378/UnixInsider/
Also see Note below.
Compaq/DEC
----------
Tru64 is another example of an OS with a pityfully small size of the shared
memory region limit. Here are instruction on how to modify shared memory max
segment size to 256MB on the Tru64 UNIX Version 4.0F:
1. create a file called /etc/sysconfig.shmmax
cat > /etc/sysconfig.shmmax << EOF
ipc:
shm-max = 268435456
EOF
You can check if the file created is OK by typing:
/sbin/sysconfigdb -l -t /etc/sysconfig.shmmax
2. Modify kernel values: sysconfigdb -a -f /etc/sysconfig.shmmax ipc
3. Reboot
4. To check new values: /sbin/sysconfig -q ipc|egrep shm-max
For more info:
http://www.tru64unix.compaq.com/docs/base_doc/DOCUMENTATION/V40F_HTML/AQ0R3GTE/CHLMTSXX.HTM
Also see Note below.
HP-UX
-----
In most HP-UX/11 installations, the default limit for the largest shared
memory segment is 64MB. A system administrator should be able to easily
increment this value to better suit the applications needs:
1. Start sam (HP System Administration Manager).
2. Go to Kernel Configuration, select Configurable Parameters.
Click on the shmmax entry on the list.
3. From Actions menu select Modify Configurable Parameter. Click on the
radio button Specify New Formula/Value and fill in a desired value for
SHMMAX in the hexadecimal format e.g., 0X8000000 (128MB).
4. Finally, select Process New Kernel from the Actions menu. It will
apply the new value of SHHMAX into the kernel after rebooting the system.
The C compiler (commercial version) under HP-UX 10.2 performs incorrect
address calculations for shared memory references. GNU gcc works fine. This
problem manifests itself in global/testing/test.x failing. If it happens, GA
has to recompiled with gcc.
Fujitsu VX/VPP
--------------
The code will not complile because of the missing header file. This file
references MPlib library, a trade secret of Fujitsu. The GA/ARMCI binaries
can be obtained from [email protected].
Cray systems
-------------
J90 and SV1 - MPI related issues
................................
* Only MPI message-passing library is supported (TCGMSG-MPI also).
* Must use "mpirun -nt" rather than "mpirun -np" command to run
GA based codes.
* The Cray MPI_Abort implementation is broken(hangs).
GA aborts using _exit().
* MPI_Initialized() is also broken - we cannot detect if some
other library already initialized MPI.
FreeBSD
-------
To increase the shared memory segments on FreeBSD the following two sysctl's
should be added to the startup scripts (e.g. /etc/rc.local):
sysctl -w kern.ipc.shmmax=67108864
sysctl -w kern.ipc.shmall=16384
the first sysctl allocates 64Mbytes of memory, the second does the same thing
in 4k pages (4k * 16384 = 64M), you must set both sysctl.
Note on SHMMAX:
---------------
Setting very large values of SHMMAX is not recommended. In particular, it
should not be bigger than the swap space or even the amount of physical memory
in the system. For most applications, 128-256MB values should work fine.