-
What does
hangwatch
do?Hangwatch periodically polls
/proc/loadavg
, and echos a user-defined set of characters into/proc/sysrq-trigger
if a user-defined load threshold is exceeded. -
Why not just run a script from
cron
?When the system is under heavy load, fork() calls may fail or hang indefinitely, making cron jobs and shell scripts useless.
-
Wouldn’t
loadwatch
be a more accurate name?Yes, it would. Many hangs are actually the result of very rapid load spikes which never show up in logs, because the applications that would be logging the load spike cannot run. Calling it
hangwatch
convinces people to run it even if they haven’t seen load spikes with their hangs. -
Why not use NMI watchdog?
NMI watchdog and hangwatch complement each other. NMI watchdog will fire when the system is completely hung, but not when it is simply too bogged down to get any real work done. Hangwatch will fire when the system is too bogged down to get work done, but probably won’t help for a hard lockup. System hangs that do not trigger either NMI watchdog or hangwatch are extremely rare in my experience.
-
How do I make hangwatch start at boot time?
The RPM installs a SysV-style init script and enables boot-time startup. The RPM also starts hangwatch immediately unless the machine is kickstarting.
To disable boot-time startup:
chkconfig hangwatch off
To stop the service:
service hangwatch stop
-
What is an appropriate threshold for
loadavg
?In general, the threshold should be at least 5X the number of CPU cores shown in
/proc/cpuinfo
. Practically, the threshold should be high enough that hangwatch does not trigger unless the system is unresponsive or almost unresponsive due to high load. -
How do I make hangwatch run with higher priority?
Set the
NICELEVEL
variable in/etc/sysconfig/hangwatch
. Seeman nice
for details. -
How do I make hangwatch run with realtime priority?
Set the
RTPRIO
variable in/etc/sysconfig/hangwatch
. Seeman chrt
for details. This variable overrides the nice level. -
How do I make hangwatch run on a particular processor?
Set the CPUS variable in
/etc/sysconfig/hangwatch
. Seeman taskset
for details.This is particularly useful if you bind hangwatch to a certain processor and bind the troublesome application to the other processors. The default /etc/sysconfig/hangwatch` configures hangwatch to run on CPU0 only.
The simplest way to force troublesome applications to other processors is to boot with the kernel parameter
isolcpus=0
. This excludes CPU0 from the process scheduler so that applications are scheduled on other CPUS only. This has effects on latency and CPU service time, so consider the performance ramifications before booting with the isolcpus parameter. Other ways to bind processes to particular CPUS includetaskset(1)
,numactl(8)
, and cpusets (seescheduler_domains.txt
in the kernel-doc). -
Where does the output of hangwatch go?
This depends on your syslog configuration, but generally it will at least go to the boot console (serial terminals and netconsole help for recording this) and will also go to
/var/log/messages
if the system is responsive enough to write to that file. -
How do I create different triggers for different load thresholds?
The default configuration shipped with the package activates a single instance of hangwatch:
OPTIONS="-s mpt -t 5 -i 1" #OPTIONS_2="-s mpt -t 15 -i 1" #OPTIONS_3="-s mpt -t 1000 -i 1"
If you uncomment the second and third options, then restart hangwatch, you can check status such as:
# service hangwatch status hangwatch1 (pid 20862) is running... hangwatch2 (pid 20873) is running... hangwatch3 (pid 20884) is running...
-
How can I collect a vmcore with hangwatch?
Put
c
at the end of your string of sysrq characters. The sysrq-c trigger causes a controlled kernel panic, which should allow netdump, diskdump, kdump, etc. to work properly. -
Where can I get the source for hangwatch?
You can use
git
to clone the hangwatch repo on github:git clone [email protected]:jumanjiman/hangwatch.git
You can also download one version of the source from http://people.redhat.com/astokes/hangwatch/
-
Do I have to compile it myself?
An RPM provides a compiled version of hangwatch and installs configuration files for you. You can use the SRPM to compile hangwatch for other architectures if desired. One version of the RPM can be downloaded from http://people.redhat.com/astokes/hangwatch/.
The source in the git repo listed above provides additional features but must be built. A good tool for building the package is
tito
, which is in Fedora. The latest version of tito is at http://github.com/dgoodwin/tito -
Will you please add feature X to hangwatch?
Maybe. Because of the circumstances under which it operates, hangwatch must always be a very lightweight program. After startup, it can’t do anything that might have problems under heavy load. That said, there are lots of other interesting pseudofiles it could potentially work with. Patches are welcome.
Hangwatch FAQ