Skip to content

Triggers a system action if a user-defined loadavg is exceeded

Notifications You must be signed in to change notification settings

jumanjiman/hangwatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hangwatch FAQ
  1. What does hangwatch do?

    Hangwatch periodically polls /proc/loadavg, and echos a user-defined set of characters into /proc/sysrq-trigger if a user-defined load threshold is exceeded.

  2. Why not just run a script from cron?

    When the system is under heavy load, fork() calls may fail or hang indefinitely, making cron jobs and shell scripts useless.

  3. Wouldn’t loadwatch be a more accurate name?

    Yes, it would. Many hangs are actually the result of very rapid load spikes which never show up in logs, because the applications that would be logging the load spike cannot run. Calling it hangwatch convinces people to run it even if they haven’t seen load spikes with their hangs.

  4. Why not use NMI watchdog?

    NMI watchdog and hangwatch complement each other. NMI watchdog will fire when the system is completely hung, but not when it is simply too bogged down to get any real work done. Hangwatch will fire when the system is too bogged down to get work done, but probably won’t help for a hard lockup. System hangs that do not trigger either NMI watchdog or hangwatch are extremely rare in my experience.

  5. How do I make hangwatch start at boot time?

    The RPM installs a SysV-style init script and enables boot-time startup. The RPM also starts hangwatch immediately unless the machine is kickstarting.

    To disable boot-time startup:

    chkconfig hangwatch off

    To stop the service:

    service hangwatch stop
  6. What is an appropriate threshold for loadavg?

    In general, the threshold should be at least 5X the number of CPU cores shown in /proc/cpuinfo. Practically, the threshold should be high enough that hangwatch does not trigger unless the system is unresponsive or almost unresponsive due to high load.

  7. How do I make hangwatch run with higher priority?

    Set the NICELEVEL variable in /etc/sysconfig/hangwatch. See man nice for details.

  8. How do I make hangwatch run with realtime priority?

    Set the RTPRIO variable in /etc/sysconfig/hangwatch. See man chrt for details. This variable overrides the nice level.

  9. How do I make hangwatch run on a particular processor?

    Set the CPUS variable in /etc/sysconfig/hangwatch. See man taskset for details.

    This is particularly useful if you bind hangwatch to a certain processor and bind the troublesome application to the other processors. The default /etc/sysconfig/hangwatch` configures hangwatch to run on CPU0 only.

    The simplest way to force troublesome applications to other processors is to boot with the kernel parameter isolcpus=0. This excludes CPU0 from the process scheduler so that applications are scheduled on other CPUS only. This has effects on latency and CPU service time, so consider the performance ramifications before booting with the isolcpus parameter. Other ways to bind processes to particular CPUS include taskset(1), numactl(8), and cpusets (see scheduler_domains.txt in the kernel-doc).

  10. Where does the output of hangwatch go?

    This depends on your syslog configuration, but generally it will at least go to the boot console (serial terminals and netconsole help for recording this) and will also go to /var/log/messages if the system is responsive enough to write to that file.

  11. How do I create different triggers for different load thresholds?

    The default configuration shipped with the package activates a single instance of hangwatch:

    OPTIONS="-s mpt -t 5 -i 1"
    #OPTIONS_2="-s mpt -t 15 -i 1"
    #OPTIONS_3="-s mpt -t 1000 -i 1"

    If you uncomment the second and third options, then restart hangwatch, you can check status such as:

    # service hangwatch status
    hangwatch1 (pid 20862) is running...
    hangwatch2 (pid 20873) is running...
    hangwatch3 (pid 20884) is running...
  12. How can I collect a vmcore with hangwatch?

    Put c at the end of your string of sysrq characters. The sysrq-c trigger causes a controlled kernel panic, which should allow netdump, diskdump, kdump, etc. to work properly.

  13. Where can I get the source for hangwatch?

    You can use git to clone the hangwatch repo on github:

    git clone [email protected]:jumanjiman/hangwatch.git

    You can also download one version of the source from http://people.redhat.com/astokes/hangwatch/

  14. Do I have to compile it myself?

    An RPM provides a compiled version of hangwatch and installs configuration files for you. You can use the SRPM to compile hangwatch for other architectures if desired. One version of the RPM can be downloaded from http://people.redhat.com/astokes/hangwatch/.

    The source in the git repo listed above provides additional features but must be built. A good tool for building the package is tito, which is in Fedora. The latest version of tito is at http://github.com/dgoodwin/tito

  15. Will you please add feature X to hangwatch?

    Maybe. Because of the circumstances under which it operates, hangwatch must always be a very lightweight program. After startup, it can’t do anything that might have problems under heavy load. That said, there are lots of other interesting pseudofiles it could potentially work with. Patches are welcome.

About

Triggers a system action if a user-defined loadavg is exceeded

Resources

Stars

Watchers

Forks

Packages

No packages published