forked from ITRS-Group/monitor-merlin
-
Notifications
You must be signed in to change notification settings - Fork 0
/
HOWTO
869 lines (704 loc) · 31 KB
/
HOWTO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
Introduction
------------
This HOWTO describes how to set up a loadbalanced, redundant and
distributed network monitoring system using op5 Monitor. Note that
Merlin is a part of op5 Monitor. Non-customers will have to adjust
paths etc used throughout this guide in order to be able to use it.
Replacing "op5 Monitor" with "nagios + merlin" will be a good start
for those of you venturing into the unknown without the aid of our
rather excellent support services. For those wishing to configure
only distributed network monitoring, or only loadbalanced or
redundant, this is still a good guide.
The guide will assume that we're installing two redundant and
loadbalanced master-servers ("yoda" and "obi1"), with three
poller servers, two of which are peered with each other. The single
poller will be designated "solo". The peered pollers will be "luke"
and "leya". We'll assume that each poller has their names in the
DNS and can be looked up that way. 'solo' will be responsible for
monitoring the hostgroup 'hyperdrive'. 'luke' and 'leya' will share
responsibility in monitoring the hostgroups 'theforce' and 'tattoine'.
With this setup, communications will go like this:
luke will talk with:
leya, obi1, yoda
leya will talk with:
luke, obi1, yoda
solo will talk with:
obi1, yoda
obi1 will talk with:
yoda, luke, leya, solo
yoda will talk with:
obi1, luke, leya, solo
Preparations
------------
The following needs to be in place for this HOWTO to be usable, but
how to obtain or set them up is outside the scope of this article.
* Make sure you have the passwords for the root accounts on all the
servers intended to be part of the monitoring network. This will
be needed in order to configure merlin.
* Open the firewalls for port 15551 (merlin's default port) and 22.
Both ends will attempt to connect on port 15551, so it's ok if only
one side of the intended connection can connect to the other. For
port 22, it's a little bit more complicated and in order to get the
full shebang of features both ends will need to be able to initiate
connections with the other. It's possible to get away with not
allowing pollers to initiate connections to the master server, but
certain recovery operations will then not be possible.
* op5 Monitor needs to be installed on all systems intended to be part
of the network.
Hello mon'!
-----------
Included in op5 Monitor is the 'mon' helper. mon is a nifty little
tool designed to help with configuring, managing and verifying the
status of a distributed Merlin installation.
It's usage is quite simple: mon <category> <command>
Just type mon and you'll get a list of all available categories and
commands. Some commands lack a category and are runnable all by
themselves, such as 'stop', 'start' and 'restart', which take care
of stopping, starting and restarting monitor and merlin using the
proper shutdown and startup sequences.
We'll soon see exactly how useful that little helper is.
Step 1 - Configure Merlin on one of the master systems
------------------------------------------------------
The aforementioned 'mon' helper has a 'node' category. This is useful
for manipulating configured nodes in merlin's configuration file,
among other things.
We'll start with configuring Merlin properly on 'yoda'. The commands
to do so will look like this (yes, there's a typo there):
mon node add obi1 type=peer
mon node add luke type=poller hostgroup=theforce,tattoine
# type=poller is default, so we don't have to spell it out
mon node add leya hostgroup=theforce,tattoine
mon node add solo hostgroup=hyperride
The 'node' category also has a 'remove' command, so when we've noticed
the typo we made when adding the poller 'solo', we can fix that by
removing the faulty node and re-adding it again:
mon node remove solo
mon node add solo type=poller hostgroup=hyperdrive
You may verify that you've done things right in a couple of different
ways. 'mon node list' lists nodes. It accepts --type= argument, so
if we want to list all pollers and peers, we can run:
mon node list --type=poller,peer
This in conjunction with 'mon node show <name>' is excellent to use
when scripting.
The contents of the configuration file, which by default resides in
/opt/monitor/op5/merlin/merlin.conf, should now include something
like this:
--8<--8<--8<--
peer obi1 {
address = obi1
port = 15551
}
poller luke {
address = luke
port = 15551
hostgroup = theforce,tattoine
}
poller leya {
address = leya
port = 15551
hostgroup = theforce,tattoine
}
poller solo {
address = solo
port = 15551
hostgroup = hyperdrive
}
--8<--8<--8<--8<--
Step 2 - Distribute ssh-keys
----------------------------
The 'mon' helper has an sshkey category. The commands in it let you
push and fetch sshkeys from remote destinations.
mon sshkey push --all
will append your ~/.ssh/id_rsa.pub file to the authorized_keys file
for the root and monitor user on all configured remote nodes.
If you don't have a public keyfile, one will be generated for you.
Please note that if you generate a keyfile yourself, it must not have
a password, or configuration synchronization will fail to work.
So far we've set up one-way communication. 'yoda' can now talk to
all the other systems without having to use passwords. In order to
fetch all the keys from the remote systems, we'll use the following
command:
mon sshkey fetch --all
This will fetch all the relevant keys into ~/.ssh/authorized_keys.
So every node can talk to 'yoda', and 'yoda' can talk to every
other node. That's great. But 'luke' and 'leya' need to be able
to talk to each other as well, and all the pollers need to be
able to talk to 'obi1'. Since we have all keys except our own in
~/.ssh/authorized_keys, we can simply amend it with the key we
generated earlier and distribute the resulting file to every node.
Or we can do what we just did for 'yoda' on all the other nodes,
and simply wait until we're done configuring merlin on all nodes
and then log in to them and run the 'mon sshkey push --all' and
'mon sshkey fetch --all' commands there too.
You can verify that this works by running:
mon node ctrl -- 'echo hostname is $(hostname)'
Step 3 - Configure Merlin on the remote systems
-----------------------------------------------
I sneakily introduced the 'node ctrl' command in the last section.
This time we'll use it rather heavily, along with the 'node add'
command which will run on the remote systems.
First we add ourself and 'obi1' as master to all pollers:
mon node ctrl --type=poller -- mon node add yoda type=master
mon node ctrl --type=poller -- mon node add obi1 type=master
Now the poller 'solo' is actually done and configured already.
Then we add ourself as a peer to all our peers (just 'obi1' really,
but in case you build larger networks, this will work better):
mon node ctrl --type=peer -- mon node add yoda type=peer
Then we add all pollers to 'obi1':
mon node ctrl obi1 -- mon node add luke hostgroup=theforce,tattoine
mon node ctrl obi1 -- mon node add leya hostgroup=theforce,tattoine
mon node ctrl obi1 -- mon node add solo hostgroup=hyperdrive
And finally we add luke and leya as peers to each other:
mon node ctrl leya -- mon node add luke type=peer
mon node ctrl luke -- mon node add leya type=peer
solo will have the following config:
--8<--8<--8<--
master yoda {
address = yoda
port = 15551
}
master obi1 {
address = obi1
port = 15551
}
--8<--8<--8<--
luke will have this in its config file:
--8<--8<--8<--
master yoda {
address = yoda
port = 15551
}
master obi1 {
address = obi1
port = 15551
}
peer leya {
address = leya
port = 15551
}
--8<--8<--8<--
leya will have this:
--8<--8<--8<--
master yoda {
address = yoda
port = 15551
}
master obi1 {
address = obi1
port = 15551
}
peer luke {
address = luke
port = 15551
}
--8<--8<--8<--
obi1 will have this:
--8<--8<--8<--
peer yoda {
address = yoda
port = 15551
}
poller luke {
address = luke
port = 15551
hostgroup = theforce,tattoine
}
poller leya {
address = leya
port = 15551
hostgroup = theforce,tattoine
}
poller solo {
address = solo
port = 15551
hostgroup = hyperdrive
}
--8<--8<--8<--
Step 4 - Verifying configuration and ssh-key setup
--------------------------------------------------
Now that we have merlin configured properly on all our five nodes,
we can use a recursive version of the 'node ctrl' command to make
sure ssh work properly from all systems to all systems they need
to talk to. Try pasting this into the console:
mon node ctrl -- \
'echo On $(hostname); hostname | mon node ctrl -- \
\'from=$(cat); echo "@$(hostname) from $from\'' \
| grep ^@
Hard to follow? I agree, but it should produce something like this:
@yoda from obi1
@luke from obi1
@leya from obi1
@solo from obi1
@yoda from luke
@obi1 from luke
@leya from luke
@yoda from leya
@obi1 from leya
@luke from leya
@yoda from solo
@obi1 from solo
If it does, that means ssh keys are properly installed, at least for
the root user(s). If the command seems to hang somewhere in the middle,
try rerunning it without the ending 'grep' statement. If that ends up
with a password prompt appearing, you'll need to revisit the sshkey
configuration and again make sure every node that should be able to
talk to other nodes can talk to the nodes it should be able to talk
to. Simple, eh?
(XXX; Test this and make sure it actually works like this)
Step 5 - Configuring Nagios
---------------------------
Handling object configuration is very much out of scope for this
article, but there are a few rules (most of them are actually more
like guidelines, but things will be confusing if you don't follow
them, so please do) one needs to adhere to in order for Merlin to
work properly.
* Each host that is a member of a hostgroup used to distribute work
from a master to a poller should never be member of a hostgroup
that is used to distribute work to a different poller. In our
case, that means that any host that is a member of either 'theforce'
or 'tattoine' shouldn't also be a member of 'hyperdrive'.
* Two peers absolutely must have identical object configuration.
This is due to the way loadbalancing works in Merin. In our case
that means that since 'luke' is responsible for 'theforce' and
'tattoine', its peer 'leia' must also be responsible for exactly
those two hostgroups, and no other.
That's basically it. It's possible to circumvent these rules, but
if you do, you're on your own. No tools currently exist to enforce
them, and Merlin won't complain if you suddenly add another poller
that's responsible for 'tattoine' and 'hyperdrive', even though
such a configuration is obviously retarded in light of the above
rules.
Step 6 - Synchronization configuration
--------------------------------------
Configuration synchronization will be a bit easier for you if you
move all of monitor's object configuration files to a cfg_dir
instead of using the default layout of mixing object configuration
with Nagios' main configuration file and other various stuff. This
is especially true for pollers and doesn't matter nearly as much
for masters.
The quick and easy way to set it up so that it works like that is
by running the following commands from 'yoda':
dir=/opt/monitor/etc/oconf
conf=/opt/monitor/etc/nagios.cfg
mon node ctrl -- sed -i /^cfg_file=/d $conf
mon node ctrl -- sed -i /^log_file=/acfg_dir=$dir $conf
mon node ctrl -- mkdir -m 775 $dir
mon node ctrl -- chown monitor:apache $dir
Now, if you run:
mon node ctrl -- mon oconf hash
you should get a list of 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
as output from all nodes. That means the pollers now have an empty
object configuration, which is just the way we like it since we'll
be pushing configuration from one of our two peered masters to all
the pollers.
("da 39 hash" is what you get from sha1 when you don't feed it any
input at all)
In merlin, you can configure a script to run that takes care of
syncing configuration. This script should also restart monitor on
the receiving ends when it's done sending configuration. In the
Merlin world, this is handled by a single command that gets run
once when we detect that we have a newer configuration than any
of our peers or pollers.
mon oconf push
takes no arguments at all. It does parse merlin.conf though and
creates complete configuration files for all the pollers, which
by default gets sent to /opt/monitor/etc/oconf/from-master.cfg
on each respective poller, which is then restarted. Again by
default, it will also send the entire /opt/monitor/etc directory
to all its peers, using rsync --delete to make sure all systems
are fully synced. Currently though, only changes to the object
config triggers a full sync, so perhaps there's room for
improvement.
Config sync is configured either globally via an object_config
compound in the daemon section of the config file, or via those
same object_config compounds in each node if one wants to
override how one system syncs to another. It could look something
like this, for instance:
daemon {
object_config {
# the command to run when one or more peers or
# pollershave older configuration than we do
push = mon oconf push
# the command to run when one or more masters or
# peers have a newer configuration than we do
#pull = some random command
}
}
peer obi1 {
object_config {
# the command to run when obi1 has older config than we do
# overrides the global command
push = rsync -aovtr --delete /opt/monitor/etc obi1:/opt/monitor
# command to run when obi1 has newer config than we do
#pull = some random command
}
}
Caveats:
* The 'pull' thing is highly untested and I'm unsure how it would
work if one node tries to pull from another while that other node
is pushing at the same time. Care should be take to avoid such
setups.
* The only supported scenario is to have the master with the most
recently changed configuration push that config to its peers and
pollers. This *will* create avalanche pushing if one uses peered
pollers that in turn have pollers themselves, since all peered
pollers that in turn have pollers will try to push at the same
time. Due to this, more than 2 tiers is currently not supported
officially, although it works just fine for everything else in our
lab environment.
* Config pushing from master to poller requires the objects.cache
file in order to split config for each poller. Since config pushes
should always be initiated by a running Merlin anyways, this isn't
much of a problem once you've done the first push and everything
is up and running, but when first setting up the system it will
be tricksy to get things to run smoothly.
The object_config compounds can contain whatever variables you
like without Merlin complaining about them, and
mon node show
will show them as
OCONF_PUSH=mon oconf push
OCONF_WHATEVER_YOU_NAMED_YOUR_VARIABLE=somevalue
so you can quite easily add some other scripted solution to support
your needs. 'mon oconf push' happens to use two such private vars,
namely 'source' and 'dest'. 'source' is really only used when pushing
configuration to peers, and 'dest' is what we end up using as target
when pushing the configuration. So if you want your peer sync script
to only send /opt/monitor/etc/oconf that we created before, you can
quite easily set that up by configuring your peer thus:
peer obi1 {
address = obi1
port = 15551
type = peer
object_config {
push = mon oconf push
source = /opt/monitor/etc/oconf
dest = /opt/monitor/etc
}
The 'oconf push' command uses another command internally:
mon oconf nodesplit
This you can run without interfering with anything. In our case, it
would print something like this:
Created /var/cache/merlin/config/luke with 1154 objects for hostgroup
list 'theforce,tattoine'
Created /var/cache/merlin/config/leya with 1154 objects for hostgroup
list 'theforce,tattoine'
Created /var/cache/merlin/config/solo with 652 objects for hostgroup
list 'hyperdrive'.
You can inspect the files thus created and see if they seem to fit your
criteria. Note that they will be rather large, since templates aren't
sent to the poller nodes.
Step 7 - Starting the distributed system
----------------------------------------
Once you've inspected the configuration and you like what you see,
it's time to activate it and get some monitoring going on. Run the
following sequence of commands when you're ready:
mon restart; sleep 3; mon oconf push
This should send configuration to all the pollers and peers and then
attempt to restart monitor and merlin on those nodes. Pushing config
to masters is not yet supported, although scripting it wouldn't be
too hard for those who are interested. Do see the notes about 'pull'
above though.
Step 8 - Verifying that it works
--------------------------------
The first thing to do is to run:
mon node status
It will quite quickly become apparent that this little helper is
awesome for finding problems in your merlin setup. It connects to
the database, grabs the currently running nodes and prints a lot
of information about them, such as if they're active, when they
last sent a program_status_data event, how many checks they're
doing and what their check latency is. If, for some reason,
one node has crashed or is otherwise unable to communicate with
the node you're looking from, you'll find this out quite quickly
using this little helper.
Filing a bugreport without including output from a run of this
program on all nodes is a hanging offense. You have been warned.
Step 9 - Finding out why it doesn't
-----------------------------------
This is sort of general guidelines to troubleshooting certain issues
in Merlin. It involves digging through logfiles, running small helper
programs and generally just tinkering around trying to figure out
what happened, what's happening and what will happen when I do this?
Most of it is stuff that has come up during beta-testing or that have
been problematic in the past. If new common problems arise, I'll add
more recipes to this little guide.
Problem: Loadbalancing seems to have stopped working even though
all my peers are ACTIVE and were last seen alive "3s ago"
Answer:
It can sometimes look like that if you check the output of
mon node status
after having restarted Merlin or Monitor. Most of the time, it's
because the peers switched peer-id's and are now slowly taking
over each others checks. If the problem isn't resolving itself
so that the number of checks performed by each peer is converging
on an equal split, something else has gone wrong and a more thorough
investigation is necessary.
Checking if all peers have the same configuration is the first step:
mon node ctrl --type=peer -- mon oconf hash
mon node ctrl --type=peer -- sha1sum /opt/monitor/var/objects.cache
If they do, you might have run into the intermittent error that
some users have seen with peers in a loadbalanced setup. Restarting
the affected systems usually restores them to good working order:
mon node ctrl --type=peer -- mon restart
Problem: Logfiles are flooded with messages about 'nulling OOB pointer'
Answer:
Merlin uses a highly efficient binary protocol to transfer events
between module and daemon and across the network to other nodes.
The way the codec works makes it not-really-but-almost impossible
to support network nodes with different wordsize and byte order.
That is, 32bit and 64bit systems can't talk to each other, and
servers running i386-type cpu's can't communicate with PowerPC or
other big-endian machines. PDP11 won't work with anything but other
PDP11's. They'll do just fine with each other though.
Since merlin-0.9.0-beta5, merlin detects when a node with different
wordsize, byte order and object structure version connects and warns
about such incompatibilities. grep the logfiles for 'FATAL.*compat'
and you should see if that's the problem.
If it isn't and it's the module logfile which holds all the messages
you've almost certainly hit a compatibility problem with Nagios, or
a concurrency issue related to threading. There shouldn't really be
any compatibility problems, since Merlin will unload itself if the
version of Nagios that loads it has a different object structure
version than we're expecting of it, but I suppose weirder things
have happened than a random malfunction in a piece of software.
Problem: Database isn't being updated
Answer:
Inserting events into the database is the job of the daemon.
Information about its problems can be found in the daemon
logfile, /opt/monitor/op5/merlin/daemon.log. If no "query failed"
messages can be found there, check the neb.log file to see if
it's sending anything, and look for disconnected peers and pollers
with:
mon node status
Problem: 'mon node status' shows one or more nodes as 'INACTIVE'
Answer:
Check that merlin and monitor is running on the remote systems.
mon node ctrl $node -- pidof monitor
mon node ctrl $node -- pidof merlind
mon node ctrl $node -- mon node status
If they're not, you've found the symptom, so check the logfiles
or look for corefiles on those systems.
If they are, try
grep -i connect /opt/monitor/op5/merlin/logs/daemon.log
If you see a lot of connection attempts to the INACTIVE node and
no "connection refused", it's almost certainly a firewall issue.
If you see the "connection refused" thing, it's almost certainly
due to either merlind not running or misconfiguration.
Problem: I've found a corefile
Answer:
Goodie. Now do something useful with it, pretty, pretty please.
file core
will tell you which command was run to create it, so then you can
run:
# gdb -q /path/to/$offending_program core
gdb> bt
If the corefile came from monitor, you'll need to run:
mondebug core
instead, since it will otherwise include a lot of "unresolved symbols"
in the backtrace, which basically means that it's completely useless
and I would still have to re-do it. At least if the core was caused
by a module, which is something we need to know.
Send me the output of both the "file" command and the backtrace
that gdb prints, along with the corefile. This is basically everything
I do when I get a corefile anyways, but for me to be able to do it if
you send me the corefile means I'll have to install the exact same
version of merlin, monitor and possibly a lot of system libraries too.
That's extremely cumbersome, so go the extra halfinch and grab a
backtrace while you're at it. Bugs will get fixed a billion times
faster if you do.
If the trace looks like this:
#0 0x0022c402 in ?? ()
#1 0x0062e116 in ?? ()
#2 0x0806f4fb in ?? ()
#3 0x080566cc in ?? ()
That means that either the stack has been overwritten (a bug can cause
this, and it's bad), or that there are only unresolved symbols in there.
Either way, it's fairly useless in that state, but I'll still want it
since any clue is better than no clue.
Problem: 'mon node status' claims one node hasn't been alive for a
very long time
Answer:
There should be a timestamp stating when it was last active. grep
for that timestamp in Merlin's logfiles and nagios.log. Start with
daemon.log on the system where you ran 'mon node status' and look
for disconnect messages. You'll have to check the logs on both the
systems to find the most likely cause.
To look through nagios.log you can use:
mon log show --start=$when-20 --end=$when+20
although you'll have to calculate the start and end things manually,
since that command right there isn't valid shell or anything.
mon log show --help
will provide more filtering options, since grep'ing can be tricky
without knowing what to look for.
Problem: Reports show wrong uptime/downtime/whatever
Answer:
In 99% of all cases this is due to missing data in the report_data
table. It now resides in the merlin database as opposed to the
monitor_reports database, where it used to be.
If you can find a particular period in the logs that happens to be
broken it's not that hard to repair it, although doing so will take
time and you have to shut down Monitor in order to pull it off.
If the database is anything but huge, it's definitely easiest to
just truncate the report_data table and recreate it from scratch.
The following sequence of commands *should* take care of doing
that, but it's been a while since I wrote them and I haven't got
anything to test with.
mon stop
mon node ctrl -- mon stop
mon log import --fetch
The 'log' category of commands is useful for importing data from
remote sites though. Snoop around a bit and see what you can find.
'sortmerge' will thrash the disks quite a lot and use a ton of
memory, but 'import' is the only one which can be potentially
dangerous. Use the --no-sql option for a dry-run first if you're
nervous.
If the report_data table *is* huge and shutting down monitor
while repairs are under way is not an option, there may be other
solutions to try but they are all situational and more or less
voodooish.
Problem: Merlin's config sync destroyed my configuration!
Answer:
If it happened on a poller system, that's by design. Nothing to
do and nothing to try. Just re-do your work and make sure you
don't do it in a file that Merlin will overwrite occasionally.
If it was on a peer system, it might be possible to save it.
Sneak a peak in /var/cache/merlin/backup and see if you can
find your config files there.
Problem: X happened and it's not a listed problem here
Answer:
Perhaps it's by design, and then again it might not be. If you
think it's wrong, check the logfiles (all three of them) on
all systems involved and look for anomalies. When that's done
and you still haven't found anything, remain calm and write a
concise report stating what you did, what you expected should
happen and what happened. Feel free to include logfiles and
stuff as well, since I'll almost certainly want to look at
them myself.
Problem: My peered pollers are acting up!
Answer:
It's possible that the pollers are trying to push their config
to the master-server, but with the default sync command, it
will attempt to push configuration to all nodes at the same
time. This can cause one peered poller to try to push to the
other poller, which resets that poller and causes it to try
to push its configuration to the master, but since it pushes
globally and by default only to pollers and peers, it will
cause config to be pushed to the other node, which is then
restarted, etc, etc, etc.
To fix it, it's usually enough to add an empty object_config
section to all your master nodes, like so:
master yoda {
address = yoda
port = 15551
object_config {
}
}
master obi1 {
address = obi1
port = 15551
object_config {
}
}
This removes the config sync command from the master nodes,
so Merlin won't try to push configuration around.
Problem: My extra files/plugins/whatever aren't being synced!
Answer:
In order to sync files outside of the object configuration one can add
an extra compound to the node-configuration of the node one wants to
sync paths to. The config should look something like this:
poller solo {
hostgroups = hyperdrive
address = solo
port = 15551
sync {
/path/to/file/to/sync = yes
/path/to/file2/to/sync = /path/on/remote/system
}
}
This will cause /path/to/file/to/sync to be sent to the same path on
the remote system, and the file /path/to/file2/to/sync to be sent to
/path/on/remote/system on the remote system. It should be possible to
list directories and not only files, but that is untested.
Caveat 1: This is not well tested, but what sparse tests I've done
worked out well.
Caveat 2: Peers normally send all of /opt/monitor/etc to each other,
so for those no extra configuration should be necessary in a normal
setup.
Caveat 3: Only the sha1 checksum of the object config, coupled with
the timestamp of the same, is used to determine which file to sync
where. There's no checking done to make sure a newer version of the
file on the other end is being overwritten.
Caveat 4: This will run as the same user as the merlin daemon. I
have no idea if file ownership and permissions will be preserved.
If caveat 3 bites your ass, check in /var/cache/merlin/backups (or
/var/cache/merlin/backup) for the original files.
This is sort-of supported as of merlin-1.1.7, but see caveat 1.
Problem: One of my nodes can't connect to the others!
Answer:
On the node that can't connect to other nodes you need to add a
node-option to prevent it from attempting to connect to the nodes
it can't connect to. This is useful if you know there is a firewall
that you won't be allowing new connections through in one direction,
and especially if you're using a tripwire-rigged firewall. An
example config would look like this:
master behind-firewall-we-cant-connect-through {
address = master.behind.firewall.example.com
connect = no
}
This will cause the merlin daemon on this node to never attempt to
connect to master.behind.firewall.example.com. Normally, both nodes
attempt to connect at startup and every so often so long as the
connection is down.
This feature was introduced in Merlin 1.1.0
Problem: My poller is monitoring a network behind a firewall which
the master can't see through!
Answer:
There's a node-option you can use on the master-node which only
works when configuring poller nodes. Here's what it would look
like
poller watching-network-behind-firewall-where-master-cant-see {
address = admin-net-poller.example.com
port = 15551
takeover = no
}
This feature was introduced in Merlin 0.9.1.
Problem: A peer has crashed hard in my network, but the other peer
isn't taking over the checks!
Answer:
If you have no pollers in your merlin network and you're running
merlin 1.1.14-p5 or earlier, you've most likely been hit by a
fairly ancient bug in the Merlin daemon, where it only checked
node timeout in case there were pollers attached to the network.
This was fixed in 7f2fc8f9850735ab549d3bbfa987b09af4cc1a6d, which
is part of all releases post 1.1.14 (ie, 1.1.15-beta1 and up).
Problem: Nodes keep timing out even though I know they're still
connected. They just send very rarely and slowly!
Answer:
As of 1.1.14-p8 (ae67657bc1bd84beeac70759df5d87a2aac30fb2), you
can set the data_timeout option for your nodes and have the
merlin daemon grok it to mean how often a particular node must
send data to still be considered alive and well. The default
value here is pulse_interval * 2, so 30 seconds in total. You
can increase it a lot, or set it to 0 to avoid any checking at
all, but eventually the system tcp timeout will kick in and
kill the connection anyway.
Problem: I want feature X!
Answer:
I want icecream.
Problem: Make feature Y work like this!
Answer:
Talk to Peter.
Problem: I want to increase/decrease the size of Merlins binlog
Answer:
When two nodes lose connection Merlin will start to store the check
results, and will sync the results when the nodes are connected
again. The data is first written to memory, and after reaching
the memory limit Merlin will write to /var/lib/merlin/binlogs/.
The old default values were 10MB to memory and 100MB to the
filesystem. The new defaults are 500MB to memory and 5000MB
to filesystem.
If you want to change these default values you can specify
binlog_max_memory_size and binlog_max_file_size in merlin.conf.
For example (in MB):
binlog_max_memory_size = 50;
binlog_max_file_size = 1337;