This section describes the requirements of the MonitoringService and which software we may chose to implement those requirements in the future, to replace the current solutions used.

General requirements
  • open source
  • documented

Dashboards

This was also discussed in: 244349.

Requirements
  • user-friendly
  • publicly accessible
  • easy and lightweight, so we do updates often and quickly
  • decentralised (e.g. ostatus does that, since statues are reflected to all subscribers)
Nice to haves
  • controled from IRC (say: change the topic, it changes the status)
  • nagios integration
  • mobile-device friendly

Also note that there are two different use-cases here:

  1. a dashboard to inform users of the status of services
  2. a notification system to pre-emptively announce downtimes and so on - see 18288 for those requirements

Ideally, the tool would do both, but not necessarily.

The PSNA also has good advice on this, see page 237.

Inspiration
Known projects (or ideas)
  • Cachet - une bonne alternative qui semble être beaucoup plus simple à déployer que les autres ci-bas, simplement du PHP avec Laravel, demo: https://demo.cachethq.io/ test@test.com, test123 (./) Cachet a été choisi, parce qu'il est joli et fonctionne bien, mais aussi et surtout parce que c'est le seul qui fonctionne vraiment! voir 244349 et CachetConfiguration. l'équipe de dev répond également très rapidement à nos demandes!

  • Staytus - un autre produit similaire à Cachet, mais écrit en Ruby. Support pour les notifications courriels (manquante de Cachet), pas de notifications Twitter non plus

    • mobile-friendly
    • not distributed
    • no nagios integration
    • user-friendly - seems to be even nicer than Cachet, as there are links to individual announcements and notifications
    • no LDAP support
    • MIT-licensed
    • similar performance problems than Cachet

    • to be considered?
  • Annonces par Redmine: depuis l'upgrade à 2.5, les gens peuvent se mettre watcher sur les fils de nouvelles des projets. on pourrait mettre les annonces là: ça supporte les fils RSS et tout. par contre ça voudrait dire créer des comptes pour tout le monde... pas super.
  • Identi.ca !koumbitstatus group/statusnet - current approach: broken! identi.ca switched to pump.io and groups are gone, so are RSS feeds and the twitter bridge. ouark.

  • use the wiki! it's on a separate server, and we could have a subset of the site hosted on koumbitstatus.net?
    • more specifically: make a heavily themed ikiwiki site pushed to the various "filet" sites and a twitter plugin

  • Overseer - used at Disqus.com, Python/django, user-friendly/simple, administrator non-friendly, twitter integration, Apache2 license, development stopped, Disqus replaced it with Statuspage.io

  • Stashboard - MIT license, demo, Twitter integration, REST API, abandonné par koumbit, voir: StashboardConfiguration

  • JenkinsService - a bunch of jobs could be configured to check on Nagios (and maybe other things) and display a nice user-friendly status

  • we also used Drupal (see 118 for details and ideas)

  • see also RapportsIntervention

  • civicrm + civimail avec les clients dedans pour les notifs ciblees
  • Baobab, the software used on Gandi's status page. Django based

  • cstate, hugo-based static site generator, tag-based RSS feeds, easy setup on Netlify, GitLab CI integration, badges, readonly API

Ohloh comparison

Availability

Requirements
  • proven, stable and reliable solution
  • interoperable
  • many metrics implemented
  • fast and scalable
  • integration with puppet
  • ...
Known projects
  • Merlin - a Nagios module + daemon for creating a distributed monitoring setup. see git repository

  • Shinken - nagios-like and compatible, rewritten from scratch

  • Icinga Now package in debian squeeze / a fork from Nagios 2009. "...the Nagios software itself- is maintained by a single developer in the United States and hence is developed at a slower pace." http://www.icinga.org/faq/why-a-fork/

    • demo user and password: guest

  • Sensu - RabbitMQ/AMPQ monitoring system

  • Observium - un complément à Nagios, utilise snmp, autodétection, jolies graphiques, a tester! Pas de package debian pour le moment référence

  • OpenNMS - un remplacement à Nagios, qui semble plus complet, cohérent et solide. Java/XML/Usine à gaz. Distribué, SNMP, scalable.

  • Zabbix - very impressive: does stats, and monitoring, SLA, escalations, RRD graphs, contacts (sms/email/etc), supports distributed setups, PHP/MySQL interface, needs C agents on nodes for disk/mem/cpu stats, SNMP support...

  • Monit - very small monitoring system good at restarting crashed stuff

  • Centron - a Nagios distribution with AJAX, PHP and MySQL

  • Argus seems like an interesting alternative to Nagios that support redundant setups and have a simpler configuration file syntax. <!> Not to be confused with the network monitoring tool Argus

  • portmon does exactly what the name says. sometimes simplicity is the key.

  • pung same branch

  • nefu seems interesting as it is a really basic probe, which a few plugins (http, imap, ntp, etc), along with dependencies

  • sysmon is in the same spirit of simplicity, a kind of cross with portmon and nefu

  • ICMPmonitor primitive way of checking up/down status

  • Netmond supports SNMP, ping, and port probes, and has nice GUI frontends.

  • Project Observer seems to try to do everything, again, but seems to do it pretty right, haven't looked at notifications however..

  • NMIS

  • OpsView

  • Riemann - alerts, graphing, seems awesome?

  • Assimmon - autodiscovery, also does switch monitoring

  • Bosun - Created by StackExchance, license MIT. autodiscovery of services, coupled with scollector and opentsdb can collect metrics and draw graphs, aggregated data monitoring (e.g. can monitor something over mulitple machines)

Requirements
  • aggregation of multiple hosts in "clusters"
  • cute, with nice overviews and graphs
  • config-less setup
  • integration with Puppet
  • ...
Known projects
  • Pandora FMS - looks quite nice with lots of gizmos

  • Ganglia - high performance, distributed, cluster-oriented, see GangliaMonitoring

  • Collectd - collect data every 10 seconds stores it in rrd format with a simple cgi to see the graphs. Much more efficient than Munin. The package in Debian lenny kick ass! {*} {*} {*} {*} {*} (la version de etch en test ici )

  • Zenoss seems an interesting merge between "munin" and "nagios", mais ça ne supporte pas postgresql

  • Apan, RRDTool plugin for nagios

  • Graphite, database backed graphing application in Django. capable of LDAP authentication for users. possible to use memcached to boost performance. uses "whisper" instead of RRD for storage (capable of filling in events a posteriori). now with a package for wheezy for "carbon", the daemon that collects data

  • http://prometheus.io/

Possible roadmap
  1. Replace munin-graph by Graphite-app (keeping munin-update to do the data collecting)
  2. Replace munin-update by Carbon and Whisper (collecting data from munin-node clients with our already-existing set of plugins)
  3. Replace munin-node by collectd to do the data collection. This would give us more accurate information (10s interval, instead of 5mins)

Traffic accounting

See TrafficAccountingService for the extensive list and requirements.

Web usage statistics

There is a lot of other crap here, let's just mention:

Network monitoring

We don't have a proper LookingGlass configured in the network, and it's going to become a significant problem once the redundant RoutingService kicks in. A few possibilities here:

Intrusion Detection System (IDS)

To detect some types of attacks that happen on the network, it could be useful to run an IDS.

Intrusion detection systems

A broader category than log auditing, this can also process raw traffic and other things...

Log auditing

Log monitoring allows for acting upon certain conditions detected in log files. This also includes a central log monitoring server and related tools.

Ticket for central logging: 11878.

A lot of those were taken from a ;login: article.

Logwatch

logwatch is too verbose for our needs as it sends an email *every day*. It was disabled after a test period on shell.k.n.

Logtool

Logtool is a logfile parser / colorizer that uses the logcheck database to display, in realtime, problems or unmatched lines. It also allows simple manipulation of the stream to remove, for example, the program name or the host in the output. It also supports manual configuration of regular expressions.

The only issue with logtool is that it blindly takes the regex from logcheck and doesn't colorize the actual log data apart from the left columns... It could use the regex subgroups instead?

Logcheck

logcheck only prints out "abnormal" lines (that don't match a database of regex patterns for normal activity) when ran, which then gets sent out to root by cron. It is ran every hour and we can easily customize the regexes.

Installation:

apt-get install logcheck

To reduce false positives, here are example rules that are added on servers, in /etc/logcheck/ignore.d.server/koumbit-custom

# Puppet
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ puppet-agent\[[0-9]+\]: .*, which is a deprecated section. I'm assuming you meant.*$
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ puppet-agent\[[0-9]+\]: \(//Apt/Exec\[/usr/bin/apt-get update .* executed successfully$"

# NTP
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: peer .* now valid
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: peer .* now invalid
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: skew change .* exceeds limit
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: reply from .* not synced.*

# on vservers
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: error writing /proc/self/oom_adj: Permission denied$

# Asterisk - Depends on your log format defined in /etc/asterisk/logger.conf, using "dateformat=%F %T" for fail2ban
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ asterisk\[[0-9]+\]: rc_avpair_new: unknown attribute [0-9]+
\[[- :0-9]{19}\] WARNING[321] app_voicemail.c: Couldn't read username
\[[- :0-9]{19}\] WARNING[321] app_dial.c: Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion)
\[[- :0-9]{19}\] WARNING[321] chan_sip.c: Peer '.*' is now.*

# Web / suhosin
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ suhosin\[[0-9]+\]: ALERT - tried to register forbidden variable .*

# ssh
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: Received disconnect from .*

# bind (especially when running on an IPv6 network)
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ named\[[0-9]+\]: DNS format error from.*

How to created new rules is documented in the logcheck mails:

anarcat@shell:~$ cat /etc/logcheck/footer.txt 
If you want to remove a line from this report, add it to /etc/logcheck/ignore.d.server/local-package where
"package" is the package generating the error. Consider contributing back to http://logcheck.org/

To tune an existing regex to see why it's failing, use:

echo 'log line to go away' | egrep --color=auto 'regexp want to test'

You can also change the e-mail of the person receiving the e-mails in /etc/logcheck/logcheck.conf (by default it sends to the "logcheck" mail alias, which is usually sent to root):

SENDMAILTO="johndoe+logcheck@example.org"

Logcheck has been being tested on shell.koumbit.net and voice.koumbit.net, but has been disabled due to excessive noise.

Currently used on devpaix.koumbit.net, oxfamqc.koumbit.net.

Centralized logging

We are planning on using log.koumbit.net to log all servers to a central location. This will allow better investigation of attacks where all logs have been removed, for example, and also permit an operator to watch *all* logs at the same time, and operate correlations between logs of different servers.

This has yet to be implemented, see 2931

Act on log entries

Swatch

Sucks. Old. badly designed, buggy. Die die die.

SEC - Simple Event Correlator

A proper implementation of the above and more.

howto

Sample config for apache crashes:

# recognize repeated crashes of apache and restart it after a certain threshold
#
# [Sun Mar 11 22:23:57 2007] [notice] child pid 23733 exit signal Segmentation fault (11)
type=SingleWithThreshold
ptype=RegExp
pattern=.*child pid \d+ exit signal Segmentation fault \(\d+\).*
desc=$0
action=shellcmd /etc/init.d/apache restart; shellcmd /bin/sh -c "echo 'apache restarted' | mail -s 'apache restarted' root"
window=10
thresh=3

Of course, I had to write a rc.d startup script (grr) for this thing to start at boot:

# Start or stop sec
#
# Anarcat <anarcat@koumbit.org>
# based on postfix's init.d script

PATH=/bin:/usr/bin:/sbin:/usr/sbin
NAME=sec

case "$1" in
    start)
        echo -n "Starting simple event correlator: sec"

        start-stop-daemon --start --pidfile /var/run/sec.pid --exec /usr/bin/perl --startas /usr/bin/sec --\
                -conf=/etc/sec.conf -input=/var/log/apache/error.log -pid=/var/run/sec.pid -detach -syslog=daemon 2>&1 |
                (grep -v 'rules loaded from' 1>&2 || /bin/true)

        echo "."
    ;;

    stop)
        echo -n "Stopping simple event correlator: sec"
        start-stop-daemon --stop --pidfile /var/run/sec.pid --quiet
        echo "."
    ;;

    restart)
        $0 stop || true
        $0 start
    ;;

    force-reload|reload)
    ;;

    *)
        echo "Usage: $0 {start|stop|restart|reload|flush|check|abort|force-reload}"
        exit 1
    ;;
esac

exit 0

... and hook it into the system:

root@ques:/etc/init.d# update-rc.d sec defaults
 Adding system startup for /etc/init.d/sec ...
   /etc/rc0.d/K20sec -> ../init.d/sec
   /etc/rc1.d/K20sec -> ../init.d/sec
   /etc/rc6.d/K20sec -> ../init.d/sec
   /etc/rc2.d/S20sec -> ../init.d/sec
   /etc/rc3.d/S20sec -> ../init.d/sec
   /etc/rc4.d/S20sec -> ../init.d/sec
   /etc/rc5.d/S20sec -> ../init.d/sec

MonitoringService/SoftwareComparison (last edited 2021-01-11 14:24:42 by anarcat)