MonitoringService/SoftwareComparison

This section describes the requirements of the MonitoringService and which software we may chose to implement those requirements in the future, to replace the current solutions used.

General requirements

open source
documented

Contents

Dashboards
Availability
Performance trends
Traffic accounting
1. Web usage statistics
Network monitoring
1. Intrusion Detection System (IDS)
Intrusion detection systems
1. Log auditing
2. Act on log entries
  1. Swatch
  2. SEC - Simple Event Correlator
Centralized logging
1. Nerdlog

Dashboards

This was also discussed in: 244349.

Requirements

user-friendly
publicly accessible
easy and lightweight, so we do updates often and quickly
decentralised (e.g. ostatus does that, since statues are reflected to all subscribers)

Nice to haves

controled from IRC (say: change the topic, it changes the status)
nagios integration
mobile-device friendly

Also note that there are two different use-cases here:

a dashboard to inform users of the status of services
a notification system to pre-emptively announce downtimes and so on - see 18288 for those requirements

Ideally, the tool would do both, but not necessarily.

The PSNA also has good advice on this, see page 237.

Inspiration

État des services gandi
Amazon Service Health Dashboard
Disqus.com service status - basically https://www.statuspage.io/
Github status - "Battle station fully operational", auto-refresh, twitter-connected, simple color coded (see this blog post for more details), not open-source (confirmed in personnal email between github support and anarcat on 2013-05-02)
Wikimedia status page - based on proprietary nimsoft software
Riseup - RSS feeds
Potager.org - ikiwiki based

Known projects (or ideas)

Cachet - une bonne alternative qui semble être beaucoup plus simple à déployer que les autres ci-bas, simplement du PHP avec Laravel, demo: https://demo.cachethq.io/ test@test.com, test123 Cachet a été choisi, parce qu'il est joli et fonctionne bien, mais aussi et surtout parce que c'est le seul qui fonctionne vraiment! voir 244349 et CachetConfiguration. l'équipe de dev répond également très rapidement à nos demandes!
- mobile-friendly
- not decentralised distribué: https://twitter.com/theanarcat/status/575061666532102144
- nagios integration discussed: https://github.com/cachethq/Cachet/issues/225
- user-friendly
- publicly accessible
- fairly easy to use
- aims for LDAP support
- no Twitter, Identica, IRC or XMPP support for now
- see CachetConfiguration for details.
Staytus - un autre produit similaire à Cachet, mais écrit en Ruby. Support pour les notifications courriels (manquante de Cachet), pas de notifications Twitter non plus
- mobile-friendly
- not distributed
- no nagios integration
- user-friendly - seems to be even nicer than Cachet, as there are links to individual announcements and notifications
- no LDAP support
- MIT-licensed
- similar performance problems than Cachet
- to be considered?
Annonces par Redmine: depuis l'upgrade à 2.5, les gens peuvent se mettre watcher sur les fils de nouvelles des projets. on pourrait mettre les annonces là: ça supporte les fils RSS et tout. par contre ça voudrait dire créer des comptes pour tout le monde... pas super.
Identi.ca !koumbitstatus group/statusnet - current approach: broken! identi.ca switched to pump.io and groups are gone, so are RSS feeds and the twitter bridge. ouark.
- Inspiration: Social networks for servers - Utiliser status.net pour avoir une page où les serveurs rapportent régulièrement leur état.
- we could run our own status.net instance?
use the wiki! it's on a separate server, and we could have a subset of the site hosted on koumbitstatus.net?
- more specifically: make a heavily themed ikiwiki site pushed to the various "filet" sites and a twitter plugin
Overseer - used at Disqus.com, Python/django, user-friendly/simple, administrator non-friendly, twitter integration, Apache2 license, development stopped, Disqus replaced it with Statuspage.io
Stashboard - MIT license, demo, Twitter integration, REST API, abandonné par koumbit, voir: StashboardConfiguration
JenkinsService - a bunch of jobs could be configured to check on Nagios (and maybe other things) and display a nice user-friendly status
we also used Drupal (see 118 for details and ideas)
see also RapportsIntervention
civicrm + civimail avec les clients dedans pour les notifs ciblees
Baobab, the software used on Gandi's status page. Django based
cstate, hugo-based static site generator, tag-based RSS feeds, easy setup on Netlify, GitLab CI integration, badges, readonly API

Ohloh comparison

Availability

Requirements

proven, stable and reliable solution
interoperable
many metrics implemented
fast and scalable
integration with puppet
...

Known projects

Merlin - a Nagios module + daemon for creating a distributed monitoring setup. see git repository
Shinken - nagios-like and compatible, rewritten from scratch
Icinga Now package in debian squeeze / a fork from Nagios 2009. "...the Nagios software itself- is maintained by a single developer in the United States and hence is developed at a slower pace." http://www.icinga.org/faq/why-a-fork/
- demo user and password: guest
Sensu - RabbitMQ/AMPQ monitoring system
Observium - un complément à Nagios, utilise snmp, autodétection, jolies graphiques, a tester! Pas de package debian pour le moment référence
OpenNMS - un remplacement à Nagios, qui semble plus complet, cohérent et solide. Java/XML/Usine à gaz. Distribué, SNMP, scalable.
Zabbix - very impressive: does stats, and monitoring, SLA, escalations, RRD graphs, contacts (sms/email/etc), supports distributed setups, PHP/MySQL interface, needs C agents on nodes for disk/mem/cpu stats, SNMP support...
- demo
- presentation RMLL
- insane mysql monitoring
- 21813
Monit - very small monitoring system good at restarting crashed stuff
Centron - a Nagios distribution with AJAX, PHP and MySQL
Argus seems like an interesting alternative to Nagios that support redundant setups and have a simpler configuration file syntax. Not to be confused with the network monitoring tool Argus
portmon does exactly what the name says. sometimes simplicity is the key.
pung same branch
nefu seems interesting as it is a really basic probe, which a few plugins (http, imap, ntp, etc), along with dependencies
sysmon is in the same spirit of simplicity, a kind of cross with portmon and nefu
ICMPmonitor primitive way of checking up/down status
Netmond supports SNMP, ping, and port probes, and has nice GUI frontends.
Project Observer seems to try to do everything, again, but seems to do it pretty right, haven't looked at notifications however..
NMIS
OpsView
Riemann - alerts, graphing, seems awesome?
Assimmon - autodiscovery, also does switch monitoring
Bosun - Created by StackExchance, license MIT. autodiscovery of services, coupled with scollector and opentsdb can collect metrics and draw graphs, aggregated data monitoring (e.g. can monitor something over mulitple machines)

Performance trends

Requirements

aggregation of multiple hosts in "clusters"
cute, with nice overviews and graphs
config-less setup
integration with Puppet
...

Known projects

Pandora FMS - looks quite nice with lots of gizmos
Ganglia - high performance, distributed, cluster-oriented, see GangliaMonitoring
Collectd - collect data every 10 seconds stores it in rrd format with a simple cgi to see the graphs. Much more efficient than Munin. The package in Debian lenny kick ass! (la version de etch en test ici )
Zenoss seems an interesting merge between "munin" and "nagios", mais ça ne supporte pas postgresql
Apan, RRDTool plugin for nagios
Graphite, database backed graphing application in Django. capable of LDAP authentication for users. possible to use memcached to boost performance. uses "whisper" instead of RRD for storage (capable of filling in events a posteriori). now with a package for wheezy for "carbon", the daemon that collects data
http://prometheus.io/

Possible roadmap

Replace munin-graph by Graphite-app (keeping munin-update to do the data collecting)
Replace munin-update by Carbon and Whisper (collecting data from munin-node clients with our already-existing set of plugins)
Replace munin-node by collectd to do the data collection. This would give us more accurate information (10s interval, instead of 5mins)

Traffic accounting

See TrafficAccountingService for the extensive list and requirements.

Web usage statistics

There is a lot of other crap here, let's just mention:

Google Analytics - non-free-as-in-speech, Google, but an now industry reference

Network monitoring

We don't have a proper LookingGlass configured in the network, and it's going to become a significant problem once the redundant RoutingService kicks in. A few possibilities here:

AOL's Trigger replaces RANCID - maybe?
SmokePing: The next version of Smokeping (2.4) can also provide a LookingGlass, in the form of an AJAXy traceroute. It's packaging has stalled (bug 485977) since the package is now looking for a new maintainer (bug 568742). 2.4 also depends on the qooxdoo framework, which is not yet in Debian (bug 485975). the new version doesn't have a lookinglass anymore.
RANCID: there's a rancid-cgi package which also provides a looking glass, but it assumes it's talking to a Cisco or similar router with a pre-defined commandline interface. Basically, it offloads the work to the router, and collects the results. Since we're using BSD-based routers, this may require extra work to setup.
OpenBSD's OpenBGPd has a looking glass called bgplg. It also supports pings and traceroutes. That seems like the best software to use considering we're going to use OpenBGPd eventually. I found this demo, which doesn't make traceroute or ping available. Oh and there's a shell interface too.
DIS - a RANCID replacement

Intrusion Detection System (IDS)

To detect some types of attacks that happen on the network, it could be useful to run an IDS.

Snort is the most known option
psad uses snort signatures and offers some more features. It's a project that originated from the Bastille Linux project.
OSSEC

Intrusion detection systems

A broader category than log auditing, this can also process raw traffic and other things...

Snort - the de-facto standard, avec une interface ruby commerciale (https://www.threatstack.com/)
Suricata
Sagan - log monitoring, snort-like
OSSEC
ACARM-ng
Bro

Log auditing

Log monitoring allows for acting upon certain conditions detected in log files. This also includes a central log monitoring server and related tools.

Ticket for central logging: 11878.

A lot of those were taken from a ;login: article.

Tenshi
logsurfer
SEC
Splunk (commercial, with an open-source version)
Sentry - Log aggregation with a cute web interface to browse events.
- Patrick a écrit un plugin bien simple pour rsyslogd
Sagan - log monitoring, snort-like
Loganalysis portal
Logstash - log parsing and aggregation, with search, looks great!
http://crunchtools.com/software/petit/
http://goaccess.prosoftcorp.com/

Logwatch

logwatch is too verbose for our needs as it sends an email *every day*. It was disabled after a test period on shell.k.n.

Logtool

Logtool is a logfile parser / colorizer that uses the logcheck database to display, in realtime, problems or unmatched lines. It also allows simple manipulation of the stream to remove, for example, the program name or the host in the output. It also supports manual configuration of regular expressions.

The only issue with logtool is that it blindly takes the regex from logcheck and doesn't colorize the actual log data apart from the left columns... It could use the regex subgroups instead?

Logcheck

logcheck only prints out "abnormal" lines (that don't match a database of regex patterns for normal activity) when ran, which then gets sent out to root by cron. It is ran every hour and we can easily customize the regexes.

Installation:

apt-get install logcheck

To reduce false positives, here are example rules that are added on servers, in /etc/logcheck/ignore.d.server/koumbit-custom

# Puppet
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ puppet-agent\[[0-9]+\]: .*, which is a deprecated section. I'm assuming you meant.*$
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ puppet-agent\[[0-9]+\]: \(//Apt/Exec\[/usr/bin/apt-get update .* executed successfully$"

# NTP
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: peer .* now valid
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: peer .* now invalid
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: skew change .* exceeds limit
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: reply from .* not synced.*

# on vservers
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: error writing /proc/self/oom_adj: Permission denied$

# Asterisk - Depends on your log format defined in /etc/asterisk/logger.conf, using "dateformat=%F %T" for fail2ban
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ asterisk\[[0-9]+\]: rc_avpair_new: unknown attribute [0-9]+
\[[- :0-9]{19}\] WARNING[321] app_voicemail.c: Couldn't read username
\[[- :0-9]{19}\] WARNING[321] app_dial.c: Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion)
\[[- :0-9]{19}\] WARNING[321] chan_sip.c: Peer '.*' is now.*

# Web / suhosin
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ suhosin\[[0-9]+\]: ALERT - tried to register forbidden variable .*

# ssh
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: Received disconnect from .*

# bind (especially when running on an IPv6 network)
^\w{3} [ :0-9]{11} [._[:alnum:]-]+ named\[[0-9]+\]: DNS format error from.*

How to created new rules is documented in the logcheck mails:

anarcat@shell:~$ cat /etc/logcheck/footer.txt 
If you want to remove a line from this report, add it to /etc/logcheck/ignore.d.server/local-package where
"package" is the package generating the error. Consider contributing back to http://logcheck.org/

To tune an existing regex to see why it's failing, use:

echo 'log line to go away' | egrep --color=auto 'regexp want to test'

You can also change the e-mail of the person receiving the e-mails in /etc/logcheck/logcheck.conf (by default it sends to the "logcheck" mail alias, which is usually sent to root):

SENDMAILTO="johndoe+logcheck@example.org"

Logcheck has been being tested on shell.koumbit.net and voice.koumbit.net, but has been disabled due to excessive noise.

Currently used on devpaix.koumbit.net, oxfamqc.koumbit.net.

Act on log entries

Swatch

Sucks. Old. badly designed, buggy. Die die die.

SEC - Simple Event Correlator

A proper implementation of the above and more.

howto

Sample config for apache crashes:

# recognize repeated crashes of apache and restart it after a certain threshold
#
# [Sun Mar 11 22:23:57 2007] [notice] child pid 23733 exit signal Segmentation fault (11)
type=SingleWithThreshold
ptype=RegExp
pattern=.*child pid \d+ exit signal Segmentation fault \(\d+\).*
desc=$0
action=shellcmd /etc/init.d/apache restart; shellcmd /bin/sh -c "echo 'apache restarted' | mail -s 'apache restarted' root"
window=10
thresh=3

Of course, I had to write a rc.d startup script (grr) for this thing to start at boot:

# Start or stop sec
#
# Anarcat <anarcat@koumbit.org>
# based on postfix's init.d script

PATH=/bin:/usr/bin:/sbin:/usr/sbin
NAME=sec

case "$1" in
    start)
        echo -n "Starting simple event correlator: sec"

        start-stop-daemon --start --pidfile /var/run/sec.pid --exec /usr/bin/perl --startas /usr/bin/sec --\
                -conf=/etc/sec.conf -input=/var/log/apache/error.log -pid=/var/run/sec.pid -detach -syslog=daemon 2>&1 |
                (grep -v 'rules loaded from' 1>&2 || /bin/true)

        echo "."
    ;;

    stop)
        echo -n "Stopping simple event correlator: sec"
        start-stop-daemon --stop --pidfile /var/run/sec.pid --quiet
        echo "."
    ;;

    restart)
        $0 stop || true
        $0 start
    ;;

    force-reload|reload)
    ;;

    *)
        echo "Usage: $0 {start|stop|restart|reload|flush|check|abort|force-reload}"
        exit 1
    ;;
esac

exit 0

... and hook it into the system:

root@ques:/etc/init.d# update-rc.d sec defaults
 Adding system startup for /etc/init.d/sec ...
   /etc/rc0.d/K20sec -> ../init.d/sec
   /etc/rc1.d/K20sec -> ../init.d/sec
   /etc/rc6.d/K20sec -> ../init.d/sec
   /etc/rc2.d/S20sec -> ../init.d/sec
   /etc/rc3.d/S20sec -> ../init.d/sec
   /etc/rc4.d/S20sec -> ../init.d/sec
   /etc/rc5.d/S20sec -> ../init.d/sec

Centralized logging

We are planning on using log.koumbit.net to log all servers to a central location. This will allow better investigation of attacks where all logs have been removed, for example, and also permit an operator to watch *all* logs at the same time, and operate correlations between logs of different servers.

This has yet to be implemented, see 2931

Nerdlog

Nerdlog est pas vraimemnt une système de log central, mais ça le remplace.

https://github.com/dimonomid/nerdlog

Nerdlog is a fast, remote-first, multi-host TUI log viewer with timeline histogram and no central server. Loosely inspired by Graylog/Kibana, but without the bloat. Pretty much no setup needed, either.

It's laser-focused on being efficient while querying logs from multiple remote machines simultaneously, filtering them by time range and patterns, while also drawing an interactive timeline histogram for quick visual insight.

Primary use case: reading system logs (from the files /var/log/messages or /var/log/syslog, or straight from journalctl) from one or more remote hosts. Very efficient even on large log files (like 1GB or more).

It does support some other log formats and can use any log files, but that was the primary use case which was driving the implementation: we were having our web service backend running as systemd services on a bunch of Linux instances, printing a lot of logs, and wanted to be able to read these logs efficiently and having the timeline histogram, much like tools like Graylog have.

MonitoringService/SoftwareComparison (last edited 2025-06-29 21:59:51 by hubide)

Wiki

Page

User

Dashboards

Availability

Performance trends

Traffic accounting

Web usage statistics

Network monitoring

Intrusion Detection System (IDS)

Intrusion detection systems

Log auditing

Logwatch

Logtool

Logcheck

Act on log entries

Swatch

SEC - Simple Event Correlator

Centralized logging

Nerdlog