This section describes the requirements of the MonitoringService and which software we may chose to implement those requirements in the future, to replace the current solutions used.
- General requirements
- open source
- documented
Contents
Dashboards
This was also discussed in: 244349.
- Requirements
- user-friendly
- publicly accessible
- easy and lightweight, so we do updates often and quickly
- decentralised (e.g. ostatus does that, since statues are reflected to all subscribers)
- Nice to haves
- controled from IRC (say: change the topic, it changes the status)
- nagios integration
- mobile-device friendly
Also note that there are two different use-cases here:
- a dashboard to inform users of the status of services
a notification system to pre-emptively announce downtimes and so on - see 18288 for those requirements
Ideally, the tool would do both, but not necessarily.
The PSNA also has good advice on this, see page 237.
- Inspiration
Disqus.com service status - basically https://www.statuspage.io/
Github status - "Battle station fully operational", auto-refresh, twitter-connected, simple color coded (see this blog post for more details), not open-source (confirmed in personnal email between github support and anarcat on 2013-05-02)
Wikimedia status page - based on proprietary nimsoft software
Riseup - RSS feeds
Potager.org - ikiwiki based
- Known projects (or ideas)
Cachet - une bonne alternative qui semble être beaucoup plus simple à déployer que les autres ci-bas, simplement du PHP avec Laravel, demo: https://demo.cachethq.io/ test@test.com, test123 Cachet a été choisi, parce qu'il est joli et fonctionne bien, mais aussi et surtout parce que c'est le seul qui fonctionne vraiment! voir 244349 et CachetConfiguration. l'équipe de dev répond également très rapidement à nos demandes!
- mobile-friendly
not decentralised distribué: https://twitter.com/theanarcat/status/575061666532102144
nagios integration discussed: https://github.com/cachethq/Cachet/issues/225
- user-friendly
- publicly accessible
- fairly easy to use
- aims for LDAP support
- no Twitter, Identica, IRC or XMPP support for now
see CachetConfiguration for details.
Staytus - un autre produit similaire à Cachet, mais écrit en Ruby. Support pour les notifications courriels (manquante de Cachet), pas de notifications Twitter non plus
- mobile-friendly
- not distributed
- no nagios integration
- user-friendly - seems to be even nicer than Cachet, as there are links to individual announcements and notifications
- no LDAP support
- MIT-licensed
- to be considered?
- Annonces par Redmine: depuis l'upgrade à 2.5, les gens peuvent se mettre watcher sur les fils de nouvelles des projets. on pourrait mettre les annonces là: ça supporte les fils RSS et tout. par contre ça voudrait dire créer des comptes pour tout le monde... pas super.
Identi.ca !koumbitstatus group/statusnet - current approach: broken! identi.ca switched to pump.io and groups are gone, so are RSS feeds and the twitter bridge. ouark.
Inspiration: Social networks for servers - Utiliser status.net pour avoir une page où les serveurs rapportent régulièrement leur état.
- we could run our own status.net instance?
- use the wiki! it's on a separate server, and we could have a subset of the site hosted on koumbitstatus.net?
more specifically: make a heavily themed ikiwiki site pushed to the various "filet" sites and a twitter plugin
Overseer - used at Disqus.com, Python/django, user-friendly/simple, administrator non-friendly, twitter integration, Apache2 license, development stopped, Disqus replaced it with Statuspage.io
Stashboard - MIT license, demo, Twitter integration, REST API, abandonné par koumbit, voir: StashboardConfiguration
JenkinsService - a bunch of jobs could be configured to check on Nagios (and maybe other things) and display a nice user-friendly status
we also used Drupal (see 118 for details and ideas)
see also RapportsIntervention
- civicrm + civimail avec les clients dedans pour les notifs ciblees
Baobab, the software used on Gandi's status page. Django based
cstate, hugo-based static site generator, tag-based RSS feeds, easy setup on Netlify, GitLab CI integration, badges, readonly API
Availability
- Requirements
- proven, stable and reliable solution
- interoperable
- many metrics implemented
- fast and scalable
- integration with puppet
- ...
- Known projects
Merlin - a Nagios module + daemon for creating a distributed monitoring setup. see git repository
Shinken - nagios-like and compatible, rewritten from scratch
Icinga Now package in debian squeeze / a fork from Nagios 2009. "...the Nagios software itself- is maintained by a single developer in the United States and hence is developed at a slower pace." http://www.icinga.org/faq/why-a-fork/
demo user and password: guest
Sensu - RabbitMQ/AMPQ monitoring system
Observium - un complément à Nagios, utilise snmp, autodétection, jolies graphiques, a tester! Pas de package debian pour le moment référence
OpenNMS - un remplacement à Nagios, qui semble plus complet, cohérent et solide. Java/XML/Usine à gaz. Distribué, SNMP, scalable.
Zabbix - very impressive: does stats, and monitoring, SLA, escalations, RRD graphs, contacts (sms/email/etc), supports distributed setups, PHP/MySQL interface, needs C agents on nodes for disk/mem/cpu stats, SNMP support...
Monit - very small monitoring system good at restarting crashed stuff
Centron - a Nagios distribution with AJAX, PHP and MySQL
Argus seems like an interesting alternative to Nagios that support redundant setups and have a simpler configuration file syntax. Not to be confused with the network monitoring tool Argus
portmon does exactly what the name says. sometimes simplicity is the key.
pung same branch
nefu seems interesting as it is a really basic probe, which a few plugins (http, imap, ntp, etc), along with dependencies
sysmon is in the same spirit of simplicity, a kind of cross with portmon and nefu
ICMPmonitor primitive way of checking up/down status
Netmond supports SNMP, ping, and port probes, and has nice GUI frontends.
Project Observer seems to try to do everything, again, but seems to do it pretty right, haven't looked at notifications however..
Riemann - alerts, graphing, seems awesome?
Assimmon - autodiscovery, also does switch monitoring
Bosun - Created by StackExchance, license MIT. autodiscovery of services, coupled with scollector and opentsdb can collect metrics and draw graphs, aggregated data monitoring (e.g. can monitor something over mulitple machines)
Performance trends
- Requirements
- aggregation of multiple hosts in "clusters"
- cute, with nice overviews and graphs
- config-less setup
- integration with Puppet
- ...
- Known projects
Pandora FMS - looks quite nice with lots of gizmos
Ganglia - high performance, distributed, cluster-oriented, see GangliaMonitoring
Collectd - collect data every 10 seconds stores it in rrd format with a simple cgi to see the graphs. Much more efficient than Munin. The package in Debian lenny kick ass! (la version de etch en test ici )
Zenoss seems an interesting merge between "munin" and "nagios", mais ça ne supporte pas postgresql
Apan, RRDTool plugin for nagios
Graphite, database backed graphing application in Django. capable of LDAP authentication for users. possible to use memcached to boost performance. uses "whisper" instead of RRD for storage (capable of filling in events a posteriori). now with a package for wheezy for "carbon", the daemon that collects data
- Possible roadmap
- Replace munin-graph by Graphite-app (keeping munin-update to do the data collecting)
- Replace munin-update by Carbon and Whisper (collecting data from munin-node clients with our already-existing set of plugins)
- Replace munin-node by collectd to do the data collection. This would give us more accurate information (10s interval, instead of 5mins)
Traffic accounting
See TrafficAccountingService for the extensive list and requirements.
Web usage statistics
There is a lot of other crap here, let's just mention:
Google Analytics - non-free-as-in-speech, Google, but an now industry reference
Network monitoring
We don't have a proper LookingGlass configured in the network, and it's going to become a significant problem once the redundant RoutingService kicks in. A few possibilities here:
AOL's Trigger replaces RANCID - maybe?
SmokePing: The next version of Smokeping (2.4) can also provide a LookingGlass, in the form of an AJAXy traceroute. It's packaging has stalled (bug 485977) since the package is now looking for a new maintainer (bug 568742). 2.4 also depends on the qooxdoo framework, which is not yet in Debian (bug 485975). the new version doesn't have a lookinglass anymore.
RANCID: there's a rancid-cgi package which also provides a looking glass, but it assumes it's talking to a Cisco or similar router with a pre-defined commandline interface. Basically, it offloads the work to the router, and collects the results. Since we're using BSD-based routers, this may require extra work to setup.
OpenBSD's OpenBGPd has a looking glass called bgplg. It also supports pings and traceroutes. That seems like the best software to use considering we're going to use OpenBGPd eventually. I found this demo, which doesn't make traceroute or ping available. Oh and there's a shell interface too.
DIS - a RANCID replacement
Intrusion Detection System (IDS)
To detect some types of attacks that happen on the network, it could be useful to run an IDS.
Snort is the most known option
psad uses snort signatures and offers some more features. It's a project that originated from the Bastille Linux project.
Intrusion detection systems
A broader category than log auditing, this can also process raw traffic and other things...
Snort - the de-facto standard, avec une interface ruby commerciale (https://www.threatstack.com/)
Sagan - log monitoring, snort-like
Log auditing
Log monitoring allows for acting upon certain conditions detected in log files. This also includes a central log monitoring server and related tools.
Ticket for central logging: 11878.
A lot of those were taken from a ;login: article.
Splunk (commercial, with an open-source version)
Sentry - Log aggregation with a cute web interface to browse events.
Patrick a écrit un plugin bien simple pour rsyslogd
Sagan - log monitoring, snort-like
Logstash - log parsing and aggregation, with search, looks great!
Logwatch
logwatch is too verbose for our needs as it sends an email *every day*. It was disabled after a test period on shell.k.n.
Logtool
Logtool is a logfile parser / colorizer that uses the logcheck database to display, in realtime, problems or unmatched lines. It also allows simple manipulation of the stream to remove, for example, the program name or the host in the output. It also supports manual configuration of regular expressions.
The only issue with logtool is that it blindly takes the regex from logcheck and doesn't colorize the actual log data apart from the left columns... It could use the regex subgroups instead?
Logcheck
logcheck only prints out "abnormal" lines (that don't match a database of regex patterns for normal activity) when ran, which then gets sent out to root by cron. It is ran every hour and we can easily customize the regexes.
Installation:
apt-get install logcheck
To reduce false positives, here are example rules that are added on servers, in /etc/logcheck/ignore.d.server/koumbit-custom
# Puppet ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ puppet-agent\[[0-9]+\]: .*, which is a deprecated section. I'm assuming you meant.*$ ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ puppet-agent\[[0-9]+\]: \(//Apt/Exec\[/usr/bin/apt-get update .* executed successfully$" # NTP ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: peer .* now valid ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: peer .* now invalid ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: skew change .* exceeds limit ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ntpd\[[0-9]+\]: reply from .* not synced.* # on vservers ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: error writing /proc/self/oom_adj: Permission denied$ # Asterisk - Depends on your log format defined in /etc/asterisk/logger.conf, using "dateformat=%F %T" for fail2ban ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ asterisk\[[0-9]+\]: rc_avpair_new: unknown attribute [0-9]+ \[[- :0-9]{19}\] WARNING[321] app_voicemail.c: Couldn't read username \[[- :0-9]{19}\] WARNING[321] app_dial.c: Unable to create channel of type 'Zap' (cause 34 - Circuit/channel congestion) \[[- :0-9]{19}\] WARNING[321] chan_sip.c: Peer '.*' is now.* # Web / suhosin ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ suhosin\[[0-9]+\]: ALERT - tried to register forbidden variable .* # ssh ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ sshd\[[0-9]+\]: Received disconnect from .* # bind (especially when running on an IPv6 network) ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ named\[[0-9]+\]: DNS format error from.*
How to created new rules is documented in the logcheck mails:
anarcat@shell:~$ cat /etc/logcheck/footer.txt If you want to remove a line from this report, add it to /etc/logcheck/ignore.d.server/local-package where "package" is the package generating the error. Consider contributing back to http://logcheck.org/ To tune an existing regex to see why it's failing, use: echo 'log line to go away' | egrep --color=auto 'regexp want to test'
You can also change the e-mail of the person receiving the e-mails in /etc/logcheck/logcheck.conf (by default it sends to the "logcheck" mail alias, which is usually sent to root):
SENDMAILTO="johndoe+logcheck@example.org"
Logcheck has been being tested on shell.koumbit.net and voice.koumbit.net, but has been disabled due to excessive noise.
Currently used on devpaix.koumbit.net, oxfamqc.koumbit.net.
Centralized logging
We are planning on using log.koumbit.net to log all servers to a central location. This will allow better investigation of attacks where all logs have been removed, for example, and also permit an operator to watch *all* logs at the same time, and operate correlations between logs of different servers.
This has yet to be implemented, see 2931
Act on log entries
Swatch
Sucks. Old. badly designed, buggy. Die die die.
SEC - Simple Event Correlator
A proper implementation of the above and more.
Sample config for apache crashes:
# recognize repeated crashes of apache and restart it after a certain threshold # # [Sun Mar 11 22:23:57 2007] [notice] child pid 23733 exit signal Segmentation fault (11) type=SingleWithThreshold ptype=RegExp pattern=.*child pid \d+ exit signal Segmentation fault \(\d+\).* desc=$0 action=shellcmd /etc/init.d/apache restart; shellcmd /bin/sh -c "echo 'apache restarted' | mail -s 'apache restarted' root" window=10 thresh=3
Of course, I had to write a rc.d startup script (grr) for this thing to start at boot:
# Start or stop sec # # Anarcat <anarcat@koumbit.org> # based on postfix's init.d script PATH=/bin:/usr/bin:/sbin:/usr/sbin NAME=sec case "$1" in start) echo -n "Starting simple event correlator: sec" start-stop-daemon --start --pidfile /var/run/sec.pid --exec /usr/bin/perl --startas /usr/bin/sec --\ -conf=/etc/sec.conf -input=/var/log/apache/error.log -pid=/var/run/sec.pid -detach -syslog=daemon 2>&1 | (grep -v 'rules loaded from' 1>&2 || /bin/true) echo "." ;; stop) echo -n "Stopping simple event correlator: sec" start-stop-daemon --stop --pidfile /var/run/sec.pid --quiet echo "." ;; restart) $0 stop || true $0 start ;; force-reload|reload) ;; *) echo "Usage: $0 {start|stop|restart|reload|flush|check|abort|force-reload}" exit 1 ;; esac exit 0
... and hook it into the system:
root@ques:/etc/init.d# update-rc.d sec defaults Adding system startup for /etc/init.d/sec ... /etc/rc0.d/K20sec -> ../init.d/sec /etc/rc1.d/K20sec -> ../init.d/sec /etc/rc6.d/K20sec -> ../init.d/sec /etc/rc2.d/S20sec -> ../init.d/sec /etc/rc3.d/S20sec -> ../init.d/sec /etc/rc4.d/S20sec -> ../init.d/sec /etc/rc5.d/S20sec -> ../init.d/sec