Server Farmer provides many monitoring capabilities, using many different systems.

Heartbeat monitoring

sf-monitoring-heartbeat extension provides heartbeat monitoring, which can be integrated with any external monitoring/alerting system, both internal (eg. Nagios), or external (eg. UptimeRobot, StatusCake, Pingdom), that support http keyword checks.

Heartbeat checks every 2 minutes for:

  • over 50 running network services (databases, mail and other popular services)
  • running Docker containers
  • running libvirt-managed virtual machines
  • mounted LUKS-encrypted partitions
  • hard drives health (if sf-monitoring-smart extension is also installed)

List of detected services is sent to heartbeat server, which provides a simple interface, that can be queried by an alerting system. Thanks to this, external monitoring/alerting system don't need network access inside your infrastructure (except of the heartbeat server).

SNMP generic monitoring

sf-monitoring-snmpd extension provides automatic installation and configuration of snmpd daemon (net-snmp on RHEL), compatible with any SNMP monitoring system, at least with:

  • Cacti
  • collectd
  • Observium
  • Opsview
  • PRTG Network Monitor
  • Solarwinds
  • Spiceworks
  • Zabbix
  • Zenoss

Cacti integration

Apart from generic SNMP monitoring, sf-monitoring-cacti provides additional capabilities:

  • OpenVZ and LXC containers monitoring (network traffic, memory and disk usage)
  • MTA queue monitoring
  • CPU thermal monitoring (compatible with generic lm-sensors + some hardware not compatible with lm-sensors, eg. QNAP, Fit-PC2, Raspberry Pi)
  • external temperature monitoring using TemperNTC USB dongles
  • integrated SMART monitoring (if sf-monitoring-smart extension is also installed)
  • possibility of swapping drives between servers without any configuration changes in Cacti
  • automatic mapping of current drive letters to graphs (for systems with multiple hard drives and dynamic drive letter assignments)

These additional capabilities require ssh connectivity from monitored servers to Cacti server. Each monitored server has its own ssh private key, for which public key has to be manually accepted on Cacti server (and can be immediately rejected in case of security issues etc.).

NewRelic integration

sf-monitoring-newrelic extension provides NewRelic license code configuration.

sf-monitoring-mysql, sf-monitoring-smart and sf-monitoring-backup extensions provide several custom checks, integrating with the NewRelic platform using dedicated dashboards.

SMART drive health monitoring

sf-monitring-smart extension provides an universal SMART drive monitoring, which can report current drive health to 3 different targets from single SMART read:

  • heartbeat server (if sf-monitoring-heartbeat extension is also installed)
  • Cacti (if sf-monitoring-cacti extension is also installed)
  • NewRelic (if sf-monitoring-newrelic extension is also installed and NewRelic license key is configured)

All ATA/SATA drives are supported, also ones connected through MegaRAID controllers, or USB bridges. Our health monitoring algorithm is based on our own experience from running own online backup business, and on knowledge provided by Backblaze and Google. It checks for 11 SMART attributes:

  • Temperature_Celsius
  • Reallocated_Sector_Ct
  • End-to-End_Error
  • UDMA_CRC_Error_Count
  • Spin_Retry_Count
  • Runtime_Bad_Block
  • Current_Pending_Sector
  • Reported_Uncorrect
  • Offline_Uncorrectable
  • Calibration_Retry_Count
  • Power_On_Hours

For some attributes, non-zero values are still considered healthy (Runtime_Bad_Block up to 10, Current_Pending_Sector up to 2). As opposite to Backblaze, our algorithm don't check Command Timeout attribute, as checking UDMA_CRC_Error_Count, Current_Pending_Sector and Reported_Uncorrect gives much better overview on drive health, without false alerts for some drive vendors.

Also, our SMART heartbeat monitoring allows creating exceptions (defined in /etc/local/.config/allowed.smart file) for attributes exceeding safe values, that user still wants to hold for some time.

External hard drive overheating prevention

Many people use external hard drives, or external hard drive enclosures (connected via USB or sometimes eSATA/Thunderbolt), that allow very cheap data storage for backup/archival purposes. Unfortunately such devices tend to overheat themselves, and eventually fail, when working continuously for too long. While most such devices have hardware SCT (standby condition timer) protection, it is active by default only in Windows, and only when using drivers provided by manufacturer.

sf-standby-monitor extension provides a simple mechanism that warns every 30 minutes, if there are USB-attacged drives, that are not in standby mode.

Disk usage monitoring

sf-farm-inspector extension provides several farm health analysis tools, including disk usage monitoring. This extends free disk space monitoring realized by any SNMP monitoring software, by providing you an insight, what exactly in your filesystem takes so much space.

Public IP change monitoring

For many companies, it is completely enough that their LAN is put behind NAT, with public IP address and remote access from outside, but where this public IP address isn't fixed. Especially when their Internet connection has very good quality and fixed IP address is expensive.

There already exist services for such companies, eg. noip.com, but they are slow and paid (or have restrictions in free mode).

sf-ip-monitor extension provides capability of monitoring public IP changes eg. every minute, and alerting about detected change using email and/or SMS messages. Emails sent by this extensions are easy to parse and process by any external system, that you may want to use, if you manage eg. hundreds of customers without fixed IP.

sf-sms-smsapi extension provides ability to send paid (cheap), prioritized SMS messages, that won't get lost in case of GSM network congestions etc.

Syslog events monitoring

sf-log-monitor provides configuration of logcheck tool, that scans syslog (and other) logs from your server (and possibly from other servers in the farm) and notifies the system administrator every hour about any unknown, possibly suspicious events. It is used to enhance the overall security level of your server/farm.