A short guide for everyone

While the term itself has a clear meaning – namely to “observe, control, monitor” – the use of the term is as varied as there are applications.
This article focuses on the technical and operational background of systematic monitoring of hardware and software and its importance in the corporate environment.

Main objectives

Most use cases are quite universal and are usually based on the weighting of individual data sets or a series of data sets over a period of time.

Individual data sets are queried in simple and concrete cases (right now, 30 minutes ago) and can be weighted with simple Boolean operators (true/not true, on/off) or simple numerical values (minimum 1, maximum 10, exactly 5) and classified as desired or incorrect state. Several data sets are queried in more abstract cases, if single data sets react too variable and are calculated with statistical methods (e.g. mean, median, modal value, standard deviation, percentiles, etc.). The result can be used e.g. as trigger to send notifications (mail, SMS, call, messenger message) or to generate graphical representations.

Proven procedure (Best Practice)

Specialist departments have their own requirements regarding e.g. depth of detail, form of presentation and time scale: Accordingly, the tool to be used must be tailored to the intended use. A direct exchange between the departments is essential.

  1. Target group: defined requirement(s)

  2. Platform or product owner: provides data (interfaces)

  3. Monitoring Expert: Analyzes data/requirements and recommends solution(s)

Which data?

A common mistake is to collect all kinds of data (and often too much irrelevant data). A good monitoring starts with defining the requirements, as well as identifying and selecting the relevant data.
Basically there are monitoring systems for different data and some systems that can handle several data formats.

  • Single data records / time series data records / log data records

  • True / False 

  • Numerical values (float, integer)

  • Strings 

Which monitoring system?

Which system is used is determined by the requirements, the target data and the available budget. In recent years, data protection has also become increasingly important. The following factors influence the choice of the monitoring system:

For very specific cases, a custom-made system can be the cost-effective solution in the long run, whereby in most cases a standardized system is the best (cost-effective) choice.

Umbrella monitoring 

Umbrella monitoring is a unified monitoring, which bundles and processes data and messages from different monitoring systems and display and administration in a single system. The term is often mistakenly used to bundle monitoring independent of effectiveness and purpose in a single monitoring system – often only a graphical user interface is meant.

Monitoring systems

There is a variety of monitoring systems with different monitoring models. While older systems are based on simple data queries with single values (Ad-Hoc-Modell), newer systems often collect data series and thus increase the level of detail. This is also reflected in the components used by the software, because while older systems often use classical relational database systems (or even text files), newer monitoring systems collect data in time series databases (e.g. InfluxDB, OpenTSDB, Prometheus). In the following, some monitoring systems are presented, but this list does not represent a complete overview of all monitoring systems and is only meant to provide an understanding of the different monitoring philosophies and models. Only systems that collect data, perform checks, apply processing logic and notify users are considered. Although many of these systems are open source and can be used free of charge, the software companies behind the tools offer additional paid support. Other systems are pure SaaS and can only be used with paid licenses or data volumes.

Nagios 

The OpenSource veteran among the monitoring systems is Nagios, which is in use since 1999. It is the classic representative of the ad-hoc model, since the checks are usually initiated by Nagios. They can be executed locally on the Nagios host or via the Nagios agent on a remote host. However, the agent and its protocol has few security features. The level of detail in this ad hoc model is extremely low and is usually minute intervals (usually 5, 10, 15, 30, 60) and in this form is not suitable for real-time data monitoring.

Zabbix 

Zabbix is a direct OpenSource competitor to Nagios and exists since 2001. The biggest advantages compared to old Nagios versions are among others the scalability and an autodiscovery function. 

Icinga / Naemon / Shinken 

Several dissatisfied developers from the Nagios community founded a fork of the Nagios software in 2009. The best known representative is the Icinga fork, which among other things also received a modern user interface. At the same time Shinken did a complete rewrite of the Nagios code in Python with full compatibility to Nagios. These forks are not very common by now.

Icinga 2 

The limitations of the old nagios base in Icinga led to the completely independent project Icinga 2 in 2014. Although many standards from the nagios environment were kept, the entire core was rewritten and put on its own legs. This increased the performance of the system and instead of 400 hosts with 10000 service checks Icinga2 can handle much larger environments. Using the cluster and HA features the system is easily scalable and real time data can be collected and checked with Icinga 2. The Icinga2 agent communicates constantly with the Icinga2 cluster and via certificate-based encryption. The checks are automatically executed on the appropriate host and actively report the result to the Icinga 2 instances, thus freeing Icinga 2 from the ad hoc model and using the collector model. When the agent is installed the host automatically registers with the Icinga2 system via an API and can automatically create template-based checks based on conditions and other dependencies. The Icingaweb2 user interface is intuitive, responsive and modular. User management can be done via multiple backends (e.g. Active Directory, database).
In the meantime Icinga 2 and its user interface Icingaweb2 has become one of the most popular monitoring systems, mainly due to the system’s adaptability, the strong support of the open source community and the commitment of NETWAYS GmbH, the company behind Icinga 2.

Influxdata TICK-Stack / Prometheus / Netdata 

A representative of the collector monitoring model are the products of influxdata. The TICK-Stack consists of four products: InfluxDB (time series database), Telegraf (data collection agent), Chronograph (visualization interface for time series data) and Kapacitor (time series data processing). The OpenSource basic versions can be used free of charge, but do not have any features like cluster capability and high availability. These features are only available in Cloud and Enterprise versions. Via the telegraph agent, data from the system, databases, network, and other sources are constantly collected and sent to the InfluxDB instance. Via the chronograph user interface, data queries can be performed and the results can be displayed graphically. The Kapacitor is responsible for service discovery, can detect anomalies in the data series with statistical functions and can connect ML-Systems. It is also responsible for alarms based on internal functions or results from ML systems. Prometheus is a comparable completely open source system. InfluxDB is also suitable for long-term archiving of collected data (older data lose details) while Prometheus is designed for detailed short-term data only. For real-time monitoring with short-lived data Netdata is suitable, which keeps the time series completely in memory and does not write to the hard disk of the host system at all except for some log data. 

PRTG Network Monitor 

Unlike most other monitoring tools, the Paessler AG tool runs exclusively on Windows systems, making it particularly suitable for Windows systems, although it can easily monitor Linux or other operating systems. This is realized via agents (called “sensors” in PRTG) or a direct connection to the target systems. The system supports both the Ad-Hoc model and the Collector model. 
PRTG is a purely commercial tool, but offers a free version for a small environment with 100 integrated sensors. The price of the larger environments depends on the sensors used.

New Relic / Datadog / Dynatrace / AppDynamics / Splunk 

A number of monitoring systems exist only as Saas and are explicitly intended for corporate purposes. The monitoring data is collected via an agent (or directly integrated into the company’s own software via libraries/hooks) and sent to the service cloud using a push procedure. There the entire processing of the data takes place and also the visualization or alarms run via the service provider’s cloud. These systems offer excellent results, especially for application and service monitoring, and additional services, such as anomaly detection and machine learning, for which no special knowledge is required from the user. If all tools and additional services are used, these systems generate e.g. automatic maps from the collected data and show visually all connections of hosts and services, and they also allow to track individual transactions. When using these systems, it is extremely important to consider privacy and data exceptions to protect company secrets, as the collected data sets may contain corresponding protected and secret information. Some service providers charge on the basis of the data transferred/processed, while others charge by host and service quotas.

Graylog / Logstash / Fluentd 

Well-known representatives of log data monitoring are Logstash, Graylog and Fluentd. These systems often consist of a data storage for log data, an optimized search system (Elasticsearch) and a display interface for simple charts. The collected data can also be used to create data histories (e.g. 10,000 messages with a specific text). Most of the tools that can be used on own systems are often open source and offer commercial support and enterprise features such as long-term archiving or fail-safe. With expertise, these log data systems can be enriched with time series data and, under certain circumstances, their scalability can beat specialized tools such as InfluxDB or similar in performance and flexibility.

Own developments

Sometimes no existing monitoring system fulfills all requirements, be it due to limitations of the own environment or lack of functionality in the monitoring systems. With careful planning and consideration of all requirements, in-house developments – especially extrapolated to a longer runtime – can be implemented and operated quite cost-effectively. If the data interfaces and the data format are based on standards, other monitoring systems (or visualization systems) can be operated with the same data if required.

 

Dashboard of a customized order entry monitoring.
The backend checks for criticality and notifies via e-mail / text-to-speech call in case of any problems
 
Monitoring Dashboards 

If all additional features such as notifications, complex monitoring with business logic are not necessary, pure dashboard monitoring can be useful. The selection of OpenSource data visualization tools is clearly arranged and also commercial products can be useful – depending on the intended use. The following list of some known dashboard systems is meant as an example, lists some advantages and disadvantages, but not all available systems.

Grafana 
  • Very flexible by supporting the following data sources

  • Graphite

  • InfluxDB

  • OpenTSDB

  • MSSQL

  • MySQL

  • PostgreSQL

  • Prometheus

  • Elasticsearch

  • AWS, Azure

  • easy handling

  • very fast user interface

  • platform independent

  • extremely customizable interface

  • extendable by plugins

  • Support in the commercial version

  • can act as a central dashboard for different monitoring systems

Freeboard 
  • Limited to dweet.io and web-based APIs (JSON format)

DataZen 
  • Not as flexible as Grafana, DataZen Server is only available on Windows servers (based on .NET)

  • The company behind DataZen was acquired by Microsoft

  • Software has been integrated into MSSQL Server Enterprise Edition and extends the Microsoft BI strategy (2016)

  • The interface is slower than Grafana and does not offer such flexible data sources

  • MSSQL

  • OLAP

  • SharePoint

  • CSV

  • Excel

  • Web services

  • ODBC

  • is rather used in BI environments

  • poor documentation

  • Editor has errors (DataZen Publisher)

  • hardly adaptable dashboard objects

 

Conclusion

Effective monitoring cannot be implemented using a single, universal tool. There are simply too many technical differences and requirements for the respective monitoring. However, a uniform presentation in a dashboard system makes sense and is possible. Most monitoring systems offer interfaces to retrieve corresponding data and dashboard systems offer enough flexibility to choose the ideal form of presentation – even from different data storage devices.

Salvatory paragraph

The assessment of the tools in this text is based on my experience with them. Therefore this is my personal opinion about the tools.

There may be some blurriness and strong simplifications in the text: This is intended and should increase the comprehensibility for the reader who has not yet had many points of contact with these topics.

Not all tools were linked, I have limited myself to a selection here.

Suggestions for improvement and corrections are welcome!

 

Teaser Image:
Wallusy@pixabay