Monitoring Metrics (Perfcounter | Performance Metrics | Operational data)

> Monitoring Metrics (Perfcounter | Performance Metrics | Operational data)

1 - About

This section is about the collection and calculation of metrics used in the context of realtime reporting and alerting known as monitoring or operational intelligence.

Monitoring system implements operational intelligence that provides a picture of what is currently happening within a system whereas business intelligence is data gathered for analyzing trends over time.

The monitoring metrics are time serie data and are also known as:

  • Perfcounter (generally on Windows)
  • Performance metrics

The application that collects and analyse this kind of data are known as event-data application. For instance

  • Machine data (IoT)
  • Network telemetry

They are produced by application (OS and third app) in order to:

See also: Observablity

These (counter|numbers) are or we will derive averages from them. There is no way to figure out from the data, if you got from one point to another in an horizontal and constant line.

2 - Type

Primitives Metrics Type:

  • Timer, a Timer measures both the count of timed events and the total time of all events timed.
  • Counter, a counter increments its value generally by one starting from 0. A counter will drop back to zero on service restart. Representing a counter without rate aggregation over some time window is rarely useful, as the representation is a function of both the rapidity with which the counter is incremented and the longevity of the service.
  • Gauge, - a single metric (gauge values are not rates)
Advertising

3 - Characteristics

3.1 - Registry

They are generally locally grouped in a registry in order to batch the data collection.

3.2 - Dimensionality

Time serie data may be classified via:

  • dimension - the event is enriched with tag key/value pairs. (AppOptics, Atlas, Azure Monitor, Cloudwatch, Datadog, Datadog StatsD, Dynatrace, Elastic, Humio, Influx, KairosDB, New Relic, Prometheus, SignalFx, Sysdig StatsD, Telegraf StatsD, Wavefront)
  • hierarchy - the name is a flat hierarchical metric name (Graphite, Ganglia, JMX, Etsy StatsD)
  • or both

dimensions are also known as tags

Hierarchy Example :

  • Atlas (CamelCase)- httpServerRequests
  • Graphite (Point separator)- http.server.requests
  • InfluxDB and Prometheus separated by _ - http_server_requests
Advertising

3.3 - Aggregation Processing

The aggregation of a set of samples over a prescribed time interval (Rate aggregation) may be performed:

  • Client Side (AppOptics, Atlas, Azure Monitor, Datadog, Elastic, Graphite, Ganglia, Humio, Influx, JMX, Kairos, New Relic, all StatsD flavors, SignalFx)
  • or Server-side (Prometheus, Wavefront)

Example: conversion of discrete samples (such as counts) to a rate.

Not all measurements are reported or best viewed as a rate. For example, gauge values are not rates.

3.4 - Metrics Collection

The collection of metrics may be done:

  • client side via client pushes (AppOptics, Atlas, Azure Monitor, Datadog, Elastic, Graphite, Ganglia, Humio, Influx, JMX, Kairos, New Relic, SignalFx, Wavefront)
  • server side via server polls (Prometheus, all StatsD flavors)

4 - Steps / Lifecycle

They are:

Alerting:

Merge with monitoring lifecycle ??

Advertising

5 - Counter Category

Machine data counter example:

5.1 - Sensor

  • temperature,
  • speed,
  • voltage,
  • number of printouts

5.2 - Service Metrics

See SLI: Service Level Indicators:

5.3 - Event

Some monitoring systems can also capture events:

  • Changes: Internal code releases, builds, and build failures
  • Alerts: Internally generated alerts or third-party notifications
  • Scaling events: Adding or subtracting hosts

6 - Property

6.1 - Scale and Persistence

  • last 2 hours at 1 minute resolution,
  • last 24 hours at 10 minute resolution,
  • last 3 days with 1 hour resolution,
  • last 7 days at 2 hours resolution

X-scale (Minor/Major Tick)

7 - Reference