Monitoring Service#

Introduction#

A monitoring stack is part of the observability tools that the experimenter may need [1].

EnOSlib provides two monitoring stacks out of the box:

Telegraf [2] /InfluxDB [3] /Grafana [4] (TIG) stack. This stack follows a push model where Telegraf agents are continuously a pushing metrics the InfluxDB collector. Grafana is used as a dashboard for visualizing the metrics.
Telegraf/Promotheus [5] /Grafana (TPG) stack. This stack follows a pull model where the Prometheus collector are polling the Telegraf agents for new metrics. For instance, this model allows to overcome a limitation when the deployment spans Grid’5000 and FIT/IOTlab platform.

Note that the Telegraf agent are also configured to expose NVidia GPU metrics if an NVidia GPU is detected and if the nvidia container toolkit is found (installed with the Docker service or by you own mean).

Dstat (monitoring)#

Dstat Service Class#

Classes:

Dstat(*, nodes[, options, backup_dir, ...])

Deploy dstat on all hosts.

class enoslib.service.dstat.dstat.Dstat(*, nodes: Iterable[Host], options: str = '-aT', backup_dir: Path | None = None, extra_vars: Dict | None = None)#

Deploy dstat on all hosts.

This assumes a debian/ubuntu based environment and aims at producing a quick way to deploy a simple monitoring stack based on dstat on your nodes. It’s opinionated out of the box but allow for some convenient customizations.

dstat metrics are dumped into a csv file by default (-o option) and retrieved when backuping.

Parameters:

nodes – the nodes to install dstat on
options – options to pass to dstat.
priors – priors to apply
extra_vars – extra vars to pass to Ansible

Examples

import logging
import time
from pathlib import Path

import enoslib as en

en.init_logging(level=logging.INFO)
en.check()


CLUSTER = "parasilo"
SITE = en.g5k_api_utils.get_cluster_site(CLUSTER)
job_name = Path(__file__).name

# claim the resources
network = en.G5kNetworkConf(type="prod", roles=["my_network"], site=SITE)
conf = (
    en.G5kConf.from_settings(job_name=job_name, walltime="0:30:00", job_type=[])
    .add_network_conf(network)
    .add_machine(roles=["control"], cluster=CLUSTER, nodes=2, primary_network=network)
    .finalize()
)

provider = en.G5k(conf)
roles, networks = provider.init()

with en.actions(roles=roles["control"]) as a:
    a.apt(name="stress", state="present")

# Start a capture
# - for the duration of the commands
with en.Dstat(nodes=roles) as d:
    time.sleep(5)
    en.run("stress --cpu 4 --timeout 10", roles)
    time.sleep(5)


# sns.lineplot(data=result, x="epoch", y="usr", hue="host", markers=True, style="host")
# plt.show()

backup(backup_dir: Path | None = None) → Path#

Backup the dstat monitoring stack.

This fetches all the remote dstat csv files under the backup_dir.

Parameters:: backup_dir (str) – path of the backup directory to use.

deploy()#: Deploy the dstat monitoring stack.

destroy()#

Destroy the dstat monitoring stack.

This kills the dstat processes on the nodes. Metric files survive to destroy.

static to_pandas(backup_dir: Path)#

Get a pandas representation of the monitoring metrics.

Why static ? You’ll probably use this method when doing post-mortem analysis. So the Dstat object might not be around anymore: you’ll be left with the dstat directory.

Internals. This work by scanning all csv files in backup_dir: this directory is assumed to have been created solely by a call to backup()

Parameters:: backup_dir – The directory created by backup()
Returns:: A pandas dataframe with all the metrics

Telegraf/InfluxDB/Grafana stack#

Classes:

TIGMonitoring(collector, agent, *[, ui, ...])

Deploy a TIG stack: Telegraf, InfluxDB, Grafana.

class enoslib.service.monitoring.monitoring.TIGMonitoring(collector: Host, agent: Iterable[Host], *, ui: Host | None = None, networks: Iterable[Network] | None = None, remote_working_dir: str = '/builds/monitoring', backup_dir: Path | None = None, collector_env: Dict | None = None, agent_conf: str | None = None, agent_env: Dict | None = None, agent_image: str = 'telegraf', ui_env: Dict | None = None, extra_vars: Dict | None = None)#

Deploy a TIG stack: Telegraf, InfluxDB, Grafana.

This assumes a debian/ubuntu base environment and aims at producing a quick way to deploy a monitoring stack on your nodes. Except for telegraf agents which will use a binary file for armv7 (FIT/IoT-LAB).

It’s opinionated out of the box but allow for some convenient customizations.

Parameters:

collector – enoslib.Host where the collector will be installed
agent – list of enoslib.Host where the agent will be installed
ui – enoslib.Host where the UI will be installed
networks – list of networks to use for the monitoring traffic. Agents will send their metrics to the collector using this IP address. In the same way, the ui will use this IP to connect to collector. The IP address is taken from enoslib.Host, depending on this parameter: - None: IP address = host.address - Iterable[Network]: Get the IP address available in host.extra_addresses which belongs to one of these networks Note that this parameter depends on calling sync_network_info to fill the extra_addresses structure. Raises an exception if no or more than IP address is found
remote_working_dir – path to a remote location that will be used as working directory
backup_dir – path to a local directory where the backup will be stored This can be overwritten by backup().
collector_env – environment variables to pass in the collector process environment
agent_conf – path to an alternative configuration file
agent_env – environment variables to pass in the agent process environment
agent_image – docker image to use for the agent (telegraf)
ui_env – environment variables to pass in the ui process environment
extra_vars – extra variables to pass to Ansible

Examples

import logging
from pathlib import Path

import enoslib as en

en.init_logging(level=logging.INFO)
en.check()


CLUSTER = "parasilo"
SITE = en.g5k_api_utils.get_cluster_site(CLUSTER)
job_name = Path(__file__).name

# claim the resources
conf = en.G5kConf.from_settings(job_name=job_name, walltime="1:00:00", job_type=[])
network = en.G5kNetworkConf(id="n1", type="prod", roles=["my_network"], site=SITE)
conf.add_network_conf(network).add_machine(
    roles=["control"], cluster=CLUSTER, nodes=1, primary_network=network
).add_machine(
    roles=["compute"], cluster=CLUSTER, nodes=1, primary_network=network
).finalize()

provider = en.G5k(conf)
roles, networks = provider.init()

m = en.TIGMonitoring(
    collector=roles["control"][0], agent=roles["compute"], ui=roles["control"][0]
)
m.deploy()

ui_address = roles["control"][0].address
print("The UI is available at http://%s:3000" % ui_address)
print("user=admin, password=admin")

import json
import logging
import time
from pathlib import Path

import requests

import enoslib as en

en.init_logging(level=logging.INFO)
en.check()

job_name = Path(__file__).name


# They have GPU in lille !
CLUSTER = "chifflet"
SITE = en.g5k_api_utils.get_cluster_site(CLUSTER)


# claim the resources
conf = en.G5kConf.from_settings(job_name=job_name, walltime="0:30:00", job_type=[])
network = en.G5kNetworkConf(id="n1", type="prod", roles=["my_network"], site=SITE)
conf.add_network_conf(network).add_machine(
    roles=["control"], cluster=CLUSTER, nodes=1, primary_network=network
).add_machine(
    roles=["compute"], cluster=CLUSTER, nodes=1, primary_network=network
).finalize()

provider = en.G5k(conf)
roles, networks = provider.init()

# The Docker service knows how to deploy nvidia docker runtime
d = en.Docker(agent=roles["control"] + roles["compute"])
d.deploy()

# The Monitoring service knows how to use this specific runtime
m = en.TIGMonitoring(
    collector=roles["control"][0], agent=roles["compute"], ui=roles["control"][0]
)
m.deploy()

ui_address = roles["control"][0].address
print("The UI is available at http://%s:3000" % ui_address)
print("user=admin, password=admin")

# waiting a bit for some metrics to come on
# and query influxdb
collector_address = roles["control"][0].address
time.sleep(10)
with en.G5kTunnel(collector_address, 8086) as (local_address, local_port, tunnel):
    url = f"http://{local_address}:{local_port}/query"
    q = (
        'SELECT mean("temperature_gpu") FROM "nvidia_smi"'
        'WHERE time > now() - 5m GROUP BY time(1m), "index", "name", "host"'
    )
    r = requests.get(url, dict(db="telegraf", q=q))
    print(json.dumps(r.json(), indent=4))

backup(backup_dir: str | None = None)#

Backup the monitoring stack.

Parameters:: backup_dir (str) – path of the backup directory to use. Will be used instead of the one set in the constructor.

deploy()#: Deploy the monitoring stack

destroy()#

Destroy the monitoring stack.

This destroys all the container and associated volumes.

Telegraf/Prometheus/Grafana stack#

Classes:

TPGMonitoring(collector, agent, *[, ui, ...])

Deploy a TPG stack: Telegraf, Prometheus, Grafana.

class enoslib.service.monitoring.monitoring.TPGMonitoring(collector: Host, agent: Iterable[Host], *, ui: Host | None = None, networks: Iterable[Network] | None = None, remote_working_dir: str = '/builds/monitoring', backup_dir: Path | None = None)#

Deploy a TPG stack: Telegraf, Prometheus, Grafana.

It’s opinionated out of the box but allow for some convenient customizations.

Parameters:

collector – enoslib.Host where the collector will be installed
ui – enoslib.Host where the UI will be installed
agent – list of enoslib.Host where the agent will be installed
networks – list of networks to use for the monitoring traffic. Agents will send their metrics to the collector using this IP address. In the same way, the ui will use this IP to connect to collector. The IP address is taken from enoslib.Host, depending on this parameter: - None: IP address = host.address - Iterable[Network]: Get the first IP address available in host.extra_addresses which belongs to one of these networks Note that this parameter depends on calling sync_network_info to fill the extra_addresses structure. Raises an exception if no or more than IP address is found
remote_working_dir – path to a remote location that will be used as working directory
backup_dir – path to a local directory where the backup will be stored This can be overwritten by backup().

backup(backup_dir: str | None = None)#

Backup the monitoring stack.

Parameters:: backup_dir (str) – path of the backup directory to use. Will be used instead of the one set in the constructor.

deploy()#: Deploy the monitoring stack

destroy()#

Destroy the monitoring stack.

This destroys all the container and associated volumes.