Monitoring Service#

Introduction#

A monitoring stack is part of the observability tools that the experimenter may need [1].

EnOSlib provides two monitoring stacks out of the box:

  • Telegraf [2] /InfluxDB [3] /Grafana [4] (TIG) stack. This stack follows a push model where Telegraf agents are continuously a pushing metrics the InfluxDB collector. Grafana is used as a dashboard for visualizing the metrics.

  • Telegraf/Promotheus [5] /Grafana (TPG) stack. This stack follows a pull model where the Prometheus collector are polling the Telegraf agents for new metrics. For instance, this model allows to overcome a limitation when the deployment spans Grid’5000 and FIT/IOTlab platform.

Note that the Telegraf agent are also configured to expose NVidia GPU metrics if an NVidia GPU is detected and if the nvidia container toolkit is found (installed with the Docker service or by you own mean).

Dstat (monitoring)#

Dstat Service Class#

Classes:

Dstat(*, nodes[, options, backup_dir, ...])

Deploy dstat on all hosts.

class enoslib.service.dstat.dstat.Dstat(*, nodes: Iterable[Host], options: str = '-aT', backup_dir: Path | None = None, extra_vars: Dict | None = None)#

Deploy dstat on all hosts.

This assumes a debian/ubuntu based environment and aims at producing a quick way to deploy a simple monitoring stack based on dstat on your nodes. It’s opinionated out of the box but allow for some convenient customizations.

dstat metrics are dumped into a csv file by default (-o option) and retrieved when backuping.

Parameters:
  • nodes – the nodes to install dstat on

  • options – options to pass to dstat.

  • priors – priors to apply

  • extra_vars – extra vars to pass to Ansible

Examples

 1import logging
 2import time
 3from pathlib import Path
 4
 5import enoslib as en
 6
 7en.init_logging(level=logging.INFO)
 8en.check()
 9
10
11CLUSTER = "parasilo"
12SITE = en.g5k_api_utils.get_cluster_site(CLUSTER)
13job_name = Path(__file__).name
14
15# claim the resources
16network = en.G5kNetworkConf(type="prod", roles=["my_network"], site=SITE)
17conf = (
18    en.G5kConf.from_settings(job_name=job_name, walltime="0:30:00", job_type=[])
19    .add_network_conf(network)
20    .add_machine(roles=["control"], cluster=CLUSTER, nodes=2, primary_network=network)
21    .finalize()
22)
23
24provider = en.G5k(conf)
25roles, networks = provider.init()
26
27with en.actions(roles=roles["control"]) as a:
28    a.apt(name="stress", state="present")
29
30# Start a capture
31# - for the duration of the commands
32with en.Dstat(nodes=roles) as d:
33    time.sleep(5)
34    en.run("stress --cpu 4 --timeout 10", roles)
35    time.sleep(5)
36
37
38# sns.lineplot(data=result, x="epoch", y="usr", hue="host", markers=True, style="host")
39# plt.show()
backup(backup_dir: Path | None = None) Path#

Backup the dstat monitoring stack.

This fetches all the remote dstat csv files under the backup_dir.

Parameters:

backup_dir (str) – path of the backup directory to use.

deploy()#

Deploy the dstat monitoring stack.

destroy()#

Destroy the dstat monitoring stack.

This kills the dstat processes on the nodes. Metric files survive to destroy.

static to_pandas(backup_dir: Path)#

Get a pandas representation of the monitoring metrics.

Why static ? You’ll probably use this method when doing post-mortem analysis. So the Dstat object might not be around anymore: you’ll be left with the dstat directory.

Internals. This work by scanning all csv files in backup_dir: this directory is assumed to have been created solely by a call to backup()

Parameters:

backup_dir – The directory created by backup()

Returns:

A pandas dataframe with all the metrics

Telegraf/InfluxDB/Grafana stack#

Classes:

TIGMonitoring(collector, agent, *[, ui, ...])

Deploy a TIG stack: Telegraf, InfluxDB, Grafana.

class enoslib.service.monitoring.monitoring.TIGMonitoring(collector: Host, agent: Iterable[Host], *, ui: Host | None = None, networks: Iterable[Network] | None = None, remote_working_dir: str = '/builds/monitoring', backup_dir: Path | None = None, collector_env: Dict | None = None, agent_conf: str | None = None, agent_env: Dict | None = None, agent_image: str = 'telegraf', ui_env: Dict | None = None, extra_vars: Dict | None = None)#

Deploy a TIG stack: Telegraf, InfluxDB, Grafana.

This assumes a debian/ubuntu base environment and aims at producing a quick way to deploy a monitoring stack on your nodes. Except for telegraf agents which will use a binary file for armv7 (FIT/IoT-LAB).

It’s opinionated out of the box but allow for some convenient customizations.

Parameters:
  • collectorenoslib.Host where the collector will be installed

  • agent – list of enoslib.Host where the agent will be installed

  • uienoslib.Host where the UI will be installed

  • networks – list of networks to use for the monitoring traffic. Agents will send their metrics to the collector using this IP address. In the same way, the ui will use this IP to connect to collector. The IP address is taken from enoslib.Host, depending on this parameter: - None: IP address = host.address - Iterable[Network]: Get the IP address available in host.extra_addresses which belongs to one of these networks Note that this parameter depends on calling sync_network_info to fill the extra_addresses structure. Raises an exception if no or more than IP address is found

  • remote_working_dir – path to a remote location that will be used as working directory

  • backup_dir – path to a local directory where the backup will be stored This can be overwritten by backup().

  • collector_env – environment variables to pass in the collector process environment

  • agent_conf – path to an alternative configuration file

  • agent_env – environment variables to pass in the agent process environment

  • agent_image – docker image to use for the agent (telegraf)

  • ui_env – environment variables to pass in the ui process environment

  • extra_vars – extra variables to pass to Ansible

Examples

 1import logging
 2from pathlib import Path
 3
 4import enoslib as en
 5
 6en.init_logging(level=logging.INFO)
 7en.check()
 8
 9
10CLUSTER = "parasilo"
11SITE = en.g5k_api_utils.get_cluster_site(CLUSTER)
12job_name = Path(__file__).name
13
14# claim the resources
15conf = en.G5kConf.from_settings(job_name=job_name, walltime="1:00:00", job_type=[])
16network = en.G5kNetworkConf(id="n1", type="prod", roles=["my_network"], site=SITE)
17conf.add_network_conf(network).add_machine(
18    roles=["control"], cluster=CLUSTER, nodes=1, primary_network=network
19).add_machine(
20    roles=["compute"], cluster=CLUSTER, nodes=1, primary_network=network
21).finalize()
22
23provider = en.G5k(conf)
24roles, networks = provider.init()
25
26m = en.TIGMonitoring(
27    collector=roles["control"][0], agent=roles["compute"], ui=roles["control"][0]
28)
29m.deploy()
30
31ui_address = roles["control"][0].address
32print("The UI is available at http://%s:3000" % ui_address)
33print("user=admin, password=admin")
 1import json
 2import logging
 3import time
 4from pathlib import Path
 5
 6import requests
 7
 8import enoslib as en
 9
10en.init_logging(level=logging.INFO)
11en.check()
12
13job_name = Path(__file__).name
14
15
16# They have GPU in lille !
17CLUSTER = "chifflet"
18SITE = en.g5k_api_utils.get_cluster_site(CLUSTER)
19
20
21# claim the resources
22conf = en.G5kConf.from_settings(job_name=job_name, walltime="0:30:00", job_type=[])
23network = en.G5kNetworkConf(id="n1", type="prod", roles=["my_network"], site=SITE)
24conf.add_network_conf(network).add_machine(
25    roles=["control"], cluster=CLUSTER, nodes=1, primary_network=network
26).add_machine(
27    roles=["compute"], cluster=CLUSTER, nodes=1, primary_network=network
28).finalize()
29
30provider = en.G5k(conf)
31roles, networks = provider.init()
32
33# The Docker service knows how to deploy nvidia docker runtime
34d = en.Docker(agent=roles["control"] + roles["compute"])
35d.deploy()
36
37# The Monitoring service knows how to use this specific runtime
38m = en.TIGMonitoring(
39    collector=roles["control"][0], agent=roles["compute"], ui=roles["control"][0]
40)
41m.deploy()
42
43ui_address = roles["control"][0].address
44print("The UI is available at http://%s:3000" % ui_address)
45print("user=admin, password=admin")
46
47# waiting a bit for some metrics to come on
48# and query influxdb
49collector_address = roles["control"][0].address
50time.sleep(10)
51with en.G5kTunnel(collector_address, 8086) as (local_address, local_port, tunnel):
52    url = f"http://{local_address}:{local_port}/query"
53    q = (
54        'SELECT mean("temperature_gpu") FROM "nvidia_smi"'
55        'WHERE time > now() - 5m GROUP BY time(1m), "index", "name", "host"'
56    )
57    r = requests.get(url, dict(db="telegraf", q=q))
58    print(json.dumps(r.json(), indent=4))
backup(backup_dir: str | None = None)#

Backup the monitoring stack.

Parameters:

backup_dir (str) – path of the backup directory to use. Will be used instead of the one set in the constructor.

deploy()#

Deploy the monitoring stack

destroy()#

Destroy the monitoring stack.

This destroys all the container and associated volumes.

Telegraf/Prometheus/Grafana stack#

Classes:

TPGMonitoring(collector, agent, *[, ui, ...])

Deploy a TPG stack: Telegraf, Prometheus, Grafana.

class enoslib.service.monitoring.monitoring.TPGMonitoring(collector: Host, agent: Iterable[Host], *, ui: Host | None = None, networks: Iterable[Network] | None = None, remote_working_dir: str = '/builds/monitoring', backup_dir: Path | None = None)#

Deploy a TPG stack: Telegraf, Prometheus, Grafana.

This assumes a debian/ubuntu base environment and aims at producing a quick way to deploy a monitoring stack on your nodes. Except for telegraf agents which will use a binary file for armv7 (FIT/IoT-LAB).

It’s opinionated out of the box but allow for some convenient customizations.

Parameters:
  • collectorenoslib.Host where the collector will be installed

  • uienoslib.Host where the UI will be installed

  • agent – list of enoslib.Host where the agent will be installed

  • networks – list of networks to use for the monitoring traffic. Agents will send their metrics to the collector using this IP address. In the same way, the ui will use this IP to connect to collector. The IP address is taken from enoslib.Host, depending on this parameter: - None: IP address = host.address - Iterable[Network]: Get the first IP address available in host.extra_addresses which belongs to one of these networks Note that this parameter depends on calling sync_network_info to fill the extra_addresses structure. Raises an exception if no or more than IP address is found

  • remote_working_dir – path to a remote location that will be used as working directory

  • backup_dir – path to a local directory where the backup will be stored This can be overwritten by backup().

backup(backup_dir: str | None = None)#

Backup the monitoring stack.

Parameters:

backup_dir (str) – path of the backup directory to use. Will be used instead of the one set in the constructor.

deploy()#

Deploy the monitoring stack

destroy()#

Destroy the monitoring stack.

This destroys all the container and associated volumes.