Monitoring Service#
Introduction#
A monitoring stack is part of the observability tools that the experimenter may need [1].
EnOSlib provides two monitoring stacks out of the box:
Telegraf [2] /InfluxDB [3] /Grafana [4] (TIG) stack. This stack follows a push model where Telegraf agents are continuously a pushing metrics the InfluxDB collector. Grafana is used as a dashboard for visualizing the metrics.
Telegraf/Promotheus [5] /Grafana (TPG) stack. This stack follows a pull model where the Prometheus collector are polling the Telegraf agents for new metrics. For instance, this model allows to overcome a limitation when the deployment spans Grid’5000 and FIT/IOTlab platform.
Note that the Telegraf agent are also configured to expose NVidia GPU metrics if
an NVidia GPU is detected and if the nvidia container toolkit is found
(installed with the Docker
service or
by you own mean).
Dstat (monitoring)#
Dstat Service Class#
Classes:
|
Deploy dstat on all hosts. |
- class enoslib.service.dstat.dstat.Dstat(*, nodes: Iterable[Host], options: str = '-aT', backup_dir: Path | None = None, extra_vars: Dict | None = None)#
Deploy dstat on all hosts.
This assumes a debian/ubuntu based environment and aims at producing a quick way to deploy a simple monitoring stack based on dstat on your nodes. It’s opinionated out of the box but allow for some convenient customizations.
dstat metrics are dumped into a csv file by default (-o option) and retrieved when backuping.
- Parameters:
nodes – the nodes to install dstat on
options – options to pass to dstat.
priors – priors to apply
extra_vars – extra vars to pass to Ansible
Examples
1import logging 2import time 3from pathlib import Path 4 5import enoslib as en 6 7en.init_logging(level=logging.INFO) 8en.check() 9 10 11CLUSTER = "parasilo" 12SITE = en.g5k_api_utils.get_cluster_site(CLUSTER) 13job_name = Path(__file__).name 14 15# claim the resources 16network = en.G5kNetworkConf(type="prod", roles=["my_network"], site=SITE) 17conf = ( 18 en.G5kConf.from_settings(job_name=job_name, walltime="0:30:00", job_type=[]) 19 .add_network_conf(network) 20 .add_machine(roles=["control"], cluster=CLUSTER, nodes=2, primary_network=network) 21 .finalize() 22) 23 24provider = en.G5k(conf) 25roles, networks = provider.init() 26 27with en.actions(roles=roles["control"]) as a: 28 a.apt(name="stress", state="present") 29 30# Start a capture 31# - for the duration of the commands 32with en.Dstat(nodes=roles) as d: 33 time.sleep(5) 34 en.run("stress --cpu 4 --timeout 10", roles) 35 time.sleep(5) 36 37 38# sns.lineplot(data=result, x="epoch", y="usr", hue="host", markers=True, style="host") 39# plt.show()
- backup(backup_dir: Path | None = None) Path #
Backup the dstat monitoring stack.
This fetches all the remote dstat csv files under the backup_dir.
- Parameters:
backup_dir (str) – path of the backup directory to use.
- deploy()#
Deploy the dstat monitoring stack.
- destroy()#
Destroy the dstat monitoring stack.
This kills the dstat processes on the nodes. Metric files survive to destroy.
- static to_pandas(backup_dir: Path)#
Get a pandas representation of the monitoring metrics.
Why static ? You’ll probably use this method when doing post-mortem analysis. So the Dstat object might not be around anymore: you’ll be left with the dstat directory.
Internals. This work by scanning all csv files in
backup_dir
: this directory is assumed to have been created solely by a call tobackup()
- Parameters:
backup_dir – The directory created by
backup()
- Returns:
A pandas dataframe with all the metrics
Telegraf/InfluxDB/Grafana stack#
Classes:
|
Deploy a TIG stack: Telegraf, InfluxDB, Grafana. |
- class enoslib.service.monitoring.monitoring.TIGMonitoring(collector: Host, agent: Iterable[Host], *, ui: Host | None = None, networks: Iterable[Network] | None = None, remote_working_dir: str = '/builds/monitoring', backup_dir: Path | None = None, collector_env: Dict | None = None, agent_conf: str | None = None, agent_env: Dict | None = None, agent_image: str = 'telegraf', ui_env: Dict | None = None, extra_vars: Dict | None = None)#
Deploy a TIG stack: Telegraf, InfluxDB, Grafana.
This assumes a debian/ubuntu base environment and aims at producing a quick way to deploy a monitoring stack on your nodes. Except for telegraf agents which will use a binary file for armv7 (FIT/IoT-LAB).
It’s opinionated out of the box but allow for some convenient customizations.
- Parameters:
collector –
enoslib.Host
where the collector will be installedagent – list of
enoslib.Host
where the agent will be installedui –
enoslib.Host
where the UI will be installednetworks – list of networks to use for the monitoring traffic. Agents will send their metrics to the collector using this IP address. In the same way, the ui will use this IP to connect to collector. The IP address is taken from
enoslib.Host
, depending on this parameter: - None: IP address = host.address - Iterable[Network]: Get the IP address available in host.extra_addresses which belongs to one of these networks Note that this parameter depends on calling sync_network_info to fill the extra_addresses structure. Raises an exception if no or more than IP address is foundremote_working_dir – path to a remote location that will be used as working directory
backup_dir – path to a local directory where the backup will be stored This can be overwritten by
backup()
.collector_env – environment variables to pass in the collector process environment
agent_conf – path to an alternative configuration file
agent_env – environment variables to pass in the agent process environment
agent_image – docker image to use for the agent (telegraf)
ui_env – environment variables to pass in the ui process environment
extra_vars – extra variables to pass to Ansible
Examples
1import logging 2from pathlib import Path 3 4import enoslib as en 5 6en.init_logging(level=logging.INFO) 7en.check() 8 9 10CLUSTER = "parasilo" 11SITE = en.g5k_api_utils.get_cluster_site(CLUSTER) 12job_name = Path(__file__).name 13 14# claim the resources 15conf = en.G5kConf.from_settings(job_name=job_name, walltime="1:00:00", job_type=[]) 16network = en.G5kNetworkConf(id="n1", type="prod", roles=["my_network"], site=SITE) 17conf.add_network_conf(network).add_machine( 18 roles=["control"], cluster=CLUSTER, nodes=1, primary_network=network 19).add_machine( 20 roles=["compute"], cluster=CLUSTER, nodes=1, primary_network=network 21).finalize() 22 23provider = en.G5k(conf) 24roles, networks = provider.init() 25 26m = en.TIGMonitoring( 27 collector=roles["control"][0], agent=roles["compute"], ui=roles["control"][0] 28) 29m.deploy() 30 31ui_address = roles["control"][0].address 32print("The UI is available at http://%s:3000" % ui_address) 33print("user=admin, password=admin")
1import json 2import logging 3import time 4from pathlib import Path 5 6import requests 7 8import enoslib as en 9 10en.init_logging(level=logging.INFO) 11en.check() 12 13job_name = Path(__file__).name 14 15 16# They have GPU in lille ! 17CLUSTER = "chifflet" 18SITE = en.g5k_api_utils.get_cluster_site(CLUSTER) 19 20 21# claim the resources 22conf = en.G5kConf.from_settings(job_name=job_name, walltime="0:30:00", job_type=[]) 23network = en.G5kNetworkConf(id="n1", type="prod", roles=["my_network"], site=SITE) 24conf.add_network_conf(network).add_machine( 25 roles=["control"], cluster=CLUSTER, nodes=1, primary_network=network 26).add_machine( 27 roles=["compute"], cluster=CLUSTER, nodes=1, primary_network=network 28).finalize() 29 30provider = en.G5k(conf) 31roles, networks = provider.init() 32 33# The Docker service knows how to deploy nvidia docker runtime 34d = en.Docker(agent=roles["control"] + roles["compute"]) 35d.deploy() 36 37# The Monitoring service knows how to use this specific runtime 38m = en.TIGMonitoring( 39 collector=roles["control"][0], agent=roles["compute"], ui=roles["control"][0] 40) 41m.deploy() 42 43ui_address = roles["control"][0].address 44print("The UI is available at http://%s:3000" % ui_address) 45print("user=admin, password=admin") 46 47# waiting a bit for some metrics to come on 48# and query influxdb 49collector_address = roles["control"][0].address 50time.sleep(10) 51with en.G5kTunnel(collector_address, 8086) as (local_address, local_port, tunnel): 52 url = f"http://{local_address}:{local_port}/query" 53 q = ( 54 'SELECT mean("temperature_gpu") FROM "nvidia_smi"' 55 'WHERE time > now() - 5m GROUP BY time(1m), "index", "name", "host"' 56 ) 57 r = requests.get(url, dict(db="telegraf", q=q)) 58 print(json.dumps(r.json(), indent=4))
- backup(backup_dir: str | None = None)#
Backup the monitoring stack.
- Parameters:
backup_dir (str) – path of the backup directory to use. Will be used instead of the one set in the constructor.
- deploy()#
Deploy the monitoring stack
- destroy()#
Destroy the monitoring stack.
This destroys all the container and associated volumes.
Telegraf/Prometheus/Grafana stack#
Classes:
|
Deploy a TPG stack: Telegraf, Prometheus, Grafana. |
- class enoslib.service.monitoring.monitoring.TPGMonitoring(collector: Host, agent: Iterable[Host], *, ui: Host | None = None, networks: Iterable[Network] | None = None, remote_working_dir: str = '/builds/monitoring', backup_dir: Path | None = None)#
Deploy a TPG stack: Telegraf, Prometheus, Grafana.
This assumes a debian/ubuntu base environment and aims at producing a quick way to deploy a monitoring stack on your nodes. Except for telegraf agents which will use a binary file for armv7 (FIT/IoT-LAB).
It’s opinionated out of the box but allow for some convenient customizations.
- Parameters:
collector –
enoslib.Host
where the collector will be installedui –
enoslib.Host
where the UI will be installedagent – list of
enoslib.Host
where the agent will be installednetworks – list of networks to use for the monitoring traffic. Agents will send their metrics to the collector using this IP address. In the same way, the ui will use this IP to connect to collector. The IP address is taken from
enoslib.Host
, depending on this parameter: - None: IP address = host.address - Iterable[Network]: Get the first IP address available in host.extra_addresses which belongs to one of these networks Note that this parameter depends on calling sync_network_info to fill the extra_addresses structure. Raises an exception if no or more than IP address is foundremote_working_dir – path to a remote location that will be used as working directory
backup_dir – path to a local directory where the backup will be stored This can be overwritten by
backup()
.
- backup(backup_dir: str | None = None)#
Backup the monitoring stack.
- Parameters:
backup_dir (str) – path of the backup directory to use. Will be used instead of the one set in the constructor.
- deploy()#
Deploy the monitoring stack
- destroy()#
Destroy the monitoring stack.
This destroys all the container and associated volumes.