Kwollect Service#
Kwollect Service Class#
Classes:
|
Collect environmental metrics from the Grid'5000 Kwollect service |
- class enoslib.service.kwollect.kwollect.Kwollect(nodes: Iterable[Host])#
Collect environmental metrics from the Grid’5000 Kwollect service
This service must be called on Grid’5000 nodes.
To fetch metrics from the service, you first have to call
start()andstop()to define the time range for which you want to retrieve metrics. Alternatively, you can use this service as a context manager.Then, use
get_metrics()orget_metrics_pandas()to fetch metrics for further processing, orbackup()to store the raw data locally.Some metrics are not available by default and require a specific job type, see https://www.grid5000.fr/w/Monitoring_Using_Kwollect
- Parameters:
nodes – list of
enoslib.Hostfor which to collect metrics
Examples
1import logging 2import time 3from pathlib import Path 4from pprint import pprint 5 6import enoslib as en 7 8en.init_logging(level=logging.INFO) 9en.check() 10 11job_name = Path(__file__).name 12 13conf = ( 14 en.G5kConf.from_settings(job_name=job_name, walltime="0:20:00") 15 .add_machine(roles=["all", "idle"], cluster="ecotype", nodes=1) 16 .add_machine(roles=["all", "stress"], cluster="ecotype", nodes=1) 17) 18 19# This will validate the configuration, but not reserve resources yet 20provider = en.G5k(conf) 21 22# Get actual resources 23roles, networks = provider.init() 24 25# Global monitor 26monitor = en.Kwollect(nodes=roles["all"]) 27monitor.deploy() 28 29# Run a loop of stress tests under the monitor 30monitor.start() 31 32duration = 20 33cores = 4 34for run in range(3): 35 en.run_command(f"stress -c {cores} -t {duration}", roles=roles["stress"]) 36 print(f"Sleeping for {duration} seconds") 37 time.sleep(duration) 38 39# Run an additional stress test with a nested monitor, using a context manager 40with en.Kwollect(nodes=roles["stress"]) as local_monitor: 41 en.run_command(f"stress -c {cores} -t {duration}", roles=roles["stress"]) 42print(f"Sleeping for {duration} seconds") 43time.sleep(duration) 44 45# Stop global monitor 46monitor.stop() 47 48# Get power metrics from global monitor 49metrics = monitor.get_metrics(metrics=["bmc_node_power_watt"]) 50pprint(metrics) 51 52monitor.backup("./enoslib_tuto_kwollect") 53monitor.backup("./enoslib_tuto_kwollect_subset", metrics=["bmc_node_power_watt"]) 54 55# Get CPU metrics from nested monitor 56metrics = local_monitor.get_metrics(metrics=["prom_node_cpu_scaling_frequency_hertz"]) 57# Compute average CPU frequency across all cores and time for the stressed machine 58datapoints = metrics["nantes"] 59average_freq = sum(m["value"] for m in datapoints) / len(datapoints) / 1000000 60print(f"Average CPU frequency: {average_freq} MHz") 61 62# Available metrics are listed here: 63# https://www.grid5000.fr/w/Monitoring_Using_Kwollect#Metrics_available_in_Grid'5000 64 65# Release all Grid'5000 resources 66provider.destroy()
- available_metrics(nodes: Iterable[Host] | None = None) Dict[str, List[Dict]]#
Returns the description of the metrics that are theoretically available for the given nodes.
- Parameters:
nodes – optional list of nodes for which to retrieve metrics (default: all)
- Returns:
dict giving a list of metrics description (as a dict) for each node.
- Example return value:
- {“gros-46.nancy.grid5000.fr”: [
- {‘description’: ‘Power consumption of node reported by wattmetre, in watt’,
‘name’: ‘wattmetre_power_watt’, ‘optional_period’: 20, ‘period’: 1000, ‘source’: {‘protocol’: ‘wattmetre’}},
- {‘description’: ‘Default subset of metrics from Prometheus Node Exporter’,
‘name’: ‘prom_node_load1’, ‘optional_period’: 15000, ‘period’: 0, ‘source’: {‘port’: 9100, ‘protocol’: ‘prometheus’}},
…
]}
- backup(backup_dir: str | None = None, metrics: List[str] | None = None, nodes: Iterable[Host] | None = None, summary: bool = False)#
Backup the kwollect data in JSONL format (one JSON record per line). Data for each node is stored in separate files.
- Parameters:
backup_dir (str) – path of the backup directory to use.
metrics – optional list of metrics to retrieve (default: all)
nodes – optional list of nodes for which to retrieve metrics (default: all)
summary – whether to retrieve summarized metrics (default: False)
- deploy()#
Validate that nodes are usable with kwollect (allows to fail early)
- destroy()#
(abstract) Destroy the service.
- get_metrics(metrics: List[str] | None = None, nodes: Iterable[Host] | None = None, summary: bool = False) Dict[str, List[Dict]]#
Retrieve metrics from Kwollect
By default, all available metrics on all hosts are retrieved. To speed up metrics retrieval, it is recommended to filter on a subset of metrics and/or a subset of hosts.
Available metrics can be found with
available_metrics()or are listed here: https://www.grid5000.fr/w/Monitoring_Using_Kwollect#Metrics_available_in_Grid’5000- Parameters:
metrics – optional list of metrics to retrieve (default: all)
nodes – optional list of nodes for which to retrieve metrics (default: all)
summary – whether to retrieve summarized metrics (default: False)
- Returns:
A list of data points for each site. Each data point is a dictionary. All data points are sorted in chronological order.
Example:
- {“nantes”: [
- {“timestamp”: “2025-04-18T18:55:33.754307+02:00”,
“device_id”: “ecotype-7”, “metric_id”: “bmc_node_power_watt”, “value”: 98, “labels”: {}},
- {“timestamp”: “2025-04-18T18:55:34.732712+02:00”,
“device_id”: “ecotype-6”, “metric_id”: “network_ifacein_bytes_total”, “value”: 152601965318 “labels”: {“interface”: “eth1”, “_device_orig”: [“ecotype-prod2-port-1_6”]}
]}
- get_metrics_pandas(*args, **kwargs)#
Same as
get_metrics(), but returns the result as a Pandas Dataframe. Data from all sites is aggregated in the same Dataframe, with an additional “site” column.- Returns:
A Pandas Dataframe with all metrics data
- start(start_time: float | None = None)#
Define the start time for metric collection.
By default, the current time is used. Make sure your clock is synchronised.
- Parameters:
start_time – optional start time override, expressed as a Unix timestamp
- stop(stop_time: float | None = None)#
Define the stop time for metric collection.
By default, the current time is used. Make sure your clock is synchronised.
- Parameters:
stop_time – optional stop time override, expressed as a Unix timestamp