Kwollect Service#

Kwollect Service Class#

Classes:

Kwollect(nodes)

Collect environmental metrics from the Grid'5000 Kwollect service

class enoslib.service.kwollect.kwollect.Kwollect(nodes: Iterable[Host])#

Collect environmental metrics from the Grid’5000 Kwollect service

This service must be called on Grid’5000 nodes.

To fetch metrics from the service, you first have to call start() and stop() to define the time range for which you want to retrieve metrics. Alternatively, you can use this service as a context manager.

Then, use get_metrics() or get_metrics_pandas() to fetch metrics for further processing, or backup() to store the raw data locally.

Some metrics are not available by default and require a specific job type, see https://www.grid5000.fr/w/Monitoring_Using_Kwollect

Parameters:

nodes – list of enoslib.Host for which to collect metrics

Examples

 1import logging
 2import time
 3from pathlib import Path
 4from pprint import pprint
 5
 6import enoslib as en
 7
 8en.init_logging(level=logging.INFO)
 9en.check()
10
11job_name = Path(__file__).name
12
13conf = (
14    en.G5kConf.from_settings(job_name=job_name, walltime="0:20:00")
15    .add_machine(roles=["all", "idle"], cluster="ecotype", nodes=1)
16    .add_machine(roles=["all", "stress"], cluster="ecotype", nodes=1)
17)
18
19# This will validate the configuration, but not reserve resources yet
20provider = en.G5k(conf)
21
22# Get actual resources
23roles, networks = provider.init()
24
25# Global monitor
26monitor = en.Kwollect(nodes=roles["all"])
27monitor.deploy()
28
29# Run a loop of stress tests under the monitor
30monitor.start()
31
32duration = 20
33cores = 4
34for run in range(3):
35    en.run_command(f"stress -c {cores} -t {duration}", roles=roles["stress"])
36    print(f"Sleeping for {duration} seconds")
37    time.sleep(duration)
38
39# Run an additional stress test with a nested monitor, using a context manager
40with en.Kwollect(nodes=roles["stress"]) as local_monitor:
41    en.run_command(f"stress -c {cores} -t {duration}", roles=roles["stress"])
42print(f"Sleeping for {duration} seconds")
43time.sleep(duration)
44
45# Stop global monitor
46monitor.stop()
47
48# Get power metrics from global monitor
49metrics = monitor.get_metrics(metrics=["bmc_node_power_watt"])
50pprint(metrics)
51
52monitor.backup("./enoslib_tuto_kwollect")
53monitor.backup("./enoslib_tuto_kwollect_subset", metrics=["bmc_node_power_watt"])
54
55# Get CPU metrics from nested monitor
56metrics = local_monitor.get_metrics(metrics=["prom_node_cpu_scaling_frequency_hertz"])
57# Compute average CPU frequency across all cores and time for the stressed machine
58datapoints = metrics["nantes"]
59average_freq = sum(m["value"] for m in datapoints) / len(datapoints) / 1000000
60print(f"Average CPU frequency: {average_freq} MHz")
61
62# Available metrics are listed here:
63# https://www.grid5000.fr/w/Monitoring_Using_Kwollect#Metrics_available_in_Grid'5000
64
65# Release all Grid'5000 resources
66provider.destroy()
available_metrics(nodes: Iterable[Host] | None = None) Dict[str, List[Dict]]#

Returns the description of the metrics that are theoretically available for the given nodes.

Parameters:

nodes – optional list of nodes for which to retrieve metrics (default: all)

Returns:

dict giving a list of metrics description (as a dict) for each node.

Example return value:
{“gros-46.nancy.grid5000.fr”: [
{‘description’: ‘Power consumption of node reported by wattmetre, in watt’,

‘name’: ‘wattmetre_power_watt’, ‘optional_period’: 20, ‘period’: 1000, ‘source’: {‘protocol’: ‘wattmetre’}},

{‘description’: ‘Default subset of metrics from Prometheus Node Exporter’,

‘name’: ‘prom_node_load1’, ‘optional_period’: 15000, ‘period’: 0, ‘source’: {‘port’: 9100, ‘protocol’: ‘prometheus’}},

]}

backup(backup_dir: str | None = None, metrics: List[str] | None = None, nodes: Iterable[Host] | None = None, summary: bool = False)#

Backup the kwollect data in JSONL format (one JSON record per line). Data for each node is stored in separate files.

Parameters:
  • backup_dir (str) – path of the backup directory to use.

  • metrics – optional list of metrics to retrieve (default: all)

  • nodes – optional list of nodes for which to retrieve metrics (default: all)

  • summary – whether to retrieve summarized metrics (default: False)

deploy()#

Validate that nodes are usable with kwollect (allows to fail early)

destroy()#

(abstract) Destroy the service.

get_metrics(metrics: List[str] | None = None, nodes: Iterable[Host] | None = None, summary: bool = False) Dict[str, List[Dict]]#

Retrieve metrics from Kwollect

By default, all available metrics on all hosts are retrieved. To speed up metrics retrieval, it is recommended to filter on a subset of metrics and/or a subset of hosts.

Available metrics can be found with available_metrics() or are listed here: https://www.grid5000.fr/w/Monitoring_Using_Kwollect#Metrics_available_in_Grid’5000

Parameters:
  • metrics – optional list of metrics to retrieve (default: all)

  • nodes – optional list of nodes for which to retrieve metrics (default: all)

  • summary – whether to retrieve summarized metrics (default: False)

Returns:

A list of data points for each site. Each data point is a dictionary. All data points are sorted in chronological order.

Example:

{“nantes”: [
{“timestamp”: “2025-04-18T18:55:33.754307+02:00”,

“device_id”: “ecotype-7”, “metric_id”: “bmc_node_power_watt”, “value”: 98, “labels”: {}},

{“timestamp”: “2025-04-18T18:55:34.732712+02:00”,

“device_id”: “ecotype-6”, “metric_id”: “network_ifacein_bytes_total”, “value”: 152601965318 “labels”: {“interface”: “eth1”, “_device_orig”: [“ecotype-prod2-port-1_6”]}

]}

get_metrics_pandas(*args, **kwargs)#

Same as get_metrics(), but returns the result as a Pandas Dataframe. Data from all sites is aggregated in the same Dataframe, with an additional “site” column.

Returns:

A Pandas Dataframe with all metrics data

start(start_time: float | None = None)#

Define the start time for metric collection.

By default, the current time is used. Make sure your clock is synchronised.

Parameters:

start_time – optional start time override, expressed as a Unix timestamp

stop(stop_time: float | None = None)#

Define the stop time for metric collection.

By default, the current time is used. Make sure your clock is synchronised.

Parameters:

stop_time – optional stop time override, expressed as a Unix timestamp