Smoothing heterogeneity: the monitoring case#
Revisiting here the “observability notebook” in the context of mixed G5K/FIT resources.
Website: https://discovery.gitlabpages.inria.fr/enoslib/index.html
Instant chat: https://framateam.org/enoslib
Source code: https://gitlab.inria.fr/discovery/enoslib
Prerequisites#
Make sure you’ve run the one time setup for your environment (e.g one time setup for https://labs.iot-lab.info)
Monitoring options#
Experimenters rely on monitoring to get insight on their deployments. Fit and G5k provide their own job monitoring infrastructure. An thus an experimenter have different choices: - Get metrics from the infrastructure (G5K REST API / FIT oml files). This is especially interesting for environnemental metrics (power …) - Deploy their own monitoring tool
On Grid’5000#
There are also different options to interact with the REST API (see https://www.grid5000.fr/w/Grid5000:Software#Experiment_management_tools) and of course curl
. REST API specification is available here: https://api.grid5000.fr/doc/3.0/
From python you can use `python-grid5000
<https://pypi.org/project/python-grid5000/>`__.
[ ]:
from grid5000 import Grid5000
from pathlib import Path
conf = Path.home() / ".python-grid5000.yaml"
gk = Grid5000.from_yaml(conf)
[ ]:
# get the list of the available metrics for a given cluster
import json
metrics = gk.sites["lyon"].clusters["nova"].metrics
print(json.dumps(metrics, indent=4))
[ ]:
[m["name"] for m in metrics]
[ ]:
import time
metric = "wattmetre_power_watt"
now = time.time()
measurements = gk.sites["lyon"].metrics.list(nodes="nova-1,nova-2,nova-3", start_time=now - 1000, metrics=metric)
# alternatively one can pass a job_id
# measurements = gk.sites["lyon"].metrics.list(job_id=1307628, metrics=metric)
measurements[:10]
[ ]:
import pandas as pd
df = pd.DataFrame([m.to_dict() for m in measurements])
df["timestamp"] = pd.to_datetime(df["timestamp"])
import seaborn as sns
sns.relplot(data=df, x="timestamp", y="value", hue="device_id", alpha=0.7)
On IOT-LAB#
One need to attach a profile
to the experiment (either at reservation time or dynamically)
[ ]:
!ls tutorial_m3.elf || wget -q https://raw.githubusercontent.com/wiki/iot-lab/iot-lab/firmwares/tutorial_m3.elf
[ ]:
import enoslib as en
from enoslib.infra.enos_iotlab.configuration import ConsumptionConfiguration
en.init_logging()
FIT_SITE="grenoble"
fit_conf = (
en.IotlabConf.from_settings(job_name="tutorial_m3", walltime="02:00")
.add_machine(roles=["xp_fit"], archi="m3:at86rf231", site=FIT_SITE, number=1, image="tutorial_m3.elf", profile="test_profile")
.add_profile(name="test_profile", archi="m3", consumption=ConsumptionConfiguration(current=True, power=True, voltage=True, period=8244, average=4))
)
fit_conf
fit = en.Iotlab(fit_conf)
fit_roles, _ = fit.init()
[ ]:
# wait a bit for data to be collected and flushed
import time
time.sleep(20)
[ ]:
fit.collect_data_experiment()
[ ]:
import tarfile
job_id = fit.client.get_job_id()
tar = tarfile.open("%s.tar.gz" % (job_id)) # tar = tarfile.open("%s-%s.iot-lab.info.tar.gz" % (job_id, FIT_SITE))
tar.extractall(path=".")
tar.close()
[ ]:
%matplotlib widget
from oml_plot_tools import consum
from pathlib import Path
# iterate over all the *.om found
consuption_dir = Path(str(job_id)) / "consumption"
for consumption in consuption_dir.glob("*.oml"):
print(consumption)
data = consum.oml_load(consumption)
data = data[0:1000]
consum.consumption_plot(data, 'consumption', ('power'))
[ ]:
fit.destroy()
User defined monitoring#
The user deploys its own monitoring solution EnOSlib provides different ways of doing that: - a lightweight monitoring: based on independant monitoring processes running on each host - an heavyweight monitoring: based on a classical monitoring stack (Telegraf + InfluxBD/Prometheus + Grafana)
Monitoring stacks are exposed using EnOSlib’s Service
. We show that these service can mix G5K + FIT nodes.
We reserve: - 2 nodes on G5K - 1 A8 nodes on FIT
[ ]:
import enoslib as en
en.init_logging()
network = en.G5kNetworkConf(type="prod", roles=["my_network"], site="rennes")
g5k_conf = (
en.G5kConf.from_settings(job_type=[], job_name="fit_g5k_monitoring")
.add_network_conf(network)
.add_machine(
roles=["xp", "collector"], cluster="paravance", nodes=1, primary_network=network
)
.add_machine(
roles=["xp", "agent"], cluster="paravance", nodes=1, primary_network=network
)
.finalize()
)
g5k_conf
[ ]:
FIT_SITE="grenoble"
fit_conf = (
en.IotlabConf.from_settings(job_name="riot_a8", walltime="02:00")
.add_machine(roles=["agent"], archi="a8:at86rf231", site=FIT_SITE, number=1)
)
fit_conf
[ ]:
# Here we set up a special provider that will try to reserve both G5k and Iotlab configuration simultaneously
from enoslib.infra.providers import Providers
iotlab_provider = en.Iotlab(fit_conf, name="Iotlab")
g5k_provider = en.G5k(g5k_conf, name="G5k")
providers = Providers([iotlab_provider,g5k_provider])
roles, networks = providers.init(86400)
iotlab_provider, g5k_provider = providers.providers
Lightweight monitoring#
EnOSlib has a lighweight monitoring service based on Dstat.
Dstat service anatomy:
----------------------
.deploy: deploys a monitoring process in the background on each targetted nodes and store metric in a file
.destroy: stop the monitoring process on each node
.backup: retrieve all the monitoring files back to your control node for post-mortem analysis
[ ]:
# a context manager that do deploy when entering and destroy + backup when exiting
with en.Dstat(nodes=roles["agent"]) as d:
import time
time.sleep(5)
en.run_command("apt install -y stress && stress --timeout 10 --cpu 8", roles=roles["G5k"] & roles["agent"])
time.sleep(5)
[ ]:
import seaborn as sns
sns.relplot(data=en.Dstat.to_pandas(d.backup_dir), x="epoch", y="idl", hue="host")
Heavyweight monitoring stack#
[ ]:
en.run_command("dhclient -6 br0", roles=roles["G5k"])
# mandatory for the following
roles = en.sync_info(roles,networks)
A good practice to deploy docker on G5k is to the EnOSlib service to - install docker agent on all nodes - configure the deamon to use the Grid’5000 registry cache - bind the docker state directory to a place with some space
[ ]:
docker = en.Docker(agent=roles["xp"],
bind_var_docker="/tmp/docker",
registry_opts=dict(type="external", ip="docker-cache.grid5000.fr", port=80))
docker.deploy()
We’re now ready to instantiate the Monitoring Service,
We only need to make sure to bind the various client on the IPv6 networks and make the client use these networks.
[ ]:
def get_nets(networks, net_type):
""" Aux method to filter networks from roles """
return set([
n for net_list in networks.values() for n in net_list
if isinstance(n.network, net_type)
])
[ ]:
from ipaddress import IPv6Network
get_nets(networks, IPv6Network)
[ ]:
m = en.TIGMonitoring(
# collector is the node where the DB will be deployed and
# where the monitoring data will be sent
collector=roles["collector"][0],
# agent are the nodes where the monitoring agent will be deployed
# on g5k we deploy those using docker
# while on fit we deploy the agent from the binary
agent=roles["agent"],
# ui is the node where the dashboard will be deployed
ui=roles["collector"][0],
# networks represent the network to use (agent <-> collector communication)
networks=get_nets(networks, IPv6Network)
)
m.deploy()
[ ]:
print(f"""
Access the UI at {m.ui.address}:3000 (admin/admin)")
---
tip1: create a ssh port forwarding -> ssh -NL 3000:{m.ui.address}:3000 access.grid5000.fr (and point your browser to http://localhost:3000)
tip2: use a proxy socks -> ssh -ND 2100 access.grid5000.fr (and point your browser to http://{m.ui.address}:3000)
tip3: use the G5K vpn
""")
[ ]:
g5k_provider.fw_create(proto="tcp+udp", port=8086)
[ ]:
proc = (roles["G5k"] & roles["agent"])[0].processor
cpu_stress = proc.cores * proc.count
en.run_command(f"stress --cpu {cpu_stress} --timeout 30", roles=[host for host in roles["G5k"] if host in roles["agent"]])
This is an example of the outcome in the dashboard
Clean up#
[ ]:
providers.destroy()