Observability facilities#

Third party software stack to keep an eye on your experiment or gather some metrics.

Website: https://discovery.gitlabpages.inria.fr/enoslib/index.html
Instant chat: https://framateam.org/enoslib
Source code: https://gitlab.inria.fr/discovery/enoslib

Prerequisites#

Make sure you’ve run the one time setup for your environment

Grid’5000 monitoring facilities#

Grid’5000 collects automatically metrics of the nodes. Those metrics can be queried using the REST API.

[ ]:

from grid5000 import Grid5000
from pathlib import Path

conf = Path.home() / ".python-grid5000.yaml"

gk = Grid5000.from_yaml(conf)

[ ]:

# get the list of the available metrics for a given cluster
import json

metrics = gk.sites["lyon"].clusters["nova"].metrics
print(json.dumps(metrics, indent=4))

[ ]:

[m["name"] for m in metrics]

[ ]:

import time
metric = "wattmetre_power_watt"
now = time.time()
measurements = gk.sites["lyon"].metrics.list(nodes="nova-1,nova-2,nova-3", start_time=now - 1000, metrics=metric)

# alternatively one can pass a job_id
# measurements = gk.sites["lyon"].metrics.list(job_id=1307628, metrics=metric)
measurements[:10]

[ ]:

import pandas as pd

df = pd.DataFrame([m.to_dict() for m in measurements])
df["timestamp"] = pd.to_datetime(df["timestamp"])
import seaborn as sns

sns.relplot(data=df, x="timestamp", y="value", hue="device_id", alpha=0.7)

## EnOSlib’s observability service#

A Service in EnOSlib is a third party software stack that is commonly used among experimenters. In particular, EnOSlib has some some Services which deal with the problem of getting some insight on what’s running on remote nodes.

A Service is a python object which exposes three main methods:

deploy: which deploy the service
destroy: remove stop the service
backup: retrieve some states of the services (e.g monitoring information)

Usually a service is used as follow:

service = Service(*args, **kwargs)
service.deploy()
...
# do stuffs
...
service.backup()
service.destroy()

But it’s sometime useful to use a Context Manager when working with module:

with Service(*args, **kwargs) as service:
    ...
    # do stuffs
    ...

This allows for

running the service for some time depending on what’s inside the context manager
cleaning (and backuping) stuffs automatically at the end

Common setup#

[ ]:

import enoslib as en

# Enable rich logging
_ = en.init_logging()

[ ]:

# claim the resources
network = en.G5kNetworkConf(type="prod", roles=["my_network"], site="rennes")

conf = (
    en.G5kConf.from_settings(job_type=[], job_name="enoslib_observability")
    .add_network_conf(network)
    .add_machine(
        roles=["control", "xp"], cluster="parasilo", nodes=1, primary_network=network
    )
    .add_machine(
        roles=["agent", "xp"], cluster="parasilo", nodes=1, primary_network=network
    )
    .finalize()
)
conf

[ ]:

provider = en.G5k(conf)
roles, networks = provider.init()
roles

A simple load generator#

We’ll install a simple load generator: stress available in the debian packages.

[ ]:

with en.actions(roles=roles["agent"]) as a:
    a.apt(name="stress", state="present")

Monitoring with dstat#

Dstat is a simple monitoring tool: dstat-real/dstat It runs as a single process and collect metrics from various sources. That makes it a good candidate for getting a quick insight on the resources consumptions during an experiment.

The EnOSlib implementation lets you easily - start Dstat processes on remote machine and start dumping the metrics into a csv file( it’s the purpose deploy method of the Dstat service) - retrieve all the csvs file (one per remote node) on your local machine ( that’s the purpose of the backup method) - stop every remote Dstat processes (that’s the purpose of the destroy method)

Capture#

Let’s start with a single capture implemented using a context manager. The context manager runs deploy when entering, and backup/destroy when exiting.

[ ]:

# Start a capture on all nodes
# - stress on some nodes
import time
with en.Dstat(nodes=roles["xp"]) as d:
    time.sleep(5)
    en.run_command("stress --cpu 4 --timeout 10", roles=roles["agent"])
    time.sleep(5)

Visualization#

All the CSVs files are available under the backup_dir inside subdirectories named after the corresponding remote host alias:

<backup_sir> / host1 / ... / <metrics>.csv
             / host2 / ..../ <metrics>.csv

The following bunch of python lines will recursively look for any csv file inside these directories and build a DataFrame and a visualization

[ ]:

import pandas as pd
import seaborn as sns


df = en.Dstat.to_pandas(d.backup_dir)
df

[ ]:

# let's show the metrics !
sns.lineplot(data=df, x="epoch", y="usr", hue="host", markers=True, style="host")

Packet sniffing with tcpdump#

Capture#

[ ]:

# start a capture
# - on all the interface configured on the my_network network
# - we dump icmp traffic only
# - for the duration of the commands (here a client is pigging the server)
with en.TCPDump(
    hosts=roles["xp"], ifnames=["any"], options="icmp"
) as t:
    backup_dir = t.backup_dir
    _ = en.run(f"ping -c10 {roles['control'][0].address}", roles["agent"])

Visualization#

[ ]:

from scapy.all import rdpcap
import tarfile
# Examples:
# create a dictionnary of (alias, if) -> list of decoded packets by scapy
decoded_pcaps = dict()
for host in roles["xp"]:
    host_dir = backup_dir / host.alias
    t = tarfile.open(host_dir / "tcpdump.tar.gz")
    t.extractall(host_dir / "extracted")
    # get all extracted pcap for this host
    pcaps = (host_dir / "extracted").rglob("*.pcap")
    for pcap in pcaps:
        decoded_pcaps.setdefault((host.alias, pcap.with_suffix("").name),
                                 rdpcap(str(pcap)))

# Displaying some packets
for (host, ifs), packets in decoded_pcaps.items():
    print(host, ifs)
    packets[0].show()
    packets[1].show()

Capture on a specific network#

You can start a capture on a dedicated network by specifying it to TCPDump This will sniff all the packet that go through an interface configured in this specific network You need to call sync_info first to enable the translation (network logical name)->interface name

[ ]:

roles = en.sync_info(roles, networks)

[ ]:

# start a capture
# - on all the interface configured on the my_network network
# - we dump icmp traffic only
# - for the duration of the commands (here a client is pigging the server)
with en.TCPDump(
    hosts=roles["xp"], networks=networks["my_network"], options="icmp"
) as t:
    backup_dir = t.backup_dir
    _ = en.run(f"ping -c10 {roles['control'][0].address}", roles["agent"])

[ ]:

from scapy.all import rdpcap
import tarfile
# Examples:
# create a dictionnary of (alias, if) -> list of decoded packets by scapy
decoded_pcaps = dict()
for host in roles["xp"]:
    host_dir = backup_dir / host.alias
    t = tarfile.open(host_dir / "tcpdump.tar.gz")
    t.extractall(host_dir / "extracted")
    # get all extracted pcap for this host
    pcaps = (host_dir / "extracted").rglob("*.pcap")
    for pcap in pcaps:
        decoded_pcaps.setdefault((host.alias, pcap.with_suffix("").name),
                                 rdpcap(str(pcap)))

# Displaying some packets
for (host, ifs), packets in decoded_pcaps.items():
    print(host, ifs)
    packets[0].show()
    packets[1].show()

Monitoring with Telegraf/[InfluxDB|prometheus]/grafana#

[ ]:

monitoring = en.TIGMonitoring(collector=roles["control"][0], agent=roles["agent"], ui=roles["control"][0])
monitoring

[ ]:

monitoring.deploy()

[ ]:

en.run_command("stress --cpu 24 --timeout 60", roles=roles["agent"], background=True)

[ ]:

print(f"""
Access the UI at {monitoring.ui.address}:3000 (admin/admin)")
---
tip1: create a ssh port forwarding -> ssh -NL 3000:{monitoring.ui.address}:3000 access.grid5000.fr (and point your browser to http://localhost:3000)
tip2: use a proxy socks -> ssh -ND 2100 access.grid5000.fr (and point your browser to http://{monitoring.ui.address}:3000)
tip3: use the G5K vpn
""")

[ ]:

# If you are running stuffs from outside g5k, you can access the dashboard by creating a tunnel
# create a tunnel to the service running inside g5k

tunnel = en.G5kTunnel(address=monitoring.ui.address, port=3000)
local_address, local_port, _ = tunnel.start()
print(f"The service is running at http://localhost:{local_port} (admin:admin)")

# wait some time
import time
time.sleep(60)


# don't forget to close it
tunnel.close()

To not forget to close the tunnel you can use a context manager: the tunnel will be closed automatically when exiting the context manager.

[ ]:

import time
with en.G5kTunnel(address=monitoring.ui.address, port=3000) as (_, local_port, _):
    print(f"The service is running at http://localhost:{local_port}")
    time.sleep(60)

Cleaning#

[ ]:

provider.destroy()

[ ]: