Observability service#

Third party software stack to keep an eye on your experiment or gather some metrics. Note that this tutorial is about instrumenting on your own the deployed nodes. Note that Grid’5000 also provides ways to get data from your job using a REST API.

Website: https://discovery.gitlabpages.inria.fr/enoslib/index.html
Instant chat: https://framateam.org/enoslib
Source code: https://gitlab.inria.fr/discovery/enoslib

Prerequisites#

⚠️ Make sure you’ve run the one time setup for your environment
⚠️ Make sure you’re running this notebook under the right kernel

[ ]:

import enoslib as en

en.check()

## EnOSlib’s service#

A Service in EnOSlib is a third party software stack that is commonly used among experimenters. In particular, EnOSlib has some some Services which deal with the problem of getting some insight on what’s running on remote nodes.

A Service is a python object which exposes three main methods:

deploy: which deploy the service
destroy: remove stop the service
backup: retrieve some states of the services (e.g monitoring information)

Usually a service is used as follow:

service = Service(*args, **kwargs)
service.deploy()
...
# do stuffs
...
service.backup()
service.destroy()

But it’s sometime useful to use a Context Manager when working with module:

with Service(*args, **kwargs) as service:
    ...
    # do stuffs
    ...

This allows for

running the service for some time depending on what’s inside the context manager
cleaning (and backuping) stuffs automatically at the end

There are different EnOSlib services for different purposes (network emulation, docker deployment, orchestrator deployment …). You can check the documentation.

Common setup#

[ ]:

import enoslib as en

# Enable rich logging
_ = en.init_logging()

[ ]:

conf = (
    en.G5kConf.from_settings(job_type=[], job_name="enoslib_observability")
    .add_machine(
        roles=["control", "xp"], cluster="parasilo", nodes=1
    )
    .add_machine(
        roles=["agent", "xp"], cluster="parasilo", nodes=1
    )
    .finalize()
)
conf

[ ]:

provider = en.G5k(conf)
roles, networks = provider.init()
roles

A simple load generator#

We’ll install a simple load generator: stress available in the debian packages.

[ ]:

with en.actions(roles=roles["agent"]) as a:
    a.apt(name="stress", state="present")

Monitoring with dstat#

Dstat is a simple monitoring tool: dstat-real/dstat It runs as a single process and collect metrics from various sources. That makes it a good candidate for getting a quick insight on the resources consumptions during an experiment.

The EnOSlib implementation lets you easily - start Dstat processes on remote machine and start dumping the metrics into a csv file( it’s the purpose deploy method of the Dstat service) - retrieve all the csvs file (one per remote node) on your local machine ( that’s the purpose of the backup method) - stop every remote Dstat processes (that’s the purpose of the destroy method)

Capture#

Let’s start with a single capture implemented using a context manager. The context manager runs deploy when entering, and backup/destroy when exiting.

[ ]:

# Start a capture on all nodes
# - stress on some nodes
import time
with en.Dstat(nodes=roles["xp"]) as d:
    time.sleep(5)
    en.run_command("stress --cpu 4 --timeout 10", roles=roles["agent"])
    time.sleep(5)

Visualization#

All the CSVs files are available under the backup_dir inside subdirectories named after the corresponding remote host alias:

<backup_sir> / host1 / ... / <metrics>.csv
             / host2 / ..../ <metrics>.csv

The following bunch of python lines will recursively look for any csv file inside these directories and build a DataFrame and a visualization

[ ]:

import pandas as pd
import seaborn as sns


df = en.Dstat.to_pandas(d.backup_dir)
df

[ ]:

# let's show the metrics !
sns.lineplot(data=df, x="epoch", y="usr", hue="host", markers=True, style="host")

Monitoring with Telegraf/[InfluxDB|prometheus]/grafana#

[ ]:

monitoring = en.TIGMonitoring(collector=roles["control"][0], agent=roles["agent"], ui=roles["control"][0])
monitoring

[ ]:

monitoring.deploy()

[ ]:

en.run_command("stress --cpu 24 --timeout 60", roles=roles["agent"], background=True)

💡 Accessing a service inside Grid’5000 isn’t straightforward. The following depends on your environment.

💡 Run the following on a terminal in your local computer

-> This requires that you SSH key is set. This can be done by managing your account on Grid’5000

[ ]:

print(f"""
Access the UI at {monitoring.ui.address}:3000 (admin/admin)")
---
tip1: create a ssh port forwarding -> ssh -NL 3000:{monitoring.ui.address}:3000 access.grid5000.fr (and point your browser to http://localhost:3000)
tip2: use a proxy socks -> ssh -ND 2100 access.grid5000.fr (and point your browser to http://{monitoring.ui.address}:3000)
tip3: use the G5K vpn
""")

💡 EnOSlib provides a way to programmaticaly create the tunnel if this notebook runs on your laptop. However this doesn’t apply if the notebook is running on a frontend node or a compute node inside Grid’5000.

[ ]:

# If you are running this notebook outside of Grid'5000 (e.g from your local machine), you can access the dashboard by creating a tunnel

# This doesn't apply if you are running this notebook from the frontend or a node inside Grid5000
tunnel = en.G5kTunnel(address=monitoring.ui.address, port=3000)
local_address, local_port, _ = tunnel.start()
print(f"The service is running at http://localhost:{local_port} (admin:admin)")

# wait some time
import time
time.sleep(60)


# don't forget to close it
tunnel.close()

To not forget to close the tunnel you can use a context manager: the tunnel will be closed automatically when exiting the context manager.

[ ]:

import time
with en.G5kTunnel(address=monitoring.ui.address, port=3000) as (_, local_port, _):
    print(f"The service is running at http://localhost:{local_port}")
    time.sleep(60)

Packet sniffing with tcpdump#

Capture#

[ ]:

# start a capture
# - on all the interface configured on the my_network network
# - we dump icmp traffic only
# - for the duration of the commands (here a client is pigging the server)
with en.TCPDump(
    hosts=roles["xp"], ifnames=["any"], options="icmp"
) as t:
    backup_dir = t.backup_dir
    _ = en.run(f"ping -c10 {roles['control'][0].address}", roles["agent"])

Visualization#

[ ]:

from scapy.all import rdpcap
import tarfile
# Examples:
# create a dictionnary of (alias, if) -> list of decoded packets by scapy
decoded_pcaps = dict()
for host in roles["xp"]:
    host_dir = backup_dir / host.alias
    t = tarfile.open(host_dir / "tcpdump.tar.gz")
    t.extractall(host_dir / "extracted")
    # get all extracted pcap for this host
    pcaps = (host_dir / "extracted").rglob("*.pcap")
    for pcap in pcaps:
        decoded_pcaps.setdefault((host.alias, pcap.with_suffix("").name),
                                 rdpcap(str(pcap)))

# Displaying some packets
for (host, ifs), packets in decoded_pcaps.items():
    print(host, ifs)
    packets[0].show()
    packets[1].show()

Capture on a specific network#

You can start a capture on a dedicated network by specifying it to TCPDump This will sniff all the packet that go through an interface configured in this specific network You need to call sync_info first to enable the translation (network logical name)->interface name

[ ]:

roles = en.sync_info(roles, networks)

[ ]:

# start a capture
# - on all the interface configured on the my_network network
# - we dump icmp traffic only
# - for the duration of the commands (here a client is pigging the server)
with en.TCPDump(
    hosts=roles["xp"], networks=networks["my_network"], options="icmp"
) as t:
    backup_dir = t.backup_dir
    _ = en.run(f"ping -c10 {roles['control'][0].address}", roles["agent"])

[ ]:

from scapy.all import rdpcap
import tarfile
# Examples:
# create a dictionnary of (alias, if) -> list of decoded packets by scapy
decoded_pcaps = dict()
for host in roles["xp"]:
    host_dir = backup_dir / host.alias
    t = tarfile.open(host_dir / "tcpdump.tar.gz")
    t.extractall(host_dir / "extracted")
    # get all extracted pcap for this host
    pcaps = (host_dir / "extracted").rglob("*.pcap")
    for pcap in pcaps:
        decoded_pcaps.setdefault((host.alias, pcap.with_suffix("").name),
                                 rdpcap(str(pcap)))

# Displaying some packets
for (host, ifs), packets in decoded_pcaps.items():
    print(host, ifs)
    packets[0].show()
    packets[1].show()

Cleaning#

[ ]:

provider.destroy()