{ "cells": [ { "cell_type": "markdown", "id": "adjustable-crime", "metadata": {}, "source": [ "# Observability facilities\n", "\n", "Third party software stack to keep an eye on your experiment or gather some metrics.\n", "\n", "---\n", "\n", "- Website: https://discovery.gitlabpages.inria.fr/enoslib/index.html\n", "- Instant chat: https://framateam.org/enoslib\n", "- Source code: https://gitlab.inria.fr/discovery/enoslib\n", "\n", "---\n", "\n", "\n", "## Prerequisites\n", "\n", "
\n", " Make sure you've run the one time setup for your environment\n", "
\n" ] }, { "cell_type": "markdown", "id": "7b85cf55-938d-4758-9bd9-f6b4c7de2632", "metadata": {}, "source": [ "## Grid'5000 monitoring facilities\n", "\n", "Grid'5000 collects automatically metrics of the nodes.\n", "Those metrics can be queried using the REST API." ] }, { "cell_type": "code", "execution_count": null, "id": "177e81b3-ecc5-4412-b660-08dfd0a64bda", "metadata": {}, "outputs": [], "source": [ "from grid5000 import Grid5000\n", "from pathlib import Path\n", "\n", "conf = Path.home() / \".python-grid5000.yaml\"\n", "\n", "gk = Grid5000.from_yaml(conf)" ] }, { "cell_type": "code", "execution_count": null, "id": "f5f2bc99-fb11-4fb1-b2af-857214a16721", "metadata": {}, "outputs": [], "source": [ "# get the list of the available metrics for a given cluster\n", "import json\n", "\n", "metrics = gk.sites[\"lyon\"].clusters[\"nova\"].metrics\n", "print(json.dumps(metrics, indent=4))" ] }, { "cell_type": "code", "execution_count": null, "id": "4c434dfa-4e9d-40f9-b69e-413f365c15b6", "metadata": {}, "outputs": [], "source": [ "[m[\"name\"] for m in metrics]" ] }, { "cell_type": "code", "execution_count": null, "id": "5f0aed1c-01ba-49b8-9571-ffbae2ec348d", "metadata": {}, "outputs": [], "source": [ "import time\n", "metric = \"wattmetre_power_watt\"\n", "now = time.time()\n", "measurements = gk.sites[\"lyon\"].metrics.list(nodes=\"nova-1,nova-2,nova-3\", start_time=now - 1000, metrics=metric)\n", "\n", "# alternatively one can pass a job_id\n", "# measurements = gk.sites[\"lyon\"].metrics.list(job_id=1307628, metrics=metric)\n", "measurements[:10]" ] }, { "cell_type": "code", "execution_count": null, "id": "e233fb27-d68d-4589-bbac-e5f564eecb30", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame([m.to_dict() for m in measurements])\n", "df[\"timestamp\"] = pd.to_datetime(df[\"timestamp\"])\n", "import seaborn as sns\n", "\n", "sns.relplot(data=df, x=\"timestamp\", y=\"value\", hue=\"device_id\", alpha=0.7)" ] }, { "cell_type": "markdown", "id": "0016ac0d-1a38-409c-9899-89c8a21ff14b", "metadata": {}, "source": [ "## EnOSlib's observability service\n", "---\n", "\n", "A `Service` in EnOSlib is a third party software stack that is commonly used among experimenters.\n", "In particular, EnOSlib has some some Services which deal with the problem of getting some insight on what's running on remote nodes. \n", "\n", "A Service is a python object which exposes three main methods:\n", "\n", "- `deploy`: which deploy the service\n", "- `destroy`: remove stop the service\n", "- `backup`: retrieve some states of the services (e.g monitoring information)\n", "\n", "Usually a service is used as follow:\n", "\n", "```python\n", "service = Service(*args, **kwargs)\n", "service.deploy()\n", "...\n", "# do stuffs\n", "...\n", "service.backup()\n", "service.destroy()\n", "```\n", "\n", "\n", "But it's sometime useful to use a Context Manager when working with module:\n", "\n", "```python\n", "with Service(*args, **kwargs) as service:\n", " ...\n", " # do stuffs\n", " ...\n", "```\n", "\n", "This allows for\n", "\n", "- running the service for some time depending on what's inside the context manager\n", "- cleaning (and backuping) stuffs automatically at the end\n", "\n", "---\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "worse-equity", "metadata": {}, "source": [ "## Common setup" ] }, { "cell_type": "code", "execution_count": null, "id": "stylish-bahamas", "metadata": {}, "outputs": [], "source": [ "import enoslib as en\n", "\n", "# Enable rich logging\n", "_ = en.init_logging()" ] }, { "cell_type": "code", "execution_count": null, "id": "dressed-mentor", "metadata": {}, "outputs": [], "source": [ "# claim the resources\n", "network = en.G5kNetworkConf(type=\"prod\", roles=[\"my_network\"], site=\"rennes\")\n", "\n", "conf = (\n", " en.G5kConf.from_settings(job_type=[], job_name=\"enoslib_observability\")\n", " .add_network_conf(network)\n", " .add_machine(\n", " roles=[\"control\", \"xp\"], cluster=\"parasilo\", nodes=1, primary_network=network\n", " )\n", " .add_machine(\n", " roles=[\"agent\", \"xp\"], cluster=\"parasilo\", nodes=1, primary_network=network\n", " )\n", " .finalize()\n", ")\n", "conf" ] }, { "cell_type": "code", "execution_count": null, "id": "corrected-analysis", "metadata": {}, "outputs": [], "source": [ "provider = en.G5k(conf)\n", "roles, networks = provider.init()\n", "roles" ] }, { "cell_type": "markdown", "id": "final-light", "metadata": {}, "source": [ "### A simple load generator\n", "\n", "We'll install a simple load generator: `stress` available in the debian packages." ] }, { "cell_type": "code", "execution_count": null, "id": "marked-sport", "metadata": {}, "outputs": [], "source": [ "with en.actions(roles=roles[\"agent\"]) as a:\n", " a.apt(name=\"stress\", state=\"present\")" ] }, { "cell_type": "markdown", "id": "satellite-burlington", "metadata": { "tags": [] }, "source": [ "## Monitoring with dstat\n", "\n", "Dstat is a simple monitoring tool: https://github.com/dstat-real/dstat#information\n", "It runs as a single process and collect metrics from various sources. \n", "That makes it a good candidate for getting a quick insight on the resources consumptions during an experiment.\n", "\n", "\n", "The EnOSlib implementation lets you easily \n", "- start Dstat processes on remote machine and start dumping the metrics into a csv file( it's the purpose `deploy` method of the Dstat service)\n", "- retrieve all the csvs file (one per remote node) on your local machine ( that's the purpose of the `backup` method)\n", "- stop every remote Dstat processes (that's the purpose of the `destroy` method)" ] }, { "cell_type": "markdown", "id": "auburn-torture", "metadata": {}, "source": [ "### Capture\n", "\n", "Let's start with a single capture implemented using a context manager.\n", "The context manager runs `deploy` when entering, and `backup/destroy` when exiting." ] }, { "cell_type": "code", "execution_count": null, "id": "excellent-saudi", "metadata": {}, "outputs": [], "source": [ "# Start a capture on all nodes\n", "# - stress on some nodes\n", "import time\n", "with en.Dstat(nodes=roles[\"xp\"]) as d:\n", " time.sleep(5)\n", " en.run_command(\"stress --cpu 4 --timeout 10\", roles=roles[\"agent\"])\n", " time.sleep(5)" ] }, { "cell_type": "markdown", "id": "announced-basis", "metadata": {}, "source": [ "### Visualization\n", "\n", "All the CSVs files are available under the `backup_dir` inside subdirectories named after the corresponding remote host alias:\n", "```bash\n", " / host1 / ... / .csv\n", " / host2 / ..../ .csv\n", "```\n", "The following bunch of python lines will recursively look for any csv file inside these directories and build a DataFrame and a visualization" ] }, { "cell_type": "code", "execution_count": null, "id": "adopted-vocabulary", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns \n", "\n", "\n", "df = en.Dstat.to_pandas(d.backup_dir)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "rough-campus", "metadata": {}, "outputs": [], "source": [ "# let's show the metrics !\n", "sns.lineplot(data=df, x=\"epoch\", y=\"usr\", hue=\"host\", markers=True, style=\"host\")" ] }, { "cell_type": "markdown", "id": "cardiac-turkish", "metadata": {}, "source": [ "## Packet sniffing with tcpdump\n", "\n", "### Capture" ] }, { "cell_type": "code", "execution_count": null, "id": "integrated-feedback", "metadata": {}, "outputs": [], "source": [ "# start a capture\n", "# - on all the interface configured on the my_network network\n", "# - we dump icmp traffic only\n", "# - for the duration of the commands (here a client is pigging the server)\n", "with en.TCPDump(\n", " hosts=roles[\"xp\"], ifnames=[\"any\"], options=\"icmp\"\n", ") as t:\n", " backup_dir = t.backup_dir\n", " _ = en.run(f\"ping -c10 {roles['control'][0].address}\", roles[\"agent\"])" ] }, { "cell_type": "markdown", "id": "nervous-joint", "metadata": {}, "source": [ "### Visualization" ] }, { "cell_type": "code", "execution_count": null, "id": "virtual-memorial", "metadata": {}, "outputs": [], "source": [ "from scapy.all import rdpcap\n", "import tarfile\n", "# Examples:\n", "# create a dictionnary of (alias, if) -> list of decoded packets by scapy\n", "decoded_pcaps = dict()\n", "for host in roles[\"xp\"]:\n", " host_dir = backup_dir / host.alias\n", " t = tarfile.open(host_dir / \"tcpdump.tar.gz\")\n", " t.extractall(host_dir / \"extracted\")\n", " # get all extracted pcap for this host\n", " pcaps = (host_dir / \"extracted\").rglob(\"*.pcap\")\n", " for pcap in pcaps:\n", " decoded_pcaps.setdefault((host.alias, pcap.with_suffix(\"\").name),\n", " rdpcap(str(pcap)))\n", "\n", "# Displaying some packets\n", "for (host, ifs), packets in decoded_pcaps.items():\n", " print(host, ifs)\n", " packets[0].show()\n", " packets[1].show()" ] }, { "cell_type": "markdown", "id": "5eae90d4-3f5f-4879-9791-5afab6278383", "metadata": {}, "source": [ "### Capture on a specific network\n", "\n", "You can start a capture on a dedicated network by specifying it to TCPDump\n", "This will sniff all the packet that go through an interface configured in this specific network\n", "You need to call `sync_info` first to enable the translation (network logical name)->interface name" ] }, { "cell_type": "code", "execution_count": null, "id": "281ed635-b2f2-4548-997d-85f32e595593", "metadata": {}, "outputs": [], "source": [ "roles = en.sync_info(roles, networks)" ] }, { "cell_type": "code", "execution_count": null, "id": "3d9815b9-4602-44e8-9992-60445424cf22", "metadata": {}, "outputs": [], "source": [ "# start a capture\n", "# - on all the interface configured on the my_network network\n", "# - we dump icmp traffic only\n", "# - for the duration of the commands (here a client is pigging the server)\n", "with en.TCPDump(\n", " hosts=roles[\"xp\"], networks=networks[\"my_network\"], options=\"icmp\"\n", ") as t:\n", " backup_dir = t.backup_dir\n", " _ = en.run(f\"ping -c10 {roles['control'][0].address}\", roles[\"agent\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "4480a37a-16bc-443d-bcc7-efcdbeff00b8", "metadata": {}, "outputs": [], "source": [ "from scapy.all import rdpcap\n", "import tarfile\n", "# Examples:\n", "# create a dictionnary of (alias, if) -> list of decoded packets by scapy\n", "decoded_pcaps = dict()\n", "for host in roles[\"xp\"]:\n", " host_dir = backup_dir / host.alias\n", " t = tarfile.open(host_dir / \"tcpdump.tar.gz\")\n", " t.extractall(host_dir / \"extracted\")\n", " # get all extracted pcap for this host\n", " pcaps = (host_dir / \"extracted\").rglob(\"*.pcap\")\n", " for pcap in pcaps:\n", " decoded_pcaps.setdefault((host.alias, pcap.with_suffix(\"\").name),\n", " rdpcap(str(pcap)))\n", "\n", "# Displaying some packets\n", "for (host, ifs), packets in decoded_pcaps.items():\n", " print(host, ifs)\n", " packets[0].show()\n", " packets[1].show()" ] }, { "cell_type": "markdown", "id": "fatal-center", "metadata": {}, "source": [ "## Monitoring with Telegraf/[InfluxDB|prometheus]/grafana " ] }, { "cell_type": "code", "execution_count": null, "id": "finished-individual", "metadata": {}, "outputs": [], "source": [ "monitoring = en.TIGMonitoring(collector=roles[\"control\"][0], agent=roles[\"agent\"], ui=roles[\"control\"][0])\n", "monitoring" ] }, { "cell_type": "code", "execution_count": null, "id": "982e398e-c0da-4451-b0e8-b074262295ad", "metadata": {}, "outputs": [], "source": [ "monitoring.deploy()" ] }, { "cell_type": "code", "execution_count": null, "id": "fossil-closer", "metadata": {}, "outputs": [], "source": [ "en.run_command(\"stress --cpu 24 --timeout 60\", roles=roles[\"agent\"], background=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "3e07f4ec-064f-4896-a4a9-7b7951470718", "metadata": {}, "outputs": [], "source": [ "print(f\"\"\"\n", "Access the UI at {monitoring.ui.address}:3000 (admin/admin)\")\n", "---\n", "tip1: create a ssh port forwarding -> ssh -NL 3000:{monitoring.ui.address}:3000 access.grid5000.fr (and point your browser to http://localhost:3000)\n", "tip2: use a proxy socks -> ssh -ND 2100 access.grid5000.fr (and point your browser to http://{monitoring.ui.address}:3000)\n", "tip3: use the G5K vpn\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "canadian-maryland", "metadata": {}, "outputs": [], "source": [ "# If you are running stuffs from outside g5k, you can access the dashboard by creating a tunnel\n", "# create a tunnel to the service running inside g5k\n", "\n", "tunnel = en.G5kTunnel(address=monitoring.ui.address, port=3000)\n", "local_address, local_port, _ = tunnel.start()\n", "print(f\"The service is running at http://localhost:{local_port} (admin:admin)\")\n", "\n", "# wait some time\n", "import time\n", "time.sleep(60)\n", "\n", "\n", "# don't forget to close it\n", "tunnel.close()" ] }, { "cell_type": "markdown", "id": "ethical-creation", "metadata": {}, "source": [ "To not forget to close the tunnel you can use a context manager: the tunnel will be closed automatically when exiting the context manager." ] }, { "cell_type": "code", "execution_count": null, "id": "stunning-turkish", "metadata": {}, "outputs": [], "source": [ "import time\n", "with en.G5kTunnel(address=monitoring.ui.address, port=3000) as (_, local_port, _):\n", " print(f\"The service is running at http://localhost:{local_port}\")\n", " time.sleep(60)\n", " " ] }, { "cell_type": "markdown", "id": "separated-briefing", "metadata": { "tags": [] }, "source": [ "## Cleaning" ] }, { "cell_type": "code", "execution_count": null, "id": "structured-motor", "metadata": {}, "outputs": [], "source": [ "provider.destroy()" ] }, { "cell_type": "code", "execution_count": null, "id": "f20a9f58-fc28-4485-ac68-2aa4e5e6d4b9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "interpreter": { "hash": "c41aab4ca0eaec89556f08ac68d7d063aee1184b32505cfb99eab1b047dc078c" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" }, "toc-autonumbering": false, "toc-showcode": false, "toc-showmarkdowntxt": false }, "nbformat": 4, "nbformat_minor": 5 }