.. _performance_tuning:

******************
Performance tuning
******************

This page is about tricks to speed up the deployment time for large
experiments (i.e. experiments involving many nodes).

Increase Ansible parallelism
============================

.. note::

  This is only supported since Enoslib 9.0.0.  Previous versions had a
  fixed parallelism level of 100.

By default, Enoslib only configures 5 nodes at a time using parallel
connections.  This should be increased to speed up large experiments, but
beware of a few side-effects :

- your control node must have sufficient CPU and memory capacity to handle
  a large number of forks.  On Grid'5000, this usually means using a
  dedicated Grid'5000 node (physical server) as the control node for Enoslib.

- when using a SSH jump gateway (e.g. when running outside of Grid'5000),
  there is a limit to the number of parallel connections you can open
  simultaneously on the SSH gateway, usually around 10-15.

See :ref:`global_config` for more details on configuration settings.

.. literalinclude:: performance_tuning/vmong5k_forks.py
   :language: python
   :linenos:

Using a dedicated control node on Grid'5000
===========================================

While small experiments can be started from the Grid'5000 frontends
systems, keep in mind that they are shared systems, which means that you
could quickly saturate their CPU and memory: this will not only slow down
your experiment, but also slow down other users of the platform.  Thus,
using a dedicated control node is a good practice for large Grid'5000
experiments.

Dedicated control node with a Jupyter notebook
----------------------------------------------

Using the `Grid'5000 Notebook interface
<https://intranet.grid5000.fr/notebooks/>`_, you can start a Jupyter
notebook that runs on a Grid'5000 node.  You can then run the control part
of your experiment in this notebook.

Make sure to select a walltime that is long enough, while complying with
the `Usage Policy
<https://www.grid5000.fr/w/Grid5000:UsagePolicy#Rules_for_the_default_queue>`_.
After this walltime is over, your Jupyter notebook will automatically be
closed.  If you need additional time, it is possible to `extend the
walltime of an existing job
<https://www.grid5000.fr/w/Advanced_OAR#Changing_the_walltime_of_a_running_job_.28oarwalltime.29>`_,
but you should still comply with the Usage Policy.

.. image:: performance_tuning/g5k-jupyter-control-node.png


Dedicated control node with a OAR job
-------------------------------------

Using OAR, it is possible to reserve a single control node on which your
experiment script will be run automatically.  Your experiment script will
then reserve additional nodes using Enoslib.

Connect to a Grid'5000 frontend using SSH.  It can be any frontend, but
you will obtain better deployment performance if the frontend is
geographically close to your nodes (on the same site or a nearby site).

Start by preparing a virtualenv with your desired version of Enoslib:

.. code-block:: shell

    nantes$ python3 -m venv enoslib-venv
    nantes$ source enoslib-venv/bin/activate
    (venv)$ pip install -U pip
    (venv)$ pip install -U 'enoslib>=8,<9'

Then, to submit a job with your experiment, use ``oarsub`` on the same frontend:

.. code-block:: shell

    $ oarsub -l walltime=0:45 "./enoslib-venv/bin/python my_short_experiment.py"
    OAR_JOB_ID=42424242

``oarsub`` returns the job ID immediately, but your job will be started
asynchronously. Once it is running, you can monitor the console output of
your experiment using the job ID:

.. code-block:: shell

    $ tail -F OAR.42424242.stdout

To submit a job while making sure that your experiment runs during the
night (see Usage Policy), with a walltime of up to 14 hours:

.. code-block:: shell

    $ oarsub -t night -l walltime=13:55 "./enoslib-venv/bin/python my_experiment.py"

If you want to start your experiment at a specific date and time, for
instance during a week-end:

.. code-block:: shell

    $ oarsub -r "2023-08-05 19:00" -l walltime=61:55 "./enoslib-venv/bin/python my_experiment.py"

If you really care about your deployment time, you can ask for a control node with
a minimum number of CPU cores and amount of RAM, using `OAR properties <https://www.grid5000.fr/w/Advanced_OAR#Selecting_resources_using_properties>`_:

.. code-block:: shell

    $ oarsub -p "core_count >= 20 AND memnode >= 64000" -l walltime=2:30 "./enoslib-venv/bin/python my_experiment.py"


Running large Grid'5000 experiment from your laptop
===================================================

In some case, you might really need to run an experiment from your laptop
or from a machine that is outside of the Grid'5000 network.  In this case,
we suggest to:

- setup the `Grid'5000 VPN <https://www.grid5000.fr/w/VPN>`_

- disable the automatic SSH jump host feature (see :ref:`global_config`)

- increase the number of forks (see :ref:`global_config`)

- enable Ansible pipelining (see below)


Ansible pipelining
==================

`Ansible pipelining
<https://docs.ansible.com/ansible/latest/reference_appendices/config.html#ansible-pipelining>`_
can speed up performance by 2x when performing several short actions in a
row.  However, it may possibly be incompatible with ``become``, ``run_as``
and ``sudo``, so you need to check if it applies to your case.

To activate it, simply define the environment variable at the start of
your experiment code:

.. code-block:: python

    import os
    import enoslib as en

    os.environ["ANSIBLE_PIPELINING"] = "True"


Designing your experiment for batch actions
===========================================

If you have many small actions to run on your nodes, prefer using a
:py:class:`~enoslib.api.actions` block instead of individual
:py:func:`~enoslib.api.run_command` calls.

See :ref:`integration-with-ansible` for more details.


Various Ansible tips and tricks
===============================

- Use fact caching: https://docs.ansible.com/ansible/latest/plugins/cache.html
- Tune the default execution strategy: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html
- Switch to a more efficient execution backend with mitogen: https://mitogen.networkgenomics.com/


Other performance improvement ideas
===================================

- Build a preconfigured image (application specific)