Skip to content

Commit

Permalink
Fix too short title underline in cli.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
jas-ho committed Aug 7, 2023
1 parent b5c7c19 commit e04dc76
Showing 1 changed file with 46 additions and 46 deletions.
92 changes: 46 additions & 46 deletions docs/getting-started/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,11 @@ The ``seals:`` prefix ensures that the seals package is imported and the environ
Train an expert and save the rollouts explicitly, then train a policy on the saved rollouts:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First, train an expert and save the demonstrations.
First, train an expert and save the demonstrations.
Note that by default these are saved in ``<log_dir>/rollouts/final.npz``.
Where for this script by default ``<log_dir>`` is ``output/train_rl/<environment>/<timestamp>`` .
However, we can pass an explicit path as logging directory.
By default, this will use ``ppo``.
Where for this script by default ``<log_dir>`` is ``output/train_rl/<environment>/<timestamp>`` .
However, we can pass an explicit path as logging directory.
By default, this will use ``ppo``.

.. code-block:: bash
Expand All @@ -115,13 +115,13 @@ Now we can run the imitation script (in this case DAgger) and pass the path to t
pendulum \
dagger.total_timesteps=2000 \
demonstrations.source=local \
demonstrations.path=output/train_rl/Pendulum-v1/my_run/rollouts/final.npz
demonstrations.path=output/train_rl/Pendulum-v1/my_run/rollouts/final.npz
Visualise saved policies
^^^^^^^^^^^^^^^^^^^^^^^^
We can use the ``eval_policy`` script to visualise and render a saved policy.
Here we are looking at the policy saved by the previous example.
^^^^^^^^^^^^^^^^^^^^^^^^
We can use the ``eval_policy`` script to visualise and render a saved policy.
Here we are looking at the policy saved by the previous example.

.. code-block:: bash
Expand All @@ -130,13 +130,13 @@ Here we are looking at the policy saved by the previous example.
expert.loader_kwargs.path=output/train_rl/Pendulum-v1/my_run/policies/final/model.zip \
environment.num_vec=1 \
render=True \
environment.gym_id='Pendulum-v1'
environment.gym_id='Pendulum-v1'
Comparing algorithms' performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Let's use the CLI to compare the performance of different algorithms.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Let's use the CLI to compare the performance of different algorithms.

First, let's train an expert on the ``CartPole-v1`` environment.

Expand All @@ -147,7 +147,7 @@ First, let's train an expert on the ``CartPole-v1`` environment.
logging.log_dir=output/train_rl/CartPole-v1/expert \
total_timesteps=10000
Now let's train a weaker agent.
Now let's train a weaker agent.

.. code-block:: bash
Expand All @@ -158,7 +158,7 @@ Now let's train a weaker agent.
We can evaluate each policy using the ``eval_policy`` script.
For the expert:
For the expert:

.. code-block:: bash
Expand All @@ -169,24 +169,24 @@ For the expert:
environment.num_vec=1 \
logging.log_dir=output/eval_policy/CartPole-v1/expert
which will return something like
which will return something like

.. code-block:: bash
INFO - eval_policy - Result: {
'n_traj': 74,
'monitor_return_len': 74,
'return_min': 26.0,
'return_mean': 154.21621621621622,
'return_std': 79.94377589657559,
'return_max': 500.0,
'len_min': 26,
'len_mean': 154.21621621621622,
'len_std': 79.94377589657559,
'len_max': 500,
'monitor_return_min': 26.0,
'monitor_return_mean': 154.21621621621622,
'monitor_return_std': 79.94377589657559,
'n_traj': 74,
'monitor_return_len': 74,
'return_min': 26.0,
'return_mean': 154.21621621621622,
'return_std': 79.94377589657559,
'return_max': 500.0,
'len_min': 26,
'len_mean': 154.21621621621622,
'len_std': 79.94377589657559,
'len_max': 500,
'monitor_return_min': 26.0,
'monitor_return_mean': 154.21621621621622,
'monitor_return_std': 79.94377589657559,
'monitor_return_max': 500.0
}
INFO - eval_policy - Completed after 0:00:12
Expand All @@ -207,24 +207,24 @@ For the non-expert:
.. code-block:: bash
INFO - eval_policy - Result: {
'n_traj': 355,
'monitor_return_len': 355,
'return_min': 8.0,
'return_mean': 28.92676056338028,
'return_std': 15.686012049373561,
'return_max': 104.0,
'len_min': 8,
'len_mean': 28.92676056338028,
'len_std': 15.686012049373561,
'len_max': 104,
'monitor_return_min': 8.0,
'monitor_return_mean': 28.92676056338028,
'monitor_return_std': 15.686012049373561,
'n_traj': 355,
'monitor_return_len': 355,
'return_min': 8.0,
'return_mean': 28.92676056338028,
'return_std': 15.686012049373561,
'return_max': 104.0,
'len_min': 8,
'len_mean': 28.92676056338028,
'len_std': 15.686012049373561,
'len_max': 104,
'monitor_return_min': 8.0,
'monitor_return_mean': 28.92676056338028,
'monitor_return_std': 15.686012049373561,
'monitor_return_max': 104.0
}
INFO - eval_policy - Completed after 0:00:17
This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec).
This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec).
We can load these with ``pandas`` and use the ``imitation.test.reward_improvement``
module to compare the performances of the two policies.

Expand All @@ -233,22 +233,22 @@ module to compare the performances of the two policies.
from imitation.testing.reward_improvement import is_significant_reward_improvement
expert_monitor = pd.read_csv(
'./output/train_rl/CartPole-v1/expert/monitor/mon000.monitor.csv',
'./output/train_rl/CartPole-v1/expert/monitor/mon000.monitor.csv',
skiprows=1
)
non_expert_monitor = pd.read_csv(
'./output/train_rl/CartPole-v1/non_expert/monitor/mon000.monitor.csv',
'./output/train_rl/CartPole-v1/non_expert/monitor/mon000.monitor.csv',
skiprows=1
)
is_significant_reward_improvement(
non_expert_monitor['r'],
expert_monitor['r'],
non_expert_monitor['r'],
expert_monitor['r'],
0.05
)
.. code-block:: bash
True
True
Algorithm Scripts
Expand Down

0 comments on commit e04dc76

Please sign in to comment.