diff --git a/docs/getting-started/cli.rst b/docs/getting-started/cli.rst index 616fbd0f7..3c942fe43 100644 --- a/docs/getting-started/cli.rst +++ b/docs/getting-started/cli.rst @@ -96,11 +96,11 @@ The ``seals:`` prefix ensures that the seals package is imported and the environ Train an expert and save the rollouts explicitly, then train a policy on the saved rollouts: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -First, train an expert and save the demonstrations. +First, train an expert and save the demonstrations. Note that by default these are saved in ``/rollouts/final.npz``. -Where for this script by default ```` is ``output/train_rl//`` . -However, we can pass an explicit path as logging directory. -By default, this will use ``ppo``. +Where for this script by default ```` is ``output/train_rl//`` . +However, we can pass an explicit path as logging directory. +By default, this will use ``ppo``. .. code-block:: bash @@ -115,13 +115,13 @@ Now we can run the imitation script (in this case DAgger) and pass the path to t pendulum \ dagger.total_timesteps=2000 \ demonstrations.source=local \ - demonstrations.path=output/train_rl/Pendulum-v1/my_run/rollouts/final.npz + demonstrations.path=output/train_rl/Pendulum-v1/my_run/rollouts/final.npz Visualise saved policies -^^^^^^^^^^^^^^^^^^^^^^^^ -We can use the ``eval_policy`` script to visualise and render a saved policy. -Here we are looking at the policy saved by the previous example. +^^^^^^^^^^^^^^^^^^^^^^^^ +We can use the ``eval_policy`` script to visualise and render a saved policy. +Here we are looking at the policy saved by the previous example. .. code-block:: bash @@ -130,13 +130,13 @@ Here we are looking at the policy saved by the previous example. expert.loader_kwargs.path=output/train_rl/Pendulum-v1/my_run/policies/final/model.zip \ environment.num_vec=1 \ render=True \ - environment.gym_id='Pendulum-v1' + environment.gym_id='Pendulum-v1' Comparing algorithms' performance -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Let's use the CLI to compare the performance of different algorithms. +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Let's use the CLI to compare the performance of different algorithms. First, let's train an expert on the ``CartPole-v1`` environment. @@ -147,7 +147,7 @@ First, let's train an expert on the ``CartPole-v1`` environment. logging.log_dir=output/train_rl/CartPole-v1/expert \ total_timesteps=10000 -Now let's train a weaker agent. +Now let's train a weaker agent. .. code-block:: bash @@ -158,7 +158,7 @@ Now let's train a weaker agent. We can evaluate each policy using the ``eval_policy`` script. -For the expert: +For the expert: .. code-block:: bash @@ -169,24 +169,24 @@ For the expert: environment.num_vec=1 \ logging.log_dir=output/eval_policy/CartPole-v1/expert -which will return something like +which will return something like .. code-block:: bash INFO - eval_policy - Result: { - 'n_traj': 74, - 'monitor_return_len': 74, - 'return_min': 26.0, - 'return_mean': 154.21621621621622, - 'return_std': 79.94377589657559, - 'return_max': 500.0, - 'len_min': 26, - 'len_mean': 154.21621621621622, - 'len_std': 79.94377589657559, - 'len_max': 500, - 'monitor_return_min': 26.0, - 'monitor_return_mean': 154.21621621621622, - 'monitor_return_std': 79.94377589657559, + 'n_traj': 74, + 'monitor_return_len': 74, + 'return_min': 26.0, + 'return_mean': 154.21621621621622, + 'return_std': 79.94377589657559, + 'return_max': 500.0, + 'len_min': 26, + 'len_mean': 154.21621621621622, + 'len_std': 79.94377589657559, + 'len_max': 500, + 'monitor_return_min': 26.0, + 'monitor_return_mean': 154.21621621621622, + 'monitor_return_std': 79.94377589657559, 'monitor_return_max': 500.0 } INFO - eval_policy - Completed after 0:00:12 @@ -207,24 +207,24 @@ For the non-expert: .. code-block:: bash INFO - eval_policy - Result: { - 'n_traj': 355, - 'monitor_return_len': 355, - 'return_min': 8.0, - 'return_mean': 28.92676056338028, - 'return_std': 15.686012049373561, - 'return_max': 104.0, - 'len_min': 8, - 'len_mean': 28.92676056338028, - 'len_std': 15.686012049373561, - 'len_max': 104, - 'monitor_return_min': 8.0, - 'monitor_return_mean': 28.92676056338028, - 'monitor_return_std': 15.686012049373561, + 'n_traj': 355, + 'monitor_return_len': 355, + 'return_min': 8.0, + 'return_mean': 28.92676056338028, + 'return_std': 15.686012049373561, + 'return_max': 104.0, + 'len_min': 8, + 'len_mean': 28.92676056338028, + 'len_std': 15.686012049373561, + 'len_max': 104, + 'monitor_return_min': 8.0, + 'monitor_return_mean': 28.92676056338028, + 'monitor_return_std': 15.686012049373561, 'monitor_return_max': 104.0 } INFO - eval_policy - Completed after 0:00:17 -This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec). +This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec). We can load these with ``pandas`` and use the ``imitation.test.reward_improvement`` module to compare the performances of the two policies. @@ -233,22 +233,22 @@ module to compare the performances of the two policies. from imitation.testing.reward_improvement import is_significant_reward_improvement expert_monitor = pd.read_csv( - './output/train_rl/CartPole-v1/expert/monitor/mon000.monitor.csv', + './output/train_rl/CartPole-v1/expert/monitor/mon000.monitor.csv', skiprows=1 ) non_expert_monitor = pd.read_csv( - './output/train_rl/CartPole-v1/non_expert/monitor/mon000.monitor.csv', + './output/train_rl/CartPole-v1/non_expert/monitor/mon000.monitor.csv', skiprows=1 ) is_significant_reward_improvement( - non_expert_monitor['r'], - expert_monitor['r'], + non_expert_monitor['r'], + expert_monitor['r'], 0.05 ) .. code-block:: bash - True + True Algorithm Scripts