HumanCompatibleAI · ernestum · Aug 10, 2023 · Jul 4, 2023 · Jul 4, 2023 · Jul 4, 2023
diff --git a/README.md b/README.md
@@ -17,6 +17,7 @@ Currently, we have implementations of the algorithms below. 'Discrete' and 'Cont
 | [Adversarial Inverse Reinforcement Learning](https://arxiv.org/abs/1710.11248)                                                    | [`algoritms.airl`](https://imitation.readthedocs.io/en/latest/algorithms/airl.html)                                      | ✅        | ✅          |
 | [Generative Adversarial Imitation Learning](https://arxiv.org/abs/1606.03476)                                                     | [`algorithms.gail`](https://imitation.readthedocs.io/en/latest/algorithms/gail.html)                                     | ✅        | ✅          |
 | [Deep RL from Human Preferences](https://arxiv.org/abs/1706.03741)                                                                | [`algorithms.preference_comparisons`](https://imitation.readthedocs.io/en/latest/algorithms/preference_comparisons.html) | ✅        | ✅          |
+| [Soft Q Imitation Learning](https://arxiv.org/abs/1905.11108)                                                                     | [`algorithms.sqil`](https://imitation.readthedocs.io/en/latest/algorithms/sqil.html)                                     | ✅        | ❌          |
 
 
 You can find [the documentation here](https://imitation.readthedocs.io/en/latest/).

diff --git a/docs/algorithms/sqil.rst b/docs/algorithms/sqil.rst
@@ -0,0 +1,61 @@
+.. _soft q imitation learning docs:
+
+================================
+Soft Q Imitation Learning (SQIL)
+================================
+
+Soft Q Imitation learning learns to imitate a policy from demonstrations by
+using the DQN algorithm with modified rewards. During each policy update, half
+of the batch is sampled from the demonstrations and half is sampled from the
+environment. Expert demonstrations are assigned a reward of 1, and the
+environment is assigned a reward of 0. This encourages the policy to imitate
+the demonstrations, and to simultaneously avoid states not seen in the
+demonstrations.
+
+.. note::
+
+    This implementation is based on the DQN implementation in Stable Baselines 3,
+    which does not implement the soft Q-learning and therefore does not support
+    continuous actions. Therefore, this implementation only supports discrete actions
+    and the name "soft" Q-learning could be misleading.
+
+Example
+=======
+
+Detailed example notebook: :doc:`../tutorials/8_train_sqil`
+
+.. testcode::
+    :skipif: skip_doctests
+
+    import datasets
+    import gym
+    from stable_baselines3.common.evaluation import evaluate_policy
+    from stable_baselines3.common.vec_env import DummyVecEnv
+
+    from imitation.algorithms import sqil
+    from imitation.data import huggingface_utils
+
+    # Download some expert trajectories from the HuggingFace Datasets Hub.
+    dataset = datasets.load_dataset("HumanCompatibleAI/ppo-CartPole-v1")
+    rollouts = huggingface_utils.TrajectoryDatasetSequence(dataset["train"])
+
+    sqil_trainer = sqil.SQIL(
+        venv=DummyVecEnv([lambda: gym.make("CartPole-v1")]),
+        demonstrations=rollouts,
+        policy="MlpPolicy",
+    )
+    # Hint: set to 1_000_000 to match the expert performance.
+    sqil_trainer.train(total_timesteps=1_000)
+    reward, _ = evaluate_policy(sqil_trainer.policy, sqil_trainer.venv, 10)
+    print("Reward:", reward)
+
+.. testoutput::
+    :hide:
+
+    ...
+
+API
+===
+.. autoclass:: imitation.algorithms.sqil.SQIL
+    :members:
+    :noindex:
diff --git a/docs/index.rst b/docs/index.rst
@@ -71,6 +71,7 @@ If you use ``imitation`` in your research project, please cite our paper to help
    algorithms/density
    algorithms/mce_irl
    algorithms/preference_comparisons
+   algorithms/sqil
 
 .. toctree::
    :maxdepth: 2
@@ -85,8 +86,9 @@ If you use ``imitation`` in your research project, please cite our paper to help
    tutorials/5a_train_preference_comparisons_with_cnn
    tutorials/6_train_mce
    tutorials/7_train_density
-   tutorials/8_train_custom_env
+   tutorials/8_train_sqil
    tutorials/9_compare_baselines
+   tutorials/10_train_custom_env
 
 API Reference
 ~~~~~~~~~~~~~

diff --git a/docs/main-concepts/experts.rst b/docs/main-concepts/experts.rst
@@ -12,7 +12,7 @@ learning library.
 For example, BC and DAgger can learn from an expert policy and the command line
 interface of AIRL/GAIL allows one to specify an expert to sample demonstrations from.
 
-In the :doc:`../getting-started/first-steps` tutorial, we first train an expert policy
+In the :doc:`../getting-started/first_steps` tutorial, we first train an expert policy
 using the stable-baselines3 library and then imitate it's behavior using
 :doc:`../algorithms/bc`.
 In practice, you may want to load a pre-trained policy for performance reasons.

diff --git a/docs/tutorials/8_train_custom_env.ipynb → docs/tutorials/10_train_custom_env.ipynb b/docs/tutorials/8_train_custom_env.ipynb → docs/tutorials/10_train_custom_env.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/8_train_custom_env.ipynb)\n",
+    "[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/10_train_custom_env.ipynb)\n",
     "# Train Behavior Cloning in a Custom Environment\n",
     "\n",
     "You can use `imitation` to train a policy (and, for many imitation learning algorithm, learn rewards) in a custom environment.\n",

diff --git a/docs/tutorials/8_train_sqil.ipynb b/docs/tutorials/8_train_sqil.ipynb
@@ -0,0 +1,157 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/8_train_sqil.ipynb)\n",
+    "# Train an Agent using Soft Q Imitation Learning\n",
+    "\n",
+    "Soft Q Imitation Learning ([SQIL](https://arxiv.org/abs/1905.11108)) is a simple algorithm that can be used to clone expert behavior.\n",
+    "It's fundamentally a modification of the DQN algorithm. At each training step, whenever we sample a batch of data from the replay buffer,\n",
+    "we also sample a batch of expert data. Expert demonstrations are assigned a reward of 1, while the agent's own transitions are assigned a reward of 0.\n",
+    "This approach encourages the agent to imitate the expert's behavior, but also to avoid unfamiliar states.\n",
+    "\n",
+    "In this tutorial we will use the `imitation` library to train an agent using SQIL."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, we need some expert trajectories in our environment (`seals/CartPole-v0`).\n",
+    "Note that you can use other environments, but the action space must be discrete for this algorithm."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import datasets\n",
+    "from stable_baselines3.common.vec_env import DummyVecEnv\n",
+    "\n",
+    "from imitation.data import huggingface_utils\n",
+    "\n",
+    "# Download some expert trajectories from the HuggingFace Datasets Hub.\n",
+    "dataset = datasets.load_dataset(\"HumanCompatibleAI/ppo-CartPole-v1\")\n",
+    "\n",
+    "# Convert the dataset to a format usable by the imitation library.\n",
+    "expert_trajectories = huggingface_utils.TrajectoryDatasetSequence(dataset[\"train\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's quickly check if the expert is any good.\n",
+    "We usually should be able to reach a reward of 500, which is the maximum achievable value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from imitation.data import rollout\n",
+    "\n",
+    "trajectory_stats = rollout.rollout_stats(expert_trajectories)\n",
+    "\n",
+    "print(\n",
+    "    f\"We have {trajectory_stats['n_traj']} trajectories.\"\n",
+    "    f\"The average length of each trajectory is {trajectory_stats['len_mean']}.\"\n",
+    "    f\"The average return of each trajectory is {trajectory_stats['return_mean']}.\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After we collected our expert trajectories, it's time to set up our behavior cloning algorithm."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from imitation.algorithms import sqil\n",
+    "import gym\n",
+    "\n",
+    "venv = DummyVecEnv([lambda: gym.make(\"CartPole-v1\")])\n",
+    "sqil_trainer = sqil.SQIL(\n",
+    "    venv=venv,\n",
+    "    demonstrations=expert_trajectories,\n",
+    "    policy=\"MlpPolicy\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As you can see the untrained policy only gets poor rewards:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from stable_baselines3.common.evaluation import evaluate_policy\n",
+    "\n",
+    "reward_before_training, _ = evaluate_policy(sqil_trainer.policy, venv, 10)\n",
+    "print(f\"Reward before training: {reward_before_training}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After training, we can match the rewards of the expert (500):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sqil_trainer.train(\n",
+    "    total_timesteps=1_000,\n",
+    ")  # Note: set to 1_000_000 to obtain good results\n",
+    "reward_after_training, _ = evaluate_policy(sqil_trainer.policy, venv, 10)\n",
+    "print(f\"Reward after training: {reward_after_training}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "interpreter": {
+   "hash": "bd378ce8f53beae712f05342da42c6a7612fc68b19bea03b52c7b1cdc8851b5f"
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/imitation/algorithms/adversarial/common.py b/src/imitation/algorithms/adversarial/common.py
@@ -98,8 +98,8 @@ class AdversarialTrainer(base.DemonstrationAlgorithm[types.Transitions]):
     If `debug_use_ground_truth=True` was passed into the initializer then
     `self.venv_train` is the same as `self.venv`."""
 
-    _demo_data_loader: Optional[Iterable[base.TransitionMapping]]
-    _endless_expert_iterator: Optional[Iterator[base.TransitionMapping]]
+    _demo_data_loader: Optional[Iterable[types.TransitionMapping]]
+    _endless_expert_iterator: Optional[Iterator[types.TransitionMapping]]
 
     venv_wrapped: vec_env.VecEnvWrapper
 

diff --git a/src/imitation/algorithms/base.py b/src/imitation/algorithms/base.py
@@ -13,8 +13,6 @@
     cast,
 )
 
-import numpy as np
-import torch as th
 import torch.utils.data as th_data
 from stable_baselines3.common import policies
 
@@ -123,11 +121,10 @@ def __setstate__(self, state):
         self.logger = state.get("_logger") or imit_logger.configure()
 
 
-TransitionMapping = Mapping[str, Union[np.ndarray, th.Tensor]]
 TransitionKind = TypeVar("TransitionKind", bound=types.TransitionsMinimal)
 AnyTransitions = Union[
     Iterable[types.Trajectory],
-    Iterable[TransitionMapping],
+    Iterable[types.TransitionMapping],
     types.TransitionsMinimal,
 ]
 
@@ -190,7 +187,7 @@ class _WrappedDataLoader:
 
     def __init__(
         self,
-        data_loader: Iterable[TransitionMapping],
+        data_loader: Iterable[types.TransitionMapping],
         expected_batch_size: int,
     ):
         """Builds _WrappedDataLoader.
@@ -202,7 +199,7 @@ def __init__(
         self.data_loader = data_loader
         self.expected_batch_size = expected_batch_size
 
-    def __iter__(self) -> Iterator[TransitionMapping]:
+    def __iter__(self) -> Iterator[types.TransitionMapping]:
         """Yields data from `self.data_loader`, checking `self.expected_batch_size`.
 
         Yields:
@@ -230,7 +227,7 @@ def make_data_loader(
     transitions: AnyTransitions,
     batch_size: int,
     data_loader_kwargs: Optional[Mapping[str, Any]] = None,
-) -> Iterable[TransitionMapping]:
+) -> Iterable[types.TransitionMapping]:
     """Converts demonstration data to Torch data loader.
 
     Args:

diff --git a/src/imitation/algorithms/bc.py b/src/imitation/algorithms/bc.py
@@ -38,7 +38,7 @@ class BatchIteratorWithEpochEndCallback:
     Will throw an exception when an epoch contains no batches.
     """
 
-    batch_loader: Iterable[algo_base.TransitionMapping]
+    batch_loader: Iterable[types.TransitionMapping]
     n_epochs: Optional[int]
     n_batches: Optional[int]
     on_epoch_end: Optional[Callable[[int], None]]
@@ -55,8 +55,8 @@ def __post_init__(self) -> None:
                 "Must provide exactly one of `n_epochs` and `n_batches` arguments.",
             )
 
-    def __iter__(self) -> Iterator[algo_base.TransitionMapping]:
-        def batch_iterator() -> Iterator[algo_base.TransitionMapping]:
+    def __iter__(self) -> Iterator[types.TransitionMapping]:
+        def batch_iterator() -> Iterator[types.TransitionMapping]:
 
             # Note: the islice here ensures we do not exceed self.n_epochs
             for epoch_num in itertools.islice(itertools.count(), self.n_epochs):
@@ -143,8 +143,8 @@ def __call__(
 
 
 def enumerate_batches(
-    batch_it: Iterable[algo_base.TransitionMapping],
-) -> Iterable[Tuple[Tuple[int, int, int], algo_base.TransitionMapping]]:
+    batch_it: Iterable[types.TransitionMapping],
+) -> Iterable[Tuple[Tuple[int, int, int], types.TransitionMapping]]:
     """Prepends batch stats before the batches of a batch iterator."""
     num_samples_so_far = 0
     for num_batches, batch in enumerate(batch_it):
@@ -308,7 +308,7 @@ def __init__(
                 parameter `l2_weight` instead), or if the batch size is not a multiple
                 of the minibatch size.
         """
-        self._demo_data_loader: Optional[Iterable[algo_base.TransitionMapping]] = None
+        self._demo_data_loader: Optional[Iterable[types.TransitionMapping]] = None
         self.batch_size = batch_size
         self.minibatch_size = minibatch_size or batch_size
         if self.batch_size % self.minibatch_size != 0:

diff --git a/src/imitation/algorithms/density.py b/src/imitation/algorithms/density.py
@@ -198,7 +198,7 @@ def set_demonstrations(self, demonstrations: base.AnyTransitions) -> None:
                         transitions.setdefault(i, []).append(flat_trans)
             elif isinstance(first_item, Mapping):
                 # analogous to cast above.
-                demonstrations = cast(Iterable[base.TransitionMapping], demonstrations)
+                demonstrations = cast(Iterable[types.TransitionMapping], demonstrations)
 
                 for batch in demonstrations:
                     transitions.update(