v0.1.1

philipdarke · Mar 31, 2022 · 4fa5c1d · 4fa5c1d
1 parent 031f839
commit 4fa5c1d
Show file tree

Hide file tree

Showing 12 changed files with 937 additions and 311 deletions.
diff --git a/.flake8 b/.flake8
@@ -2,7 +2,7 @@
 count = True
 max-line-length = 88
 max-complexity = 18
-ignore = E121,E123,E126,E226,E24,E704,W503,W504,E203
+extend-ignore = E203
 include = '\.pyi?$'
 exclude =
     .venv

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -40,4 +40,10 @@ jobs:
         with:
           github_token: ${{ secrets.GITHUB_TOKEN }}
           publish_dir: docs/_build/html
-
+      - name: Release if tag
+        uses: docker://antonyurchenko/git-release:latest
+        if: startsWith(github.ref, 'refs/tags/')
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          CHANGELOG_FILE: "CHANGELOG.md"
+          ALLOW_EMPTY_CHANGELOG: "false"
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,34 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+## [0.1.1] - 2022-03-31
+
+### Added
+
+* Missing data simulation for UEA/UCR data sets
+* Support appending missing data masks and time delta channels
+* `packed_sequence` collate function
+* Documentation now includes a tutorial
+* Automated releases using GitHub Actions
+* DOI
+
+### Changed
+
+* Simplified training/validation/test split approach
+* Default file path for PhysioNet2019 data set is now `data/physionet2019`
+* Refactored `torchtime.data` to share utility functions across data classes
+* Expanded unit tests
+* Updated documentation
+
+## [0.1.0] - 2022-03-28
+
+First release to PyPi
+
+[Unreleased]: https://github.com/philipdarke/torchtime/compare/v0.1.1..HEAD
+[0.1.1]: https://github.com/philipdarke/torchtime/compare/v0.1.0..v0.1.1
+[0.1.0]: https://github.com/philipdarke/torchtime/releases/tag/v0.1.0
diff --git a/README.md b/README.md
@@ -1,16 +1,14 @@
 # Time series data sets for PyTorch
 
-![PyPi](https://img.shields.io/pypi/v/torchtime)
+[![PyPi](https://img.shields.io/pypi/v/torchtime)](https://pypi.org/project/torchtime)
 [![Build status](https://img.shields.io/github/workflow/status/philipdarke/torchtime/build.svg)](https://github.com/philipdarke/torchtime/actions/workflows/build.yml)
 ![Coverage](https://philipdarke.com/torchtime/assets/coverage-badge.svg)
 [![License](https://img.shields.io/github/license/philipdarke/torchtime.svg)](https://github.com/philipdarke/torchtime/blob/main/LICENSE)
 
 `torchtime` provides ready-to-go time series data sets for use in PyTorch. The current list of supported data sets is:
 
 * All data sets in the UEA/UCR classification repository [[link]](https://www.timeseriesclassification.com/)
-* PhysioNet Challenge 2019 [[link]](https://physionet.org/content/challenge-2019/1.0.0/)
-
-The package follows the *batch first* convention. Data tensors are therefore of shape (*n*, *s*, *c*) where *n* is batch size, *s* is trajectory length and *c* are the number of channels.
+* PhysioNet Challenge 2019 (early prediction of sepsis) [[link]](https://physionet.org/content/challenge-2019/1.0.0/)
 
 ## Installation
 
@@ -20,9 +18,9 @@ $ pip install torchtime
 
 ## Using `torchtime`
 
-The example below uses the `torchtime.data.UEA` class. The data set is specified using the `dataset` argument (see list [here](https://www.timeseriesclassification.com/dataset.php)). The `split` argument determines whether training, validation or test data are returned. The size of the splits are controlled with the `train_split` and `val_split` arguments.
+The example below uses the `torchtime.data.UEA` class. The data set is specified using the `dataset` argument (see list of data sets [here](https://www.timeseriesclassification.com/dataset.php)). The `split` argument determines whether training, validation or test data are returned. The size of the splits are controlled with the `train_split` and `val_split` arguments. Reproducibility is achieved using the `seed` argument.
 
-For example, to load training data for the [ArrowHead](https://www.timeseriesclassification.com/description.php?Dataset=ArrowHead) data set with a 70% training, 20% validation and 10% testing split:
+For example, to load training data for the [ArrowHead](https://www.timeseriesclassification.com/description.php?Dataset=ArrowHead) data set with a 70/30% training/validation split:
 
 ```
 from torch.utils.data import DataLoader
@@ -32,72 +30,85 @@ arrowhead = UEA(
     dataset="ArrowHead",
     split="train",
     train_split=0.7,
-    val_split=0.2,
+    seed=456789,
 )
 dataloader = DataLoader(arrowhead, batch_size=32)
 ```
 
-Batches are dictionaries of tensors `X`, `y` and `length`. `X` are the time series data with an additional time stamp in the first channel, `y` are one-hot encoded labels and `length` are the length of each trajectory.
+The DataLoader returns batches as a dictionary of tensors `X`, `y` and `length`. `X` are the time series data. By default, a time stamp is appended to the data as the first channel. This package follows the *batch first* convention therefore `X` has shape (*n*, *s*, *c*) where *n* is batch size, *s* is trajectory length and *c* is the number of channels.
 
-ArrowHead is a univariate time series with 251 observations in each trajectory. `X` therefore has two channels, the time stamp followed by the time series. A batch size of 32 was specified above therefore `X` has shape (32, 251, 2).
+ArrowHead is a univariate time series with 251 observations in each trajectory. `X` therefore has two channels, the time stamp followed by the time series.
 
 ```
->> next(iter(dataloader))["X"].shape
-
-torch.Size([32, 251, 2])
-
 >> next(iter(dataloader))["X"]
 
-tensor([[[  0.0000,  -1.8295],
-         [  1.0000,  -1.8238],
-         [  2.0000,  -1.8101],
+tensor([[[  0.0000,  -1.8302],
+         [  1.0000,  -1.8123],
+         [  2.0000,  -1.8122],
          ...,
-         [248.0000,  -1.7759],
-         [249.0000,  -1.8088],
-         [250.0000,  -1.8110]],
+         [248.0000,  -1.7821],
+         [249.0000,  -1.7971],
+         [250.0000,  -1.8280]],
 
         ...,
 
-        [[  0.0000,  -2.0147],
-         [  1.0000,  -2.0311],
-         [  2.0000,  -1.9471],
+        [[  0.0000,  -1.8392],
+         [  1.0000,  -1.8314],
+         [  2.0000,  -1.8125],
          ...,
-         [248.0000,  -1.9901],
-         [249.0000,  -1.9913],
-         [250.0000,  -2.0109]]])
+         [248.0000,  -1.8359],
+         [249.0000,  -1.8202],
+         [250.0000,  -1.8387]]])
 ```
 
-There are three classes therefore `y` has shape (32, 3).
+Labels `y` are one-hot encoded and have shape (*n*, *l*) where *l* is the number of classes.
 
 ```
->> next(iter(dataloader))["y"].shape
-
-torch.Size([32, 3])
-
 >> next(iter(dataloader))["y"]
 
 tensor([[0, 0, 1],
+        [1, 0, 0],
+        [1, 0, 0],
+
         ...,
-        [1, 0, 0]])
-```
 
-Finally, `length` is the length of each trajectory (before any padding for data sets of irregular length) and therefore has shape (32).
+        [0, 0, 1],
+        [0, 1, 0],
+        [1, 0, 0]])
 
 ```
->> next(iter(dataloader))["length"].shape
 
-torch.Size([32])
+The `length` of each trajectory (before padding if the data set is of irregular length) is provided as a tensor of shape (*n*).
 
+```
 >> next(iter(dataloader))["length"]
 
-tensor([251, ..., 251])
+tensor([251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251,
+        251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251,
+        251, 251, 251, 251])
 ```
 
 ## Learn more
 
-Other features include missing data simulation for UEA data sets. See the [API](api) for more information.
+Missing data can be simulated using the `missing` argument. In addition, missing data/observational masks and time delta channels can be appended using the `mask` and `delta` arguments. See the [tutorial](https://philipdarke.com/torchtime/tutorial.html) and [API](https://philipdarke.com/torchtime/api.html) for more information.
+
+This work is based on some of the data processing ideas in Kidger et al, 2020 [[1]](https://arxiv.org/abs/2005.08926) and Che et al, 2018 [[2]](https://doi.org/10.1038/s41598-018-24271-9).
+
+## References
+
+1. Kidger, P, Morrill, J, Foster, J, *et al*. Neural Controlled Differential Equations for Irregular Time Series. *arXiv* 2005.08926 (2020). [[arXiv]](https://arxiv.org/abs/2005.08926)
+
+1. Che, Z, Purushotham, S, Cho, K, *et al*. Recurrent Neural Networks for Multivariate Time Series with Missing Values. *Sci Rep* 8, 6085 (2018). [[doi]](https://doi.org/10.1038/s41598-018-24271-9)
+
+1. Reyna M, Josef C, Jeter R, *et al*. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. *Critical Care Medicine* 48 2: 210-217 (2019). [[doi]](https://doi.org/10.1097/CCM.0000000000004145)
+
+1. Reyna, M, Josef, C, Jeter, R, *et al*. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). *PhysioNet* (2019). [[doi]](https://doi.org/10.13026/v64v-d857)
+
+1. Goldberger, A, Amaral, L, Glass, L, *et al*. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101 (23), pp. e215–e220 (2000). [[doi]](https://doi.org/10.1161/01.cir.101.23.e215)
+
+## Funding
 
-This work is based on some of the data processing ideas in Kidger et al, 2020 [[link]](https://arxiv.org/abs/2005.08926) and Che et al, 2018 [[link]](https://doi.org/10.1038/s41598-018-24271-9).
+This work was supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data, Newcastle University (grant number EP/L015358/1).
 
 ## License
 

diff --git a/docs/source/api.md b/docs/source/api.md
@@ -1,15 +1,23 @@
 # API
 
-## `torchtime.data`
+## Time series data sets
+
+* [PhysioNet2019](torchtime.data.PhysioNet2019)
+* [UEA](torchtime.data.UEA)
 
 ```{eval-rst}
 .. automodule:: torchtime.data
-    :members: 
+   :members: 
 ```
 
-## `torchtime.collate`
+## Custom collate functions
+
+Data sets of variable length can be efficiently represented in PyTorch using a [`PackedSequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.PackedSequence.html) object. These are formed using
+[`pack_padded_sequence()`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence) which by default expects the input batch to be sorted in descending length. This is handled by the [`sort_by_length()`](torchtime.collate.sort_by_length) collate function. Alternatively, a `PackedSequence` object can be formed using the [`packed_sequence()`](torchtime.collate.packed_sequence) collate function.
+
+Custom collate functions should be passed to the `collate_fn` argument of a [DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader).
 
 ```{eval-rst}
 .. automodule:: torchtime.collate
-    :members: 
+   :members: 
 ```
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -4,6 +4,8 @@
 ```{eval-rst}
 .. toctree::
    :hidden:
+   :maxdepth: 2
 
+   tutorial
    api
 ```