Skip to content

Commit

Permalink
v0.1.1
Browse files Browse the repository at this point in the history
  • Loading branch information
philipdarke committed Mar 31, 2022
1 parent 031f839 commit 4fa5c1d
Show file tree
Hide file tree
Showing 12 changed files with 937 additions and 311 deletions.
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
count = True
max-line-length = 88
max-complexity = 18
ignore = E121,E123,E126,E226,E24,E704,W503,W504,E203
extend-ignore = E203
include = '\.pyi?$'
exclude =
.venv
Expand Down
8 changes: 7 additions & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,10 @@ jobs:
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: docs/_build/html

- name: Release if tag
uses: docker://antonyurchenko/git-release:latest
if: startsWith(github.ref, 'refs/tags/')
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
CHANGELOG_FILE: "CHANGELOG.md"
ALLOW_EMPTY_CHANGELOG: "false"
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.1.1] - 2022-03-31

### Added

* Missing data simulation for UEA/UCR data sets
* Support appending missing data masks and time delta channels
* `packed_sequence` collate function
* Documentation now includes a tutorial
* Automated releases using GitHub Actions
* DOI

### Changed

* Simplified training/validation/test split approach
* Default file path for PhysioNet2019 data set is now `data/physionet2019`
* Refactored `torchtime.data` to share utility functions across data classes
* Expanded unit tests
* Updated documentation

## [0.1.0] - 2022-03-28

First release to PyPi

[Unreleased]: https://github.com/philipdarke/torchtime/compare/v0.1.1..HEAD
[0.1.1]: https://github.com/philipdarke/torchtime/compare/v0.1.0..v0.1.1
[0.1.0]: https://github.com/philipdarke/torchtime/releases/tag/v0.1.0
87 changes: 49 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
# Time series data sets for PyTorch

![PyPi](https://img.shields.io/pypi/v/torchtime)
[![PyPi](https://img.shields.io/pypi/v/torchtime)](https://pypi.org/project/torchtime)
[![Build status](https://img.shields.io/github/workflow/status/philipdarke/torchtime/build.svg)](https://github.com/philipdarke/torchtime/actions/workflows/build.yml)
![Coverage](https://philipdarke.com/torchtime/assets/coverage-badge.svg)
[![License](https://img.shields.io/github/license/philipdarke/torchtime.svg)](https://github.com/philipdarke/torchtime/blob/main/LICENSE)

`torchtime` provides ready-to-go time series data sets for use in PyTorch. The current list of supported data sets is:

* All data sets in the UEA/UCR classification repository [[link]](https://www.timeseriesclassification.com/)
* PhysioNet Challenge 2019 [[link]](https://physionet.org/content/challenge-2019/1.0.0/)

The package follows the *batch first* convention. Data tensors are therefore of shape (*n*, *s*, *c*) where *n* is batch size, *s* is trajectory length and *c* are the number of channels.
* PhysioNet Challenge 2019 (early prediction of sepsis) [[link]](https://physionet.org/content/challenge-2019/1.0.0/)

## Installation

Expand All @@ -20,9 +18,9 @@ $ pip install torchtime

## Using `torchtime`

The example below uses the `torchtime.data.UEA` class. The data set is specified using the `dataset` argument (see list [here](https://www.timeseriesclassification.com/dataset.php)). The `split` argument determines whether training, validation or test data are returned. The size of the splits are controlled with the `train_split` and `val_split` arguments.
The example below uses the `torchtime.data.UEA` class. The data set is specified using the `dataset` argument (see list of data sets [here](https://www.timeseriesclassification.com/dataset.php)). The `split` argument determines whether training, validation or test data are returned. The size of the splits are controlled with the `train_split` and `val_split` arguments. Reproducibility is achieved using the `seed` argument.

For example, to load training data for the [ArrowHead](https://www.timeseriesclassification.com/description.php?Dataset=ArrowHead) data set with a 70% training, 20% validation and 10% testing split:
For example, to load training data for the [ArrowHead](https://www.timeseriesclassification.com/description.php?Dataset=ArrowHead) data set with a 70/30% training/validation split:

```
from torch.utils.data import DataLoader
Expand All @@ -32,72 +30,85 @@ arrowhead = UEA(
dataset="ArrowHead",
split="train",
train_split=0.7,
val_split=0.2,
seed=456789,
)
dataloader = DataLoader(arrowhead, batch_size=32)
```

Batches are dictionaries of tensors `X`, `y` and `length`. `X` are the time series data with an additional time stamp in the first channel, `y` are one-hot encoded labels and `length` are the length of each trajectory.
The DataLoader returns batches as a dictionary of tensors `X`, `y` and `length`. `X` are the time series data. By default, a time stamp is appended to the data as the first channel. This package follows the *batch first* convention therefore `X` has shape (*n*, *s*, *c*) where *n* is batch size, *s* is trajectory length and *c* is the number of channels.

ArrowHead is a univariate time series with 251 observations in each trajectory. `X` therefore has two channels, the time stamp followed by the time series. A batch size of 32 was specified above therefore `X` has shape (32, 251, 2).
ArrowHead is a univariate time series with 251 observations in each trajectory. `X` therefore has two channels, the time stamp followed by the time series.

```
>> next(iter(dataloader))["X"].shape
torch.Size([32, 251, 2])
>> next(iter(dataloader))["X"]
tensor([[[ 0.0000, -1.8295],
[ 1.0000, -1.8238],
[ 2.0000, -1.8101],
tensor([[[ 0.0000, -1.8302],
[ 1.0000, -1.8123],
[ 2.0000, -1.8122],
...,
[248.0000, -1.7759],
[249.0000, -1.8088],
[250.0000, -1.8110]],
[248.0000, -1.7821],
[249.0000, -1.7971],
[250.0000, -1.8280]],
...,
[[ 0.0000, -2.0147],
[ 1.0000, -2.0311],
[ 2.0000, -1.9471],
[[ 0.0000, -1.8392],
[ 1.0000, -1.8314],
[ 2.0000, -1.8125],
...,
[248.0000, -1.9901],
[249.0000, -1.9913],
[250.0000, -2.0109]]])
[248.0000, -1.8359],
[249.0000, -1.8202],
[250.0000, -1.8387]]])
```

There are three classes therefore `y` has shape (32, 3).
Labels `y` are one-hot encoded and have shape (*n*, *l*) where *l* is the number of classes.

```
>> next(iter(dataloader))["y"].shape
torch.Size([32, 3])
>> next(iter(dataloader))["y"]
tensor([[0, 0, 1],
[1, 0, 0],
[1, 0, 0],
...,
[1, 0, 0]])
```
Finally, `length` is the length of each trajectory (before any padding for data sets of irregular length) and therefore has shape (32).
[0, 0, 1],
[0, 1, 0],
[1, 0, 0]])
```
>> next(iter(dataloader))["length"].shape

torch.Size([32])
The `length` of each trajectory (before padding if the data set is of irregular length) is provided as a tensor of shape (*n*).

```
>> next(iter(dataloader))["length"]
tensor([251, ..., 251])
tensor([251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251,
251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251, 251,
251, 251, 251, 251])
```

## Learn more

Other features include missing data simulation for UEA data sets. See the [API](api) for more information.
Missing data can be simulated using the `missing` argument. In addition, missing data/observational masks and time delta channels can be appended using the `mask` and `delta` arguments. See the [tutorial](https://philipdarke.com/torchtime/tutorial.html) and [API](https://philipdarke.com/torchtime/api.html) for more information.

This work is based on some of the data processing ideas in Kidger et al, 2020 [[1]](https://arxiv.org/abs/2005.08926) and Che et al, 2018 [[2]](https://doi.org/10.1038/s41598-018-24271-9).

## References

1. Kidger, P, Morrill, J, Foster, J, *et al*. Neural Controlled Differential Equations for Irregular Time Series. *arXiv* 2005.08926 (2020). [[arXiv]](https://arxiv.org/abs/2005.08926)

1. Che, Z, Purushotham, S, Cho, K, *et al*. Recurrent Neural Networks for Multivariate Time Series with Missing Values. *Sci Rep* 8, 6085 (2018). [[doi]](https://doi.org/10.1038/s41598-018-24271-9)

1. Reyna M, Josef C, Jeter R, *et al*. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. *Critical Care Medicine* 48 2: 210-217 (2019). [[doi]](https://doi.org/10.1097/CCM.0000000000004145)

1. Reyna, M, Josef, C, Jeter, R, *et al*. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). *PhysioNet* (2019). [[doi]](https://doi.org/10.13026/v64v-d857)

1. Goldberger, A, Amaral, L, Glass, L, *et al*. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. *Circulation* 101 (23), pp. e215–e220 (2000). [[doi]](https://doi.org/10.1161/01.cir.101.23.e215)

## Funding

This work is based on some of the data processing ideas in Kidger et al, 2020 [[link]](https://arxiv.org/abs/2005.08926) and Che et al, 2018 [[link]](https://doi.org/10.1038/s41598-018-24271-9).
This work was supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data, Newcastle University (grant number EP/L015358/1).

## License

Expand Down
16 changes: 12 additions & 4 deletions docs/source/api.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
# API

## `torchtime.data`
## Time series data sets

* [PhysioNet2019](torchtime.data.PhysioNet2019)
* [UEA](torchtime.data.UEA)

```{eval-rst}
.. automodule:: torchtime.data
:members:
:members:
```

## `torchtime.collate`
## Custom collate functions

Data sets of variable length can be efficiently represented in PyTorch using a [`PackedSequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.PackedSequence.html) object. These are formed using
[`pack_padded_sequence()`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html#torch.nn.utils.rnn.pack_padded_sequence) which by default expects the input batch to be sorted in descending length. This is handled by the [`sort_by_length()`](torchtime.collate.sort_by_length) collate function. Alternatively, a `PackedSequence` object can be formed using the [`packed_sequence()`](torchtime.collate.packed_sequence) collate function.

Custom collate functions should be passed to the `collate_fn` argument of a [DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader).

```{eval-rst}
.. automodule:: torchtime.collate
:members:
:members:
```
2 changes: 2 additions & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
```{eval-rst}
.. toctree::
:hidden:
:maxdepth: 2
tutorial
api
```
Loading

0 comments on commit 4fa5c1d

Please sign in to comment.