Standard set of data-loaders for training and making predictions for DNA sequence-based models.
All dataloaders in kipoiseq.dataloaders
decorated with @kipoi_dataloader
(SeqIntervalDl and StringSeqIntervalDl) are compatible Kipoi models and can be directly used when specifying a new model in model.yaml
:
...
default_dataloader:
defined_as: kipoiseq.dataloaders.SeqIntervalDl
default_args:
auto_resize_len: 1000 # override default args in SeqIntervalDl
dependencies:
pip:
- kipoiseq
...
pip install kipoiseq
Optional dependencies:
pip install cyvcf2, pyranges
conda install cyvcf2, pyranges
from kipoiseq.dataloaders import SeqIntervalDl
dl = SeqIntervalDl.init_example() # use the provided example files
# your own files
dl = SeqIntervalDl("intervals.bed", "genome.fa")
len(dl) # length of the dataset
dl[0] # get one instance. # returns a dictionary:
# dict(inputs=<one-hot-encoded-array>,
# targets=<additional columns in the bed file>,
# metadata=dict(ranges=GenomicRanges(chr=, start, end)...
all = dl.load_all() # load the whole dataset
# load batches of data
it = dl.batch_iter(32, num_workers=8) # load batches of data in parallel using 8 workers
# returns a dictionary with all three keys: inputs, targets, metadata
it = dl.batch_train_iter(32, num_workers=8)
# returns a tuple: (inputs, targets), can be used directly with keras' `model.fit_generator`
More info:
- Follow the getting-started colab notebook.
- See docs
- Read the pytorch Data Loading and Processing Tutorial to become more familiar with transforms and dataloaders
- Read the code for
SeqIntervalDl
in kipoiseq/dataloaders/sequence.py- you can skip the
@kipoi_dataloader
and the long yaml doc-string. These are only required if you want to use dataloaders in Kipoi's model.yaml files.
- you can skip the
- Explore the available transforms (functional, class-based) or extractors (kipoiseq, genomelake)