Implement a pytorch dataloader that filter and download at run time #39

rom1504 · 2021-09-16T18:18:25Z

this is an online version of #31
Combine the whole pipeline not as a big batch job, but instead as a data loader that

query/filter in a knn index + metadata structure
download
resize
give to training

It makes sense in particular when the model training speed is low. For example dalle is such a model.
For clip it could make less sense

it could be a lot more convenient than downloading TB of webdataset if it works:

download a 16GB knn index and 50GB of metadata
write your best keyword and how much of each you'd like (with clip thresholds)
start the training on up to 400M sample

rom1504 · 2021-09-25T13:37:07Z

related rom1504/img2dataset#56

I'm thinking of implementing the download+resize inside img2dataset since these features are already there.
I think to pass it to pytorch a good way would be to add a writer to img2dataset that would take as attribute a multiprocessing queue https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues and then to use that queue in an iterable dataset https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset
Since queue is process and thread safe, it would work from the producer part (img2dataset produce from multiple processes), and from the consumer part (torch dataloader could apply any resizing/batching in different processes)

img2dataset would not need to depend on pytorch since implementing an iterable dataset only requires having a class with an __iter__ method

rom1504 · 2021-09-25T13:40:15Z

the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset

let's hope this can be made to work with the same speed at img2dataset (1300 sample/s)

rom1504 · 2022-02-11T09:14:57Z

rom1504/img2dataset#82

Could be interesting to investigate this path

img2dataset is a (multi instance per machine) rest service that takes as input a path towards an url shard, and return a path towards an image shard when it's done
clip inference is a (multiple instance per machine) rest service that takes as input a path towards an image shard and return a path towards an embedding shard when it's done
autofaiss is a (multiple instance per machine) rest service that takes as input a path towards an embedding file and return a path towards an index path when it's done

The img2dataset service can also expose a shard endpoint that takes as input some url, caption files and turn them into shard files.
The autofaiss service can also expose a train endpoint and a merge endpoint.
The clip inference service can also expose a combine endpoint to turn N embeddings file into one

Then all that is needed will be an orchestrator with a metadata database, that makes sure all the shards are properly done.

Benefits:

easy separation of concern
easy deployment of the services
easy scaling
easy to combine various features
provide both streaming and batch modes with one implementation
possible to use it only to get a few shards
simpler to test
logic in each service is limited, no need to redo the orchestration every time

To check:

how to plug that kind of thing into a spark job (run services in background task in each executor ?) if needed
https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78
https://stackoverflow.com/questions/59216604/how-to-call-a-web-service-called-from-a-spark-job
-> probably doesn't make a lot of sense
how to start/kill such services when you run some training code

rom1504 · 2022-02-26T15:31:57Z

new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back

reader:

url/meta in parquet, csv,.. -> shards of url/meta
images in files, tar, parquet -> shards of image/meta
embeddings in npy, parquet -> shards of embeddings
indices in .index -> shards of indices

writer:

shards of url/meta -> url/meta in parquet, csv, ..
shards of image/meta -> images in files, tar, parquet
shards of embeddings -> embeddings in npy, parquet
shards of indices -> indices in .index

transformer:

shard of url/meta -> shards of image/meta
shards of image/meta -> shards of embeddings / meta
shards of embeddings / meta -> shards of indices

These bricks could then be naturally composed to form downloaders, inferences and indexers

defining good interfaces for each subtool then making each tool a separate package, well tested and with good examples

Check if https://docarray.jina.ai/fundamentals/documentarray/ could be helpful to build this

This new structure should make it possible to make all these tools both more powerful and more reusable

rom1504 · 2022-02-26T17:58:46Z

related https://github.com/webdataset/webdataset/blob/main/notebooks/openimages.ipynb

rom1504 · 2022-02-26T19:08:10Z

let's first try and check how to read in parallel a large file with fsspec

rom1504 · 2022-02-27T00:47:00Z

reading a large file with fsspec works by seeking and reading up to a length, it's much faster

rom1504 · 2022-02-27T00:47:16Z

next step will be implementing a clean embedding-reader package

rom1504 · 2022-02-27T00:47:33Z

independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good

rom1504 pinned this issue Sep 19, 2021

rom1504 added the priority label Sep 26, 2021

rom1504 added important and removed priority labels Oct 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a pytorch dataloader that filter and download at run time #39

Implement a pytorch dataloader that filter and download at run time #39

rom1504 commented Sep 16, 2021 •

edited

Loading

rom1504 commented Sep 25, 2021 •

edited

Loading

rom1504 commented Sep 25, 2021

rom1504 commented Feb 11, 2022 •

edited

Loading

rom1504 commented Feb 26, 2022

rom1504 commented Feb 26, 2022

rom1504 commented Feb 26, 2022

rom1504 commented Feb 27, 2022

rom1504 commented Feb 27, 2022

rom1504 commented Feb 27, 2022

Implement a pytorch dataloader that filter and download at run time #39

Implement a pytorch dataloader that filter and download at run time #39

Comments

rom1504 commented Sep 16, 2021 • edited Loading

rom1504 commented Sep 25, 2021 • edited Loading

rom1504 commented Sep 25, 2021

rom1504 commented Feb 11, 2022 • edited Loading

rom1504 commented Feb 26, 2022

rom1504 commented Feb 26, 2022

rom1504 commented Feb 26, 2022

rom1504 commented Feb 27, 2022

rom1504 commented Feb 27, 2022

rom1504 commented Feb 27, 2022

rom1504 commented Sep 16, 2021 •

edited

Loading

rom1504 commented Sep 25, 2021 •

edited

Loading

rom1504 commented Feb 11, 2022 •

edited

Loading