Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zstd as a compression option? #30

Open
mariusvniekerk opened this issue Feb 14, 2019 · 5 comments
Open

Add zstd as a compression option? #30

mariusvniekerk opened this issue Feb 14, 2019 · 5 comments

Comments

@mariusvniekerk
Copy link

would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy.

@mrocklin
Copy link
Member

mrocklin commented Feb 14, 2019 via email

@martindurant
Copy link
Member

ref: fsspec/filesystem_spec#69

@martindurant
Copy link
Member

@mariusvniekerk , seems like you'd have to make the PR if you want this to progress :)

@wanx4910
Copy link

wanx4910 commented Aug 2, 2021

ref: intake/filesystem_spec#69

Does this mean Zstandard is now supported in Dask? Also is there some documentations as to how I would use it? As I have some zstandard compressed files filled with msgpack serialized data in chunks and would like to use Dask (multiprocessing) to speed up the read or to operate on the data without reading things in memory.

@martindurant
Copy link
Member

This repo is not really about data access, but about temporary store for dask.
However, I can still answer your question. Caveat: msgpack is not a file format, as far as I know, but you can treat the contents of a file as a msgpack binary stream.

You should be able to load a single file of your data like

def readafile(fn):
    with fsspec.open(fn,  mode="rb", compression="zstd") as f:
        return msgpack.load(f)

Then, you could make a set of delayed functions for dask to work on, one chunk per file, in parallel

import dask
output = [dask.delayed(readafile)(fn) for fn in filenames]

Now the question becomes: what do you want to do with this data?

By the way: Zstd supports internal streams and block which can, in theory, provide near random-access. Dask/fsspec does not support this, so you cannot read a single file chunk-wise using the method above. However, msgpack does support streaming object-by-object, so you could change the function to work that way (which much lower memory usage), if you intend to output just aggregated values from each file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants