Add zstd as a compression option? #30

mariusvniekerk · 2019-02-14T12:32:15Z

would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy.

mrocklin · 2019-02-14T14:56:06Z

No objection from me.

…

On Thu, Feb 14, 2019 at 4:32 AM Marius van Niekerk ***@***.***> wrote: would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#30>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszKDTcoy0r4DPyrBoG47Ryb6Gapziks5vNVdPgaJpZM4a7fwP> .

martindurant · 2019-07-18T13:42:45Z

ref: fsspec/filesystem_spec#69

martindurant · 2020-04-08T13:47:50Z

@mariusvniekerk , seems like you'd have to make the PR if you want this to progress :)

wanx4910 · 2021-08-02T16:03:12Z

ref: intake/filesystem_spec#69

Does this mean Zstandard is now supported in Dask? Also is there some documentations as to how I would use it? As I have some zstandard compressed files filled with msgpack serialized data in chunks and would like to use Dask (multiprocessing) to speed up the read or to operate on the data without reading things in memory.

martindurant · 2021-08-02T16:31:17Z

This repo is not really about data access, but about temporary store for dask.
However, I can still answer your question. Caveat: msgpack is not a file format, as far as I know, but you can treat the contents of a file as a msgpack binary stream.

You should be able to load a single file of your data like

def readafile(fn):
    with fsspec.open(fn,  mode="rb", compression="zstd") as f:
        return msgpack.load(f)

Then, you could make a set of delayed functions for dask to work on, one chunk per file, in parallel

import dask
output = [dask.delayed(readafile)(fn) for fn in filenames]

Now the question becomes: what do you want to do with this data?

By the way: Zstd supports internal streams and block which can, in theory, provide near random-access. Dask/fsspec does not support this, so you cannot read a single file chunk-wise using the method above. However, msgpack does support streaming object-by-object, so you could change the function to work that way (which much lower memory usage), if you intend to output just aggregated values from each file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zstd as a compression option? #30

Add zstd as a compression option? #30

mariusvniekerk commented Feb 14, 2019

mrocklin commented Feb 14, 2019 via email

martindurant commented Jul 18, 2019

martindurant commented Apr 8, 2020

wanx4910 commented Aug 2, 2021

martindurant commented Aug 2, 2021

Add zstd as a compression option? #30

Add zstd as a compression option? #30

Comments

mariusvniekerk commented Feb 14, 2019

mrocklin commented Feb 14, 2019 via email

martindurant commented Jul 18, 2019

martindurant commented Apr 8, 2020

wanx4910 commented Aug 2, 2021

martindurant commented Aug 2, 2021