read_object() with idx parameter is slow #29

lvarriano · 2023-11-01T20:44:06Z

Using the idx parameter in read_object() makes the read about 20x slower. From the lh5_store.py code, this is related to line 732 nda = h5f[name][source_sel]. I thought that this might be related to the number of discontinuities in the data so I tested for that, and it seems somehow related. I don't understand the sudden jump in read time, though.

If, instead, the entire object is read into memory before the list indexing, i.e. nda = h5f[name][...][source_sel], then the speed is about the same. Note that this line does not apply if read_object is passed an obj_buf to read the data into and some other changes will need to be made.

We suspect this may be related to DataLoader's performance legend-exp/pygama#521 but it does not account for all of the slowdown. This seems to be a fundamental issue with HDF5 files (from the linked issue "It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and h5py/h5py#1597").

This is demonstrated below on a single channel in a single file.

import lgdo.lh5_store as lh5
import numpy as np
from matplotlib import pyplot as plt
import time

raw_file = 'l200-p03-r003-phy-20230406T045650Z-tier_raw.lh5'
out_file = 'test.lh5'

test_channel = 'ch1104002'

numtests = 10

store = lh5.LH5Store()

times_readall = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file)
    end = time.time()
    times_readall.append(end-start)

print(f"read entire: {np.mean(times_readall):.5f} +-{np.std(times_readall):.5f} s")

totind = len(chobj)
allind = np.arange(totind)

times_readallidx = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=allind)
    end = time.time()
    times_readallidx.append(end-start)

print(f"read by idx, all {len(allind)} rows: {np.mean(times_readallidx):.5f} +-{np.std(times_readallidx):.5f} s")

numbreaks = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500, 1000])

for j in range(len(numbreaks)):
    breaks = np.round(np.linspace(0, len(allind) - 1, numbreaks[j])).astype(int)
    theseind = np.delete(allind, breaks)
    times_readsome = []
    for i in range(numtests):
        start = time.time()
        chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind)
        end = time.time()
        times_readsome.append(end-start)
        
    print(f"read by idx, {len(theseind)} rows to read with {numbreaks[j]} breaks: {np.mean(times_readsome):.5f} +- {np.std(times_readsome):.5f} s")

prints

read entire: 0.04197 +-0.00789 s
read by idx, all 1739 rows: 0.07562 +-0.04448 s
read by idx, 1738 rows to read with 1 breaks: 0.05652 +-0.00268 s
read by idx, 1737 rows to read with 2 breaks: 0.05645 +-0.00218 s
read by idx, 1734 rows to read with 5 breaks: 1.24324 +-0.03775 s
read by idx, 1729 rows to read with 10 breaks: 1.24398 +-0.07181 s
read by idx, 1719 rows to read with 20 breaks: 1.21239 +-0.03777 s
read by idx, 1689 rows to read with 50 breaks: 1.15172 +-0.00964 s
read by idx, 1639 rows to read with 100 breaks: 1.15134 +-0.02681 s
read by idx, 1539 rows to read with 200 breaks: 1.10985 +-0.00872 s
read by idx, 1239 rows to read with 500 breaks: 1.03938 +-0.07005 s
read by idx, 739 rows to read with 1000 breaks: 0.67653 +-0.03128 s

If line 732 is replaced as above (to nda = h5f[name][...][source_sel]) to read the whole object below list indexing, this prints instead

read entire: 0.03629 +-0.00453 s
read by idx, all 1739 rows: 0.05294 +-0.04059 s
read by idx, 1738 rows to read with 1 breaks: 0.03781 +- 0.00088 s
read by idx, 1737 rows to read with 2 breaks: 0.03913 +- 0.00252 s
read by idx, 1734 rows to read with 5 breaks: 0.03769 +- 0.00095 s
read by idx, 1729 rows to read with 10 breaks: 0.03798 +- 0.00092 s
read by idx, 1719 rows to read with 20 breaks: 0.03732 +- 0.00080 s
read by idx, 1689 rows to read with 50 breaks: 0.03786 +- 0.00083 s
read by idx, 1639 rows to read with 100 breaks: 0.03777 +- 0.00092 s
read by idx, 1539 rows to read with 200 breaks: 0.03791 +- 0.00078 s
read by idx, 1239 rows to read with 500 breaks: 0.03743 +- 0.00121 s
read by idx, 739 rows to read with 1000 breaks: 0.03606 +- 0.00078 s

The text was updated successfully, but these errors were encountered:

lvarriano · 2023-11-06T15:14:39Z

I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves.

iguinn · 2023-11-06T16:42:46Z

One thing that probably has an effect here is that when data is read from disk it tends to be read in contiguous chunks of 4096 kB pages. This means that if the spacing between entries is less than one page, it is still reading the full file. The waveforms are big enough that they should be spread across pages, while the other fields probably use only one or two pages for the full file. So a couple of other things that might be worth checking:

What if you just look at waveforms?
What if you look at a very large file and create much larger gaps?

gipert · 2023-11-06T16:49:40Z

Did we ever think carefully about how to chunk datasets on disk? This is also relevant for HDF5 compression.

https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage

lvarriano · 2023-11-06T18:27:40Z

1. What if you just look at waveforms?

It is the same slowdown. Loading just the waveforms (using the same phy file) prints the following.

read entire: 0.02775 +-0.01266 s
read by idx, all 1739 rows: 0.02262 +-0.00170 s
read by idx, 1738 rows to read with 1 breaks: 0.02238 +- 0.00227 s
read by idx, 1737 rows to read with 2 breaks: 0.02199 +- 0.00125 s
read by idx, 1734 rows to read with 5 breaks: 1.82485 +- 0.02439 s
read by idx, 1729 rows to read with 10 breaks: 1.96193 +- 0.13411 s
read by idx, 1719 rows to read with 20 breaks: 1.82573 +- 0.01488 s
read by idx, 1689 rows to read with 50 breaks: 1.79066 +- 0.02706 s
read by idx, 1639 rows to read with 100 breaks: 1.76084 +- 0.07120 s
read by idx, 1539 rows to read with 200 breaks: 1.62478 +- 0.01129 s
read by idx, 1239 rows to read with 500 breaks: 1.33234 +- 0.00841 s
read by idx, 739 rows to read with 1000 breaks: 0.81090 +- 0.00506 s

For reference, I changed lines in the script above to e.g.

chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind, field_mask=['waveform'])

2. What if you look at a very large file and create much larger gaps?

I used a cal file ('l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5') that has 8235 events. The slowdown is actually worse ~x30-40 times slower, which gets even worse if the gaps are larger (up to x80!). Keep in mind that I am reading in only a single channel. The code below prints

read entire: 0.19241 +- 0.00773 s
read by idx, all 8235 rows: 0.29639 +- 0.00475 s
read by idx, 8234 rows to read with 1 breaks of size 1: 0.29881 +- 0.00724 s
read by idx, 8233 rows to read with 2 breaks of size 1: 0.30981 +- 0.02371 s
read by idx, 8230 rows to read with 5 breaks of size 1: 8.84480 +- 0.16213 s
read by idx, 8225 rows to read with 10 breaks of size 1: 8.78127 +- 0.10855 s
read by idx, 8215 rows to read with 20 breaks of size 1: 9.05398 +- 0.36159 s
read by idx, 8225 rows to read with 1 breaks of size 10: 8.91650 +- 0.21048 s
read by idx, 8215 rows to read with 2 breaks of size 10: 9.19139 +- 0.42725 s
read by idx, 8185 rows to read with 5 breaks of size 10: 9.54173 +- 0.49595 s
read by idx, 8135 rows to read with 10 breaks of size 10: 9.45252 +- 0.59045 s
read by idx, 8035 rows to read with 20 breaks of size 10: 8.92942 +- 0.03804 s
read by idx, 8135 rows to read with 1 breaks of size 100: 9.00981 +- 0.20564 s
read by idx, 8035 rows to read with 2 breaks of size 100: 9.06565 +- 0.15981 s
read by idx, 7735 rows to read with 5 breaks of size 100: 10.00159 +- 0.59773 s
read by idx, 7235 rows to read with 10 breaks of size 100: 11.66310 +- 0.24250 s
read by idx, 6235 rows to read with 20 breaks of size 100: 15.15527 +- 0.22353 s
read by idx, 7235 rows to read with 1 breaks of size 1000: 12.01743 +- 0.36342 s
read by idx, 6235 rows to read with 2 breaks of size 1000: 15.81197 +- 0.61867 s
read by idx, 3235 rows to read with 5 breaks of size 1000: 15.53589 +- 0.40457 s
read by idx, 0 rows to read with 10 breaks of size 1000: 0.04306 +- 0.00211 s
read by idx, 0 rows to read with 20 breaks of size 1000: 0.04304 +- 0.00054 s

Again, changing line 732 in lh5_store.py restores the performance.

read entire: 0.17909 +- 0.00186 s
read by idx, all 8235 rows: 0.25028 +- 0.09423 s
read by idx, 8234 rows to read with 1 breaks of size 1: 0.21351 +- 0.00270 s
read by idx, 8233 rows to read with 2 breaks of size 1: 0.21347 +- 0.00163 s
read by idx, 8230 rows to read with 5 breaks of size 1: 0.21279 +- 0.00247 s
read by idx, 8225 rows to read with 10 breaks of size 1: 0.21245 +- 0.00310 s
read by idx, 8215 rows to read with 20 breaks of size 1: 0.21244 +- 0.00192 s
read by idx, 8225 rows to read with 1 breaks of size 10: 0.21133 +- 0.00238 s
read by idx, 8215 rows to read with 2 breaks of size 10: 0.21408 +- 0.00512 s
read by idx, 8185 rows to read with 5 breaks of size 10: 0.21166 +- 0.00131 s
read by idx, 8135 rows to read with 10 breaks of size 10: 0.21438 +- 0.00354 s
read by idx, 8035 rows to read with 20 breaks of size 10: 0.21336 +- 0.00366 s
read by idx, 8135 rows to read with 1 breaks of size 100: 0.21335 +- 0.00283 s
read by idx, 8035 rows to read with 2 breaks of size 100: 0.21340 +- 0.00229 s
read by idx, 7735 rows to read with 5 breaks of size 100: 0.21315 +- 0.00322 s
read by idx, 7235 rows to read with 10 breaks of size 100: 0.20953 +- 0.00246 s
read by idx, 6235 rows to read with 20 breaks of size 100: 0.20589 +- 0.00156 s
read by idx, 7235 rows to read with 1 breaks of size 1000: 0.21116 +- 0.00428 s
read by idx, 6235 rows to read with 2 breaks of size 1000: 0.20491 +- 0.00171 s
read by idx, 3235 rows to read with 5 breaks of size 1000: 0.19387 +- 0.00156 s
read by idx, 0 rows to read with 10 breaks of size 1000: 0.18268 +- 0.00153 s
read by idx, 0 rows to read with 20 breaks of size 1000: 0.18392 +- 0.00217 s

import lgdo.lh5_store as lh5
import numpy as np
from matplotlib import pyplot as plt
import time

raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'

test_channel = 'ch1104002'

numtests = 10

store = lh5.LH5Store()

times_readall = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file)
    end = time.time()
    times_readall.append(end-start)

print(f"read entire: {np.mean(times_readall):.5f} +- {np.std(times_readall):.5f} s")

totind = len(chobj)
allind = np.arange(totind)

times_readallidx = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=allind)
    end = time.time()
    times_readallidx.append(end-start)

print(f"read by idx, all {len(allind)} rows: {np.mean(times_readallidx):.5f} +- {np.std(times_readallidx):.5f} s")

numbreaks = np.array([1, 2, 5, 10, 20])
breaksizes = np.array([1, 10, 100, 1000])

for k in range(len(breaksizes)):
    for j in range(len(numbreaks)):
        breaks = np.round(np.linspace(0, len(allind) - 1, breaksizes[k]*numbreaks[j])).astype(int)
        theseind = np.delete(allind, breaks)
        times_readsome = []
        for i in range(numtests):
            start = time.time()
            chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind)
            end = time.time()
            times_readsome.append(end-start)
            
        print(f"read by idx, {len(theseind)} rows to read with {numbreaks[j]} breaks of size {breaksizes[k]}: {np.mean(times_readsome):.5f} +- {np.std(times_readsome):.5f} s")

lvarriano · 2023-11-06T18:31:18Z

I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves.

I take this back - I think we should change lh5_store.py to read the whole object in. In most (all?) cases, the user will read in the data channel by channel in some loop (I'm not familiar enough to understand if it can be done differently), so the amount of memory used at any single point to load in all the data is actually quite small.

gipert · 2023-11-07T15:47:10Z

From the h5py docs:

An HDF5 dataset created with the default settings will be contiguous.

And we are not setting any chunking option!

legend-pydataobj/src/lgdo/lh5_store.py

Lines 1097 to 1099 in 6901bf3

    
           ds = group.create_dataset( 
        
               name, data=nda, maxshape=maxshape, **comp_kwargs 
        
           )

(maxshape is set to the actual full object length above, correct me if I am wrong). So I don't expect any performance/memory gain in indexed reading, compared to reading the full dataset in and then indexing. I don't understand how it can be slower though. Later in the docs section:

Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

So we should definitely do something about it (even if turning on HDF5 compression should automatically do it? But how?). Continuing:

Since picking a chunk shape can be confusing, you can have h5py guess a chunk shape for you:
dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)

Should we use this feature?

@oschulz might also have an option about all this.

oschulz · 2023-11-07T16:53:35Z

We're doing something on the Julia side to give the user control over chunking, but it's not really finished yet (legend-exp/LegendHDF5IO.jl#35).

In general, non-chunked files are faster to read, since memory-mapping can be used, which will be good for ML applications. For "standard" data storage, chunking makes it possible to write files bit by bit, so the writer doesn't need to have a large amount of memory available. Also, as you point out, compression (I think) always implies chunking on HDF5.

gipert · 2023-11-07T21:20:45Z

So I was wrong. We do always set maxshape in order to be able to append to datasets (?):

legend-pydataobj/src/lgdo/lh5_store.py

Line 1080 in 6901bf3

maxshape = (None,) + nda.shape[1:]

so (as per h5py docs) all our datasets are chunked. I don't know how they are chunked (I could not find documentation). Maybe the "autochunking" feature is used? I still have no clue what that means.

gipert · 2023-11-07T21:25:53Z

Note that this implies: it's impossible to write contiguous datasets to disk with legend-pydataobj at the moment.

iguinn · 2023-11-08T17:20:20Z

Here's a stack overflow that was answered by one of the h5py developers https://stackoverflow.com/questions/21766145/h5py-correct-way-to-slice-array-datasets. Apparently fancy indexing is not directly supported in hdf5 (unlike slicing, which is), so they did their own implementation of it that he admits is very slow for >1000 elements (abysmal is his choice word)...So I think this supports the idea that we may want to accept higher memory usage and do fancy indexing through numpy instead.

If we really wanted to optimize this, we could try to implement it with chunks in mind...Basically, look at our index list, figure out which indices are in a chunk and grab those from the chunk and then move to the next one. This would limit the extra memory burden from this to the smaller chunk size. This would also let us skip chunks that don't have any indices in our fancy index, which could speed up sparse data reading (the waveform browser could potentially benefit from this). Of course, this would be a lot of extra work, so for now lets just read in the whole group and then assess if this would be worth it.

jasondet · 2023-11-08T21:16:53Z

looks like this behavior is even properly documented and I just ignored it or didn't worry about it when we implemented this long ago.

Louis's fix seems to handle all use cases with just moderate burden due to memory realloc and its way better than what we have. I propose we just accept his fix for now and come back to this when/if it's the bottleneck again.

gipert · 2023-11-08T21:43:48Z

What a pity! I agree we clearly need Louis' workaround, at this point. I'm curious to see the effect on DataLoader.load().

lvarriano · 2023-11-09T00:47:49Z

For interest/posterity, here's a comparison of the new implementation and the previous implementation (accessed by using the new use_h5idx flag). The script is below and uses the memory-profiler package. This is reading a single channel for 3 cal raw files on my laptop which has an NVME SSD and ext4 file system type (in case those are relevant for future comparisons). The partial reads are a random selection of 5000 rows. The memory penalty is ~130 MB, the size of the channel object in this cal file. I don't understand all of the spikes in the new implementation, seems like there are one or two extras compared to what I would expect. The speed-up for the partial reads is ~x50 in this example.

There's currently a factor of x2 penalty to speed if the user passes in idx that corresponds to all events compared to just reading everything without any idx. In the future, we could add a comparison or some way to check if the user did this. For now, it is up the user to do this. Maybe this is relevant for the DataLoader?

import lgdo.lh5_store as lh5
import numpy as np

raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
test_channel = 'ch1104002'

store = lh5.LH5Store()

chobj, _ = store.read_object(test_channel+'/raw', raw_file)
allind = np.arange(len(chobj))
theseind = np.sort(np.random.choice(allind, size=5000, replace=False))
del chobj

@profile
def new_readall(files):
    chobj, _ = store.read_object(test_channel+'/raw', files)

@profile
def orig_readall(files): 
    chobj, _ = store.read_object(test_channel+'/raw', files, use_h5idx=True)

@profile
def new_readallidx(files, idxs):   
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)  

@profile
def orig_readallidx(files, idxs): 
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)

@profile
def new_readsomeidx(files, idxs):   
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)

@profile
def orig_readsomeidx(files, idxs):   
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)

if __name__ == '__main__':
    new_readall([raw_file, raw_file, raw_file])
    orig_readall([raw_file, raw_file, raw_file])
    new_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
    orig_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
    new_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])
    orig_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])

lvarriano · 2023-11-09T16:00:12Z

In the pull-request #35, Luigi gave me an idea to convert idx to a slice if that is possible (idx has elements all increasing by 1). This restores some speed if the user happens to pass in an idx corresponding to all rows. Now there is a ~30% penalty to speed instead of x2.

gipert added performance Code performance lh5 HDF5 I/O labels Nov 6, 2023

lvarriano mentioned this issue Nov 6, 2023

improve performance of read_object with idx by reading in whole objec… #33

Closed

lvarriano mentioned this issue Nov 9, 2023

improve read_object speed when passing idx #35

Merged

lvarriano closed this as completed Nov 10, 2023

lvarriano mentioned this issue Mar 15, 2024

Increase read speed by x20-100 for most data #78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_object() with idx parameter is slow #29

read_object() with idx parameter is slow #29

lvarriano commented Nov 1, 2023 •

edited

Loading

lvarriano commented Nov 6, 2023

iguinn commented Nov 6, 2023

gipert commented Nov 6, 2023 •

edited

Loading

lvarriano commented Nov 6, 2023

lvarriano commented Nov 6, 2023

gipert commented Nov 7, 2023 •

edited

Loading

oschulz commented Nov 7, 2023

gipert commented Nov 7, 2023

gipert commented Nov 7, 2023

iguinn commented Nov 8, 2023

jasondet commented Nov 8, 2023

gipert commented Nov 8, 2023

lvarriano commented Nov 9, 2023

lvarriano commented Nov 9, 2023

read_object() with idx parameter is slow #29

read_object() with idx parameter is slow #29

Comments

lvarriano commented Nov 1, 2023 • edited Loading

lvarriano commented Nov 6, 2023

iguinn commented Nov 6, 2023

gipert commented Nov 6, 2023 • edited Loading

lvarriano commented Nov 6, 2023

lvarriano commented Nov 6, 2023

gipert commented Nov 7, 2023 • edited Loading

oschulz commented Nov 7, 2023

gipert commented Nov 7, 2023

gipert commented Nov 7, 2023

iguinn commented Nov 8, 2023

jasondet commented Nov 8, 2023

gipert commented Nov 8, 2023

lvarriano commented Nov 9, 2023

lvarriano commented Nov 9, 2023

lvarriano commented Nov 1, 2023 •

edited

Loading

gipert commented Nov 6, 2023 •

edited

Loading

gipert commented Nov 7, 2023 •

edited

Loading