Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_object() with idx parameter is slow #29

Closed
lvarriano opened this issue Nov 1, 2023 · 14 comments
Closed

read_object() with idx parameter is slow #29

lvarriano opened this issue Nov 1, 2023 · 14 comments
Labels
lh5 HDF5 I/O performance Code performance

Comments

@lvarriano
Copy link
Contributor

lvarriano commented Nov 1, 2023

Using the idx parameter in read_object() makes the read about 20x slower. From the lh5_store.py code, this is related to line 732 nda = h5f[name][source_sel]. I thought that this might be related to the number of discontinuities in the data so I tested for that, and it seems somehow related. I don't understand the sudden jump in read time, though.

If, instead, the entire object is read into memory before the list indexing, i.e. nda = h5f[name][...][source_sel], then the speed is about the same. Note that this line does not apply if read_object is passed an obj_buf to read the data into and some other changes will need to be made.

We suspect this may be related to DataLoader's performance legend-exp/pygama#521 but it does not account for all of the slowdown. This seems to be a fundamental issue with HDF5 files (from the linked issue "It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and h5py/h5py#1597").

This is demonstrated below on a single channel in a single file.

import lgdo.lh5_store as lh5
import numpy as np
from matplotlib import pyplot as plt
import time

raw_file = 'l200-p03-r003-phy-20230406T045650Z-tier_raw.lh5'
out_file = 'test.lh5'

test_channel = 'ch1104002'

numtests = 10

store = lh5.LH5Store()

times_readall = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file)
    end = time.time()
    times_readall.append(end-start)

print(f"read entire: {np.mean(times_readall):.5f} +-{np.std(times_readall):.5f} s")

totind = len(chobj)
allind = np.arange(totind)

times_readallidx = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=allind)
    end = time.time()
    times_readallidx.append(end-start)

print(f"read by idx, all {len(allind)} rows: {np.mean(times_readallidx):.5f} +-{np.std(times_readallidx):.5f} s")

numbreaks = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500, 1000])

for j in range(len(numbreaks)):
    breaks = np.round(np.linspace(0, len(allind) - 1, numbreaks[j])).astype(int)
    theseind = np.delete(allind, breaks)
    times_readsome = []
    for i in range(numtests):
        start = time.time()
        chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind)
        end = time.time()
        times_readsome.append(end-start)
        
    print(f"read by idx, {len(theseind)} rows to read with {numbreaks[j]} breaks: {np.mean(times_readsome):.5f} +- {np.std(times_readsome):.5f} s")

prints

read entire: 0.04197 +-0.00789 s
read by idx, all 1739 rows: 0.07562 +-0.04448 s
read by idx, 1738 rows to read with 1 breaks: 0.05652 +-0.00268 s
read by idx, 1737 rows to read with 2 breaks: 0.05645 +-0.00218 s
read by idx, 1734 rows to read with 5 breaks: 1.24324 +-0.03775 s
read by idx, 1729 rows to read with 10 breaks: 1.24398 +-0.07181 s
read by idx, 1719 rows to read with 20 breaks: 1.21239 +-0.03777 s
read by idx, 1689 rows to read with 50 breaks: 1.15172 +-0.00964 s
read by idx, 1639 rows to read with 100 breaks: 1.15134 +-0.02681 s
read by idx, 1539 rows to read with 200 breaks: 1.10985 +-0.00872 s
read by idx, 1239 rows to read with 500 breaks: 1.03938 +-0.07005 s
read by idx, 739 rows to read with 1000 breaks: 0.67653 +-0.03128 s

If line 732 is replaced as above (to nda = h5f[name][...][source_sel]) to read the whole object below list indexing, this prints instead

read entire: 0.03629 +-0.00453 s
read by idx, all 1739 rows: 0.05294 +-0.04059 s
read by idx, 1738 rows to read with 1 breaks: 0.03781 +- 0.00088 s
read by idx, 1737 rows to read with 2 breaks: 0.03913 +- 0.00252 s
read by idx, 1734 rows to read with 5 breaks: 0.03769 +- 0.00095 s
read by idx, 1729 rows to read with 10 breaks: 0.03798 +- 0.00092 s
read by idx, 1719 rows to read with 20 breaks: 0.03732 +- 0.00080 s
read by idx, 1689 rows to read with 50 breaks: 0.03786 +- 0.00083 s
read by idx, 1639 rows to read with 100 breaks: 0.03777 +- 0.00092 s
read by idx, 1539 rows to read with 200 breaks: 0.03791 +- 0.00078 s
read by idx, 1239 rows to read with 500 breaks: 0.03743 +- 0.00121 s
read by idx, 739 rows to read with 1000 breaks: 0.03606 +- 0.00078 s
@lvarriano
Copy link
Contributor Author

I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves.

@gipert gipert added performance Code performance lh5 HDF5 I/O labels Nov 6, 2023
@iguinn
Copy link
Contributor

iguinn commented Nov 6, 2023

One thing that probably has an effect here is that when data is read from disk it tends to be read in contiguous chunks of 4096 kB pages. This means that if the spacing between entries is less than one page, it is still reading the full file. The waveforms are big enough that they should be spread across pages, while the other fields probably use only one or two pages for the full file. So a couple of other things that might be worth checking:

  1. What if you just look at waveforms?
  2. What if you look at a very large file and create much larger gaps?

@gipert
Copy link
Member

gipert commented Nov 6, 2023

Did we ever think carefully about how to chunk datasets on disk? This is also relevant for HDF5 compression.

https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage

@lvarriano
Copy link
Contributor Author

1. What if you just look at waveforms?

It is the same slowdown. Loading just the waveforms (using the same phy file) prints the following.

read entire: 0.02775 +-0.01266 s
read by idx, all 1739 rows: 0.02262 +-0.00170 s
read by idx, 1738 rows to read with 1 breaks: 0.02238 +- 0.00227 s
read by idx, 1737 rows to read with 2 breaks: 0.02199 +- 0.00125 s
read by idx, 1734 rows to read with 5 breaks: 1.82485 +- 0.02439 s
read by idx, 1729 rows to read with 10 breaks: 1.96193 +- 0.13411 s
read by idx, 1719 rows to read with 20 breaks: 1.82573 +- 0.01488 s
read by idx, 1689 rows to read with 50 breaks: 1.79066 +- 0.02706 s
read by idx, 1639 rows to read with 100 breaks: 1.76084 +- 0.07120 s
read by idx, 1539 rows to read with 200 breaks: 1.62478 +- 0.01129 s
read by idx, 1239 rows to read with 500 breaks: 1.33234 +- 0.00841 s
read by idx, 739 rows to read with 1000 breaks: 0.81090 +- 0.00506 s

For reference, I changed lines in the script above to e.g.

chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind, field_mask=['waveform'])
2. What if you look at a very large file and create much larger gaps?

I used a cal file ('l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5') that has 8235 events. The slowdown is actually worse ~x30-40 times slower, which gets even worse if the gaps are larger (up to x80!). Keep in mind that I am reading in only a single channel. The code below prints

read entire: 0.19241 +- 0.00773 s
read by idx, all 8235 rows: 0.29639 +- 0.00475 s
read by idx, 8234 rows to read with 1 breaks of size 1: 0.29881 +- 0.00724 s
read by idx, 8233 rows to read with 2 breaks of size 1: 0.30981 +- 0.02371 s
read by idx, 8230 rows to read with 5 breaks of size 1: 8.84480 +- 0.16213 s
read by idx, 8225 rows to read with 10 breaks of size 1: 8.78127 +- 0.10855 s
read by idx, 8215 rows to read with 20 breaks of size 1: 9.05398 +- 0.36159 s
read by idx, 8225 rows to read with 1 breaks of size 10: 8.91650 +- 0.21048 s
read by idx, 8215 rows to read with 2 breaks of size 10: 9.19139 +- 0.42725 s
read by idx, 8185 rows to read with 5 breaks of size 10: 9.54173 +- 0.49595 s
read by idx, 8135 rows to read with 10 breaks of size 10: 9.45252 +- 0.59045 s
read by idx, 8035 rows to read with 20 breaks of size 10: 8.92942 +- 0.03804 s
read by idx, 8135 rows to read with 1 breaks of size 100: 9.00981 +- 0.20564 s
read by idx, 8035 rows to read with 2 breaks of size 100: 9.06565 +- 0.15981 s
read by idx, 7735 rows to read with 5 breaks of size 100: 10.00159 +- 0.59773 s
read by idx, 7235 rows to read with 10 breaks of size 100: 11.66310 +- 0.24250 s
read by idx, 6235 rows to read with 20 breaks of size 100: 15.15527 +- 0.22353 s
read by idx, 7235 rows to read with 1 breaks of size 1000: 12.01743 +- 0.36342 s
read by idx, 6235 rows to read with 2 breaks of size 1000: 15.81197 +- 0.61867 s
read by idx, 3235 rows to read with 5 breaks of size 1000: 15.53589 +- 0.40457 s
read by idx, 0 rows to read with 10 breaks of size 1000: 0.04306 +- 0.00211 s
read by idx, 0 rows to read with 20 breaks of size 1000: 0.04304 +- 0.00054 s

Again, changing line 732 in lh5_store.py restores the performance.

read entire: 0.17909 +- 0.00186 s
read by idx, all 8235 rows: 0.25028 +- 0.09423 s
read by idx, 8234 rows to read with 1 breaks of size 1: 0.21351 +- 0.00270 s
read by idx, 8233 rows to read with 2 breaks of size 1: 0.21347 +- 0.00163 s
read by idx, 8230 rows to read with 5 breaks of size 1: 0.21279 +- 0.00247 s
read by idx, 8225 rows to read with 10 breaks of size 1: 0.21245 +- 0.00310 s
read by idx, 8215 rows to read with 20 breaks of size 1: 0.21244 +- 0.00192 s
read by idx, 8225 rows to read with 1 breaks of size 10: 0.21133 +- 0.00238 s
read by idx, 8215 rows to read with 2 breaks of size 10: 0.21408 +- 0.00512 s
read by idx, 8185 rows to read with 5 breaks of size 10: 0.21166 +- 0.00131 s
read by idx, 8135 rows to read with 10 breaks of size 10: 0.21438 +- 0.00354 s
read by idx, 8035 rows to read with 20 breaks of size 10: 0.21336 +- 0.00366 s
read by idx, 8135 rows to read with 1 breaks of size 100: 0.21335 +- 0.00283 s
read by idx, 8035 rows to read with 2 breaks of size 100: 0.21340 +- 0.00229 s
read by idx, 7735 rows to read with 5 breaks of size 100: 0.21315 +- 0.00322 s
read by idx, 7235 rows to read with 10 breaks of size 100: 0.20953 +- 0.00246 s
read by idx, 6235 rows to read with 20 breaks of size 100: 0.20589 +- 0.00156 s
read by idx, 7235 rows to read with 1 breaks of size 1000: 0.21116 +- 0.00428 s
read by idx, 6235 rows to read with 2 breaks of size 1000: 0.20491 +- 0.00171 s
read by idx, 3235 rows to read with 5 breaks of size 1000: 0.19387 +- 0.00156 s
read by idx, 0 rows to read with 10 breaks of size 1000: 0.18268 +- 0.00153 s
read by idx, 0 rows to read with 20 breaks of size 1000: 0.18392 +- 0.00217 s
import lgdo.lh5_store as lh5
import numpy as np
from matplotlib import pyplot as plt
import time

raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'

test_channel = 'ch1104002'

numtests = 10

store = lh5.LH5Store()

times_readall = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file)
    end = time.time()
    times_readall.append(end-start)

print(f"read entire: {np.mean(times_readall):.5f} +- {np.std(times_readall):.5f} s")

totind = len(chobj)
allind = np.arange(totind)

times_readallidx = []
for i in range(numtests):
    start = time.time()
    chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=allind)
    end = time.time()
    times_readallidx.append(end-start)

print(f"read by idx, all {len(allind)} rows: {np.mean(times_readallidx):.5f} +- {np.std(times_readallidx):.5f} s")

numbreaks = np.array([1, 2, 5, 10, 20])
breaksizes = np.array([1, 10, 100, 1000])

for k in range(len(breaksizes)):
    for j in range(len(numbreaks)):
        breaks = np.round(np.linspace(0, len(allind) - 1, breaksizes[k]*numbreaks[j])).astype(int)
        theseind = np.delete(allind, breaks)
        times_readsome = []
        for i in range(numtests):
            start = time.time()
            chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind)
            end = time.time()
            times_readsome.append(end-start)
            
        print(f"read by idx, {len(theseind)} rows to read with {numbreaks[j]} breaks of size {breaksizes[k]}: {np.mean(times_readsome):.5f} +- {np.std(times_readsome):.5f} s")

@lvarriano
Copy link
Contributor Author

I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves.

I take this back - I think we should change lh5_store.py to read the whole object in. In most (all?) cases, the user will read in the data channel by channel in some loop (I'm not familiar enough to understand if it can be done differently), so the amount of memory used at any single point to load in all the data is actually quite small.

@gipert
Copy link
Member

gipert commented Nov 7, 2023

From the h5py docs:

An HDF5 dataset created with the default settings will be contiguous.

And we are not setting any chunking option!

ds = group.create_dataset(
name, data=nda, maxshape=maxshape, **comp_kwargs
)

(maxshape is set to the actual full object length above, correct me if I am wrong). So I don't expect any performance/memory gain in indexed reading, compared to reading the full dataset in and then indexing. I don't understand how it can be slower though. Later in the docs section:

Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.

So we should definitely do something about it (even if turning on HDF5 compression should automatically do it? But how?). Continuing:

Since picking a chunk shape can be confusing, you can have h5py guess a chunk shape for you:
dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)

Should we use this feature?

@oschulz might also have an option about all this.

@oschulz
Copy link

oschulz commented Nov 7, 2023

We're doing something on the Julia side to give the user control over chunking, but it's not really finished yet (legend-exp/LegendHDF5IO.jl#35).

In general, non-chunked files are faster to read, since memory-mapping can be used, which will be good for ML applications. For "standard" data storage, chunking makes it possible to write files bit by bit, so the writer doesn't need to have a large amount of memory available. Also, as you point out, compression (I think) always implies chunking on HDF5.

@gipert
Copy link
Member

gipert commented Nov 7, 2023

So I was wrong. We do always set maxshape in order to be able to append to datasets (?):

maxshape = (None,) + nda.shape[1:]

so (as per h5py docs) all our datasets are chunked. I don't know how they are chunked (I could not find documentation). Maybe the "autochunking" feature is used? I still have no clue what that means.

@gipert
Copy link
Member

gipert commented Nov 7, 2023

Note that this implies: it's impossible to write contiguous datasets to disk with legend-pydataobj at the moment.

@iguinn
Copy link
Contributor

iguinn commented Nov 8, 2023

Here's a stack overflow that was answered by one of the h5py developers https://stackoverflow.com/questions/21766145/h5py-correct-way-to-slice-array-datasets. Apparently fancy indexing is not directly supported in hdf5 (unlike slicing, which is), so they did their own implementation of it that he admits is very slow for >1000 elements (abysmal is his choice word)...So I think this supports the idea that we may want to accept higher memory usage and do fancy indexing through numpy instead.

If we really wanted to optimize this, we could try to implement it with chunks in mind...Basically, look at our index list, figure out which indices are in a chunk and grab those from the chunk and then move to the next one. This would limit the extra memory burden from this to the smaller chunk size. This would also let us skip chunks that don't have any indices in our fancy index, which could speed up sparse data reading (the waveform browser could potentially benefit from this). Of course, this would be a lot of extra work, so for now lets just read in the whole group and then assess if this would be worth it.

@jasondet
Copy link
Contributor

jasondet commented Nov 8, 2023

looks like this behavior is even properly documented and I just ignored it or didn't worry about it when we implemented this long ago.

Louis's fix seems to handle all use cases with just moderate burden due to memory realloc and its way better than what we have. I propose we just accept his fix for now and come back to this when/if it's the bottleneck again.

@gipert
Copy link
Member

gipert commented Nov 8, 2023

What a pity! I agree we clearly need Louis' workaround, at this point. I'm curious to see the effect on DataLoader.load().

@lvarriano
Copy link
Contributor Author

For interest/posterity, here's a comparison of the new implementation and the previous implementation (accessed by using the new use_h5idx flag). The script is below and uses the memory-profiler package. This is reading a single channel for 3 cal raw files on my laptop which has an NVME SSD and ext4 file system type (in case those are relevant for future comparisons). The partial reads are a random selection of 5000 rows. The memory penalty is ~130 MB, the size of the channel object in this cal file. I don't understand all of the spikes in the new implementation, seems like there are one or two extras compared to what I would expect. The speed-up for the partial reads is ~x50 in this example.

There's currently a factor of x2 penalty to speed if the user passes in idx that corresponds to all events compared to just reading everything without any idx. In the future, we could add a comparison or some way to check if the user did this. For now, it is up the user to do this. Maybe this is relevant for the DataLoader?

compare_idxreads

import lgdo.lh5_store as lh5
import numpy as np

raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
test_channel = 'ch1104002'

store = lh5.LH5Store()

chobj, _ = store.read_object(test_channel+'/raw', raw_file)
allind = np.arange(len(chobj))
theseind = np.sort(np.random.choice(allind, size=5000, replace=False))
del chobj

@profile
def new_readall(files):
    chobj, _ = store.read_object(test_channel+'/raw', files)

@profile
def orig_readall(files): 
    chobj, _ = store.read_object(test_channel+'/raw', files, use_h5idx=True)

@profile
def new_readallidx(files, idxs):   
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)  

@profile
def orig_readallidx(files, idxs): 
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)

@profile
def new_readsomeidx(files, idxs):   
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)

@profile
def orig_readsomeidx(files, idxs):   
    chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)

if __name__ == '__main__':
    new_readall([raw_file, raw_file, raw_file])
    orig_readall([raw_file, raw_file, raw_file])
    new_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
    orig_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
    new_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])
    orig_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])

@lvarriano
Copy link
Contributor Author

In the pull-request #35, Luigi gave me an idea to convert idx to a slice if that is possible (idx has elements all increasing by 1). This restores some speed if the user happens to pass in an idx corresponding to all rows. Now there is a ~30% penalty to speed instead of x2.

compare_idxreads2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lh5 HDF5 I/O performance Code performance
Projects
None yet
Development

No branches or pull requests

5 participants