Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader performance is bad? #521

Open
patgo25 opened this issue Oct 27, 2023 · 5 comments
Open

DataLoader performance is bad? #521

patgo25 opened this issue Oct 27, 2023 · 5 comments
Labels
flow High-level data management performace Code performance

Comments

@patgo25
Copy link
Contributor

patgo25 commented Oct 27, 2023

At the analysis workshop, it was argued that the slowness of the DataLoader was only due to the IO speed. When I loaded the data with "low-level routines", I never felt it was that slow.
To test this quantitatively, I wrote a quick script to compare low-level loading with DataLoader loading.

The script is:

import lgdo.lh5_store as store
from legendmeta import LegendMetadata
import re
import glob
import numpy as np
import time
from pygama.flow import DataLoader

def low_level_load(period="p03",run="r000",ndets=10):
    start = time.time()
    meta_path = "/data2/public/prodenv/prod-blind/ref/v01.06/inputs"
    f_hit = sorted(glob.glob("/data2/public/prodenv/prod-blind/ref/v01.06/generated/tier/hit/phy/"+period+"/"+run+"/*.lh5"))
    f_tcm = [e.replace("hit", "tcm") for e in f_hit]
    dt_files = time.time() - start
    print(f"time to located files: \t {dt_files:.3f} s")

    lmeta = LegendMetadata(path=meta_path)
    chmap = lmeta.channelmap(re.search(r"\d{8}T\d{6}Z", f_hit[0]).group(0))
    geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
    dt_meta = time.time() - start - dt_files
    print(f"time to load meta: \t {dt_meta:.3f} s")

    # load TCM data to define an event
    nda = store.load_nda(f_tcm, ["array_id", "array_idx","cumulative_length"], "hardware_tcm_1/")
    clt = nda["cumulative_length"]
    split = clt[np.diff(clt,append=[clt[-1]])<0]
    ids = np.split(nda["array_id"],np.cumsum(split))
    idx = np.split(nda["array_idx"],np.cumsum(split))
    dt_tcm = time.time() -start - dt_meta
    print(f"time to load tcm: \t {dt_tcm:.3f} s")

    for ch in geds:
        idx_ch = [idx[i][ids[i] == ch] for i in range(len(idx))]
        nda = store.load_nda(f_hit,['cuspEmax_ctc_cal','AoE_Classifier'],f"ch{ch}/hit/",idx_list=idx_ch)
        mask = nda['cuspEmax_ctc_cal'] > 25
        nda['AoE_Classifier'][mask]

    dt_end = time.time() - start -dt_tcm
    print(f"time to load data: \t {dt_end:.3f} s")


def data_loader_load(period="p03",run="r000",ndets=10):
    start = time.time()
    prodenv = "/data2/public/prodenv"
    dl = DataLoader(f"{prodenv}/prod-blind/ref/v01.06[setups/l200/dataloader]")
    file_query = f"period == '{period}' and run == '{run}' and datatype == 'phy'"
    dl.reset()
    dl.data_dir = prodenv
    dl.set_files(file_query)
    dt_files = time.time() - start
    print(f"time to located files: \t {dt_files:.3f} s")

    first_key = dl.get_file_list().iloc[0].timestamp
    lmeta = LegendMetadata(f"{prodenv}/prod-blind/ref/v01.06/inputs")
    chmap = lmeta.channelmap(on=first_key)  # get the channel map
    geds = list(chmap.map("system", unique=False).geds.map("daq.rawid").keys())[:ndets]
    dl.set_datastreams(geds, "ch")
    dt_meta = time.time() - start - dt_files
    print(f"time to load meta: \t {dt_meta:.3f} s")

    dl.set_cuts({"hit": "trapEmax_ctc_cal > 25"})
    dl.set_output(columns=["AoE_Classifier"], fmt="lgdo.Table")
    geds_el = dl.build_entry_list(tcm_level="tcm")
    dt_tcm = time.time() -start - dt_meta
    print(f"time to load tcm: \t {dt_tcm:.3f} s")

    dl.load(geds_el)
    dt_end = time.time() - start - dt_tcm
    print(f"time to load data: \t {dt_end:.3f} s")

if __name__ == "__main__":
    print("Try low level:")
    low_level_load()
    print("Try Data loader:")
    data_loader_load()
    

First, I would ask you if you can find any unfair treatments in one or the other routine.
Booth routines should:

  1. Use data from period 03 run 000 for 10 detectors
  2. Set a cut on the energy > 25 keV
  3. Load the AoE values for the 10 geds where the cut is valid according to the TCM

The result on LNGS of the script is:

> python speed_test.py 
Try low level:
time to located files:   0.003 s
time to load meta:       0.931 s
time to load tcm:        1.605 s
time to load data:       10.845 s
Try Data loader:
time to located files:   0.918 s
time to load meta:       0.763 s
time to load tcm:        172.463 s
time to load data:       7.972 s
@jasondet
Copy link
Collaborator

what is going on with the tcm loading!

okay thanks for testing, will check this asap.

@gipert gipert added tests Testing the package flow High-level data management labels Oct 31, 2023
@SamuelBorden
Copy link
Collaborator

It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and h5py/h5py#1597

We're going to try the same workaround in read_object in order to speed up calls when an idx is provided.

@gipert
Copy link
Member

gipert commented Nov 2, 2023

Where do we do indexed reading in the data loader other than in .load(), when an entry list is used?

@jasondet
Copy link
Collaborator

jasondet commented Nov 3, 2023

the idx read appears to have been a red herring. I found a factor of ~3 speed up for that tcm read step (really: build entry list) in data_loader and another factor of ~2 in lgdo.table.get_dataframe. However the overall read is still a factor of ~3 slower than Patrick's low-level read:

Singularity> python kraus_speed_test.py
Try low level:
time to located files:   0.003 s
time to load meta:       1.050 s
time to load tcm:        1.632 s
time to load data:       9.426 s
Try Data loader:
time to located files:   0.726 s
time to load meta:       0.891 s
time to load tcm:        26.584 s
time to load data:       7.684 s

There is a lot of complexity in data_loader having to do with dealing with multi-level cuts and other generalizations, but I hope it can be sped up more with some refactoring. It looks like there is still a lot going on in the inner loops.

@gipert
Copy link
Member

gipert commented Nov 3, 2023

Nice! Performance of the LGDO conversion methods is also a topic for legend-exp/legend-pydataobj#30 by @MoritzNeuberger.

Did anyone try profiling the code to spot straightforward bottlenecks?

@gipert gipert added performace Code performance and removed tests Testing the package labels Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flow High-level data management performace Code performance
Projects
None yet
Development

No branches or pull requests

4 participants