-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move expensive parse_header's to read #1411
Comments
Hi @bendichter. For some format getting number of sample can take a while because we need to parse with very bad loop the entire file jumping from data block to data block. We we need this number of sample before the read (get_analog_signal_chunk). In short what you propose is decomposing the get header in 2 parts : all metadata and then the number of sample. |
Thanks for the additional context. Let me explain why this is important for us. When we built NeuroConv and NWB GUIDE on top of NEO, we imagined that construction of a In the GUIDE, we extract as much metadata as we can from the source files and then have the user fill in the rest themselves. If you want the session start time for a Plexon session, the current workflow is to first initialize the converter, then call interface = PlexonInterface(file_path="file.plx")
interface.get_metadata()["NWBFile"]["session_start_time"]
What if in python-neo/neo/rawio/plexonrawio.py Lines 377 to 381 in 0c175b4
before if self._data_blocks is None:
self._data_blocks = self._parse_data_blocks() |
A couple more points.
These thoughts are a little inchoate, so my apologies, but just trying to brainstorm right now since it would be both an api change as well as a schema change to split up the header parsing. And for my workflows (with neo and with spikeinterface) I'm okay with a long wait right at the beginning, but I totally understand why you wouldn't want to wait around. |
The GUIDE is a graphical wrapper around NeuroConv operations, which has the following multi-step workflow for a i) initialize SpikeInterface recording extractor (I believe the contentious thing here is that ii) (optional) fetch default metadata from the interface. For neo-reliant interfaces, this granted simply retrieves the already parsed metadata iii) (optional) user modifies or confirms metadata by passing it back to the interface (include file-wide metadata, or cross-object metadata not necessarily just the recording specific stuff). The default just passes back everything that was fetched automatically in (ii). iv) the user calls The GUIDE makes (ii) and (iii) far more intuitive since we can visually display all information with detailed explanations of what various fields mean, which makes it a very interactive process So for
I'm not sure what kind of workflow you're envisioning ; the parsing of a lightweight metadata header, as I understand, is itself a part of the 'collecting metadata' step (the values in the file provide many defaults such as channel names, electrode group structure, gains, even the session start time) So we can't go from 'select plexon interface' to 'specify source file for plexon' (which then initializes neo object) to 'edit metadata for plexon' asynchronously since the steps are dependent |
I'll point out maybe the confusion is due a mix in understand what a As I read it for something like like SpikeGLX, this should just mean 'read the .meta file and form a dictionary out of it' Or for Intan, 'read the first X number of bytes at the top of the file and form a dictionary out of it' But my impression for Plexon is that it also scans all the packets of data collecting some summary info and that is the expensive step, right? So it's doing something fundamentally different than other more straightforward header parsing does I've not delved into the code though so let me know if I'm way off on this Also: a really cheap way of getting better speed without going to C is to offer a parallelized version of the procedure, if its parallelizable |
In Ben's message I thought he said you were collecting some metadata from the user. So you can have the 20min long parsing of the file based metadata + the rest of neo's parsing (number of samples memmap etc) + also collect the user based metadata so that them inputting various info is occupying the same time rather than waiting the full 20 minutes to finish before being able to work on (the rest of) metadata. But based on your (iii) maybe it is actually that you allow them to modify the metadata rather than just collect metadata not saved in the file? (NB I'm referring to file-based metadata which Neo is grabbing and user-based which would be like animal genotype, name of experiment which only the user could provide as two separate things).
The Again @samuelgarcia is the authority on the
This is beyond me so if Sam knew how to do this it might be possible, but I would have to read more to feel like I could safely do this. Currently the idea is you read a block of the file using numpy functions serially, but with parallel I don't know how I could make sure we are safely going through the data blocks. Hope you're well @CodyCBakerPhD :) |
It doesn't take 20 minutes to type in the info for those forms; and unless neo somehow provides to us what fields are in the files prior to parsing, we can't let the user know 'this field will be inferred from your Plexon files' The long expensive part of the conversion process is after you hit the 'run conversion' button, which is why we're pushing to have anything that needs to iterate over data packets occur then.
Thanks for the explanation. Shocking to hear that you don't have sufficient information to form a memory map from a very small amount of top-level text values (all you ought to need is dtype, column order, and shape?) Could you get the shape from a single packet, then use the packet size as a heuristic to multiply by the number of packets? Otherwise, what about a pre-step that compiles and caches a neo-required fields from an inefficient format, such that the primary workflow can then always expect the information in either a lightweight text file like SpikeGLX, or the precomputed lightweight file for others |
I know. Just trying to occupy some of the time so it doesn't feel as long. But not knowing what to fill in until after parsing would make this unhelpful.
For simple file formats that is all you need. (Honestly we always seem to have 'c' column order. I don't know why, but makes sense that most files would just use 'c' order). + offset (to account for the header). So dtype, column order, shape, and offset. But for complicated formats you need to traverse the file to get this info. Heberto was telling me that Sam told him (so lots of heresy) that one of the files that Neo supports has its header in the middle of the file, which means the file needs to be searched to find the header which will take time as well (people should really follow the convention that the header is at the head of the file). So a single "packet" is not always equivalent to all packets. This being said I ran a couple tests just now see below: I did just run a test and a 100mb Intan file was parsing in 100ms vs a 10mb Plexon file was taking 1.56 seconds. So we are looking at an 100x slower if the files were the same size (assuming they scale). I think the issue is that some file formats have nice packets and some don't. For Plexon the different sets of data are set-up like this (this is just to get the dtype information for files in the different parts of the headers: Again just comparing intan and plexon, intan makes one memory map with only one global offset for the header. Plexon makes 5 memory maps where we need to adjust the offset not based on just the global header, but the sub-headers. Then we do the final data cleanup which I assume is part of the slow part since it is a nested for loop. I think one variable between Some other quick facts, parsing the intan header (for metadata) took ~18ms, making the memap took 200 microseconds, so the other 80ms in this case is coming from rawio overhead. It might be a decent idea to run a more sophisticated profile for your Plexon example data so we can better figure out where things are getting hung up. Just ran cProfile and got that for 2s (with the cprofile overhead) that the breakdown for Plexon is: |
Forgot to respond to this part. That could work, but again that's a bit more of an api level change. Could be nice though. We would really need to profile plexon though. |
I get the idea, but also note we are talking 20 minutes per session, and there could easily be 20 sessions in a conversion. |
Sorry one last note on profiling then I'm done for now. If we get rid of the datablock organization within the header parsing we cut the time to 1second (so 2x speed up). So I think the issue with plexon vs something like intan is that intan makes its one memmap and then loads the memmap only really during the So if we want to enhance the header parsing we should think about limiting the calls to the memmap that is generated. Getting rid of the data_block preparation caused a speed up of 2.67x for the |
Thanks for looking into this. Yes, that was my suggestion. Also I would expect the speed up to be way more than 2.67x on a real sized recording file. |
If you are willing to share a file I can do some testing on my end. I was just playing around with the Plexon websites free sample data (which is 10mb). But if you have something bigger I could also profile on that so we could see (it might be worth testing because if this scales I don't know if going from 20 min to 10 min helps enough for making the user wait around). I just find it super interesting that plexon as is is between 10-100x slower than intan and if I comment out the data_blocks parsing with the small file it is still 5-50x slower than intan. Which makes me think that this is a combo of how we are doing it + some inefficiencies in the .plx format. |
Sounds good. I sent you an invitation to a file on Google Drive via your email. |
I suppose if it is really not feasible to change this on the NEO side, we could create a new type of data interface that delays the initialization of the neo reader and uses custom code to extract metadata. Then all neo readers that have a constructor which takes more time than is acceptable for our use-cases can use this other data interface instead of our current one. |
Wouldn't even need to be a 'new kind' of interface; just override the Greatest downside is if there's a problem in reading the file data, we usually know that right away from the |
yes, you could do it that way, but I don't like skipping |
I'm downloading it now. I'm doing recordings the rest of the day, but I'll profile it tomorrow and post here. That way you can decide if you want to change on your side and we can decide how updates would scale to large files on the neo side. |
You could |
@CodyCBakerPhD yes, that's an option, though I would prefer that the |
So with the file I was testing just so we have baseline and changes. I ran cProfile. And I figured I can just display the raw results. But to summarize first: total parse_header time with the cProfile overhead 15:39. Of note I did not use tqdm.
And for me doing a dummy removal of organization of the data_blocks (@bendichter's idea to do this later although for my test I was just removing) We get total time of 6:31.468. (again no tqdm).
Final result of delaying is an overall speedup of 2.4x which seems that improvement would roughly scale. What do you think @bendichter and @CodyCBakerPhD ? |
Is your feature request related to a problem? Please describe.
Initializing
PlexonRawIO
can take 20 minutes to parse the headers for a real file. This is a problem e.g. in NWB GUIDE, where we need to get metadata. This requires initialization of every session, which would amount to making the user wait for hours in the middle of building the conversion just to fetch the metadata from each session.Describe the solution you'd like
I would like to keep the metadata parsing in the constructor and move the expensive heading to the read command. This would use caching to ensure it is only executed once, even when the read command is used multiple times. This would not improve the speed, but would delay the long wait until a time that is much better for usability.
Describe alternatives you've considered
It would be great to also improve the efficiency of the header parsing code if possible.
The text was updated successfully, but these errors were encountered: