provide a countLines method #16

nathandunn · 2019-02-28T22:58:00Z

All I really want / need is a quick way to count the number of total hits (or lines) for all chunks. String parsing slows it down.

https://github.com/GMOD/tabix-js/blob/master/src/tabixIndexedFile.js#L118

Related to this: GMOD/jbrowse#1322

garrettjstevens · 2019-03-01T21:01:49Z

The problem is that without parsing the string, you can't get an accurate line count. For example, let's say you have a query for a very small region: chromosome 1, start 100, end 102. Using the tabix index, you can get the chunk of the file that covers that region. But that chunk may have a lot more lines than just your query. There may be 100 lines in that chunk that don't match your query while only a single line actually matches it.

Assuming that queries are always going to be large regions and approximate line counts are ok, though, you could skip parsing those lines, but you still have the bottleneck of converting the chunk of the file to a string so it can be split at newlines. That's probably the biggest current bottleneck, more than the string parsing I would bet. There's an idea of how to do that faster in #10 which might be worth further investigation.

nathandunn · 2019-03-01T23:36:22Z

@garrettjstevens Estimates are perfectly good.

Are the chunks ordered (I would assume they would have to be)? If you knew the start of each chunk and the number of lines per chunk, we could just quickly scan the chunks and determine which ones you had to parse per bin, without having to fully parse (just look for the first instance of "\n" parse that line, and return that as the start, if the next chunk start is also part of the same bun, you can just add that number). I think that is what you are saying. However, you wouldn't have to fully parse some chunks.

I think that #10 is one possible solution that will help and we should definitely explore that. The toString() and trim() functions actually add a lot. I'll post the performance snapshots as I get them.

I am also unsure if the parse() function couldn't automatically create the index on the fly that provides a rough estimate.

nathandunn · 2019-03-02T00:03:49Z

You can see the network lag on the LHS followed by the processing.

Zooming in the toString() (slowToString?) chews up most of the CPU:

nathandunn · 2019-03-02T00:04:22Z

@garrettjstevens Let me know if you need any sample data.

garrettjstevens · 2019-03-02T16:19:42Z

@nathandunn I'm not sure I understand what you're saying. Let's sync up next week to talk about it.

In the meantime, I tried a couple things related to #10 and couldn't see any noticeable performance improvement.

@rbuels, do you have any ideas for an efficient getApproxLineCount() method?

cmdcolin · 2019-03-02T16:22:20Z

Something like indexcov would be cool https://www.ncbi.nlm.nih.gov/m/pubmed/29048539/

nathandunn · 2019-03-02T16:30:40Z

all, Yes, let's definitely sync up next week.

Very worst-case (if nothing works at all), I could autogenerate a histogram BigWig (similar to the snp density) and provide an accompanying track. What is done for VCF's and BAM's (who I would anticipate would have similar issues)?

@cmdcolin I'm also not opposed to providing an alternate index, either stored on the filesystem like this (better) or within the database (worse, we are now back to GBrowse).

cmdcolin · 2019-03-02T16:38:07Z

The idea of indexcov is it works on the normal index

nathandunn · 2019-03-02T16:58:21Z

Nice. Would you use something like gopherjs to port their implementation or just write a native one ? Nathan

…

On Mar 2, 2019, at 8:38 AM, Colin Diesh ***@***.***> wrote: The idea of indexcov is it works on the normal index — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nathandunn mentioned this issue Mar 1, 2019

WIP::Speedup tabix histogram GMOD/jbrowse#1322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide a countLines method #16

provide a countLines method #16

nathandunn commented Feb 28, 2019

garrettjstevens commented Mar 1, 2019

nathandunn commented Mar 1, 2019

nathandunn commented Mar 2, 2019

nathandunn commented Mar 2, 2019

garrettjstevens commented Mar 2, 2019

cmdcolin commented Mar 2, 2019

nathandunn commented Mar 2, 2019

cmdcolin commented Mar 2, 2019

nathandunn commented Mar 2, 2019 via email

provide a countLines method #16

provide a countLines method #16

Comments

nathandunn commented Feb 28, 2019

garrettjstevens commented Mar 1, 2019

nathandunn commented Mar 1, 2019

nathandunn commented Mar 2, 2019

nathandunn commented Mar 2, 2019

garrettjstevens commented Mar 2, 2019

cmdcolin commented Mar 2, 2019

nathandunn commented Mar 2, 2019

cmdcolin commented Mar 2, 2019

nathandunn commented Mar 2, 2019 via email