Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a countLines method #16

Open
nathandunn opened this issue Feb 28, 2019 · 9 comments
Open

provide a countLines method #16

nathandunn opened this issue Feb 28, 2019 · 9 comments

Comments

@nathandunn
Copy link

All I really want / need is a quick way to count the number of total hits (or lines) for all chunks. String parsing slows it down.

https://github.com/GMOD/tabix-js/blob/master/src/tabixIndexedFile.js#L118

Related to this: GMOD/jbrowse#1322

@garrettjstevens
Copy link
Contributor

The problem is that without parsing the string, you can't get an accurate line count. For example, let's say you have a query for a very small region: chromosome 1, start 100, end 102. Using the tabix index, you can get the chunk of the file that covers that region. But that chunk may have a lot more lines than just your query. There may be 100 lines in that chunk that don't match your query while only a single line actually matches it.

Assuming that queries are always going to be large regions and approximate line counts are ok, though, you could skip parsing those lines, but you still have the bottleneck of converting the chunk of the file to a string so it can be split at newlines. That's probably the biggest current bottleneck, more than the string parsing I would bet. There's an idea of how to do that faster in #10 which might be worth further investigation.

@nathandunn
Copy link
Author

@garrettjstevens Estimates are perfectly good.

Are the chunks ordered (I would assume they would have to be)? If you knew the start of each chunk and the number of lines per chunk, we could just quickly scan the chunks and determine which ones you had to parse per bin, without having to fully parse (just look for the first instance of "\n" parse that line, and return that as the start, if the next chunk start is also part of the same bun, you can just add that number). I think that is what you are saying. However, you wouldn't have to fully parse some chunks.

I think that #10 is one possible solution that will help and we should definitely explore that. The toString() and trim() functions actually add a lot. I'll post the performance snapshots as I get them.

I am also unsure if the parse() function couldn't automatically create the index on the fly that provides a rough estimate.

@nathandunn
Copy link
Author

You can see the network lag on the LHS followed by the processing.

screen shot 2019-03-01 at 3 57 04 pm

Zooming in the toString() (slowToString?) chews up most of the CPU:

screen shot 2019-03-01 at 3 58 51 pm

@nathandunn
Copy link
Author

@garrettjstevens Let me know if you need any sample data.

@garrettjstevens
Copy link
Contributor

@nathandunn I'm not sure I understand what you're saying. Let's sync up next week to talk about it.

In the meantime, I tried a couple things related to #10 and couldn't see any noticeable performance improvement.

@rbuels, do you have any ideas for an efficient getApproxLineCount() method?

@cmdcolin
Copy link
Contributor

cmdcolin commented Mar 2, 2019

Something like indexcov would be cool https://www.ncbi.nlm.nih.gov/m/pubmed/29048539/

@nathandunn
Copy link
Author

all, Yes, let's definitely sync up next week.

Very worst-case (if nothing works at all), I could autogenerate a histogram BigWig (similar to the snp density) and provide an accompanying track. What is done for VCF's and BAM's (who I would anticipate would have similar issues)?

@cmdcolin I'm also not opposed to providing an alternate index, either stored on the filesystem like this (better) or within the database (worse, we are now back to GBrowse).

@cmdcolin
Copy link
Contributor

cmdcolin commented Mar 2, 2019

The idea of indexcov is it works on the normal index

@nathandunn
Copy link
Author

nathandunn commented Mar 2, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants