-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide a countLines method #16
Comments
The problem is that without parsing the string, you can't get an accurate line count. For example, let's say you have a query for a very small region: chromosome 1, start 100, end 102. Using the tabix index, you can get the chunk of the file that covers that region. But that chunk may have a lot more lines than just your query. There may be 100 lines in that chunk that don't match your query while only a single line actually matches it. Assuming that queries are always going to be large regions and approximate line counts are ok, though, you could skip parsing those lines, but you still have the bottleneck of converting the chunk of the file to a string so it can be split at newlines. That's probably the biggest current bottleneck, more than the string parsing I would bet. There's an idea of how to do that faster in #10 which might be worth further investigation. |
@garrettjstevens Estimates are perfectly good. Are the chunks ordered (I would assume they would have to be)? If you knew the start of each chunk and the number of lines per chunk, we could just quickly scan the chunks and determine which ones you had to parse per bin, without having to fully parse (just look for the first instance of "\n" parse that line, and return that as the start, if the next chunk start is also part of the same bun, you can just add that number). I think that is what you are saying. However, you wouldn't have to fully parse some chunks. I think that #10 is one possible solution that will help and we should definitely explore that. The toString() and trim() functions actually add a lot. I'll post the performance snapshots as I get them. I am also unsure if the parse() function couldn't automatically create the index on the fly that provides a rough estimate. |
@garrettjstevens Let me know if you need any sample data. |
@nathandunn I'm not sure I understand what you're saying. Let's sync up next week to talk about it. In the meantime, I tried a couple things related to #10 and couldn't see any noticeable performance improvement. @rbuels, do you have any ideas for an efficient |
Something like indexcov would be cool https://www.ncbi.nlm.nih.gov/m/pubmed/29048539/ |
all, Yes, let's definitely sync up next week. Very worst-case (if nothing works at all), I could autogenerate a histogram BigWig (similar to the snp density) and provide an accompanying track. What is done for VCF's and BAM's (who I would anticipate would have similar issues)? @cmdcolin I'm also not opposed to providing an alternate index, either stored on the filesystem like this (better) or within the database (worse, we are now back to GBrowse). |
The idea of indexcov is it works on the normal index |
Nice. Would you use something like gopherjs to port their implementation or just write a native one ?
Nathan
… On Mar 2, 2019, at 8:38 AM, Colin Diesh ***@***.***> wrote:
The idea of indexcov is it works on the normal index
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
All I really want / need is a quick way to count the number of total hits (or lines) for all chunks. String parsing slows it down.
https://github.com/GMOD/tabix-js/blob/master/src/tabixIndexedFile.js#L118
Related to this: GMOD/jbrowse#1322
The text was updated successfully, but these errors were encountered: