Improve support for very long input lines (> 2Gbyte) #1542
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some changes to make reading and writing very long lines work better. It is a partial fix for #1539 - the overflow mentioned in dealt with, but more work will be needed to make the very long VCF records readable.
bgzf_getline()
to avoid integer overflow.tbx_parse1()
andvcf_write()
so they can handle long lines (albeit only on 64-bit platforms).read()
andwrite()
infd_read()
,fd_write()
to ensure they really have processed the expected amount of data. This is especially important on Linux, which has a limit on the amount of data that will be read or written in a single call.This is mostly useful for
tabix
, which doesn't do much interpretation of its input. With these changes it will happily index and return long lines if it has enough memory.This does not change the maximum size of a SAM record, as that is limited by
bam1_t::l_data
which is anint
.The situation for VCF is a bit more complicated, and it may be possible to get very slightly over 2Gbytes as various limits apply to different parts of the record. The size of the sample data, which is likely to be the biggest part, is currently restricted to 2Gbytes by this check, and by the size of
bcf_fmt_t::p_off
which is 31 bits. It might be possible to work around thep_off
limit by abusing the ability ofbcf_fmt_t
to store edited data in a separate buffer - but doing so would be a bit hairy and would need a lot of thought and testing.