-
Notifications
You must be signed in to change notification settings - Fork 34
Models
This module stores models for entities so they can be handled in the same way, independently of the format of the file they were read from. The most commonly used fields are explicitly specified, and at the same time the entities provide mechanisms for preserving all the information of a certain format. For a variant, the specified fields would be (among others) chromosome, position, reference and alternatives; if a VCF file is being stored, then columns such as INFO are also saved in a key-value data structure.
A variant is uniquely represented by a tuple (chromosome, start, reference allele, alternate allele). The whole list of fields that model a variant are included in the Variant.java file, and are the following:
chromosome : string
start : integer
end : integer
length : integer
reference : string
alternate : string
hgvs : set of strings
The information that was specific to a given file is stored in another attribute whose class is implemented in the ArchivedVariantFile.java file. Such information would be the columns FILTER, QUAL, INFO and FORMAT, as well as all the samples, in a VCF file.
There is an issue in how indels (insertions/deletions) are represented in VCF files: they do not contain only the bases that have been mutated. Let's illustrate this situation with two variants extracted from the 1000 Genomes dataset:
X 152301 . G GCA
X 173473 . TG T
We can see how bases that were not mutated are included. This problem is aggravated with the presence of multi-allelic and co-located variants, because they allow to mix SNVs and indels, as well as variants that do not start in the same position, in the same record.
It is straight-forward to convert VCF records to our variants model when they have only one alternate allele, but what happens when multi-allelic variants are read? They are represented ambiguously in the VCF format, because they can be stored in one or several lines. And there are also "co-located variants", overlapping variants that do not even have to start in the same position. But sometimes, in VCF files their alleles are "formatted" to put them in a single line, in the same position.
Let's see an example from the VCF specification itself. It is a case that combines a deletion (TC) that starts in position 20:1234568 and an insertion (T) that starts in position 20:1234570. But they are both reported in position 20:1234567.
20 1234567 microsat1 GTC G,GTCT
But the specification would allow to represent them in two lines, like the following:
20 1234567 microsat1 GTC G 50
20 1234569 microsat1 C CT 50
This forces to query the information depending on how the variant caller decided to save it in the file! Information must be stored in a homogeneous format, and we conclude that splitting a multi-allelic record in pairs (reference, alternate) was the only way to preserve all the information while fulfilling that condition. It is also important to note that alleles must be re-formatted to store only the bases that represent the real variant.
Two different conversions must be run: the first one regarding the reference and alternate alleles, and the second one regarding the samples (genotypes and likelihoods).