Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "P" special Number character for FORMAT #442

Closed
wants to merge 1 commit into from

Conversation

dancooke
Copy link

Specifies that the number of values is the sample ploidy (i.e. the number of GT entries).

One possible example use case is to specify the frequency of alleles (or haplotypes):

...
##FORMAT=<ID=HF,Number=P,Type=Float,Description="Haplotype frequency">
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA12878
1	100	.	A	C	.	.	SOMATIC	GT:PS:HF	0|0|1:100:0.4,0.4,0.2
1	150	.	G	T	.	.	.	GT:PS:HF	1|0|1:100:0.4,0.4,0.2
1	200	.	T	A	.	.	.	GT:PS:HF	1|0:200:0.5,0.5

Specifies that the number of values is the sample ploidy (i.e. the number of GT entries).
@hts-specs-bot
Copy link

Changed PDFs as of b717f40: VCFv4.3 (diff).

@jmmut
Copy link
Contributor

jmmut commented Aug 30, 2019

I agree that any field with "ploidy cardinality" at the moment needs Number=., so I'm not against this in principle, but shouldn't allele/haplotype frequencies be a summary of all the samples, and not a property of each sample? If so, it should go in the INFO, right?

And the problem with putting it in INFO is that not all samples may have the same ploidy, for instance human male/female samples on chrX.

Can we come up with other real-world use cases for this?

@dancooke
Copy link
Author

but shouldn't allele/haplotype frequencies be a summary of all the samples, and not a property of each sample? If so, it should go in the INFO, right?

The Allele/haplotype frequencies that I'm suggesting here are a property of the sample. This totally makes sense in the case of tumour samples (as suggested in the example that I gave) where the allele frequencies are usually unknown (e.g. due to subclonality, tumour impurity, copy-number variation), or other such samples where unknown allele frequencies are unknown and vary between samples (e.g. polyclonal bacterial or viral samples).

@lbergelson
Copy link
Member

This seems like a good idea although I'm not sure it will see a lot of use.

The example you gave is a bit strange because the numbers you specified would be better represented with an A count. I see the point about haplotype frequency in somatic samples though. Are there any tools that can generate that?

We do know of at least one additional use case as well. The proposed PSL and PSQ tags in #421 are haplotype specific phasing information and have this count but they didn't define a new ordering for it.

This would be a breaking change, existing parsers would not understand this. So I think it would probably have to go into 4.4 instead of 4.3.

@dancooke
Copy link
Author

@lbergelson Allele count doesn't work if you want to refer to each unique haplotype. For example, you might have something like

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NORMAL	TUMOUR
1	100	.	A	C,G	.	.	SOMATIC	GT:PS:HF	1|1:100:0.5,0.5	1|1|2:100:0.3,0.4,0.3
1	150	.	G	T	.	.	.	GT:PS:HF	1|0:100:0.5,0.5	1|0|1:100:0.3,0.4,0.3

Now there are three unique haplotypes in sample TUMOUR, but max two alt alleles.

Octopus generates genotypes like this for somatic samples and infers frequencies for each haplotype. Hence my interest in seeing this added.

@lbergelson
Copy link
Member

@dancooke Right. I agree allele doesn't work if you are referring to haplotypes. I was just saying the example wasn't that convincing because it seems like it could be better represented with just the allele information.

I'm in favor of this idea. We need a way of representing per haplotype information and supporting this is not a difficult change.

I'm just not certain if it should go into 4.3.

dancooke pushed a commit to luntergroup/octopus that referenced this pull request Aug 30, 2019
Rather than reporting a single statistic (or pair), report statistics for each called haplotype. This addresses the issue faced in issue #81.

TODO: update header rows once the VCF spec adds special Number "P" (see samtools/hts-specs#442).
@pd3
Copy link
Member

pd3 commented Sep 1, 2019

I agree that if added, it should go into the next version. I am not prinicipally opposed, but unsure if another special Number=P case is really needed as Number=. can serve the same purpose. I am wondering how widely is it going to be used? I'd suggest not to rush with addition of this and wait if it gains momentum before changing the specs.

@dancooke
Copy link
Author

dancooke commented Sep 1, 2019

@pd3 It won't gain momentum if it's not in the spec... I'd like to have Octopus generating VCFs using this but am currently forced to use .. I'd also speculate that the need to refer to haplotypes will only increase since more tools are adding phasing information to VCFs.

@pd3 pd3 added the vcf label Sep 1, 2019
@pd3
Copy link
Member

pd3 commented Sep 1, 2019

@dancooke It sometimes takes surprisingly long for libraries and programs to adopt new versions of VCF, in fact, some never do. I witness it firsthand, it's a problem for both users and developers.

To explain better what I meant: adopting a breaking feature like this may not be such a big win for Octopus because it may end up generating VCFs that other programs refuse to work with. There must be a strong support to implement the feature. On the other hand, if you use the Number=. convention in the meantime, nothing breaks and everyone is happy.

Hence my cautious support for this addition - if requests like this really become more frequent as you are predicting.

@dancooke
Copy link
Author

dancooke commented Sep 1, 2019

@pd3 I understand your concern, but I think if people get used to using . then that's what will stick (even though P is clearly better); changing it later will only be harder. I'm not expecting this to go into 4.3, but I don't see why it can't go into 4.4, which presumably has a bunch of other breaking changes.

@pd3
Copy link
Member

pd3 commented Sep 1, 2019

@dancooke My suggestion was to give it more time to get some feeling for how widely this would be used and what benefits Number=P brings compared to the existing Number=.. For now I don't think it's necessary to introduce yet another specifier, but I am happy to be convinced otherwise.

In general, the more unnecessary breaking changes are introduced, the less likely it is that the new version of the format gets adopted. In past I forced changes to BCF which made life more difficult for everyone, and although I still believe they were necessary, it certainly took a toll on adoption of the format.

@lh3
Copy link
Member

lh3 commented Sep 1, 2019

I agree with @pd3. It would be good to have the feature at the beginning of VCF, but adding it now brings more troubles than benefit. Furthermore, if I were an Octopus developer, I wouldn't consider "P" anyway. Doing that would only make most existing tools reject Octopus VCF.

@lbergelson
Copy link
Member

I think this is a simple enough change that there's no reason to keep it out of a new version of vcf due to difficulty of adoption. t would need to go in vcf 4.4 and as a developer I probably wouldn't start emitting it until 4.4 exists, but I think it would be worthwhile.

We haven't had much need for something like this because support for phasing has been very limited and spotty. We already have at least 1 other active specs PR that would benefit from this right now though. With new long read technology we're going to be seeing more tools generate phased data and needing a way to associate values with each haplotype so it seems likely that this will be increasingly used if we ratify it.

@pd3
Copy link
Member

pd3 commented Sep 7, 2019

Features shouldn't be added to the specification just because they are easy to add, but because they add new functionality with benefits. In this thread, I've been hoping to hear more about the latter.

So far, nothing has been said that couldn't be addressed with the existingNumber=.. The existing Number=R, A, G are needed to allow automatic processing of tags which depend on the number and the order of alleles. For example, merging VCFs, adding or removing samples, can change the number and order of ALT alleles. Thanks to this, programs can do the right thing.

I can imagine that with the proposed convention, phasing tools can automatically rearrange Number=P tags when the order of alleles in the genotype changes. Do you envisage operations which need to reorder alleles in Octopus output or change ploidy of genotypes?

@lbergelson
Copy link
Member

I agree that we shouldn't add features that have no use just because they are easy, but I think moderately useful features with low implementation cost should be a much easier sell than moderately useful features with very difficult implementation cost.
In this case, I have a pretty strong aversion to Number=., so to me reducing the number of fields that have that rely on it is a pretty good feature. It allows automated error checking of gross errors in these fields, as well as supporting automated haplotype reordering, filtering, etc.

@lh3
Copy link
Member

lh3 commented Sep 7, 2019

supporting automated haplotype reordering, filtering, etc.

This is a good point. However, phasing in the long read era will be very different from what we are thinking now. It is not worth breaking forward compatibility for hypothetical operations we haven’t implemented in htslib or htsjdk. I don’t hold a strong opinion on this, though, mostly because bcftools silently ignores unrecognized Number anyway.

@yfarjoun
Copy link
Contributor

yfarjoun commented Nov 3, 2019

The non-diploid phasing notation in #421 could use a P number-type....

NOT true...the length of the PSL is the number of phased alleles in the GT field...not the length of the GT field.

@lbergelson
Copy link
Member

@yfarjoun Would it make sense to have PSL correspond to the length of GT and explicitly specify . for pieces that aren't phased. That would make it simpler to parse and it could actually replace the need for changing the pipe notation to allow additional pipes. I.e. just check the phase set to see if it's phased.

@lbergelson lbergelson added this to the VCF v4.4 milestone Feb 14, 2020
@yfarjoun
Copy link
Contributor

yes...that's a good idea. moving it to that thread.

@d-cameron
Copy link
Contributor

If this feature is designed to support somatic samples, shouldn't we include support for subclone at the same time? Currently everything is designed around pure samples with integer copy numbers for everything.

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 26, 2020 via email

@d-cameron
Copy link
Contributor

Seems reasonable. We'd have to make CN a float instead of a int but that's forr aanother PR.

@LiterallyUniqueLogin
Copy link

I'm confused where this conversation stands, but I'd like to add another vote in favor of Number=P for format fields. I'm from the lab which supports the GangSTR tandem repeat caller. Currently we output the length of the corresponding allele for each called haplotype, and can't represent the number of integers in this format field appropriately without Number=P.

That use-case is for convenience, and could be handled by just Number=R fields and some lookups. However, we may want to emit a quality metric for each haplotype call. I don't see how the current system can support that without Number=. and the exact same thing, just by less-than-clear convention instead of specification.

Also, a lot of the talk on this page is with regards to somatic variation. The use-case we want this for is simpler: 2/1 calls of X chromosomes and 0/1 calls of Y chromosomes in women/men. That seems like a fundamental use-case that should be well supported.

@d-cameron
Copy link
Contributor

Also, a lot of the talk on this page is with regards to somatic variation. The use-case we want this for is simpler: 2/1 calls of X chromosomes and 0/1 calls of Y chromosomes in women/men. That seems like a fundamental use-case that should be well supported.

Unfortunately, even human sex chromosome ploidy is messier than you describe with ~0.5% of individuals neither XX nor XY. When running across thousands of samples, you need to be able to handle this. Similarly, if/when GangSTR decides to expand it's scope to call somatic STRs (since MSI is important in cancer genomics), then you'll likely consider somatic analysis to be a common use case.

A well designed specifications makes common uses cases easy, and uncommon use cases possible. If we don't consider all potential use cases when designing a feature, we end up in the situation in which yet another file format is required for the scenarios that VCF doesn't handle. We're in that situation right now for aneuploidy & subclonality - many CNV callers write their own custom file format.

@d-cameron
Copy link
Contributor

Number=P has been added to VCF 4.5

@d-cameron d-cameron closed this Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants