-
Notifications
You must be signed in to change notification settings - Fork 0
/
04-00-CNV-calling.Rmd
169 lines (79 loc) · 8.98 KB
/
04-00-CNV-calling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# PhD Research topic based
## CNV calling
### breaking point detection
- 4 CNV breakpoint detection methods (2021-07-17 Group meeting)
1. CHISEL: https://www.nature.com/articles/s41587-020-0661-6#Sec8 (see global clustering subsection)
- seemingly no breakpoint detection, but rather a global clustering (ie. entry-wise for a bin-by-cell matrix), thus the resolution of CNV is the bin size (5MB)
2. Alleloscope: https://www.nature.com/articles/s41587-021-00911-w#Sec10 (see segmentation subsection)
- HMM on a pooled cells (pseudo-bulk?) with pre-defined Gaussian means and variance for each state
3. InferCNV: https://github.com/broadinstitute/inferCNV/wiki/inferCNV-HMM-based-CNV-Prediction-Methods
- i6-HMM generates in silico spike-in; seemingly define CNV region (segment) on cluster instead of cell, but using noise model on each cell (not quite sure from the doc).
4. CopyKat: https://www.nature.com/articles/s41587-020-00795-2#Sec9
- KS test for whether to two neighbour bins should be joined, by using the posterior samples of Gamma-Poisson posterior. Seemingly using noise model on each cell within a cluster
FACLON
https://academic.oup.com/nar/article/43/4/e23/2410993
![Overview of HATCHet algorithm](./figs/CNV/HATCHet.jpg)
https://www.nature.com/articles/s41467-020-17967-y/figures/1
**a** HATCHet takes in input DNA sequencing data from multiple bulk tumor samples of the same patient and has five steps.
**b** First, HATCHet calculates the RDRs and BAFs in bins of the reference genome (black squares). Here, we show two tumor samples p and q.
**c** Second, HATCHet clusters the bins based on RDRs and BAFs globally along the entire genome and jointly across samples p and q. Each cluster (color) includes bins with the same copy-number state within each clone present in p or q.
**d** Third, HATCHet estimates two values for the fractional copy number of each cluster by scaling RDRs. If there is no WGD, the identification of the cluster (magenta) with copy-number state (1, 1) is sufficient and RDRs are scaled correspondingly.
If a WGD occurs, HATCHet identifies an additional cluster with identical copy-number state in all tumor clones. Dashed black horizontal lines in the scaled BAF-RDR plot represent values of fractional copy numbers that correspond to clonal CNAs.
**e** Fourth, HATCHet factors the allele-specific fractional copy numbers FA, FB into the allele-specific copy numbers A, B, respectively, and the clone proportions U. Here, there is a normal clone and 3 tumor clones.
**f** Last, HATCHet’s model-selection criterion identifies the matrices A, B, and U in the factorization while evaluating the fit according to both the inferred number of clones and presence/absence of a WGD.
**g** HATCHet outputs allele- and clone-specific copy numbers (with the color of the corresponding clone) and clone proportions (in the top right part of each plot) for each sample.
Clusters are classified according to the inference of unique/different copy-number states in each sample (sample-clonal/subclonal) and across all tumor clones (tumor-clonal/subclonal).
![Overview of chisel algorithm](./figs/CNV/chisel.jpg)
https://www.nature.com/articles/s41587-020-0661-6/figures/1
**a**, CHISEL computes RDRs and BAFs in low-coverage (<0.05× per cell) single-cell DNA sequencing data (top left). Read counts from 2,000 individual cells (rows) in 5-Mb genomic bins (columns) across three chromosomes (gray rectangles in first row) are shown.
For each bin in each cell, CHISEL computes the RDR (top) by normalizing the observed read counts. CHISEL computes the BAF in each bin and cell (bottom) by first performing referenced-based phasing of germline SNPs in 50-kb haplotype blocks (magenta and green) and then phasing all these blocks jointly across all cells.
**b**, CHISEL clusters RDRs and BAFs globally along the genome and jointly across all cells resulting here in five clusters of genomic bins (red, blue, purple, yellow and gray) with distinct copy-number states.
**c**, CHISEL infers a pair {c^t,cˇt} of allele-specific copy numbers for each cluster by determining whether the allele-specific copy numbers of the largest balanced (BAF of ~0.5) cluster are equal to {1, 1} (diploid), {2, 2} (tetraploid) or are higher ploidy.
**d**, CHISEL infers haplotype-specific copy numbers (at, bt) by phasing the allele-specific copy numbers {c^t,cˇt} consistently across all cells.
**e**, CHISEL clusters tumor cells into clones according to their haplotype-specific copy numbers. Here, a diploid clone (light gray) and two tumor clones (red and blue) are obtained.
A phylogenetic tree describes the evolution of these clones. Somatic SNVs are derived from pseudo-bulk samples and placed on the branches of the tree.
![Overview of TITAN local clustering for segemetation](./figs/CNV/TITAN.jpg)
![Overview of inferCNV i3HMM](./figs/CNV/infercnv_i3HMM_model.png)
![Overview of inferCNV i6HMM](./figs/CNV/infercnv_i6HMM_model.png)
![Overview of HoneyBADGER method](./figs/CNV/HoneyBADGER.png)
![Overview of Alleloscope algorithm](./figs/CNV/Alleloscope.jpg)
![Overview of copykat algorithm](./figs/CNV/copykat.jpg)
### Related research
**clonealign**: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers
However, independently sampled single-cell measurements introduce a new analytical challenge of how to associate cells across each modality. Assuming a population structure with a fixed number of clones, this can be expressed as a mapping problem, whereby cells measured with transcriptome assays must be aligned to those measured with a genome assay.
![clonealign overview](./figs/CNV/clonealign.jpg)
In order to relate the independent measurements, we assume that an increase in the copy number of a gene will result in a corresponding increase in that gene's expression and vice versa (Fig. 1b)
a relationship previously observed in joint RNA-DNA assays in bulk tissues [12] and at the single-cell level [9, 10, 13].
**CHISEL**
**inferCNV**
**copyKAT**
**CaSpER**
**HoneyBADGER**
### Smoothing strategies in RDR
-TODO
### PUBMON
[Precise identification of cancer cells from allelic imbalances in single cell transcriptomes](https://www.biorxiv.org/content/10.1101/2021.11.25.469995v1)
this paper used BAF information to identify cancer cells, seems quite relevant.
**2022-03-09**
[Evolutionary tracking of cancer haplotypes at single-cell resolution](https://www.biorxiv.org/content/10.1101/2021.06.04.447031v1.full)
[Haplotype-enhanced inference of somatic copy number profiles from single-cell transcriptomes](https://www.biorxiv.org/content/10.1101/2022.02.07.479314v1.full)
- **Numbat**
- introduction
Existing approaches for CNV detection from scRNA-seq do not utilize the prior knowledge of haplotypes, or the individual-specific configuration of variant alleles on the two homologous chromosomes, which can enable more sensitive detection of allelic imbalance.
The utility of phasing in detecting CNV signals from scRNA-based assays, however, has not been explored.
We therefore developed a computational method, Numbat, which integrates expression, allele, and haplotype information derived from population-based phasing to comprehensively characterize the CNV landscape in single-cell transcriptomes.
Numbat does not require sample-matched DNA data or a priori genotyping, and is widely applicable to a wide range of experimental settings and cancer types.
- Results
**Enhanced detection of subclonal allelic imbalances using population-based haplotype phasing**
Prior phasing information can effectively amplify weak allelic imbalance signals of individual SNPs induced by the CNV, by exposing joint behavior of entire haplotype sequences and thereby increasing the statistical power.
The ability to infer phasing between genes is particularly useful for CNV inference, as it provides means to overcome stochastic allele-specific expression effects which give rise to bursts of gene-specific allelic imbalances in individual cells.
The differential phasing accuracy from within and between genes reflects the fact that the strength of genetic linkage decays with increasing distance (Supplementary Figure 1a).
To reflect the decay in phasing strength over longer genetic distances, we introduced site-specific transition probabilities between haplotype states in the Numbat allele HMM (see Methods).
**Accurate copy number inference from single-cell transcriptomes**
To increase robustness, Numbat models gene expression as integer read counts using a discrete Poisson Lognormal mixture distribution, and accounts for excess variance in the allele frequency (e.g. due to allele-specific detection or transcriptional bursts) using a Beta-Binomial distribution.
**Iterative strategy to decompose tumor clonal architecture**
**Reliable identification of cancer cells in the tumor microenvironment**
**Allele-specific CNV analysis reveals additional subclonal complexity**
**Unraveling the interplay between genetic and transcriptional heterogeneity in tumor evolution**
- Discussion
- Methods