forked from digitalcytometry/ecotyper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
executable file
·1844 lines (1247 loc) · 128 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
toc: true
toc_float: true
github_document:
math_method:
engine: webtex
url: http://chart.apis.google.com/chart?cht=tx&chl=
pdf_document: default
word_document: default
md_document: default
---
```{r, setup, include=FALSE}
knitr::opts_knit$set(root.dir = '.')
knitr::opts_knit$set(fig.path = '.')
```
<p align="center">
<img width="300" src="utils/EcoTyper_Logo.png">
</p>
## Introduction
[EcoTyper](https://ecotyper.stanford.edu/) is a machine learning framework for large-scale identification of cell type-specific transcriptional states and their co-association patterns from bulk and single-cell (scRNA-seq) expression data.
We have already defined cell states and ecotypes across **carcinomas** ([Luca/Steen et al., Cell 2021](https://doi.org/10.1016/j.cell.2021.09.014)) and in **diffuse large B cell lymphoma (DLBCL)** ([Steen/Luca et al., Cancer Cell 2021](https://doi.org/10.1016/j.ccell.2021.08.011)). The current version of EcoTyper allows users to recover the cell states and ecotypes for these two tumor categories in their own data. Additionally, it allows users to discover and recover cell states and ecotypes in their system of interest, including **directly** from scRNA-seq data (see [Tutorial 5](#tutorial-5-de-novo-discovery-of-cell-states-and-ecotypes-in-scrna-seq-data)). Below we illustrate each of these functionalities.
## Citation
If EcoTyper software, data, and/or website are used in your publication, please cite the following paper(s):
* [Luca/Steen et al., Cell 2021](https://doi.org/10.1016/j.cell.2021.09.014) (detailed description of EcoTyper and application to carcinomas).
* [Steen/Luca et al., Cancer Cell 2021](https://doi.org/10.1016/j.ccell.2021.08.011) (application of EcoTyper to lymphoma).
## Setup
The latest version of EcoTyper source code can be found on [EcoTyper GitHub repository](https://github.com/digitalcytometry/ecotyper) and [Ecotyper website](https://ecotyper.stanford.edu/). To set up EcoTyper, please download this folder locally:
```{bash, eval = F}
git clone https://github.com/digitalcytometry/ecotyper
cd ecotyper
```
or:
```{bash, eval = F}
wget https://github.com/digitalcytometry/ecotyper/archive/refs/heads/master.zip
unzip master.zip
cd ecotyper-master
```
### Basic resources
The R packages listed below are required for running EcoTyper. The version numbers indicate the package versions used for developing and testing the EcoTyper code. Other R versions might work too:
* R (v3.6.0 and v4.1.0).
* R packages: ComplexHeatmap (v2.2.0 and v2.8.0), NMF (v0.21.0 and v0.23.0), RColorBrewer (v1.1.2), cluster (v2.1.0 and v2.1.2)), circlize (v0.4.10 and v0.4.12), cowplot (v1.1.0 and v1.1.1), data.table (base package R v3.6.0 and v4.1.0), doParallel (v1.0.15 and v1.0.16), ggplot2 (v3.3.2, v3.3.3), grid (base package R v3.6.0 and v4.1.0), reshape2 (v1.4.4), viridis (v0.5.1 and v0.6.1), config (v0.3.1), argparse (v2.0.3), colorspace (v1.4.1 and v2.0.1), plyr (v1.8.6), Biobase (v2.40.0).
These packages, together with the other resources pre-stored in the EcoTyper folder, allow users to:
* perform the recovery of previously defined cell states and ecotypes in their own bulk RNA-seq, microarray and scRNA-seq data (Tutorials 1 and 2).
* perform cell state and ecotype discovery in scRNA-seq and pre-sorted cell type-specific profiles (Tutorials 5 and 6).
Besides these packages, the additional resources described in the next section are needed for analyses described in Tutorials 3 and 4. Moreover, Mac users might need xquartz.
### Additional resources
For some use cases, such as cell state and ecotype recovery in spatial transcriptomics assays ([Tutorial 3](#tutorial-3-recovery-of-cell-states-and-ecotypes-in-spatial-transcriptomics-data)) and *de novo* identification of cell states and ecotypes from bulk expression data ([Tutorial 4](#tutorial-4-de-novo-discovery-of-cell-states-and-ecotypes-in-bulk-data)), EcoTyper relies on CIBERSORTx ([Newman et al., Nature Biotechnology 2019](https://www.nature.com/articles/s41587-019-0114-2), a digital cytometry framework for enumerating cell types in bulk data and performing *in silico* deconvolution of cell type specific expression profiles. In these situations, the following additional resources are needed for running EcoTyper:
* Docker or Singularity.
* CIBERSORTx executables than can be downloaded from the [CIBERSORTx website](https://cibersortx.stanford.edu/), as Docker images. Specifically, EcoTyper requires the **CIBERSORTx Fractions** and **CIBERSORTx HiRes** modules. Please follow the instructions on the [Download](https://cibersortx.stanford.edu/download.php) section of the website to download the Docker images and obtain the Docker tokens necessary for running them. If Singularity is used, the Docker images need to be [converted to Singularity Image Files (SIF)](https://sylabs.io/guides/3.0/user-guide/build_a_container.html).
## EcoTyper implementation
EcoTyper is a standalone software, implemented in R (not an R package). Some of the EcoTyper functions are computationally intensive, especially for the cell state discovery step described in Tutorials 4-6. Therefore, EcoTyper is designed as a collection of modular command-line R scripts, that can be run in parallel on a multi-processor server or a high-performance cluster. Each script is designed such that its instances can typically be run on a single core.
We provide wrappers over these scripts that encapsulate the typical EcoTyper workflows (*Tutorials 1-6*). These wrappers can be run on a multi-core system, and allow users to discover cell states and ecotypes in their own bulk, scRNA-seq and FACS-sorted data, as well as recover previously discovered cell states and ecotypes in bulk tissue expression profiles, spatial transcriptomics assays, and single-cell RNA-seq data.
## EcoTyper overview
EcoTyper performs two major types of analysis: discovery of cell states and ecotypes, starting from bulk, scRNA-seq and pre-sorted cell type specific expression profiles (e.g. FACS-sorted or deconvolved *in silico*); and recovery of previously defined cell states and ecotypes in new bulk, scRNA-seq and spatial transcriptomics data.
When the input is bulk data, EcoTyper performs the following major steps for discovering cell states and ecotypes:
* *In silico* purification: This step enables imputation of cell type-specific gene expression profiles from bulk tissue transcriptomes, using CIBERSORTx ([Newman et al., Nature Biotechnology 2019](https://www.nature.com/articles/s41587-019-0114-2)).
* Cell state discovery: This step enables identification and quantitation of cell type-specific transcriptional states.
* Ecotype discovery: This step enables co-assignment of cell states into multicellular communities (ecotypes).
When the input is scRNA-seq or bulk-sorted cell type-specific profiles (e.g., FACS-purified), EcoTyper performs the following major steps for discovering cell states and ecotypes:
* Gene filtering: This step filters out genes that do not show cell type specificity.
* Cell state discovery: This step enables identification and quantitation of cell type-specific transcriptional states.
* Ecotype discovery: This step enables co-assignment of cell states into multicellular communities (ecotypes).
Regardless of the input type used for deriving cell states and ecotypes, EcoTyper can perform cell state and ecotype recovery in external expression datasets. The recovery can be performed in bulk, scRNA-seq and spatial transcriptomics data.
An upcoming book chapter describing how to use EcoTyper in detail can be found [here](https://github.com/digitalcytometry/ecotyper/blob/master/EcoTyper_MiMB_Chapter_2022.pdf). Additionally, we provide below 6 tutorials illustrating these functionalities. The first three demonstrate how the recovery of cell states and ecotypes can be performed with various input types. The last three demonstrate how the recovery of cell states and ecotypes can be performed with various input types:
* [**Tutorial 1:** Recovery of Cell States and Ecotypes in User-Provided Bulk Data](#tutorial-1-recovery-of-cell-states-and-ecotypes-in-user-provided-bulk-data)
* [**Tutorial 2:** Recovery of Cell States and Ecotypes in User-Provided scRNA-seq Data](#tutorial-2-recovery-of-cell-states-and-ecotypes-in-user-provided-scrna-seq-data)
* [**Tutorial 3:** Recovery of Cell States and Ecotypes in Visium Spatial Gene Expression Data](#tutorial-3-recovery-of-cell-states-and-ecotypes-in-spatial-transcriptomics-data)
* [**Tutorial 4:** *De novo* Discovery of Cell States and Ecotypes in Bulk Expression Data](#tutorial-4-de-novo-discovery-of-cell-states-and-ecotypes-in-bulk-data)
* [**Tutorial 5:** *De novo* Discovery of Cell States and Ecotypes in **scRNA-seq Data**](#tutorial-5-de-novo-discovery-of-cell-states-and-ecotypes-in-scrna-seq-data)
* [**Tutorial 6.** *De novo* Discovery of Cell States and Ecotypes in Pre-Sorted Data](#tutorial-6-de-novo-discovery-of-cell-states-and-ecotypes-in-pre-sorted-data)
A schema of the tutorials is presented below:
```{r echo = F, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("utils/schema.png")
```
## **Tutorial 1:** Recovery of Cell States and Ecotypes in User-Provided Bulk Data
EcoTyper comes pre-loaded with the resources necessary for the reference-guided recovery of cell states and ecotypes previously defined in carcinoma or lymphoma, in user-provided bulk expression data. In the [carcinoma EcoTyper paper](https://doi.org/10.1016/j.cell.2021.09.014), we demonstrate that prior deconvolution of bulk data using *CIBERSORTx HiRes* is not necessary for high-fidelity recovery of cell states in bulk-tissue expression data. We can proceed to the recovery of states based on bulk data only.
In this tutorial, we illustrate how EcoTyper can be used to recover the cell states and ecotypes that we defined across **carcinomas** and in **diffuse large B cell lymphoma** (DLBCL), in a set of the bulk samples from lung adenocarcinoma (LUAD) from TCGA and [bulk samples](https://www.nature.com/articles/s41591-018-0016-8) from diffuse large-cell lymphoma (DLBCL), respectively. Plese note that the recovery procedure described in this tutorial can also be applied on user-defined cell states and ecotypes, derived as described in Tutorials 4-6.
### 1.1. Recovery of **Carcinoma** Cell States and Ecotypes in Bulk Data
For this section, we used a subset of the TCGA bulk samples from lung adenocarcinoma (LUAD), available in `example_data/bulk_lung_data.txt`, together with the sample annotation file `example_data/bulk_lung_annotation.txt`.
The script used to perform recovery in bulk data is called `EcoTyper_recovery_bulk.R`:
```{bash}
Rscript EcoTyper_recovery_bulk.R -h
```
The script takes the following arguments:
* *-d*/*--discovery*: The name of the discovery dataset used for defining cell states. By default, the only accepted values are *Carcinoma* and *Lymphoma* (case sensitive), which will recover the cell states that we already defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (*Tutorials 4-6*), the name of the discovery dataset is the value provided in the *Discovery dataset name* field of the configuration file used for running cell state discovery. In our tutorial, the name of the discovery dataset is *Carcinoma*.
* *-m*/*--matrix*: Path to the input expression matrix. The expression matrix should be in the TPM or FPKM space for bulk RNA-seq and **non-logarithmic** (exponential) space for microarrays. It should have gene symbols on the first column and gene counts for each sample on the next columns. Column (sample) names should be unique. Also, we recommend that the column names do not contain special characters that are modified by the R function *make.names*, *e.g.* having digits at the beginning of the name or containing characters such as *space*, *tab* or *-*. The lung cancer scRNA-seq data used in this tutorial looks as follows:
```{r}
data = read.delim("example_data/bulk_lung_data.txt", nrow = 5)
head(data[,1:5])
```
* *-a*/*--annotation*: Path to a tab-delimited annotation file (not required). If provided, this file should contain a column called *ID* with the same values as the columns of the expression matrix. Additionally, this file can contain any number of columns, that can be used for plotting as color bars in the output heatmaps (see argument *-c*/*--columns*).
```{r}
data = read.delim("example_data/bulk_lung_annotation.txt")
head(data)
```
* *-c*/*--columns*: A comma-separated list of column names from the annotation file (see argument *-a*/*--annotation*) to be plotted as color bars in the output heatmaps. By default, the output heatmaps contain as color bar the cell state label each cell is assigned to. The column names indicated by this argument will be added to that color bar.
* *-t*/*--threads*: Number of threads. Default: 10.
* *-o*/*--output*: Output folder. The output folder will be created if it does not exist.
The command line for recovering the carcinoma cell states and ecotypes in the example bulk data is:
```{bash, eval = F}
Rscript EcoTyper_recovery_bulk.R -d Carcinoma -m example_data/bulk_lung_data.txt -a example_data/bulk_lung_annotation.txt -c Tissue -o RecoveryOutput
```
The output of this script for each cell type includes:
* The abundance (fraction) of each cell state in each sample:
```{r}
data = read.delim("RecoveryOutput/bulk_lung_data/Fibroblasts/state_abundances.txt")
head(data[,1:5])
```
* The assignment of samples to the state with highest abundance. If the cell state with the highest abundance is one of the cell states filtered by the automatic QC filters of EcoTyper, the sample is considered unassigned and filtered out from this table. For more information about the sample filtering procedure please see the *Cell state quality control* section of the [EcoTyper paper](https://doi.org/10.1016/j.cell.2021.09.014) methods:
```{r}
data = read.delim("RecoveryOutput/bulk_lung_data/Fibroblasts/state_assignment.txt")
head(data[,c("ID", "State")])
```
* Two heatmaps: the heatmap representing the expression of "marker" genes for each state (See Tutorial 4 for more details) in the discovery dataset and in the user-provided bulk dataset:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/bulk_lung_data/Fibroblasts/state_assignment_heatmap.png")
```
* The expression matrix used for plotting the heatmap of user-provided bulk dataset:
```{r}
data = read.delim("RecoveryOutput/bulk_lung_data/Fibroblasts/heatmap_data.txt")
dim(data)
head(data[,1:5])
```
* The meta-information data frame used for plotting the color bar in the heatmap of user-provided bulk dataset:
```{r}
data = read.delim("RecoveryOutput/bulk_lung_data/Fibroblasts/heatmap_top_ann.txt")
dim(data)
head(data[,1:5])
```
The output for ecotypes includes:
* The abundance (fraction) of each ecotype in each sample:
```{r}
assign = read.delim("RecoveryOutput/bulk_lung_data/Ecotypes/ecotype_abundance.txt")
dim(assign)
head(assign[,1:5])
```
* The assignment of samples to the carcinoma ecotype with the highest abundance. If the cell state fractions from the dominant ecotype are not significantly higher than the other cell state fractions in a given sample, the sample is considered unassigned and filtered out from this table. For more information about the sample filtering procedure please see the *Ecotype discovery* section of the [EcoTyper paper](https://doi.org/10.1016/j.cell.2021.09.014) methods:
```{r}
discrete_assignments = read.delim("RecoveryOutput/bulk_lung_data/Ecotypes/ecotype_abundance.txt")
dim(discrete_assignments)
head(discrete_assignments[,1:5])
```
* A heatmap of cell state abundances across the samples assigned to ecotypes. Rows correspond to the cell states forming ecotypes, while columns correspond to the samples assigned to ecotypes:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/bulk_lung_data/Ecotypes/heatmap_assigned_samples_viridis.png")
```
### 1.2. Recovery of **Lymphoma** Cell States and Ecotypes in Bulk Data
For this section, we used a subset of the [bulk samples](https://www.nature.com/articles/s41591-018-0016-8) from diffuse large-cell lymphoma (DLBCL), available in `example_data/bulk_lymphoma_data.txt`, together with the sample annotation file `example_data/bulk_lymphoma_annotation.txt`.
The script used to perform recovery in bulk data is called `EcoTyper_recovery_bulk.R`:
```{bash}
Rscript EcoTyper_recovery_bulk.R -h
```
The script takes the following arguments:
* *-d*/*--discovery*: The name of the discovery dataset used for defining cell states. By default, the only accepted values are *Carcinoma* and *Lymphoma* (case sensitive), which will recover the cell states that we already defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (*Tutorials 4-6*), the name of the discovery dataset is the value provided in the *Discovery dataset name* field of the configuration file used for running cell state discovery.
* *-m*/*--matrix*: Path to the input expression matrix. The expression matrix should be in the TPM or FPKM space for bulk RNA-seq and **non-logarithmic** (exponential) space for microarrays. It should have gene symbols on the first column and gene counts for each sample on the next columns. Column (sample) names should be unique. Also, we recommend that the column names do not contain special characters that are modified by the R function *make.names*, *e.g.* having digits at the beginning of the name or containing characters such as *space*, *tab* or *-*. The bulk data used in this tutorial looks as follows:
```{r}
data = read.delim("example_data/bulk_lymphoma_data.txt", nrow = 5)
head(data[,1:5])
```
* *-a*/*--annotation*: Path to a tab-delimited annotation file (not required). If provided, this file should contain a column called **ID** with the same values as the columns of the expression matrix. Additionally, this file can contain any number of columns, that can be used for plotting as color bars in the output heatmaps (see argument *-c*/*--columns*).
```{r}
data = read.delim("example_data/bulk_lymphoma_annotation.txt")
head(data)
```
* *-c*/*--columns*: A comma-separated list of column names from the annotation file (see argument *-a*/*--annotation*) to be plotted as color bars in the output heatmaps. By default, the output heatmaps contain as color bar the cell state label each cell is assigned to. The column names indicated by this argument will be added to that color bar.
* *-t*/*--threads*: Number of threads. Default: 10.
* *-o*/*--output*: Output folder. The output folder will be created if doesn't exist.
The command line for recovering the lymphoma cell states and ecotypes in the example bulk data is:
```{bash, eval = F}
Rscript EcoTyper_recovery_bulk.R -d Lymphoma -m example_data/bulk_lymphoma_data.txt -a example_data/bulk_lymphoma_annotation.txt -c schmitz_labels,COO -o RecoveryOutput
```
The output of this script for each cell type includes:
* The abundance (fraction) of each cell state in each sample:
```{r}
data = read.delim("RecoveryOutput/bulk_lymphoma_data/B.cells/state_abundances.txt")
head(data[,1:5])
```
* The assignment of samples to the state with highest abundance. If the cell state with the highest abundance is one of the cell states filtered by the automatic QC filters of EcoTyper, the sample is considered unassigned and filtered out from this table. For more information about the sample filtering procedure please see the *Cell state quality control* section of the [EcoTyper paper] (https://doi.org/10.1016/j.cell.2021.09.014) methods:
```{r}
data = read.delim("RecoveryOutput/bulk_lymphoma_data/B.cells/state_assignment.txt")
head(data[,c("ID", "State")])
```
* Two heatmaps: the heatmap representing the expression of "marker" genes for each state (See [Tutorial 3](#tutorial-3-recovery-of-cell-states-and-ecotypes-in-spatial-transcriptomics-data) for more details) in the discovery dataset and in the user-provided bulk dataset:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/bulk_lymphoma_data/B.cells/state_assignment_heatmap.png")
```
* The expression matrix used for plotting the heatmap of user-provided bulk dataset:
```{r}
data = read.delim("RecoveryOutput/bulk_lymphoma_data/B.cells/heatmap_data.txt")
dim(data)
head(data[,1:5])
```
* The meta-information data frame used for plotting the color bar in the heatmap of user-provided bulk dataset:
```{r}
data = read.delim("RecoveryOutput/bulk_lymphoma_data/B.cells/heatmap_top_ann.txt")
dim(data)
head(data[,1:5])
```
The output for ecotypes includes:
* The abundance (fraction) of each ecotype in each sample:
```{r}
assign = read.delim("RecoveryOutput/bulk_lymphoma_data/Ecotypes/ecotype_abundance.txt")
dim(assign)
head(assign[,1:5])
```
* The assignment of samples to the lymphoma ecotype with the highest abundance. If the cell state fractions from the dominant ecotype are not significantly higher than the other cell state fractions in a given sample, the sample is considered unassigned and filtered out from this table. For more information about the sample filtering procedure please see the *Ecotype discovery* section of the [EcoTyper paper](https://doi.org/10.1016/j.cell.2021.09.014) methods:
```{r}
discrete_assignments = read.delim("RecoveryOutput/bulk_lymphoma_data/Ecotypes/ecotype_abundance.txt")
dim(discrete_assignments)
head(discrete_assignments[,1:5])
```
* A heatmap of cell state abundances across the samples assigned to ecotypes. Rows correspond to the cell states forming ecotypes, while columns correspond to the samples assigned to ecotypes:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/bulk_lymphoma_data/Ecotypes/heatmap_assigned_samples_viridis.png")
```
## **Tutorial 2:** Recovery of Cell States and Ecotypes in User-Provided scRNA-seq Data
EcoTyper comes pre-loaded with the resources necessary for the reference-guided recovery of cell states and ecotypes previously defined in carcinoma and lymphoma, in user-provided scRNA-seq data.
In this tutorial, we illustrate how EcoTyper can be used to recover the cell states and ecotypes, that we defined across **carcinomas** and in **diffuse large B cell lymphoma** (DLBCL), in a downsampled version of a [scRNA-seq dataset](https://www.nature.com/articles/s41588-020-0636-z) from colorectal cancer specimens, and a downsampled version of a scRNA-seq dataset from lymphoma specimens, respectively. Plese note that the recovery procedure described in this tutorial can also be applied on user-defined cell states and ecotypes, derived as described in Tutorials 4-6.
### 2.1. Recovery of **Carcinoma** Cell States and Ecotypes in scRNA-seq Data
In this section we illustrate how carcinoma cell states can be recovered in a scRNA-seq dataset from colorectal cancer specimens. The expression data used in this tutorial can be found in `example_data/scRNA_CRC_data.txt`, and its corresponding sample annotation in `example_data/scRNA_CRC_annotation.txt`.
The script used to perform recovery in scRNA-seq data is `EcoTyper_recovery_scRNA.R`:
```{bash}
Rscript EcoTyper_recovery_scRNA.R -h
```
The script takes the following arguments:
* *-d*/*--discovery*: The name of the discovery dataset used for defining cell states. By default, the only accepted values are *Carcinoma* and *Lymphoma* (case sensitive), which will recover the cell states that we already defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (*Tutorial 4-6*), the name of the discovery dataset is the value provided in the *Discovery dataset name* field of the configuration file used for running cell state discovery. For this tutorial, we set the name of the discovery dataset to *Carcinoma*.
* *-m*/*--matrix*: Path to the input scRNA-seq matrix. The scRNA-seq expression matrix should be a tab-delimited file, with gene symbols on the first column and cells on the next columns. It should have cell identifiers (e.g. barcodes) as column names, and should be in TPM, CPM, FPKM or any other suitable count format. Gene symbols and cell identifiers should be unique. Moreover, we recommend that the column names do not contain special characters that are modified by the R function *make.names*, *e.g.* having digits at the beginning of the name or containing characters such as *space*, *tab* or *-*. The CRC cancer scRNA-seq data used in this tutorial looks as follows:
```{r}
data = read.delim("example_data/scRNA_CRC_data.txt", nrow = 5)
head(data[,1:5])
```
* *-a*/*--annotation*: Path to a tab-delimited annotation file. This file should contain at least two columns: *ID* with the same values as the columns of the expression matrix, and *CellType* (case sensitive) which contains the cell type for each cell. The values in column *CellType* should indicate for each cell its cell type. These values are limited to the set of cell types analyzed in the discovery dataset. If the argument *-d*/*--discovery* is set to *Carcinoma* (as is the case for this tutorial), then the accepted values for column *CellType* are: 'B.cells', 'CD4.T.cells', 'CD8.T.cells', 'Dendritic.cells', 'Endothelial.cells', 'Epithelial.cells', 'Fibroblasts', 'Mast.cells', 'Monocytes.and.Macrophages', 'NK.cells', 'PCs' and 'PMNs'. If the argument *-d*/*--discovery* is set to *Lymphoma*, then the accepted values for column *CellType* are: 'B.cells', 'Plasma.cells', 'T.cells.CD8', 'T.cells.CD4', 'T.cells.follicular.helper', 'Tregs', 'NK.cells', 'Monocytes.and.Macrophages', 'Dendritic.cells', 'Mast.cells', 'Neutrophils', 'Fibroblasts', 'Endothelial.cells'. All other values will be ignored for these two cases. The annotation file can contain a column called *Sample*. If this column is present, the ecotype recovery will be performed, in addition to cell state recovery. Moreover, this file can contain any number of columns, that can be used for plotting color bars in the output heatmaps (see argument *-c*/*--columns*).
```{r}
data = read.delim("example_data/scRNA_CRC_annotation.txt")
head(data)
```
* *-c*/*--columns*: A comma-separated list of column names from the annotation file (see argument *-a*/*--annotation*) to be plotted as color bars in the output heatmaps. By default, the output heatmaps contain as color bar the cell state label each cell is assigned to. The column names indicated by this argument will be added to that color bar.
* *-z*/*--z-score*: Flag indicating whether the significance quantification procedure should be run (default is *FALSE*). This procedure allows users to determine whether cell states are significantly recovered in a given dataset. Please note that this procedure can be very slow, as the NMF model is applied 30 times on the same dataset.
* *-s*/*--subsample*: An integer specifying the number of cells each cell type will be downsampled to. For values <50, no downsampling will be performed. Default: -1 (no downsampling).
* *-t*/*--threads*: Number of threads. Default: 10.
* *-o*/*--output*: Output folder. The output folder will be created if it does not exist.
The command line for recovering the carcinoma cell states in the example scRNA-seq data is:
```{bash, eval = F}
Rscript EcoTyper_recovery_scRNA.R -d Carcinoma -m example_data/scRNA_CRC_data.txt -a example_data/scRNA_CRC_annotation.txt -o RecoveryOutput
```
The outputs of this script include the following files, for each cell type provided:
* The assignment of single cells to states:
```{r}
data = read.delim("RecoveryOutput/scRNA_CRC_data/Fibroblasts/state_assignment.txt")
head(data[,c("ID", "State")])
```
* Two heatmaps: a heatmap representing the expression of cell state marker genes (see [Tutorial 4](#tutorial-4-de-novo-discovery-of-cell-states-and-ecotypes-in-bulk-data) for more details) in the discovery dataset, and a heatmap with the expression of the same marker genes in the scRNA-seq dataset, smoothed to mitigate the impact of scRNA-seq dropout:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/scRNA_CRC_data/Fibroblasts/state_assignment_heatmap.png")
```
* The expression matrix used for plotting the heatmap of user-provided scRNA-seq dataset:
```{r}
data = read.delim("RecoveryOutput/scRNA_CRC_data/Fibroblasts/heatmap_data.txt")
dim(data)
head(data[,1:5])
```
* The meta-information data frame used for plotting the color bar in the heatmap of user-provided scRNA-seq dataset:
```{r}
data = read.delim("RecoveryOutput/scRNA_CRC_data/Fibroblasts/heatmap_top_ann.txt")
dim(data)
head(data[,1:5])
```
* If the statistical significance quantification method is applied, the resulting z-scores for each cell state are output in the same directory:
```{r}
#data = read.delim("RecoveryOutput/scRNA_CRC_data/Epithelial.cells/recovery_z_scores.txt")
#head(data[,c("State", "Z")])
```
The output for ecotypes includes:
* The abundance (fraction) of each ecotype in each sample:
```{r}
assign = read.delim("RecoveryOutput/scRNA_CRC_data/Ecotypes/ecotype_abundance.txt")
dim(assign)
head(assign[,1:5])
```
* The assignment of samples to the carcinoma ecotype with the highest abundance. If the cell state fractions from the dominant ecotype are not significantly higher than the other cell state fractions in a given sample, the sample is considered unassigned and filtered out from this table. For more information about the sample filtering procedure please see the *Ecotype discovery* section of the [EcoTyper paper](https://doi.org/10.1016/j.cell.2021.09.014) methods:
```{r}
discrete_assignments = read.delim("RecoveryOutput/scRNA_CRC_data/Ecotypes/ecotype_abundance.txt")
dim(discrete_assignments)
head(discrete_assignments[,1:5])
```
* A heatmap of cell state abundances across the samples assigned to ecotypes. Rows correspond to the cell states forming ecotypes, while columns correspond to the samples assigned to ecotypes:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/scRNA_CRC_data/Ecotypes/heatmap_assigned_samples_viridis.png")
```
### 2.2. Recovery of **Lymphoma** Cell States and Ecotypes in scRNA-seq Data
In this section we illustrate how lymphoma cell states can be recovered in the scRNA-seq dataset from lymphoma specimens. The expression data used in this tutorial can be found in `example_data/scRNA_lymphoma_data.txt`, and sample annotation in `example_data/scRNA_lymphoma_annotation.txt`.
The script used to perform recovery in scRNA-seq data is called `EcoTyper_recovery_scRNA.R`:
```{bash}
Rscript EcoTyper_recovery_scRNA.R -h
```
The script takes the following arguments:
* *-d*/*--discovery*: The name of the discovery dataset used for defining cell states. By default, the only accepted values are *Carcinoma* and *Lymphoma* (case sensitive), which will recover the cell states that we defined in carcinoma and lymphoma, respectively. If the user defined cell states in their own data (**Tutorials 4-6**), the name of the discovery dataset is the value provided in the 'Discovery dataset name' filed of the configuration file used for running EcoTyper discovery ('EcoTyper_discovery_bulk.R') script. In our tutorial, the name of the discovery dataset is *Lymphoma*.
* *-m*/*--matrix*: Path to the input scRNA-seq matrix. The scRNA-seq expression matrix should be a tab-delimited file, with gene symbols on the first column and cells on the next columns. It should have cell identifiers (e.g. barcodes) as column names, and should be in TPM, CPM, FPKM or any other suitable count format. Gene symbols and cell identifiers should be unique. Moreover, we recommend that the column names do not contain special characters that are modified by the R function *make.names*, *e.g.* having digits at the beginning of the name or containing characters such as *space*, *tab* or *-*. The scRNA-seq data used in this tutorial looks as follows:
```{r}
data = read.delim("example_data/scRNA_lymphoma_data.txt", nrow = 5)
head(data[,1:5])
```
* *-a*/*--annotation*: Path to a tab-delimited annotation file. This file should contain at least two columns, **ID** with the same values as the columns of the expression matrix, and **CellType** which contains the cell type for each cell. The values in column **CellType** should be the same as the cell types analyzed in the discovery dataset. If the argument *-d*/*--discovery* is set to *Lymphoma* (as is the case for this tutorial), then the acceptable values for column **CellType** are: 'B.cells', 'Dendritic.cells', 'Endothelial.cells', 'Fibroblasts', 'Mast.cells', 'Monocytes.and.Macrophages', 'Neutrophils', 'NK.cells', 'Plasma.cells', 'T.cells.CD4', 'T.cells.CD8', 'T.cells.follicular.helper', and 'Tregs'. All the other values will be ignored. The annotation file can contain a column called *Sample*. If this column is present, the ecotype recovery will be performed, in addition to cell state recovery. Moreover, this file can contain any number of columns, that can be used for plotting color bars in the output heatmaps (see argument *-c*/*--columns*).
```{r}
data = read.delim("example_data/scRNA_lymphoma_annotation.txt")
head(data)
```
* *-c*/*--columns*: A comma-separated list of column names from the annotation file (see argument *-a*/*--annotation*) to be plotted as color bars in the output heatmaps. By default, the output heatmaps contain as color bar the cell state label each cell is assigned to. The column names indicated by this argument will be added to that color bar.
* *-z*/*--z-score*: Flag indicating whether the significance quantification procedure should be run (default is *FALSE*). This procedure allows users to determine whether cell states are significantly recovered in a given dataset. Please note that this procedure can be very slow, as the NMF model is applied 30 times on the same dataset.
* *-s*/*--subsample*: An integer specifying the number of cells each cell type will be downsampled to. For values <50, no downsampling will be performed. Default: -1 (no downsampling).
* *-t*/*--threads*: Number of threads. Default: 10.
* *-o*/*--output*: Output folder. The output folder will be created if it does not exist.
The command line for recovering the lymphoma cell states in the example scRNA-seq data is:
```{bash, eval = F}
Rscript EcoTyper_recovery_scRNA.R -d Lymphoma -m example_data/scRNA_lymphoma_data.txt -a example_data/scRNA_lymphoma_annotation.txt -o RecoveryOutput -c Tissue
```
The outputs of this script include the following files, for each cell type provided:
* The assignment of single cells to states:
```{r}
data = read.delim("RecoveryOutput/scRNA_lymphoma_data/B.cells/state_assignment.txt")
head(data[,c("ID", "State")])
```
* Two heatmaps: a heatmap representing the expression of cell state marker genes (see Tutorial 4 for more details) in the discovery dataset, and a heatmap with the expression of the same marker genes in the scRNA-seq dataset, smoothed to mitigate the impact of scRNA-seq dropout:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("RecoveryOutput/scRNA_lymphoma_data/B.cells/state_assignment_heatmap.png")
```
* The expression matrix used for plotting the heatmap of user-provided scRNA-seq dataset:
```{r}
data = read.delim("RecoveryOutput/scRNA_lymphoma_data/B.cells/heatmap_data.txt")
dim(data)
head(data[,1:5])
```
* The meta-information data frame used for plotting the color bar in the heatmap of user-provided scRNA-seq dataset:
```{r}
data = read.delim("RecoveryOutput/scRNA_lymphoma_data/B.cells/heatmap_top_ann.txt")
dim(data)
head(data[,1:5])
```
* If the statistical significance quantification method is applied, the resulting z-score for each cell state are output in the same directory:
```{r}
#data = read.delim("RecoveryOutput/scRNA_lymphoma_data/B.cells/recovery_z_scores.txt")
#head(data[,c("State", "Z")])
```
Since in this case the annotation file did not contain a column called *Sample*, ecotype recovery was not performed.
## **Tutorial 3:** Recovery of Cell States and Ecotypes in Spatial Transcriptomics data
EcoTyper comes pre-loaded with the resources necessary for the reference-guided recovery of cell states and ecotypes previously defined in carcinoma and lymphoma, in user-provided expression data. The recovery procedure described in this tutorial can also be applied on user-defined cell states and ecotypes, derived as described in Tutorials 4-6.
Here we illustrate how one can perform cell state and ecotype recovery in Visium Spatial Gene Expression arrays from [10x Genomics](https://www.10xgenomics.com/products/spatial-gene-expression). For this tutorial we recover cell states and ecotypes defined across carcinomas in whole transcriptome spatial transcriptomics data from [breast cancer](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.1.0/V1_Breast_Cancer_Block_A_Section_1).
### 3.1. Checklist before performing cell states and ecotypes recovery in Visium data
In order for EcoTyper to perform cell states and ecotypes recovery in Visium data, the following resources need to be available:
* the filtered feature-barcode matrices `barcodes.tsv.gz`, `features.tsv.gz` and `matrix.mtx.gz`, in the format provided by [10x Genomics](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/matrices), and the `tissue_positions_list.csv` file produced by the [run summary images pipeline](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/images), containing the spatial position of barcodes.
* if the major cell populations expected in the system to be analyzed **are** recapitulated by the cell populations analyzed in the EcoTyper carcinoma paper (B cells, CD4 T cells, CD8 T cells, dendritic cells, endothelial cells, epithelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells, neutrophils), or the EcoTyper lymphoma paper (B cells, CD4 T cells, CD8 T cells, follicular helper T cells, Tregs, dendritic cells, endothelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells, neutrophils), then the user needs:
* Docker
* Docker containers for **CIBERSORTx Fractions** and **CIBERSORTx HiRes** modules, both of which can be obtained from the [CIBERSORTx website](https://cibersortx.stanford.edu/). Please follow the instructions from the website to install them.
* A token required for running the docker containers, which can also be obtained from the [CIBERSORTx website](https://cibersortx.stanford.edu/download.php).
* if the major cell populations expected in the system to be analyzed **are not** recapitulated by the cell populations analyzed in the EcoTyper carcinoma paper (B cells, CD4 T cells, CD8 T cells, dendritic cells, endothelial cells, epithelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells, neutrophils), or the EcoTyper lymphoma paper (B cells, CD4 T cells, CD8 T cells, follicular helper T cells, Tregs, dendritic cells, endothelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells, neutrophils), then the user needs to provide their own cell type proportion estimations for these populations (see more details below).
The script that does cell type and ecotype discovery is:
```{bash}
Rscript EcoTyper_recovery_visium.R -h
```
### 3.2. The configuration file
This script takes as input file a configuration file in [YAML](https://yaml.org/) format. The configuration file for this tutorial is available in `config_recovery_visium.yml`:
```{cat, class.source='yaml'}
default :
Input :
Discovery dataset name : "Carcinoma"
Recovery dataset name : "VisiumBreast"
Input Visium directory : "example_data/VisiumBreast"
#Path to a file containing the precomputed cell fractions for the visium array
Recovery cell type fractions : "NULL"
Background cell type : "Epithelial.cells"
CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"
Output :
Output folder : "VisiumOutput"
Pipeline settings :
Number of threads : 10
CIBERSORTx fractions Singularity path : NULL
```
The configuration file has three sections, *Input*, *Pipeline settings*, and *Output*. We next will describe the expected content in each of these sections, and instruct the user how to set the appropriate settings in their applications.
#### Input section
The *Input* section contains settings regarding the input data.
#### Discovery dataset name
```{cat, class.source='yaml'}
Discovery dataset name : "Carcinoma"
```
*Discovery dataset name* should contain the name of the discovery dataset used for defining cell states. By default, the only accepted values are *Carcinoma* and *Lymphoma* (case sensitive), which will recover the cell states that we defined across carcinomas and in lymphoma, respectively. If the user defined cell states in their own data (**Tutorials 4-6**), the name of the discovery dataset is the value provided in the *Discovery dataset name* field of the configuration file used for running discovery. For this tutorial, we set the name of the discovery dataset to *Carcinoma*.
#### Recovery dataset name
```{cat, class.source='yaml'}
Recovery dataset name : "VisiumBreast"
```
*Recovery dataset name* is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes abundances. Any value that contains alphanumeric characters and '_' is accepted for this field.
#### Input Visium directory
```{cat, class.source='yaml'}
Input Visium directory : "example_data/VisiumBreast"
```
There are 4 input files needed for recovery on the visium data:
```{r}
list.files("example_data/VisiumBreast")
```
The filtered feature-barcode matrices `barcodes.tsv.gz`, `features.tsv.gz` and `matrix.mtx.gz`, in the format provided by [10x Genomics](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/matrices), and the `tissue_positions_list.csv` file produced by the [run summary images pipeline](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/images), containing the spatial position of barcodes.
#### Recovery cell type fractions
```{cat, class.source='yaml'}
Recovery cell type fractions : "NULL"
```
*Recovery cell type fractions* should contain the path to a file containing the cell type fraction estimations for each spot on the visium array. This field is ignored when the discovery dataset is *Carcinoma* or *Lymphoma* or when the discovery has been performed as described in [*Tutorial 4*](#tutorial-4-de-novo-discovery-of-cell-states-and-ecotypes-in-bulk-data), using *Carcinoma_Fractions* or *Lymphoma_Fractions*. It is only used when users provided their own cell type fractions for deriving cell states and ecotypes in [*Tutorial 4*](#tutorial-4-de-novo-discovery-of-cell-states-and-ecotypes-in-bulk-data). In this case, the user needs to provide a path to a tab-delimited file for this field. The file should contain in the first column the same sample names used as column names in the input expression matrix, and in the next columns, the cell type fractions for the same cell populations used for discovering cell states and ecotypes. These fractions should sum up to 1 for each row. An example of such a file is provided in:
```{r}
data = read.delim("example_data/visium_fractions_example.txt", nrow = 5)
dim(data)
data
```
Since in this tutorial we use the *Carcinoma* dataset as the discovery dataset, this field is not required. However, if it needs to be provided, it can be set as follows:
```{cat, class.source='yaml'}
Recovery cell type fractions : "example_data/visium_fractions_example.txt"
```
#### Background cell type
```{cat, class.source='yaml'}
Background cell type : "Epithelial.cells"
```
The cell of origin population for the cancer type being analyzed, amongst the cell types used for discovery. This field is used for plotting a gray background in the resulting output plot, with the intensity of gray depicting the abundance of the cell of origin population in each spot. It is not used when the discovery dataset is *Carcinoma* or *Lymphoma* or when the discovery has been performed as described in *Tutorials 4-6*, using *Carcinoma_Fractions* or *Lymphoma_Fractions*. In these cases, the malignant cells are automatically considered to be originating from *Epithelial.cells* or *B.cells*, respectively. Otherwise, this field can be set to one of the column names in the file provided in *Recovery cell type fractions* field, corresponding to the appropriate cell type of origin. If this field is not provided, or if the value provided is not found in the column names of the fractions file, the background will be uniformly set to gray.
#### CIBERSORTx username and token
```{cat, class.source='yaml'}
CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"
```
The fields **CIBERSORTx username** and **CIBERSORTx token** should contain the username on the CIBERSORTx website and the token necessary to run the CIBERSORTx source code. The token can be obtained from the [CIBERSORTx website](https://cibersortx.stanford.edu/download.php).
#### The output section
The *Output* section contains a single field, *Output folder*, which specifies the path where the final output will be saved. This folder will be created if it does not exist.
```{cat, class.source='yaml'}
Output folder : "VisiumOutput"
```
#### Number of threads
The last section, *Pipeline settings*, contains only one argument, the number of threads used for performing recovery:
```{cat, class.source='yaml'}
Number of threads : 10
```
#### CIBERSORTx fractions Singularity path
```{cat, class.source='yaml'}
CIBERSORTx fractions Singularity path : NULL
```
The path to the Singularity container (a .SIF file) for the CIBERSORTx fractions module. If this path is provided, cell fraction estimation at step 2 will be performed using Sngularity. Otherwise it will be performed using Docker.
### 3.3. The command line
After editing the configuration file (`config_recovery_visium.yml`), the command line for recovering the cell states and ecotypes in Visium Spatial Gene Expression data looks as illustrated below. Please note that this script might take up to two hours to run on 10 threads. Also, since CIBERSORTx is run on each spot, the memory requirements might exceed the memory available on a typical laptop. We recommend that this tutorial is run on a server with >32GB of RAM.
```{r, eval = F}
Rscript EcoTyper_recovery_visium.R -c config_recovery_visium.yml
```
### 3.4. The output format
EcoTyper generates for each cell type the following outputs:
* Cell state abundances:
```{r}
data = read.delim("VisiumOutput/VisiumBreast/state_abundances.txt")
dim(data)
head(data[,1:10])
```
* Plots illustrating the cell state abundance across state from each cell type. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("VisiumOutput/VisiumBreast/Fibroblasts_spatial_heatmaps.png")
```
* Ecotype abundances:
```{r}
data = read.delim("VisiumOutput/VisiumBreast/ecotype_abundances.txt")
dim(data)
head(data[,1:10])
```
* Plots illustrating the ecotype abundances. The intensity of charcoal represents the cell state abundance. The intensity of gray represents the fraction of the cancer cell of origin population:
```{r echo = T, message=T, out.width = "100%", fig.align = 'center'}
knitr::include_graphics("VisiumOutput/VisiumBreast/Ecotype_spatial_heatmaps.png")
```
## **Tutorial 4.** *De novo* Discovery of Cell States and Ecotypes in Bulk Data
In this tutorial we illustrate how one can perform *de novo* identification of cell states and ecotypes, starting from a bulk-tissue expression matrix. For illustration purposes, we use as discovery dataset a downsampled version of the TCGA samples from lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), available in `example_data/bulk_lung_data.txt`, together with the sample annotation file `example_data/bulk_lung_annotation.txt`.
### 4.1. Overview of the EcoTyper workflow for discovering cell states
EcoTyper derives cell states and ecotypes in a sequence of steps:
1. **Cell type fraction estimation**: EcoTyper relies on cell abundance estimations of the major cell lineages expected to be present in the tissue analyzed, for each sample in the discovery dataset.
<br/>
One way of estimating cell type abundances in bulk tissue specimens is by using *CIBERSORTx Fractions* module. *CIBERSORTx Fractions* leverages sets of barcode genes, termed *signature matrix*, to estimate cell fractions. Complete tutorials about how signature matrices can be derived are available on the [CIBERSORTx website](https://cibersortx.stanford.edu/tutorial.php). In the EcoTyper carcinoma and lymphoma papers, we serially apply two signature matrices, to get a comprehensive representation of cell types typically found in these malignancies. We make these strategies automatically available in EcoTyper. If, however, the tissue/system being analyzed is expected to have different cell populations, then the user needs to estimate the appropriate fractions themselves (see details below).
2. **Cell type expression purification**: To impute cell type-specific gene expression profiles from bulk tissue transcriptomes, EcoTyper employs CIBERSORTx HiRes module. CIBERSORTx HiRes takes as input the bulk expression matrix of the discovery dataset and the fractions of the cell populations obtained at step 1. It produces cell-type specific expression profiles, at single-sample resolution, for each cell population.
3. **Cell state discovery**: EcoTyper leverages nonnegative matrix factorization (NMF) to identify transcriptionally-defined cell states from expression profiles purified by CIBERSORTx HiRes (step 2). Given c cell types, let $V_i$ be a $g×n$ cell type-specific expression matrix for cell type $i$ consisting of $g$ rows (the number of genes) and $n$ columns (the number of samples). The primary objective of NMF is to factorize $V_i$ into two non-negative matrices: a $g×k$ matrix, $W$, and a $k×n$ matrix, $H$, where $k$ is a user-specified rank (i.e., number of clusters). The basis matrix, $W$, encodes a representative expression level for each gene in each cell state. The mixture coefficients matrix $H$, scaled to sum to 1 across cell states, encodes the representation (relative abundance) of each cell state in each sample.
<br/>
EcoTyper applies NMF on the top 1000 genes with highest relative dispersion across samples. If less than 1000 genes are available, all genes are selected. If less than 50 genes are imputed for a given cell type, that cell type is not used for cell state identification. Prior to NMF, each gene is scaled to mean 0 and unit variance. To satisfy the non-negativity requirement of NMF, cell type-specific expression matrices are individually processed using *posneg* transformation. This function converts an input expression matrix $V_i$ into two matrices, one containing only positive values and the other containing only negative values with the sign inverted. These two matrices are subsequently concatenated to produce $V_i^*$.
<br/>
For each cell type, EcoTyper applies NMF across a range of ranks (number of cell states), by default 2-20 states. For each rank, the NMF algorithm is applied multiple times (we recommend at least 50) with different starting seeds, for robustness.
4. **Choosing the number of cell states**: Cluster (state) number selection is an important consideration in NMF applications. We found that previous approaches that rely on minimizing error measures (e.g., RMSE, KL divergence) or optimizing information-theoretic metrics either failed to converge or were dependent on the number of genes imputed. In contrast, the cophenetic coefficient quantifies the classification stability for a given rank (i.e., the number of clusters) and ranges from 0 to 1, with 1 being maximally stable. While the rank at which the cophenetic coefficient starts decreasing is typically selected, this approach is challenging to apply in situations where the cophenetic coefficient exhibits a multi-modal shape across ranks, as we found for some cell types. Therefore, we developed a heuristic approach more suitable for such settings. In each case, the rank was automatically chosen based on the cophenetic coefficient evaluated in the range 2–20 clusters (by default). Specifically, we determined the first occurrence in the interval 2–20 for which the cophenetic coefficient dropped below 0.95 (by default), having been above this level for at least two consecutive ranks. We then selected the rank immediately adjacent to this crossing point which was closest to 0.95 (by default).
5. **Extracting cell state information**: The NMF output resulting from step 4 is parsed and cell state information is extracted for the downstream analyses.
6. **Cell state QC filter**: Although posneg transformation is required to satisfy the non-negativity constraint of NMF following standardization, it can lead to the identification of spurious cell states driven by features with more negative values than positive ones. To combat this, we devised an adaptive false positive index (AFI), a novel index defined as the ratio between the sum of weights from the W matrix corresponding to the negative and positive features. EcoTyper automatically filters the states with $AFI >= 1$.
7. **Advanced cell state QC filter**: When the discovery dataset is comprised of multiple tumor types, we recommend using this advanced filter. This filter identifies poor-quality cell states using a dropout score, which flags states whose marker genes exhibit anomalously low variance and high expression across the discovery cohort, generally an artefact of CIBEROSRTx HiRes. To calculate the dropout score for each marker gene (i.e., genes with maximal log2 fold change in each state relative to other states within a given cell type), EcoTyper determines the maximum fraction of samples for which the gene has the same value. It also calculates the average log2 expression of the gene across samples. It averages each quantity, scaled to unit variance across states, within each state, converts them to z-scores, and removes states with a mean Z >1.96 (P < 0.05).
8. **Ecotype (cellular community) discovery**: *Ecotypes* or *cellular communities* are derived by identifying patterns of co-occurrence of cell states across samples. First, EcoTyper leverages the Jaccard index to quantify the degree of overlap between each pair of cell states across samples in the discovery cohort. Toward this end, it discretizes each cell state $q$ into a binary vector $a$ of length $l$, where $l$ denotes the number of samples in the discovery cohort. Collectively, these vectors comprise binary matrix $A$, with same number of rows as cell states across cell types and $l$ columns (samples). Given sample $s$, if state $q$ is the most abundant state among all states in cell type $i$, EcoTyper sets $A_(q,s)$ to 1; otherwise $A_(q,s) ← 0$. It then computes all pairwise Jaccard indices on the rows (states) in matrix $A$, yielding matrix $J$. Using the hypergeometric test, it evaluates the null hypothesis that any given pair of cell states $q$ and $k$ have no overlap. In cases where the hypergeometric p-value is >0.01, the Jaccard index for $J_(q,k)$ is set to 0 (i.e., no overlap). To identify communities while accommodating outliers, the updated Jaccard matrix $J^'$ is hierarchically clustered using average linkage with Euclidean distance (hclust in the R stats package). The optimal number of clusters is then determined via silhouette width maximization. Clusters with less than 3 cell states are eliminated from further analysis.
### 4.2. Checklist before performing cell states and ecotypes discovery
In order for EcoTyper to perform cell states and ecotypes discovery, the following resources need to be available:
* docker containers for **CIBERSORTx Fractions** and **CIBERSORTx HiRes** modules, both of which can be obtained from the [CIBERSORTx website](https://cibersortx.stanford.edu/). Please follow the instructions from the website to install them.
* a token required for running the docker containers, which can also be obtained from the [CIBERSORTx website](https://cibersortx.stanford.edu/download.php).
* a user-provided bulk tissue expression matrix (RNA-seq or microarray), on which the discovery will be performed (a discovery cohort). For this tutorial, we will use the example data in `example_data/bulk_lung_data.txt`.
* if the major cell populations expected in the system to be analyzed are **not** recapitulated by the cell populations analyzed in the EcoTyper carcinoma paper (B cells, CD4 T cells, CD8 T cells, dendritic cells, endothelial cells, epithelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells, neutrophils), or the EcoTyper lymphoma paper (B cells, CD4 T cells, CD8 T cells, follicular helper T cells, Tregs, dendritic cells, endothelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells, neutrophils), then the user needs to provide their own cell type proportion estimations for these populations (see more details below).
* optionally, a sample annotation file, such as the one provided in `example_data/bulk_lung_annotation.txt`, can be supplied to EcoTyper. The information in this file can be used for heatmap plotting purposes, and also to instruct EcoTyper to find cell states/ecotypes common across different biological batches (e.g. tumor types), as detailed below.
### 4.3. Cell states and ecotypes discovery
The script that does cell type and ecotype discovery is:
```{bash}
Rscript EcoTyper_discovery_bulk.R -h
```
This script takes as input file a configuration file in [YAML](https://yaml.org/) format. The configuration file for this tutorial is available in `config_discovery_bulk.yml`:
```{cat, class.source='yaml'}
default :
Input :
Discovery dataset name : "MyDiscovery"
Expression matrix : "example_data/bulk_lung_data.txt"
#Possible values: "Carcinoma_Fractions", "Lymphoma_Fractions" or a path to a file containing the precomputed cell fractions
Cell type fractions : "Carcinoma_Fractions"
#Possible values: "RNA-seq", "Affymetrix", "Other"
Expression type : "RNA-seq"
#This field can also be set to "NULL"
Annotation file : "example_data/bulk_lung_annotation.txt"
#This field can also be set to "NULL"
Annotation file column to scale by : "Histology"
#This field can also be set to "NULL"
Annotation file column(s) to plot : ["Histology", "Tissue"]
CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"
Output :
Output folder : "DiscoveryOutput"
Pipeline settings :
#Pipeline steps:
# step 1 (cell type fraction estimation)
# step 2 (cell type expression purification)
# step 3 (cell state discovery)
# step 4 (choosing the number of cell states)
# step 5 (extracting cell state information)
# step 6 (cell state QC filter)
# step 7 (advanced cell state QC filter)
# step 8 (ecotype discovery)
Pipeline steps to skip : [7] # by default, step 7 is skipped
Number of threads : 10
Number of NMF restarts : 5
Maximum number of states per cell type : 20
Cophenetic coefficient cutoff : 0.95
CIBERSORTx fractions Singularity path : NULL
CIBERSORTx hires Singularity path : NULL
Minimum number of states in ecotypes : 3
```
The configuration file has three sections, *Input*, *Output* and *Pipeline settings*. We next will describe the expected content in each of these three sections, and instruct the user how to set the appropriate settings in their applications.
#### Input section
The *Input* section contains settings regarding the input data.
#### Discovery dataset name
*Discovery dataset name* is the identifier used by EcoTyper to internally save and retrieve the information about the cell states/ecotypes defined on this discovery dataset. It is also the name to be provided to the *-d*/*--discovery* argument of scripts `EcoTyper_recovery_scRNA.R` and `EcoTyper_recovery_bulk.R`, when performing cell state/ecotypes recovery. Any value that contains alphanumeric characters and '_' is accepted for this field.
```{cat, class.source='yaml'}
Discovery dataset name : "MyDiscovery"
```
#### Expression matrix
```{cat, class.source='yaml'}
Expression matrix : "example_data/bulk_lung_data.txt"
```
*Expression matrix* field should contain the path to a tab-delimited file containing the expression data, with genes as rows and samples as columns. The expression matrix should be in the TPM or FPKM space for bulk RNA-seq and **non-logarithmic** (exponential) space for microarrays. It should have gene symbols on the first column and gene counts for each sample on the next columns. Column (sample) names should be unique. Also, we recommend that the column names do not contain special characters that are modified by the R function *make.names*, *e.g.* having digits at the beginning of the name or containing characters such as *space*, *tab* or *-*:
The expected format for the expression matrix is:
```{r}
data = read.delim("example_data/bulk_lung_data.txt", nrow = 5)
dim(data)
head(data[,1:5])
```
#### Cell type fractions
*Cell type fractions* field instructs EcoTyper on how to compute cell type fractions on the discovery dataset:
```{cat, class.source='yaml'}
#Possible values: "Carcinoma_Fractions", "Lymphoma_Fractions" or a path to a file containing the precomputed cell fractions
Cell type fractions : "Carcinoma_Fractions"
```
If the major cell populations expected in the user-provided discovery dataset are recapitulated by the cell populations analyzed in the EcoTyper carcinoma paper (B cells, CD4 T cells, CD8 T cells, dendritic cells, endothelial cells, epithelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells and neutrophils), then this field can be set to **Carcinoma_Fractions** (case sensitive), and EcoTyper will automatically estimate fractions for these populations, in step 1 of the workflow. Similarly, if the cell populations analyzed in the EcoTyper lymphoma paper (B cells , CD4 T cells, CD8 T cells, follicular helper T cells, Tregs, dendritic cells, endothelial cells, fibroblasts, mast cells, monocytes/macrophages, NK cells, plasma cells and neutrophils) are appropriate, then the user can set this field to **Lymphoma_Fractions** and cell fractions will be automatically calculated.
In each of these cases the fractions are being estimated by serially applying two signature matrices on the discovery dataset. The first signature matrix, denoted TR4, is available in `utils/signature_matrices/TR4/TR4`. TR4 was obtained from FACS-sorted profiles of epithelial cells (EPCAM+), fibroblasts (CD10+), endothelial cells (CD31+) and immune cells (CD45+), obtained from lung cancer specimens ([Newman et al., Nature Biotechnology 2019](https://www.nature.com/articles/s41587-019-0114-2)). The second signature matrix, LM22, available in `utils/signature_matrices/LM22/LM22`, was published with [Newman et al., Nature Methods 2015](https://www.nature.com/articles/nmeth.3337), and is able to deconvolve 22 immune subsets. In the EcoTyper carcinoma paper, we first collapse the fractions for 22 subsets to obtain the representation of the 9 major cell types (B cells, plasma cells, CD4 T cells, CD8 T cells, NK cells, monocytes/macrophages, dendritic cells, mast cells, and neutrophils). We then replace the TR4 immune cell fractions with the fractions of the 9 cell lineages. This way we obtain cell abundance estimations for 12 cell populations used in that paper. An analogous process is used to obtain the lymphoma fractions.
If neither of these cases apply, the user needs to provide a path to a tab-delimited file containing the cell type proportion estimations for the expected populations. The file should contain in the first column the same sample names used for column names in the input expression matrix, and in the next columns, the cell type fractions for each cell population. These fractions should sum up to 1 for each row. An example of such a file is provided in:
```{r}
data = read.delim("example_data/bulk_fractions_example.txt", nrow = 5)
dim(data)
data
```
This path can provided in the configuration file as follows:
```{cat, class.source='yaml'}
#Possible values: "Carcinoma_Fractions", "Lymphoma_Fractions" or a path to a file containing the precomputed cell fractions
Cell type fractions : "example_data/bulk_fractions_example.txt"
```
#### Expression type
```{cat, class.source='yaml'}
#Possible values: "RNA-seq", "Affymetrix", "Other"
Expression type : "RNA-seq"
```
*Expression type* field specifies the platform used to generate the data provided in the expression matrix. The accepted values are **RNA-seq** for bulk RNA-seq data, **Affymetrix** for data profiled using Affymetrix microarray platforms, and **Other** for data from non-Affymetrix microarray platform. This argument is relevant only if the cell type fractions are being estimated automatically by EcoTyper (i.e. values **Carcinoma_Fractions** or **Lymphoma_Fractions** are being provided in the field **Cell type fractions** of the configuration file, as described above). Based on this field, EcoTyper determines the appropriate parameters for the CIBERSORTx fractions module, when estimating cell type fractions using the TR4 and LM22 signatures (see above). If **RNA-seq** is provided, CIBERSORTx with no batch correction is applied on TR4 and with B-mode batch correction on LM22. If **Affymetrix** is provided CIBERSORTx fractions with B-mode batch correction is applied on TR4 and with no batch correction on LM22. If **Other** is provided, CIBERSORTx fractions with B-mode batch correction is applied on both signatures
#### Annotation file
```{cat, class.source='yaml'}
Annotation file : "example_data/bulk_lung_annotation.txt"
```
A path to an annotation file can be provided in the field *Annotation file*. If provided, this file should contain a column called **ID** with the same names as the columns of the expression matrix, and any number of additional columns. The additional columns can be used for defining sample batches (see Section *Annotation file column to scale by* below) and for plotting color bars in the heatmaps output (see Section *Annotation file column(s) to plot* below). If not provided, this field needs to be set to **"NULL"**. For the current example, the annotation file has the following format:
```{r}
annotation = read.delim("example_data/bulk_lung_annotation.txt", nrow = 5)
dim(annotation)
head(annotation)
```
#### Annotation file column to scale by
```{cat, class.source='yaml'}
Annotation file column to scale by : "Histology"
```
In order to discover pan-carcinoma cell states and ecotypes in the EcoType carcinoma paper, we standardize genes to mean 0 and unit 1 *within* each tumor type (histology). Field *Annotation file column to scale by* allows users to specify a column name in the annotation file, by which the samples will be grouped when performing standardization. The example discovery dataset used in this tutorial has samples from lung adenocarcinoma and lung squamous cell carcinoma. Therefore, for this tutorial we will use the **Histology** column to perform standardization.
However, this is an analytical choice, depending on the purpose of the analysis. If the users are interested in defining cell states and ecotypes regardless of tumor type-specificity, this argument can be set to **"NULL"**. In this case, the standardization will be applied across all samples in the discovery cohort. The same will happen if the annotation file is not provided.
#### Annotation file column(s) to plot
```{cat, class.source='yaml'}
Annotation file column(s) to plot : ["Histology", "Tissue"]
```
*Annotation file column(s) to plot* field specifies which columns in the annotation file will be used as color bar in the output heatmaps, in addition to the cell state label or ecotype label column, plotted by default.
#### CIBERSORTx username and token
```{cat, class.source='yaml'}
CIBERSORTx username : "<Please use your username from the CIBERSORTx website>"
CIBERSORTx token : "<Please obtain a token from the CIBERSORTx website>"
```
The fields **CIBERSORTx username** and **CIBERSORTx token** should contain the username on the CIBERSORTx website and the token necessary to run the CIBERSORTx source code. The token can be obtained from the [CIBERSORTx website](https://cibersortx.stanford.edu/getoken.php).
#### The output section
The *Output* section contains a single field, *Output folder*, which specifies the path where the final output will be saved. This folder will be created if it does not exist.
```{cat, class.source='yaml'}
Output folder : "DiscoveryOutput"
```
#### Pipeline settings
The last section, *Pipeline settings*, contains settings controlling how EcoTyper is run.
#### Pipeline steps to skip
```{cat, class.source='yaml'}
#Pipeline steps:
# step 1 (cell type fraction estimation)
# step 2 (cell type expression purification)
# step 3 (cell state discovery)
# step 4 (choosing the number of cell states)
# step 5 (extracting cell state information)
# step 6 (cell state QC filter)
# step 7 (advanced cell state QC filter)
# step 8 (ecotype discovery)
Pipeline steps to skip : [7] # by default, step 7 is skipped
```
The *Pipeline steps to skip* option allows user to skip some of the steps outlined in section *Overview of the EcoTyper workflow for discovering cell states*. Please note that this option is only intended for cases when the pipeline had already been run once, and small adjustments are made to the parameters. For example, if the Cophenetic coefficient cutoff used in step 3 needs adjusting, the user might want to skip steps 1-2 and re-run from step 3 onwards.
#### Number of threads
```{cat, class.source='yaml'}
Number of threads : 10
```
The number of threads EcoTyper will be run on.
#### Number of NMF restarts
```{cat, class.source='yaml'}
Number of NMF restarts : 5
```
The NMF approach used by EcoTyper (Brunet et al.), can give slightly different results, depending on the random initialization of the algorithm. To obtain a stable solution, NMF is generally run multiple times with different seeds, and the solution that best explains the discovery data is chosen. Additionally, the variation of NMF solutions across restarts with different seeds is quantified using Cophenetic coefficients and used in step 4 of EcoTyper for selecting the number of states. The parameter *Number of NMF restarts* specifies how many restarts with different seed should EcoTyper perform for each rank selection, in each cell type. Since this is a very time consuming process, in this example we only use 5. **However, for publication-quality results, we recommend at least 50 restarts**.
#### Maximum number of states per cell type
```{cat, class.source='yaml'}
Maximum number of states per cell type : 20
```
**Maximum number of states per cell type** specifies the upper end of the range for the number of states possible for each cell type. The lower end is 2.
#### Cophenetic coefficient cutoff
```{cat, class.source='yaml'}
Cophenetic coefficient cutoff : 0.95
```
This field indicates the Cophenetic coefficient cutoff, in the range [0, 1], used for automatically determining the number of states in step 4. Lower values generally lead to more clusters being identified.
#### CIBERSORTx fractions Singularity path
```{cat, class.source='yaml'}
CIBERSORTx fractions Singularity path : NULL
```
The path to the Singularity container (a .SIF file) for the CIBERSORTx fractions module. If this path is provided, cell fraction estimation at step 1 will be performed using Sngularity. Otherwise it will be performed using Docker.
#### CIBERSORTx hires Singularity path
```{cat, class.source='yaml'}
CIBERSORTx hires Singularity path : NULL
```
The path to the Singularity container (a .SIF file) for the CIBERSORTx HiRes module. If this path is provided, in silico cell purification at stepp 2 will be performed using Sngularity. Otherwise it will be performed using Docker.