-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Giotto Object to STdeconvolve, and general questions about workflow, choosing K, cell type annotation, etc #35
Comments
Hi @chrkuo, Thanks for reaching out and your interest in using
You can first create a
where Then you can extract the counts matrix:
I'm sure Thanks! |
Thank you so much. I spent the whole morning going through the documentation of for 10x and it's been quite helpful but I ran into an error with a few questions.
data_path_EWS4512_h5 = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/raw_feature_bc_matrix.h5" data_path_EWS4512_tissue_positions = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/spatial/tissue_positions_list.csv" data_path_EWS4512_image = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/spatial/tissue_hires_image.png" data_path_EWS4512_scalefactors = "/Users/chrkuo/Library/CloudStorage/OneDrive-ChildrensHospitalLosAngeles/Chris/Visium\ Data/EWS4512/outs/spatial/scalefactors_json.json" create Giotto objects#object has data in data frames - in columns and rows I would appreciate it if there's a way to somehow utilize the Giotto object to run STDeconvolve?
I'm wondering if there's a way to make it the same size or bigger or perhaps just see first 4 genes?
right after trying to create the data.frame using the louvian clustering. #Let’s visualize the communities in terms of the spatial positions of the barcodes: dat <- data.frame("emb1" = pos[,"x"],
Thank you so much for your time and assistance and I am a clinician just getting into bioinformatics so i sincerely apologize if a lot of these questions are basic code questions that I just don't know the answers to. |
Hi @chrkuo, It looks like you have made some progress with As for your other questions, I'll try to get to them tomorrow, but given that you were able to deconvolve cell types, it seems that loading the data into In terms of the initial loading of data, you want the unnormalized/unfiltered data, which is non-negative integer counts of genes in the barcodes. The reason why STdeconvolve requires non-negative integers basically boils down to the fact that latent Dirichlet allocation requires frequency counts of words or terms, specified as a matrix of nonnegative integers. In this case, our terms are genes, but the same idea holds. LDA looks for terms that form clusters in each barcode or capture location but the read depths don't need to be normalized as one might do for scRNA-seq data via CPM normalization, for example. Will look into the plotting and other errors tomorrow. Hope this helps for now, |
@bmill3r thank you so much. I hope it's okay that I add a few more questions in the mean time. Aside from filtering via pixel are there any additional methods to filter? Is there a way to graph out the fitted model K's vs. deconvolved cell types and perplexity so I can select the best K? It wasn't in the documentation My current data has quite high perplexity although the # of cell type on the left has been minimal (I tested various Ks up to 30 and seems like the perplexity ranges around 300-400). Is this just the inherent nature of tumor? Does this reflect poor quality? It seems like the pixel positions object (pos) has 1437 rows and the emb object has 1375. Is this discrepancy due to the fact that some of the pixels have genes that were filtered out? I played around a bit and used this code to circumvent the error: pos_subset <- pos[tempCom,] But the graph I got was extremely erroneous I'm only graphing out 1 point per cluster which is extremely weird. And it's not on my tissue. seems like there are duplicates in the data frame? Thank you again - I realize this became more of a trouble shooting instead of Giotto Object question now. |
@bmill3r wanted to check in and follow up Sorry I know i asked a few questions. Truly appreciate your time and assistance!!! |
Hi @chrkuo No problem about the questions - here I'll try to go through them based on my understanding of what I think you're asking:
STdeconvolve doesn't have a way to directly interface with Giotto objects, but it seems that you were able to obtain the pixels (i.e., barcodes) x genes count matrix and supply this as input into STdeconvolve to obtain predicted cell type proportions in the pixels
In terms of the initial loading of data, you want the unnormalized/unfiltered data, which is non-negative integer counts of genes in the barcodes. The reason why STdeconvolve requires non-negative integers basically boils down to the fact that latent Dirichlet allocation requires frequency counts of words or terms, specified as a matrix of nonnegative integers. In this case, our terms are genes, but the same idea holds. LDA looks for terms that form clusters in each barcode or capture location but the read depths don't need to be normalized as one might do for scRNA-seq data via CPM normalization, for example. STdeconvolve does have a pre-filtering step to remove poorly captured or non-informative genes.
I agree that is odd behavior but likely has something to do with the ggplot2 parameters when drawing the different panels. There are a lot of parameters for the plotting so it's hard for me to diagnose this without seeing your code. Are you able to provide the code you ran to produce this plot? Depending on how you are generating this plot, there are several ways you can make it larger. If you are trying to save it as a pdf, you can do something like:
Also, for example in the
section of the code in the Analysis of 10X Visium data tutorial, you could try specifying plot heights and widths by adding parameters If you just want to see the first 4 genes for example, then in this same plotting example in the 10X tutorial, Alternatively, you can also just run the function
do you have any suggestions on how to work through this error? it seems like the emb matrix has 1375 rows but the pos has 1437 rows so they are not matching up. wondering why this is the case To get the communities from transcriptionally clustering the pixels, the input is the PCs. From the 10X Visium example:
Recall that
So my guess is that If
I would recommend checking out the Annotating deconvolved cell-types tutorial.
The function
Once you've picked an optimal LDA model and obtain the beta and theta matrices,
Yes - in the tutorial Getting started with STdeconvolve you'll see that typically one will fit multiple LDA models to the data using the function
In general, perplexity is how well the model fits the data where the lower the perplexity, the better. However, when looking at a specific dataset, the absolute perplexity range doesn't matter that much - it's more about choosing a model with the lowest perplexity while balancing a relatively low number of rare cell types. As the K increases, perplexity tends to decrease, but the number of rare cell types also increases, which suggests over splitting of the data. So it's a balance between these two metrics but one that each user will ultimately need to decide on. That said, your observed perplexity range of 300-400 is higher than what I have seen more more organized tissues like brain samples, so perhaps this is also reflective of the inherent heterogeneous and undifferentiated state of cancer cells (with less distinct cell types, it becomes harder for the model to deconvolve them). That said, it looks like the model is still able to deconvolve distinct cell types so the data is of sufficient quality and there are distinct cell type signatures that are detectable.
See response to question 4. Hope this helps, |
@bmill3r wanted to follow up regarding how to generate such matrix or gene sets for deconvolution annotation. I'm sort of stuck at this step. would really appreciate it if there's some step by step guidelines on how to generate such matrix between transcriptional approach and GSEA approach. I've explored other vignettes/open issues/blog linked here as well (https://jef.works/blog/2022/08/30/annotating-STdeconvolve-cell-types-with-asctb-tables/) but all of them see to upload some sort of reference matrix. If I have a set list of genes ie. can i use this to use as ground truths? or as GSEA. if so how do i do so? |
Hi @chrkuo To answer your questions:
It is not the heterogeneity but more the transcriptional "distinctness" of the cell types. STdeconvolve is looking for groups of genes that occur together at high probability and uses these groups to identify topics, or cell types. Ideally, there will be genes that specifically cluster together to form the cell type topics we're looking for among the capture locations. In reality, however, while a group of genes may occur together at high probability, they also probably occur at some frequency with genes of other clusters. Cases where genes are spread across, or co-occur with lots of other genes, make it harder for the model to assign genes to specific topics with high confidence. If we think about this in terms of cell types, cell types that have very distinct transcription profiles will likely have specific genes that strongly co-occur with each other (i.e., the genes being expressed specifically by the given cell type) but not with other genes expressed by other cell types. These will be easier for the model to detect, assuming that these cell type marker genes were efficiently captured in the spatial experiment. Likewise, cell types that are transcriptionally similar will likely share many genes that co-occur together across cell types, making it more difficult for LDA to assign these genes into specific groups to identify these cell types. Some cancers can be undifferentiated, and these cells, having been transformed into a less distinct transcriptional state, may be more difficult for STdeconvolve to detect. Indeed, we demonstrate this idea here: Examples of when STdeconvolve may fail.
Because STdeconvolve identifies cell types in a reference-free manner, users can use any method(s) they would like to identify the deconvolve cell types. If you have a
In the Annotating deconvolve cell-types tutorial, there is no reference for the mOB dataset, but for demonstration purposes I create one for the tissue layers just to show how one can perform a Pearson correlation analysis between the deconvolved cell types and some other matrix of cell type transcriptional profiles. A scRNA-seq dataset should have this gene count matrix for the individual cells. With cell type annotations, you can sum together the gene counts for cells belonging to the same cell type.
As you've seen, instead of performing correlations between transcriptional profiles, one can look for enrichment of certain cell type marker genes in the deconvolved transcriptional profiles to help identify the deconvolved cell types. In the Annotating deconvolve cell-types tutorial, for demonstration purposes, I use the top log2 fold-change genes for each tissue layer of the mOB dataset to generate my list of marker genes, which is used as the known gene set input for the function It actually looks like you are almost there with your provided gene sets for the different cell types you have provided. You can just turn that into a list via:
It looks like you have been following the Analysis of 10X Visium data tutorial. Note that the section that contains the plot you have provided is part of the analysis belonging to the "Compare to transcriptional clustering" section. This part is not about annotating the deconvolved cell types in the Visium Cortex data but rather comparing and contrasting the deconvolved cell types to the traditional way of just transcriptionally clustering the barcodes of the dataset. You can learn more about these differences in another demo we provide here. Hope this helps clarify things, |
Thank you so much Brendan @bmill3r, as always really appreciate your prompt and thoughtful responses. This has been extremely helpful. and yes your answers absolutely makes sense. As I work through it I have a few more questions (apologize in advance). I generated a graph looking at perplexity and cell types with mean proportion <5% and I don't see a gray zone area indicating alpha >1 which means that any of these k values that I pick should be good to explore correct? I picked a K of 12 and subsequently had 12 celltypes identified as shown here: As you have suggested I have been able to curate a list of genes that I deem to be cell type specific and input them into a matrix as such:
What is interesting is that the output seem to show a few correlated "GSEA" types as how I annotated. but certain "topics" were NA. even though topic 3 is "NA" when I type in this code to look at
M2_Macrophage_metastasis.character 0.00149985 0.0059994 -1.166917 0.03693869 Do you know why there is a discrepancy if I can call out the results for topic 3 why isn't it listed? the P-value seems to be significant as well? My other question is that there are several topics that are NA including: 3, 8, 9, 10, 12: are these not identifiable because I don't have a matching set list of genes that can best correlate with the ouput? is there a way to look at perhaps the top 5-10 differentially expressed genes in these topics and determine what these cells are sort of treating it like scRNAseq data? The GSEA that I am inputting if certain genes are not expressed within that gene set it will still pick up within the topic right? for example "CD8Macrophage" = c("CD163", "CD8", "MRC1") if CD163 and CD8 are not expressed within the pixel but MRC1 is highly expressed then it will still list it as CD8 Macrophage. Or is it looking for conditions where all 3 are highly expressed? Lastly, topic 1, 4, 6, 7 all seem to indicate the same "cell type" why is that? |
Hi @chrkuo
Alpha is a parameter governing the shape or concentration of the Dirichlet distribution in the fitted model. An alpha < 1 means that the cell type probability distributions of the capture locations are concentrated towards the "edges" of the Dirichlet, and so the capture locations are expected to be enriched in some of the cell types but not all of them. This is what we want and you can get a better visual sense of what I mean by this here: Examples of when STdeconvolve may fail. Because all of the different models have an alpha <1, this is a good sign that there are distinct cell types that are heterogeneously distributed throughout the capture locations that the model is picking up on and you could use any of the models with different Ks. However, the lower the perplexity, the better the model fits the data and so picking a model with a lower perplexity is recommended. However, this is also balanced between the increasing number of rare cell types, and an increase in the number of rare cell types is an indicator that the model might be overfitting the data. So it is recommended that you pick a model with a K that balances a decreasing perplexity with an increasing number of rare cell types. Your choice of K=12 is a reasonable one.
I believe this is because the and in the package this function is built around here: liger
There might be several reasons why they are not positively enriched in any of the gene sets you tested. But you can definitely look at the top differentially expressed genes in these topics to get a better sense of what cell types they might be representing. You might want to check out the tutorial Getting Started with STdeconvolve. There's a section where we look at the genes with the top log2FC for different devonvolved cell types. Additionally, we provide a function:
which will return the top 10 genes for each topic, where
I would recommend looking at the original GSEA paper to get a better idea as to how the algorithm is looking for gene set enrichment in the data.
It is likely that the top significantly enriched gene set for each of these topics was the same. But it might be worth looking at the entire gene set enrichment results for each of these topics to see if there were some cell type gene sets that were very close in terms of the GSEA results. Remember that the analysis is trying to find the most similar cell type based on the provided genes in the gene set but there could be several cell types that were significantly enriched. Also, maybe these topics are capturing different cell states or subtypes. For example, maybe several topics are enriched in "neuronal" genes, but they each could be different neuronal subtypes. It depends on how specific the genes in the gene sets are and how well represented they are in the deconvolved transcriptional profiles. Hope this helps a bit, |
Hello, Please, I am trying to annotate my cells using the GSEA function with I list I generated myself as seen below, but it is throwing errors: Could you please advise on how I can fix this problem? Create a nested list
Error in gset[[celltype]]$character : Thank you. Regards, |
Thank you so much for creating this tool for deconvolution without scRNAseq data.
I am working with an extremely rare pediatric tumor where the samples are few and far between and there are basically no publicly available scRNAseq reference dataset.
I am currently running my samples through Visium platform for SRT and using Giottosuite to ultimate analyze my data. The output is a Giotto object. I have read through the documentation using 10x data but I wasn't sure how to utilize Giotto object in the code for setting up to perform STdeconvolve. Can I please get some assistance.
Thank you so much I cannot wait to use STdeconvolve
The text was updated successfully, but these errors were encountered: