Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getting a ton of column name issues #16

Open
aelyaderani opened this issue Sep 23, 2020 · 14 comments
Open

getting a ton of column name issues #16

aelyaderani opened this issue Sep 23, 2020 · 14 comments

Comments

@aelyaderani
Copy link

brain.integrated <- addPercentMtRibo(brain.integrated, organism='mm',gene_nomenclature='name')
[10:53:30] No mitochondrial genes found in data set.
[10:53:30] Calculate percentage of 83 ribosomal transcript(s) present in the data set...
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
brain.integrated <- getMostExpressedGenes(brain.integrated)
Error: Sample column (column_sample) could not be found in data. Please provide an existing column name or NULL if you want to skip calculation of most expressed genes for samples.
brain.integrated <- getMarkerGenes(brain.integrated)
Error: Cannot find specified column ([email protected]$sample) that is supposed to contain sample information.
brain.integrated <- getEnrichedPathways(brain.integrated)
Error: No marker genes found. Please run 'getMarkerGenes()' first.
> exportFromSeurat(brain.integrated, 'my_experiment.crb')
Error: Column specified in column_sample not found in meta data.

the Seurat object named "brain.itegrated" has had SCTransform and Integration performed on it, but I'm using the RNA assay for the conversion. Any Idea what's going on?

@romanhaa
Copy link
Owner

Hi @aelyaderani,

You ran multiple commands there and the problem is always the same. The functions getMostExpressedGenes(), getMarkerGenes(), getEnrichedPathways() and exportFromSeurat() expect you to specify the column_sample and column_cluster parameters. Those are the columns in the [email protected] slot that hold the sample and cluster assignments for each cell in your data set. The default values are sample and cluster, but it seems that in your case the sample column doesn't exist, therefore resulting in an error.

See the error messages, they are pretty clear:

brain.integrated <- getMostExpressedGenes(brain.integrated)
# Error: Sample column (column_sample) could not be found in data. Please provide an existing column name or NULL if you want to skip calculation of most expressed genes for samples.

and

brain.integrated <- getMarkerGenes(brain.integrated)
# Error: Cannot find specified column ([email protected]$sample) that is supposed to contain sample information.

The getEnrichedPathways() function failed because it expected the marker genes from the getMarkerGenes() function, which couldn't be executed because of the missing sample column. Again, if you read the error message, it's pretty clear what the problem is.

brain.integrated <- getEnrichedPathways(brain.integrated)
# Error: No marker genes found. Please run 'getMarkerGenes()' first.

Finally, when you want to export your data set, again the sample column couldn't be found in the data set, and so also exportFromSeurat() fails:

exportFromSeurat(brain.integrated, 'my_experiment.crb')
# Error: Column specified in column_sample not found in meta data.

I suggest you have a look at the cerebroApp website, with a particular focus on the role of the column_sample and column_cluster parameters, for example in the getMostExpressedGenes() function. Once you specify existing meta data columns, you should be able to export your data.

I should also mention that I will soon release v1.3 of cerebroApp, which comes with a re-design of the grouping variables. Namely, the column_sample and column_cluster variables will be replaced by a generic groups parameter that (1) has no default and (2) can take any number of grouping variables present in the data set, e.g. only samples, or clusters and samples, or clusters and cell types, etc.. An article that explains how to use this logic will be published along with the new release.

@aelyaderani
Copy link
Author

@romanhaa Oh ok, ya i add my own metadata after i create the Seurat object. Do you have an example Seurat object that works with cerebroApp? I just want to see how the meta_data is setup in an object that's been tested with Cerebro :)

@romanhaa
Copy link
Owner

romanhaa commented Sep 25, 2020

Of course. Here is an example of the meta.data slot of a Seurat object that I often use to make examples:

glimpse(seurat@meta.data)
# Rows: 5,697
# Columns: 15
# $ orig.ident                                   <fct> SeuratProject, SeuratPro…
# $ nCount_RNA                                   <dbl> 5783, 6036, 4653, 14761,…
# $ nFeature_RNA                                 <int> 1654, 1396, 1298, 2544, …
# $ sample                                       <fct> A, A, A, A, A, A, A, A, …
# $ nCount_SCT                                   <dbl> 4845, 4842, 4440, 5079, …
# $ nFeature_SCT                                 <int> 1653, 1396, 1298, 1065, …
# $ SCT_snn_res.0.8                              <fct> 16, 6, 6, 8, 5, 7, 0, 16…
# $ seurat_clusters                              <fct> 16, 6, 6, 8, 5, 7, 0, 16…
# $ S.Score                                      <dbl> -0.0593824754, -0.000904…
# $ G2M.Score                                    <dbl> -0.105379466, -0.0606620…
# $ cell_cycle_seurat                            <fct> G1, G1, G1, G2M, G2M, G1…
# $ cell_type_singler_blueprintencode_main       <chr> "HSC", "CD8+ T-cells", "…
# $ cell_type_singler_blueprintencode_main_score <dbl> 0.3857094, 0.4382698, 0.…
# $ percent_mt                                   <dbl> 0.028013142, 0.032140490…
# $ percent_ribo                                 <dbl> 0.38077123, 0.48939695, …

In this example, you would set column_sample = "sample" and column_cluster = "seurat_clusters" in the calls to functions such as getMostExpressedGenes(), getMarkerGenes(), and exportFromSeurat(), because they contain the assignments of cells to samples and clusters.

As I anticipated, in the upcoming release you would do the same by setting groups = c("sample", "seurat_clusters") instead of using the column_sample and column_cluster parameters. This has the benefit that you can include any grouping variable you like. In the example above, maybe you are interested in using the cell type as a grouping variable, to find marker genes for each of the levels (e.g. erythrocytes). You could then do this:

groups = c("sample", "seurat_clusters", "cell_type_singler_blueprintencode_main")

In other cases, you might only have a single sample in the data set and so it wouldn't make any sense to identify marker genes, so you could run the command only for clusters like this:

groups = c("seurat_clusters")

@aelyaderani
Copy link
Author

@romanhaa Perfect!!! that helped a lot :) i got it to work! this is amazing!!.... i do have another question, how were you able to add a plotly 3D plot to your plot list in the example on you'r page? anytime i run UMAP again for 3 dimension parameter in order to pass it to poltly 3D, my original UMAP get's overwritten. :(

@aelyaderani
Copy link
Author

@romanhaa Perfect!!! that helped a lot :) i got it to work! this is amazing!!.... i do have another question, how were you able to add a plotly 3D plot to your plot list in the example on you'r page? anytime i run UMAP again for 3 dimension parameter in order to pass it to poltly 3D, my original UMAP get's overwritten. :(

@romanhaa
Copy link
Owner

Great, I'm happy you were able to make it work.

Regarding the dimensional reductions, if I understand correctly all you have to do is store the dimensional reductions under different names and keys, e.g. like this:

seurat <- RunUMAP(
  seurat,
  reduction.name = 'UMAP',
  reduction.key = 'UMAP_',
  n.components = 2
)

seurat <- RunUMAP(
  seurat,
  reduction.name = 'UMAP_3D',
  reduction.key = 'UMAP3D_',
  n.components = 3
)

Then, they will be stored separately and both exported to Cerebro.

@aelyaderani
Copy link
Author

@romanhaa I have some bad news :(
I have a sample set that has 200K cells and the '.crb' file is 1.74GB
and every time i try to upload it into Cerebro, i get the same message error "Maximum upload size exceeded"
Screen Shot 2020-09-26 at 12 04 43 PM (2)

@romanhaa
Copy link
Owner

romanhaa commented Sep 27, 2020

Wow, that's a pretty massive data set. I never tested Cerebro with more than 50k cells, should be interesting to see whether it can handle 200k. Anyway, Cerebro has an integrated file limit. It looks like you are using the standalone version on macOS, is that correct? If that's the case, you should navigate to the directory where the Cerebro app is stored, right-click the app and choose "Show Package Contents". Then, go to "Contents", "Resources", "app", and open the "app.R" file. In there is only one line of code: cerebroApp::launchCerebro(). All you need to do is change this to cerebroApp::launchCerebro(maxFileSize = 2000). The number 2000 corresponds to 2,000 MB of max file size, which should be fine in your case.

In case you are using the standalone version of Cerebro, I recommend switching to the one shipped in the cerebroApp R package. It will use fewer resources and you can control the maxFileSize parameters directly without modifying the Cerebro app. To Launch Cerebro from R, just run the same command as above (cerebroApp::launchCerebro(maxFileSize = 2000)) from a normal R session. Another reason why this is the recommended way is because I can no longer produce the standalone version for the new cerebroApp version. So, if you want to use cerebroApp v1.3 (which is now public, you can check the cerebroApp website for info), you will have to launch Cerebro from within R because there is no standalone version of it anymore.

EDIT: I'd love to perform some tests on your data set. If you have permission to share it with me - I promise that I won't look at the data itself and won't share it - it would be great if you could do so. But I know that people can be very protective of their data, so no worries if sharing is not possible.

@aelyaderani
Copy link
Author

@romanhaa unfittingly these are human samples and we can't share them :( however, we are in the middle of publishing the a paper for it. once the paper is published, we plan on sharing the data through http://synapse.org/

I'll let you know once that happens... also, thanks for the fix :) but it looks like i'm running into some memory handling issues. :(
It's hard to use the one shipped in the R package, because I'm running everything on a linux server... but more importantly, I am using the standalone because our collaborators don't know how to use R haha So i want to make sure it works on my local computer before i pass them the results.
Screen Shot 2020-09-27 at 11 35 50 AM

@aelyaderani
Copy link
Author

@romanhaa So i used the second method (using cerebro in the R package) and it worked :) but it's super slow loading everything lol i think it took about 52 minutes to upload the data and loading from umap to other umap take about 10 minute. changing 😅 .... "color cells by" isn't too bad :) only a few second... at most maybe a minute.
Screen Shot 2020-09-27 at 5 08 49 PM

@romanhaa
Copy link
Owner

Wow, I'm impressed that it works but at the same time it's a bummer that it's so slow. What you could also try is storing the expression data, which I assume occupies most of your memory but isn't needed in most tabs, as a delayed array. Check the reference page for the exportFromSeurat() function. Since cerebroApp v1.3, there is a parameter called use_delayed_array which is set to FALSE by default. By setting it to TRUE, the expression data will be stored in a delayed array, which means it's not loaded into memory. When you then want to check expression values in the "Gene (set) expression" tab, the data will be loaded from the disk. This will of course take a bit longer than having the data in memory, but in your case it might be worth a try. Actually, I made that option precisely for cases like yours. However, since I didn't have a data set that large in my hands, I wasn't able to test whether this makes a big difference. Would be nice to hear your feedback.

@aelyaderani
Copy link
Author

@romanhaa haven't had the chance to do try that method. I'll have more time tomorrow to give it a try. also, is there a way to host Cerebro? Similar to CellxGene? https://cellxgene.foundinpd.org/

@romanhaa
Copy link
Owner

Yes, people have done it on their own servers. If you don't have one, you can also use the shinyapps.io service. I wrote an article about what you have to do in order to upload Cerebro there: https://romanhaa.github.io/cerebroApp/articles/host_cerebro_on_shinyapps.html. If you don't pay the resources are relatively limited but at least you can try it for free.

@aelyaderani
Copy link
Author

@romanhaa thanks man!! i ran into an issue, I'm using a server to host the shiny app, but i get an error when i run
rsconnect::deployApp('~/test_cerebro_shinyapps/', appName = 'Cerebro')

Screen Shot 2020-09-29 at 5 44 18 PM (2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants