Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nClus parameter not working #14

Open
emmanuelaaaaa opened this issue Aug 4, 2020 · 3 comments
Open

nClus parameter not working #14

emmanuelaaaaa opened this issue Aug 4, 2020 · 3 comments

Comments

@emmanuelaaaaa
Copy link

Hello again :),

I have been running into a weird issue where I specify the number of clusters as e.g. 20, and it's running flowsom with nClus=20, the plot for the CV looks ok, but when it's doing the training it's only using 10 clusters, so it says Processing cluster 1... up to 10. The same with the actual normalisation, it seems to only be using 10 clusters.
Any idea what's happening there?

Many thanks and best wishes,
Emma

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.10 (Final)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale:
[1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1
[5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 LC_PAPER=en_GB.ISO-8859-1 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] flowCore_1.52.1 FlowSOM_1.18.0 igraph_1.2.5 dplyr_1.0.0 CytoNorm_0.0.5 optparse_1.6.6

loaded via a namespace (and not attached):
[1] Biobase_2.46.0 splines_3.6.3 jsonlite_1.6.1 ConsensusClusterPlus_1.50.0 R.utils_2.9.2
[6] ellipse_0.4.2 gtools_3.8.2 RcppParallel_5.0.1 stats4_3.6.3 latticeExtra_0.6-29
[11] RBGL_1.62.1 flowWorkspace_3.34.1 yaml_2.2.1 robustbase_0.93-6 pillar_1.4.4
[16] lattice_0.20-41 glue_1.3.2 digest_0.6.25 RColorBrewer_1.1-2 colorspace_1.4-1
[21] ggcyto_1.14.1 Matrix_1.2-18 R.oo_1.23.0 plyr_1.8.6 pcaPP_1.9-73
[26] XML_3.99-0.3 pkgconfig_2.0.3 pheatmap_1.0.12 tsne_0.1-3 fda_5.1.4
[31] zlibbioc_1.32.0 purrr_0.3.4 corpcor_1.6.9 mvtnorm_1.1-1 scales_1.1.1
[36] jpeg_0.1-8.1 getopt_1.20.3 openCyto_1.24.0 flowStats_3.44.0 tibble_3.0.1
[41] generics_0.0.2 ggplot2_3.3.1 ellipsis_0.3.1 flowViz_1.50.0 BiocGenerics_0.32.0
[46] hexbin_1.28.1 mnormt_1.5-6 magrittr_1.5 crayon_1.3.4 IDPmisc_1.1.20
[51] mclust_5.4.6 ks_1.11.7 R.methodsS3_1.8.0 MASS_7.3-51.6 graph_1.64.0
[56] tools_3.6.3 data.table_1.12.8 ncdfFlow_2.32.0 flowClust_3.24.0 lifecycle_0.2.0
[61] matrixStats_0.56.0 stringr_1.4.0 munsell_0.5.0 cluster_2.1.0 compiler_3.6.3
[66] rlang_0.4.6 grid_3.6.3 base64enc_0.1-3 gtable_0.3.0 rrcov_1.5-2
[71] R6_2.4.1 gridExtra_2.3 clue_0.3-57 CytoML_1.12.1 KernSmooth_2.23-17
[76] Rgraphviz_2.30.0 stringi_1.4.6 parallel_3.6.3 Rcpp_1.0.4.6 vctrs_0.3.0
[81] png_0.1-7 DEoptimR_1.0-8 tidyselect_1.1.0

@emmanuelaaaaa
Copy link
Author

Hello,
I have found what the issue was so I thought I'd update here too. CytoNorm is writing a tmp folder with the FlowSom clustering of the training from prepareFlowSOM. Because I was running it in the same directory with different parameters (nClus), even though I was running prepareFlowSOM every time with the different nClus, when it came to the training with CytoNorm.train, it was finding the tmp directory already there and it was overwriting the fsom obj that I had run further above:

    if (!file.exists(file.path(outputDir, "CytoNorm_FlowSOM.RDS"))) {
...
    } else {
        fsom <- readRDS(file.path(outputDir, "CytoNorm_FlowSOM.RDS"))
        warning("Reusing previously saved FlowSOM result.")
    }

Easy fix, I went into a subdirectory Norm_nClus#, every time I run the CytoNorm.train step.

Now there is still one thing that I don't fully understand why it's happening and it looks a bit suspicious. Even though I'm training and fitting with different numbers of clusters, I get exactly the same warnings of exactly the same proportions of cells that are far away from their cluster centers.
For example with nClus=5 I get:

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  887 cells (2.65%) seem far from their cluster centers.
2: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2382 cells (2.73%) seem far from their cluster centers.
3: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1021 cells (6.28%) seem far from their cluster centers.
4: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  4241 cells (4.58%) seem far from their cluster centers.
5: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3813 cells (9.64%) seem far from their cluster centers.
6: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3816 cells (24.13%) seem far from their cluster centers.
7: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  671 cells (2.97%) seem far from their cluster centers.
8: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2111 cells (7.73%) seem far from their cluster centers.
9: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  857 cells (2.19%) seem far from their cluster centers.
10: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1370 cells (6.58%) seem far from their cluster centers.

... And exactly the same with nClus=20:

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  887 cells (2.65%) seem far from their cluster centers.
2: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2382 cells (2.73%) seem far from their cluster centers.
3: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1021 cells (6.28%) seem far from their cluster centers.
4: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  4241 cells (4.58%) seem far from their cluster centers.
5: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3813 cells (9.64%) seem far from their cluster centers.
6: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3816 cells (24.13%) seem far from their cluster centers.
7: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  671 cells (2.97%) seem far from their cluster centers.
8: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2111 cells (7.73%) seem far from their cluster centers.
9: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  857 cells (2.19%) seem far from their cluster centers.
10: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1370 cells (6.58%) seem far from their cluster centers.

I admit that this might be a coincidence with just the first cluster being the same but I was wondering if you have any ideas on how to explore further.
Thanks,
Emma

@SofieVG
Copy link
Member

SofieVG commented Aug 11, 2020 via email

@tomashhurst
Copy link

@emmanuelaaaaa that tmp folder thing is a subtle trap, so well done for noticing it! Always worth checking to see if it's still there, which might happen if CytoNorm gets interrupted.

In terms of the later error you mention:

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  887 cells (2.65%) seem far from their cluster centers.

It would be the same each time because as @SofieVG said, the first level of clustering will generate the same number of clusters (~100) and then the metaclustering will group into 5 or 20 metaclusters etc. One reason it might happen is if your data is very variable between batches, so the clusters are capturing cells that are actually quite spread out. It's possible you could try increasing the number of first level clusters (by increasing the 'grid size' -- xdim = 10 and ydim = 10 results in 10 x 10 = 100 clusters) to capture this. If you're data has small batch effects then this is more likely to be because your are capturing cells from different populations into each first level cluster, and the solution would again to try again with an increased grid size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants