Replies: 2 comments
-
Hi Xi, Thanks for these wonderful questions! I have to admit the distinction between AnnData and AnnDataSet is a bit confusing. I'll write more documentation on this front soon. To address your questions:
As you've noticed, when you create AnnDataSet, information in
But I agree this has to be better documented.
The best practice in my mind is to perform QC and plotting for individual samples before merging them into AnnDataSets. But I'm open to suggestions.
Current API doesn't allow you get back individual AnnData from AnnDataSet. If you want this feature, feel free to open a feature request. The way to do it now is to open the h5ad file.
By design, AnnDataSet can only be created from backed AnnData(s). If you have AnnData(s) in non-backed mode, this usually means you don't have to use AnnDataSet. Instead, you can simply merge AnnData(s) into one big AnnData using: https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.concatenate.html. So why do we need AnnDataSet? Imagine a situation where you have 100+ AnnData(s) and reading all of them into memory may cost you 500+ G memory depends on the nature of the data. Using AnnDataSet, you can access these data in backed mode and merge into AnnDataSet for downstream analysis, which normally use less than 1G memory regardless the number of files. Plus, AnnDataSet achieve this almost instantly while it can take several minutes for |
Beta Was this translation helpful? Give feedback.
-
Hi Kai, Thanks very much for the clarification. You have explained the reasons very well. Now I understand, and it is very clear. I haven't really got big datasets yet, so I will stick to I will have more ideas and problems after I play with snapatac2 more in the future. I will post them in due course. Thanks again. Xi |
Beta Was this translation helpful? Give feedback.
-
Hello there,
I always feel that we need a good and functional scATAC-seq analysis package in python, and snapatac2 really fills in the gap. Thanks very much for developing this wonderful tool.
I only recently started using snapatac2, so my questions might be naive.
First, I'm a bit confused about the concept of
AnnData
andAnnDataSet
. Use the "Multi-sample Pipeline" in the tutorial as an example. When we read all samples intoadatas
, we see that anAnnData
is created for each sample. In the meantime,obs: 'n_fragment', 'frac_dup', 'frac_mito'
is automatically created for each sample. Then,tsse
is created for each sample. However, when we combine them intodata
as anAnnDataSet
, those four pieces of information are lost. The objectdata
only haveobs: 'sample'
. Now if we want to look at the QC of Number of Unique Fragments vs TSS Enrichment Score, we cannot just dosnap.pl.tsse(data, interactive=False)
, since there is non_fragment
andtsse
indata.obs
. We realise they can be accessed viadata.adatas.obs
. So, I'm currently doing:Is this the best way of doing the plot? What is the recommended way? On a related note, is there a way of generating those QC plots for each sample? I don't see a
groupby
or the like in those plotting functions. Should I just use a for loop to do that individually?The 2nd related question is that
data
now has 5AnnData
in it. I see the message:How can I access a specific sample from here? Say, I want to take the sample
colon_transverse_SM-A9HOW
out for some specific analysis. Do I just read the backedh5ad
file for that sample? What if I don't have a backed file? What is the recommended way of doing this?The last question is related to the
obs
and the like. If I want to look at the content in it as a table or data frame, I normally just dodata.obs
ordata.obs.head()
in scanpy. However, this does not work in snapatac2. If I dotype(data.obs)
, it says it is aPyDataFrameElem
, which does not seem to be a pandas dataframe. Is there a way of view or print thedata.obs
as a table?Thanks again for developing and actively maintaining the tool.
Regards,
Xi
Beta Was this translation helpful? Give feedback.
All reactions