Usage principles on AnnDataSet #303

dbrg77 · 2024-05-07T02:46:54Z

dbrg77
May 7, 2024

Hello there,

I always feel that we need a good and functional scATAC-seq analysis package in python, and snapatac2 really fills in the gap. Thanks very much for developing this wonderful tool.

I only recently started using snapatac2, so my questions might be naive.

First, I'm a bit confused about the concept of AnnData and AnnDataSet. Use the "Multi-sample Pipeline" in the tutorial as an example. When we read all samples into adatas, we see that an AnnData is created for each sample. In the meantime, obs: 'n_fragment', 'frac_dup', 'frac_mito' is automatically created for each sample. Then, tsse is created for each sample. However, when we combine them into data as an AnnDataSet, those four pieces of information are lost. The object data only have obs: 'sample'. Now if we want to look at the QC of Number of Unique Fragments vs TSS Enrichment Score, we cannot just do snap.pl.tsse(data, interactive=False), since there is no n_fragment and tsse in data.obs. We realise they can be accessed via data.adatas.obs. So, I'm currently doing:

data.obs['n_fragment'] = data.adatas.obs['n_fragment']
data.obs['tsse'] = data.adatas.obs['tsse']

Is this the best way of doing the plot? What is the recommended way? On a related note, is there a way of generating those QC plots for each sample? I don't see a groupby or the like in those plotting functions. Should I just use a for loop to do that individually?

The 2nd related question is that data now has 5 AnnData in it. I see the message:

AnnDataSet object ... contains 5 AnnData objects with keys: 'colon_transverse_SM-A9HOW', 'colon_transverse_SM-A9VP4', ...

How can I access a specific sample from here? Say, I want to take the sample colon_transverse_SM-A9HOW out for some specific analysis. Do I just read the backed h5ad file for that sample? What if I don't have a backed file? What is the recommended way of doing this?

The last question is related to the obs and the like. If I want to look at the content in it as a table or data frame, I normally just do data.obs or data.obs.head() in scanpy. However, this does not work in snapatac2. If I do type(data.obs), it says it is a PyDataFrameElem, which does not seem to be a pandas dataframe. Is there a way of view or print the data.obs as a table?

Thanks again for developing and actively maintaining the tool.

Regards,
Xi

kaizhang · 2024-05-07T05:01:35Z

kaizhang
May 7, 2024
Maintainer

Hi Xi,

Thanks for these wonderful questions! I have to admit the distinction between AnnData and AnnDataSet is a bit confusing. I'll write more documentation on this front soon. To address your questions:

"However, when we combine them into data as an AnnDataSet, those four pieces of information are lost. "

As you've noticed, when you create AnnDataSet, information in obs are not copied from AnnData(s). This is intentional as the motivation for developing AnnDataSet is to handle extremely large dataset, e.g., 10 million or beyond. In such cases, copying everything from obs may be too costly. Instead, we want to leave this choice to users whether to copy these information or not. It is as simple as what you have done:

data.obs['n_fragment'] = data.adatas.obs['n_fragment']
data.obs['tsse'] = data.adatas.obs['tsse']

But I agree this has to be better documented.

"Is this the best way of doing the plot? What is the recommended way? On a related note, is there a way of generating those QC plots for each sample?"

The best practice in my mind is to perform QC and plotting for individual samples before merging them into AnnDataSets. But I'm open to suggestions.

"How can I access a specific sample from here? Say, I want to take the sample colon_transverse_SM-A9HOW out for some specific analysis. Do I just read the backed h5ad file for that sample? "

Current API doesn't allow you get back individual AnnData from AnnDataSet. If you want this feature, feel free to open a feature request. The way to do it now is to open the h5ad file.

"What if I don't have a backed file? What is the recommended way of doing this?"

By design, AnnDataSet can only be created from backed AnnData(s). If you have AnnData(s) in non-backed mode, this usually means you don't have to use AnnDataSet. Instead, you can simply merge AnnData(s) into one big AnnData using: https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.concatenate.html.

So why do we need AnnDataSet? Imagine a situation where you have 100+ AnnData(s) and reading all of them into memory may cost you 500+ G memory depends on the nature of the data. Using AnnDataSet, you can access these data in backed mode and merge into AnnDataSet for downstream analysis, which normally use less than 1G memory regardless the number of files. Plus, AnnDataSet achieve this almost instantly while it can take several minutes for anndata package to read a giant h5ad file.

0 replies

dbrg77 · 2024-05-07T07:51:31Z

dbrg77
May 7, 2024
Author

Hi Kai,

Thanks very much for the clarification. You have explained the reasons very well. Now I understand, and it is very clear. I haven't really got big datasets yet, so I will stick to AnnData throughout the pipepline.

I will have more ideas and problems after I play with snapatac2 more in the future. I will post them in due course.

Thanks again.

Xi

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage principles on AnnDataSet #303

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Usage principles on AnnDataSet #303

dbrg77 May 7, 2024

Replies: 2 comments

kaizhang May 7, 2024 Maintainer

dbrg77 May 7, 2024 Author

dbrg77
May 7, 2024

kaizhang
May 7, 2024
Maintainer

dbrg77
May 7, 2024
Author