Confusion regarding Intronic Sequence Generation in new 'nac' Workflow #225

bio-info-guy · 2023-11-24T06:41:46Z

Thank you for making this wonderful package to facilitate RNA velocity analysis. I do have some confusion regarding the new workflow 'nac' (in v.0.28.0) and how it is generating intronic sequences.

In ref.py: line 937 -949

          if not nascent:
                intron_temp_path = ngs.fasta.split_genomic_fasta_to_intron(
                    fasta_path,
                    intron_temp_path,
                    gene_infos,
                    transcript_infos,
                    flank=flank if flank is not None else k -
                    1 if k is not None else 30
                )
           else:
                intron_temp_path = ngs.fasta.split_genomic_fasta_to_nascent(
                    fasta_path, intron_temp_path, gene_infos
                )

If nascent sequences are not needed (not nascent == True), it would generate intronic only sequences and not include intronic-exonic junctions, however the subsequent function has flanking length as an argument. Because nascent is default True, would the default behavior not include flanking sequences in the generated intronic fasta? Furthermore, I don't think nascent is assigned in main.py either.

Thank you,
Best

The text was updated successfully, but these errors were encountered:

Yenaled · 2023-11-24T07:35:15Z

In any kb-python run, nascent will ALWAYS equal true. The split_genomic_fasta_to_intron (with a flank) is no longer supported by the current workflow for reasons I described in pachterlab/kallisto#411

That "if not nascent" block of code is only present in the codebase for legacy reasons.

See our most recent preprint on how the nac workflow should be used for an RNA velocity type analysis: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v1 -- and feel free to ask any questions about it!

bio-info-guy · 2023-11-24T09:03:48Z

Thank you very much for the quick response, I have taken a look at the preprint and saw you mentioned that each gene now generates cDNA fasta (no change here) and a single 'nascent' fasta (which includes all exons and introns of the gene). I do still have a few questions listed below:

Instead of previously generating a fasta for each intron+flank, basically each intron is its own transcript, now its 1 fasta sequence for the entire genomic sequence of exons and introns (1 transcript to account for the nascent transcript), is my understanding of this correct?
When counting expression, would this inflate nascent expression, since it now includes both exons and introns, or is there some form of subtraction involved (i.e. subtracting cDNA expression from nascent expression to obtain the final nascent expression)?
Apart from the reduction in memory usage and increases in speed, what effect would this have on the quantification of unspliced vs spliced estimates compared to the previous lamanno workflow?
I am currently trying to run this on SMART-SEQ2 data, which workflow would you recommend: lammano or nac?

Thank you again for the response
Best

Yenaled · 2023-11-24T09:30:00Z

Correct
No, because three matrices are generated: nascent (N), mature (M), and ambiguous (A). Since both nascent and mature RNA contain exons, those regions are considered "ambiguous". See the example on page 22-23 which lets you know which matrices to use for RNA velocity and other "biophysical models". Essentially, you use M+A (i.e. cells_x_genes.cell.mtx) as your "spliced" counts and N (i.e. cells_x_genes.nascent.mtx) as your "unspliced" counts.
Using what I had written in point 2, this will provide similar (actually more accurate) estimates to the unspliced vs. spliced counts, especially as unspliced exon-intron spanning reads are now handled much better than having some "flank" parameter.
Always use nac from now on; lamanno is deprecated and is no longer supported for any data type.

bio-info-guy · 2023-11-24T21:39:20Z

Thank you for the explanation and suggestion, my confusion regarding the issue has been resolved. I was confused at first due to the loom file matrices of 'unspliced' and 'spliced' showing very different proportions than pervious version where I used the lammano workflow, but upon adding the ambiguous matrix to the unspliced matrix, the proportions are now similar to what it was before. I believe this issue can now be closed.

Yenaled closed this as completed Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion regarding Intronic Sequence Generation in new 'nac' Workflow #225

Confusion regarding Intronic Sequence Generation in new 'nac' Workflow #225

bio-info-guy commented Nov 24, 2023 •

edited

Loading

Yenaled commented Nov 24, 2023

bio-info-guy commented Nov 24, 2023

Yenaled commented Nov 24, 2023

bio-info-guy commented Nov 24, 2023

Confusion regarding Intronic Sequence Generation in new 'nac' Workflow #225

Confusion regarding Intronic Sequence Generation in new 'nac' Workflow #225

Comments

bio-info-guy commented Nov 24, 2023 • edited Loading

Yenaled commented Nov 24, 2023

bio-info-guy commented Nov 24, 2023

Yenaled commented Nov 24, 2023

bio-info-guy commented Nov 24, 2023

bio-info-guy commented Nov 24, 2023 •

edited

Loading