Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion regarding Intronic Sequence Generation in new 'nac' Workflow #225

Closed
bio-info-guy opened this issue Nov 24, 2023 · 4 comments
Closed

Comments

@bio-info-guy
Copy link

bio-info-guy commented Nov 24, 2023

Thank you for making this wonderful package to facilitate RNA velocity analysis. I do have some confusion regarding the new workflow 'nac' (in v.0.28.0) and how it is generating intronic sequences.

In ref.py: line 937 -949

          if not nascent:
                intron_temp_path = ngs.fasta.split_genomic_fasta_to_intron(
                    fasta_path,
                    intron_temp_path,
                    gene_infos,
                    transcript_infos,
                    flank=flank if flank is not None else k -
                    1 if k is not None else 30
                )
           else:
                intron_temp_path = ngs.fasta.split_genomic_fasta_to_nascent(
                    fasta_path, intron_temp_path, gene_infos
                )

If nascent sequences are not needed (not nascent == True), it would generate intronic only sequences and not include intronic-exonic junctions, however the subsequent function has flanking length as an argument. Because nascent is default True, would the default behavior not include flanking sequences in the generated intronic fasta? Furthermore, I don't think nascent is assigned in main.py either.

Thank you,
Best

@Yenaled
Copy link
Collaborator

Yenaled commented Nov 24, 2023

In any kb-python run, nascent will ALWAYS equal true. The split_genomic_fasta_to_intron (with a flank) is no longer supported by the current workflow for reasons I described in pachterlab/kallisto#411

That "if not nascent" block of code is only present in the codebase for legacy reasons.

See our most recent preprint on how the nac workflow should be used for an RNA velocity type analysis: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v1 -- and feel free to ask any questions about it!

@bio-info-guy
Copy link
Author

Thank you very much for the quick response, I have taken a look at the preprint and saw you mentioned that each gene now generates cDNA fasta (no change here) and a single 'nascent' fasta (which includes all exons and introns of the gene). I do still have a few questions listed below:

  1. Instead of previously generating a fasta for each intron+flank, basically each intron is its own transcript, now its 1 fasta sequence for the entire genomic sequence of exons and introns (1 transcript to account for the nascent transcript), is my understanding of this correct?
  2. When counting expression, would this inflate nascent expression, since it now includes both exons and introns, or is there some form of subtraction involved (i.e. subtracting cDNA expression from nascent expression to obtain the final nascent expression)?
  3. Apart from the reduction in memory usage and increases in speed, what effect would this have on the quantification of unspliced vs spliced estimates compared to the previous lamanno workflow?
  4. I am currently trying to run this on SMART-SEQ2 data, which workflow would you recommend: lammano or nac?

Thank you again for the response
Best

@Yenaled
Copy link
Collaborator

Yenaled commented Nov 24, 2023

  1. Correct
  2. No, because three matrices are generated: nascent (N), mature (M), and ambiguous (A). Since both nascent and mature RNA contain exons, those regions are considered "ambiguous". See the example on page 22-23 which lets you know which matrices to use for RNA velocity and other "biophysical models". Essentially, you use M+A (i.e. cells_x_genes.cell.mtx) as your "spliced" counts and N (i.e. cells_x_genes.nascent.mtx) as your "unspliced" counts.
  3. Using what I had written in point 2, this will provide similar (actually more accurate) estimates to the unspliced vs. spliced counts, especially as unspliced exon-intron spanning reads are now handled much better than having some "flank" parameter.
  4. Always use nac from now on; lamanno is deprecated and is no longer supported for any data type.

@bio-info-guy
Copy link
Author

Thank you for the explanation and suggestion, my confusion regarding the issue has been resolved. I was confused at first due to the loom file matrices of 'unspliced' and 'spliced' showing very different proportions than pervious version where I used the lammano workflow, but upon adding the ambiguous matrix to the unspliced matrix, the proportions are now similar to what it was before. I believe this issue can now be closed.

@Yenaled Yenaled closed this as completed Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants