Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supported opposite-strand protocols (for metagene) #23

Open
sarahhcarl opened this issue Mar 13, 2018 · 5 comments
Open

Supported opposite-strand protocols (for metagene) #23

sarahhcarl opened this issue Mar 13, 2018 · 5 comments

Comments

@sarahhcarl
Copy link

Hi Plastid team,

I have found a lot of the functionality provided by plastid very useful, but I got frustratingly stuck today when I was trying to run metagene count on data originating from a stranded protocol that produces reads on the opposite of the original strand. I know in your FAQ you suggest reverse-complementing the reads and re-aligning them, but honestly I feel like this is a lot more work than necessary, especially for a big project with many samples.

This issue was almost a deal-breaker for me, but then I realized I could get the same effect by simply reversing the strand of all of my window annotations in my ROI file. I did it with sed, but it would seem to be a pretty straightforward feature to add as a parameter in the actual command, either at the stage of generating the ROI or on the fly during counting. Such a feature would be a relatively easy way to expand usage of plastid by supporting a large number of users with this type of data.

Thanks!
Sarah

@joshuagryphon
Copy link
Owner

Hi @sarahhcarl ,

Thank you for your thoughts. I agree that re-aligning and reprocessing is more work than it is worth, especially for large datasets, and especially when dUTP-based and other reverse-strand sequencing methodologies are common.

I have written kind of a long post in response (below); the TLDR is, I have been thinking about how to do this for a while, and am taking suggestions on how best to support this. Please read on if you have thoughts!

Thanks & best,
Josh


There are a couple of reasons we haven't included a strand-flipping function yet — even though I actually started working on it two years ago (!), without finishing it — but the main theme is that we haven't figured out the best way to solve the problem, "best" meaning:

  • in the least annoying way possible from a user's point of view
  • in a way that is most consistent throughout the code base, so that we don't surprise ourselves by creating counterintuitive behaviors in existing or future functionality

If you have thoughts on how best to do this, I welcome your feedback. Here's what I have been considering:

  1. As you note, this sort of flipping functionality could be accomplished by flipping the strands of (a) features (e.g. genes, as you did), (b) read alignments, or (c) enabling both, allowing the user to choose.

    • Code-wise, flipping the features is the fastest, easiest, and most isolated
    • However, flipping reads seems to be the most intuitive (at least to me), because as a biologist, I think of reads as deriving from a given mRNA or strand of DNA, and want the reads to be represented on that strand in e.g. a genome browser, or in other analyses.
  2. If we go with the read flipping strategy, there are a bunch of places where code changes could be made to support that feature:

    • in the mapping functions (what I'm leaning towards). This would be the most explicit, and, philosophically, is most in tune with Plastid's data model. But doing so would mess up functionality / require code duplication in Plastid's BAMGenomeArray, among other places
    • inside the various classes of GenomeArray objects. This would result in fewer code changes, and cleaner code overall, but probably makes less sense to the user because flipping the strand is really in the purview of the mapping function.
    • inside the script itself, like metagene. I'd rather not do this, because (a) the feature would have to be duplicated across all of the scripts, and (b) the feature wouldn't be usable inside interactive environments (e.g. Jupyter notebooks, which a lot of people use)
  3. When implementing this feature, I want to solve another related problem (and a common ask), to support reads from legacy unstranded sequencing protocols. To minimize risk of having to make breaking changes later on, it makes sense to come up with a design that deals with both of these issues holistically, and I don't have a sense of what the best set of behaviors would be.

@joshuagryphon joshuagryphon changed the title Supported opposite-strand protocols for metagene Supported opposite-strand protocols (for metagene) Mar 13, 2018
@sarahhcarl
Copy link
Author

sarahhcarl commented Mar 15, 2018

Hi Josh,

Thanks for the thoughtful reply. After thinking about it some more and reading the genome_array.py code (not exhaustively, so I may have missed something!), it seems to me like the most intuitive way to go would be to neither explicitly flip the strands of the reads nor the features, but rather to change the logic of read filtering. Maybe this is what you were getting at in point 2a ("in the mapping functions").

For example, it seems like one critical point in the get_reads_and_counts function is here:

 reads = itertools.chain.from_iterable((X.fetch(reference=chrom,
                                               start=start,
                                               end=end,
                                               # until_eof=True, # this could speed things up. need to test/investigate
                                               ) for X in self.bamfiles))
            
        # filter by strand
        if strand == "+":
            reads = ifilter(lambda x: x.is_reverse is False,reads)
        elif strand == "-":
            reads = ifilter(lambda x: x.is_reverse is True,reads)

What about making an additional mapping parameter that would allow this filtering step to either choose reads on the same strand, reads on the opposite strand, or all reads regardless of strand? I don't know how many functions would need this to be implemented (how much code would be duplicated), but to me it seems like a straightforward solution that also supports unstranded protocols.

Cheers,
Sarah

@joshuagryphon
Copy link
Owner

Hi Sarah,

Thank you for your thoughts! I have given this a little bit of work and have a working implementation. I moved the logic you highlighted above into the mapping functions (FivePrimeMapFactory(), and the like), which now take an extra parameter called flip_strand.

I just need to write a few more tests to make sure there are no hidden surprises, and make a few other changes to keep up with updates to pysam. I'll let you know when the changes are done and have been merged back into the develop branch.

Cheers,
Josh

@joshuagryphon
Copy link
Owner

Hi Sarah-
I just wanted to let you know that this hasn't fallen off the radar. I haven't had a chance to get back to my tests due to work pressures. If it's still useful, I can push the feature branch containing my implementation up here, with the caveat that until I'm done building the test datasets & rewriting the unit tests, the implementation won't be proven

@sarahhcarl
Copy link
Author

sarahhcarl commented Jun 21, 2018 via email

@joshuagryphon joshuagryphon added this to the Plastid 0.5.0 milestone Nov 29, 2019
@joshuagryphon joshuagryphon modified the milestones: Plastid 0.5.0, Plastid 0.6.0 May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants