Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nPhase on one chromosome #12

Open
elsasverris opened this issue Oct 13, 2021 · 3 comments
Open

nPhase on one chromosome #12

elsasverris opened this issue Oct 13, 2021 · 3 comments

Comments

@elsasverris
Copy link

Hi,
I am working on tetraploid potato and trying to run nPhase on my dataset, but as you mention yourself it is very time consuming and I had to terminate the process after running it for 8 days. I was wondering if there is any way to parallelize the process, for example to run it one chromosome at a time like you present in your paper - though I haven't figured out how to do so. Is there an option in the algorithm or have you just subset the dataset before running the algorithm?
Best regards,
Elsa

@OmarOakheart
Copy link
Owner

Hi Elsa,

For my paper I did just subset the data. In order for you to also run nPhase on one chromosome at a time you can simply separate your reference genome into one fasta file per chromosome, then run one instance of nPhase for each chromosome by changing the reference sequence being used.

A quick way to separate your genome into one fasta file per chromosome can be to use pyfaidx:

pip install pyfaidx
faidx -x sequences.fa

Source: https://www.biostars.org/p/173723/

That said, there are a few details to consider, for example if you try to map your full dataset to chromosome 1, then some reads which would have better mapped to another chromosome might still map to it, which would lead to some inaccuracies. A more cumbersome approach would be to first map and variant call all of your reads, then manually subset them and use them as input to nPhase partial, again with a one fasta file per chromosome scheme. nPhase doesn't have any associated tools that would really help achieve that though.

I also manually subset the VCF file to reduce the heterzogysity level to 1%, which you may find useful to do as well, and I think that for a tetraploid, any more than 60-80X coverage would be excessive, so feel free to downsample if you have a crazy amount of data.

There's a lot that can be done to improve how fast nPhase runs, you can check the status of the following issues in the near future to see if any significant updates have been made to help with genomes of that size.

#11
#10

Keep me posted if there's anything else I can help with, hope you'll get good results!

Best,
Omar

@elsasverris
Copy link
Author

elsasverris commented Oct 14, 2021

Hi Omar,
Great, thanks for your suggestions, I will try to run nPhase partial. I have run into an error with the partial pipeline though, but I don't know if it's something to do with the way it's installed on our servers or if it's something to do with the script:

nphase partial --sampleName kingedward --reference /srv/KLN/users/esv/ST4.03ch02.fasta --output /srv/KLN/users/esv/KRISPS/nPhase/kingedward_ch02/ --longReads /srv/KLN/users/kln/KRISPS/BGI_data/F20FTSEUHT0162_POTilnD/KingEdwardA/pacbio.KingEdwardA.subreads.fastq.gz --vcf /srv/KLN/users/esv/KRISPS/freebayes/results/variants/vcfs/variants.ST4.03ch02.vcf --mappedLongReads /srv/KLN/users/esv/KRISPS/Pacbio/Mapping/BAM_sorted/kingedward.sorted.REF_ST4.03ch02.bam  --threads 20
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Traceback (most recent call last):
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/bin/nphase", line 8, in <module>
    sys.exit(main())
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 560, in main
    partialPipeline(args)
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 438, in partialPipeline
    nPhaseFunctions.assignLongReadToSNPs(cleanLongReadSamFile,shortReadSNPsBedFilePath,args.reference,minQ,minMQ,minAln,longReadPositionNTFile)
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 207, in assignLongReadToSNPs
    for line in samFile:
  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

@OmarOakheart
Copy link
Owner

Hi Elsa,

nPhase partial requires a sorted sam file, not a bam file, since my code reads the individual lines and I didn't use a library that can parse the binary format of bam files. Normally when you run the full nPhase pipeline it generates the correct long read sam file which you can reuse for nphase partial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants