nPhase on one chromosome #12

elsasverris · 2021-10-13T09:59:18Z

Hi,
I am working on tetraploid potato and trying to run nPhase on my dataset, but as you mention yourself it is very time consuming and I had to terminate the process after running it for 8 days. I was wondering if there is any way to parallelize the process, for example to run it one chromosome at a time like you present in your paper - though I haven't figured out how to do so. Is there an option in the algorithm or have you just subset the dataset before running the algorithm?
Best regards,
Elsa

OmarOakheart · 2021-10-13T10:27:07Z

Hi Elsa,

For my paper I did just subset the data. In order for you to also run nPhase on one chromosome at a time you can simply separate your reference genome into one fasta file per chromosome, then run one instance of nPhase for each chromosome by changing the reference sequence being used.

A quick way to separate your genome into one fasta file per chromosome can be to use pyfaidx:

pip install pyfaidx
faidx -x sequences.fa

Source: https://www.biostars.org/p/173723/

That said, there are a few details to consider, for example if you try to map your full dataset to chromosome 1, then some reads which would have better mapped to another chromosome might still map to it, which would lead to some inaccuracies. A more cumbersome approach would be to first map and variant call all of your reads, then manually subset them and use them as input to nPhase partial, again with a one fasta file per chromosome scheme. nPhase doesn't have any associated tools that would really help achieve that though.

I also manually subset the VCF file to reduce the heterzogysity level to 1%, which you may find useful to do as well, and I think that for a tetraploid, any more than 60-80X coverage would be excessive, so feel free to downsample if you have a crazy amount of data.

There's a lot that can be done to improve how fast nPhase runs, you can check the status of the following issues in the near future to see if any significant updates have been made to help with genomes of that size.

#11
#10

Keep me posted if there's anything else I can help with, hope you'll get good results!

Best,
Omar

elsasverris · 2021-10-14T08:48:06Z

Hi Omar,
Great, thanks for your suggestions, I will try to run nPhase partial. I have run into an error with the partial pipeline though, but I don't know if it's something to do with the way it's installed on our servers or if it's something to do with the script:

nphase partial --sampleName kingedward --reference /srv/KLN/users/esv/ST4.03ch02.fasta --output /srv/KLN/users/esv/KRISPS/nPhase/kingedward_ch02/ --longReads /srv/KLN/users/kln/KRISPS/BGI_data/F20FTSEUHT0162_POTilnD/KingEdwardA/pacbio.KingEdwardA.subreads.fastq.gz --vcf /srv/KLN/users/esv/KRISPS/freebayes/results/variants/vcfs/variants.ST4.03ch02.vcf --mappedLongReads /srv/KLN/users/esv/KRISPS/Pacbio/Mapping/BAM_sorted/kingedward.sorted.REF_ST4.03ch02.bam  --threads 20
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Traceback (most recent call last):
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/bin/nphase", line 8, in <module>
    sys.exit(main())
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 560, in main
    partialPipeline(args)
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 438, in partialPipeline
    nPhaseFunctions.assignLongReadToSNPs(cleanLongReadSamFile,shortReadSNPsBedFilePath,args.reference,minQ,minMQ,minAln,longReadPositionNTFile)
  File "/space/sharedbin_ubuntu_14_04/software/nPhase/1.1.3-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 207, in assignLongReadToSNPs
    for line in samFile:
  File "/space/sharedbin_ubuntu_14_04/software/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

OmarOakheart · 2021-10-14T09:10:31Z

Hi Elsa,

nPhase partial requires a sorted sam file, not a bam file, since my code reads the individual lines and I didn't use a library that can parse the binary format of bam files. Normally when you run the full nPhase pipeline it generates the correct long read sam file which you can reuse for nphase partial.

rmormando mentioned this issue Sep 22, 2022

Duration and Heterozygosity #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nPhase on one chromosome #12

nPhase on one chromosome #12

elsasverris commented Oct 13, 2021

OmarOakheart commented Oct 13, 2021

elsasverris commented Oct 14, 2021 •

edited

Loading

OmarOakheart commented Oct 14, 2021

nPhase on one chromosome #12

nPhase on one chromosome #12

Comments

elsasverris commented Oct 13, 2021

OmarOakheart commented Oct 13, 2021

elsasverris commented Oct 14, 2021 • edited Loading

OmarOakheart commented Oct 14, 2021

elsasverris commented Oct 14, 2021 •

edited

Loading