You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I only got different results for sORF peptides. Below I've included the full output files, but here is a snippet of the differences (nt seqs are truncated):
Short first:
peptide_id start end peptide_type peptide_class prediction_tool nlpprecursor_class_score nlpprecursor_cleavage_score protein_sequence nucleotide_sequence
petx0wholefemale_NODE_618917_length_73_cov_2.401639_g519643_i0 NA NA sORF NA plmutils NA NA MPTLSATAMFTQFTLPL ATGCCAACCTTGAGTGCTACTGCCATGTTTACACAGTTTACCTTGCCA>
petx0wholefemale_NODE_618920_length_72_cov_3.090909_g519646_i0 NA NA sORF NA plmutils NA NA LTLRAKRPKDTTNQTS CTGACTTTGCGAGCCAAACGACCAAAAGACACAACTAACCAAACAAGC
Transcript_10 NA NA sORF NA plmutils NA NA VSGEPVAKHKGLASFLELYCENCAFPEKVISRAYTSWRVTAGKDESKSAARAYDSGSSCESFTVNVKAVVVARSFGIRYQQLMVQEVSGDSGFFYG GTGTCTGGGGAACCGG>
Transcript_30 NA NA sORF NA plmutils NA NA MGSSAVSLAGRRRRMKAILYSSIALSSLLLMLSQSNLQNRVRLLYFFSELCSVGYFLLVGSTVKKMH ATGGGAAGTTCTGCCGTGAGCCTCGCCGGCAGAAGGCGTCGGATGAAA>
Transcript_31 NA NA sORF NA plmutils NA NA MFVQWRTQNSNNSSSVDCCDVWDVNLSRKCVIYACKYFKNRLRTWFSAMKMFNVVAGGQVSKLLEIIHYYSRLQLRNEAANVPQRSPSSQWS ATGTTTGTGCAGTGGCGGACACAA>
Transcript_42 NA NA sORF NA plmutils NA NA MHFYTGRVDLALVFGVDFSHSEMPKFPSWLGAGLEVVGELCRRCVPMARLAFGNRAGFSSPVCWAASMSVQSAM ATGCACTTTTATACAGGGAGAGTAGATCTTGCTTTAGTCT>
Transcript_45 NA NA sORF NA plmutils NA NA LGCILLLLLTLLLFLCDYTRECVVDINMCIKDTARTMAVVCSCFENAVWCAVLVECFWYVLDICGVLCARSGIATEGSRLAGWLEDEDDVSSFSWKNS CTGGGATGTATCTTGT>
Transcript_53 NA NA sORF NA plmutils NA NA LHGGACKPEILLTQYGPFMSYQGNTKTESRLIRMFCVGACGSGNCKNKEIAPKCCCVPLLHKSLATNFFPATVCVRLLLVLRLFFAVYFLLTNFL TTGCATGGCGGCGCCTGCAAGCCG>
Transcript_56 NA NA sORF NA plmutils NA NA VSQRCVLCLLFFVSSVALLWVMISETKVVVSAGYCNLVRSTYTILLAPCSLRHLLRTTFRRTSPQRTLHSLKNAVAVTT GTGAGTCAGCGCTGTGTGTTGTGTCTCCTCTTCTTTGTCT>
Transcript_57 NA NA sORF NA plmutils NA NA LSAEMNGPNLSDEYAASVLPLFPTGTAFKNSSLLRVGRSIELYVSSLGAPEFFVSARISFLRAFVIEGFKCSELVTVTSAK TTGTCGGCTGAAATGAACGGCCCCAACTTGTC>
Transcript_58 NA NA sORF NA plmutils NA NA LGAMRLLLTDDSHHYRTLEPILLFRHIAACFRGSFDPFYTHFTPILPKGPSLCALAMGMATTKPLQ TTGGGAGCGATGCGCCTGCTGCTTACCGATGATTCGCATCACTACAGA>
Transcript_66 NA NA sORF NA plmutils NA NA LPHFFFATEAEGANQERRHCHSHAIYSYRARLLVKHKSSLVVPSSRIKKLGIPLCHA CTGCCACATTTTTTTTTTGCAACTGAAGCGGAAGGTGCCAACCAGGAGCGCCGTCA>
Transcript_67 NA NA sORF NA plmutils NA NA MGYLTAQCAIWLEICVQFFTQASCSMNMLECDCFSYAFEDPSKHTCTLYDVKQHTKGHMLALSLLMYTCVSAISSLLSILWLPSIT ATGGGATACTTAACTGCACAGTGCGCTATTTG>
Transcript_80 NA NA sORF NA plmutils NA NA LIQCLRTYSVWTHGRKARPYLEERNSYMRMSKLNASCFIILRHTVVMETRKLSLHLQRGTKSTKP TTGATACAATGTCTCAGAACATACTCCGTCTGGACGCACGGTCGGAAG>
Transcript_81 NA NA sORF NA plmutils NA NA LVQRTNNSQLNSRHCCLSCTFTQVQGLHSSFHAQPFLFGQMDKNAVTLINRQALYIKEVFFK CTGGTTCAAAGGACAAATAACAGTCAGCTAAACAGCCGGCACTGCTGTCTGTCGTG
Long first:
peptide_id start end peptide_type peptide_class prediction_tool nlpprecursor_class_score nlpprecursor_cleavage_score protein_sequence nucleotide_sequence
Transcript_53 NA NA sORF NA plmutils NA NA LHGGACKPEILLTQYGPFMSYQGNTKTESRLIRMFCVGACGSGNCKNKEIAPKCCCVPLLHKSLATNFFPATVCVRLLLVLRLFFAVYFLLTNFL TTGCATGGCGGCGCCTGCAAGCCG>
Transcript_54 NA NA sORF NA plmutils NA NA VRHEKTEISSPLLHSLSFWLLRKAGFSPIMNNNHEAVVISAFLHASHDRKTHRPSQPSFSY GTGAGGCATGAAAAAACTGAAATATCATCTCCGCTTCTACATTCGTTGTCATTCTG>
Transcript_67 NA NA sORF NA plmutils NA NA MGYLTAQCAIWLEICVQFFTQASCSMNMLECDCFSYAFEDPSKHTCTLYDVKQHTKGHMLALSLLMYTCVSAISSLLSILWLPSIT ATGGGATACTTAACTGCACAGTGCGCTATTTG>
Transcript_79 NA NA sORF NA plmutils NA NA LWRIIIAAQFSLKSGDHCLVFHQLRLLRCETVPEFFFFAVIQLFVLRIVQIFFSVLEVLINVVSVH TTGTGGAGAATCATCATTGCTGCACAGTTTTCTCTTAAATCAGGCGAC>
Transcript_80 NA NA sORF NA plmutils NA NA LIQCLRTYSVWTHGRKARPYLEERNSYMRMSKLNASCFIILRHTVVMETRKLSLHLQRGTKSTKP TTGATACAATGTCTCAGAACATACTCCGTCTGGACGCACGGTCGGAAG>
Transcript_83 NA NA sORF NA plmutils NA NA MWKLNNTLLRDDVYYRAVKDEIGKINPCKNLKIWQQWELSKESLKIKAIERATCIRYKEKNEAELRALLETLLKQECKEPRKWI ATGTGGAAGCTAAACAACACGCTTCTTCGCGA>
Transcript_85 NA NA sORF NA plmutils NA NA LVLRLRGGAKKRKKKNYSTPKKIKHKRKKVKLAVLKYYKVDENGKIHRLRRECTSESCGAGVFMAAHEDRHYCGKCHLTLVYSKQEDK CTGGTGCTTCGCCTGCGCGGTGGC>
Transcript_88 NA NA sORF NA plmutils NA NA VPLFKAPSDNVVLEKWRRAIPRADRTLMPTDHVCAKHFAEDAISRAYYAELDKSATLRGRNARAFQRCSSYITVADG GTGCCATTATTCAAAGCTCCGTCCGACAATGTTGTTTTGG>
Transcript_90 NA NA sORF NA plmutils NA NA LTVALPTSHLLNGILCLLSSLAGVGKQPSEVYHICHLSRLQHRVFSTVTPT TTGACAGTAGCATTACCCACTTCTCATTTATTAAACGGCATTCTGTGCCTTCTTAGTTCTCTTG>
Transcript_91 NA NA sORF NA plmutils NA NA TRTNGSPSSLKPRIIGRNFRYSIYTLQLKLHAVTSAALKTITHG ACGCGTACTAATGGAAGCCCGAGTTCACTGAAACCCCGCATAATTGGGAGGAACTTTCGGTATTCCATTTAT>
Transcript_92 NA NA sORF NA plmutils NA NA MLSNRKCVYTNMFTADGIYLQPVPLLSIRGACCTTGDCSISDVWAAYHHSVLAVCITQLTHILRPANHLNPILHNGPTRSFAAVYNR ATGCTGAGTAACAGAAAATGCGTTTATACAAA>
Transcript_96 NA NA sORF NA plmutils NA NA LKCATSAHLKLKKRNIADACFPHALKKGFLEKYNDNVNLQAVRLQSSGYPILFFGSVVENRLQHIV TTGAAATGCGCTACCTCAGCGCATTTAAAGCTGAAAAAACGCAACATT>
Transcript_98 NA NA sORF NA plmutils NA NA LIAHSRDPPCSRSRSFKQRSDQCRCVRMTKVFHKPRFSHISRPLRCSLLN CTGATAGCGCACTCCAGGGATCCCCCGTGCTCTCGAAGCCGTTCATTCAAACAGCGCTCCGACC>
petx0wholefemale_NODE_618928_length_70_cov_1.512605_g519654_i0 NA NA sORF NA plmutils NA NA THLWSISSYRCHTTTRQYF ACGCACCTTTGGTCAATTTCATCCTATCGCTGTCACACGACGACACGA>
petx0wholefemale_NODE_618932_length_68_cov_3.153846_g519658_i0 NA NA sORF NA plmutils NA NA LPPCSAFFLSLFNCVVNY TTGCCGCCATGCTCGGCTTTTTTTTTATCTTTGTTTAACTGTGTGGT
In this tar'd archive, I include the demo outputs that are "correct" (when the short contigs are concatenated first) and when they're "wrong" (when the long contigs are concatenated first).
We chatted offline, but just for the record, I looked into this bug using the files attached above in bug.tar.tz, and I'm almost certain that this problem is due to an order-preservation bug in plm-utils: the order of the rows of the embeddings matrix generated by plmutils embed command do not match the order of the sequences in the input fasta file. Since the embeddings matrix is used to generate the predictions, the result is that the predictions are not matched to the correct sequence IDs in the plmutils_predictions.csv file output by the plmutils_predict rule of the peptigate snakefile.
This bug is fixed in a PR in the plm-utils repo here.
Description of the bug
Over in #47, I noticed that if I ran
I got different sORF predictions than if I ran
I only got different results for sORF peptides. Below I've included the full output files, but here is a snippet of the differences (nt seqs are truncated):
Short first:
Long first:
Command used and terminal output
Relevant files
bug.tar.gz
In this tar'd archive, I include the demo outputs that are "correct" (when the short contigs are concatenated first) and when they're "wrong" (when the long contigs are concatenated first).
These are the github links to the short and long contig files used as inputs for this run:
https://github.com/Arcadia-Science/peptigate/blob/main/demo/contigs_longer_than_r2t_minimum_length.fa
https://github.com/Arcadia-Science/peptigate/blob/main/demo/contigs_shorter_than_r2t_minimum_length.fa
System information
I ran peptigate on a Linux EC2. Compute specifications are reported here: https://github.com/Arcadia-Science/peptigate?tab=readme-ov-file#compute-specifications
The text was updated successfully, but these errors were encountered: