Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangling edges for Panther #446

Open
RichardBruskiewich opened this issue Apr 24, 2023 · 7 comments
Open

Dangling edges for Panther #446

RichardBruskiewich opened this issue Apr 24, 2023 · 7 comments

Comments

@RichardBruskiewich
Copy link
Collaborator

RichardBruskiewich commented Apr 24, 2023

Dangling edges found for Panther. See https://monarch-initiative.github.io/monarch-qc/ for the report; https://data.monarchinitiative.org/monarch-kg-dev/ for the data.

Find out why and suggest a repair.

@RichardBruskiewich RichardBruskiewich self-assigned this Apr 24, 2023
@RichardBruskiewich
Copy link
Collaborator Author

The simplest explanation here that many (most) of the dangling edges have ENSEMBL subject or object gene identifiers.

A brute force solution - tempting to apply - is simply to have the ingest script filter out all edges with ENSEMBL prefixed identifiers.

That said, simple grep of the dangling edges versus the ingest output file itself, isolating single deviant Panther protein groups, shows a slight discrepancy in the counts. Oddly enough, the ‘dangling edges’ file have many entries that totally lack a (mapped) original_subject or original_object node identifier (i.e. the column value is empty for the edge), but some of the entries still have one non-blank identifier, which generally seems to be from the ENSEMBL namespace (so it was not removed from the edge during mappings?). This oddity likely (at least partly) explains the count difference.

Unless we think otherwise, the necessary patch of the Panther ingest is simply to filter out ENSEMBL identifiers in either the subject or object node. I can issue a PR to do this and we could rerun the ingest to see if this handles most of the dangling edges.

I don’t know if we’ll lose any legitimate edges - I guess if we don’t commonly rely on ENSEMBL identifiers for gene nodes, but rather, model organism curated nodes only, then we should be fine.

@RichardBruskiewich
Copy link
Collaborator Author

Attempted resolution in #456 by filtering out edges that contain ENSEMBL identifiers; however, after @kevinschaper and I review this, simply discarding ENSEMBL gene identifiers is not the best solution.

That said, @kevinschaper has made some progress in reducing the dangling edges(?).

@RichardBruskiewich
Copy link
Collaborator Author

RichardBruskiewich commented Jun 19, 2023

Taking a fresh look today (June 19, 2023):

From the file downloaded today shows the first dangling edge record:

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep panther |less
uuid:b401c46f-0dd0-11ee-bd34-f39d5ac7a30a               biolink:orthologous_to          biolink:GeneToGeneHomologyAssociation   infores:monarchinitiative       PANTHER.FAMILY:PTHR15464        infores:panther panther_genome_orthologs_edges                                                                          
HGNC:11629      ENSEMBL:ENSSSCG00070024292

Searching the latest gene2ensembl file:

$ gunzip -c gene2ensembl.gz |grep ENSSSCG |grep 24292
9823    100516001       ENSSSCG00000007596      XM_003124292.3  ENSSSCT00000008336.5    XP_003124340.1  ENSSSCP00000008117.2
9823    100521003       ENSSSCG00000026422      XM_003124244.5  ENSSSCT00000022811.4    XP_003124292.1  ENSSSCP00000027235.2
9823    100624292       ENSSSCG00000028172      XM_013993942.2  ENSSSCT00000023759.4    XP_013849396.1  ENSSSCP00000020943.1
9823    100736682       ENSSSCG00000004854      XM_021098883.1  ENSSSCT00000024292.4    XP_020954542.1  ENSSSCP00000019275.3

Shows that the gene record is simply missing from the gene2ensembl file.

That said, a direct https://www.ebi.ac.uk/ebisearch/search using this identifier brings up the following (note, the first TCF-19 link is broken, but the other one works).

UniProKB entry Q9TSV4 does have a link to the Ensembl gene record ENSSSCG00070024292.

Genomes & metagenomes (1 results)

Source: Ensembl Gene (ID: ENSSSCG00070024292)
[TCF19](https://www.ensembl.org/pig_usmarc/geneview?gene=ENSSSCG00070024292)

transcription factor 19 [Source:NCBI gene;Acc:100152381]

Cross References: Samples & ontologies (3) Nucleotide sequences (2) Protein sequences (2)

Protein sequences (1 results)

Source: UniProtKB (ID: TCF19_PIG)
[Q9TSV4](https://www.uniprot.org/uniprot/Q9TSV4)

Transcription factor 19 TCF-19
Sus scrofa(Reviewed)
Secondary accession number(s): O19083

Cross References: Protein families (22) Bioactive molecules (8) Protein sequences (6) show more

Formats:[ in FASTA format ](https://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=uniprotkb&id=Q9TSV4&format=fasta&style=raw)in Feature Viewer in Interpro Matches

@RichardBruskiewich
Copy link
Collaborator Author

RichardBruskiewich commented Jun 19, 2023

Another random use case: ANIA loci (Dictyostelium genomic loci).

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep ANIA_ |wc -l
29950

A modest subset of the dangling edges.

For example, we look at the first one at the top of the list: ANIA_10586

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep ANIA_ |head -1
uuid:b401c492-0dd0-11ee-bd34-f39d5ac7a30a               biolink:orthologous_to          biolink:GeneToGeneHomologyAssociation   infores:monarchinitiative       PANTHER.FAMILY:PTHR43765        infores:panther panther_genome_orthologs_edges                                                                         SGD:S000002605   ENSEMBL:ANIA_10586

Again, assuming that this is ENSEMBL, nothing found inside in the gene2ensembl.gz:

$ gunzip -c gene2ensembl.gz |grep ANIA_|wc -l
0

However, a UniProtKB search ignoring the ENSEMBL prefix, has a hit: C8VAR7, including a Panther family mapping: PTHR43765.

Thus, for these pseudo-ENSEMBL curies that have object id's beginning in ANIA_ (locus identifiers from the original Dictostelium gene set?), we'd simply want to strip off the ENSEMBL prefix and conduct a direct match on UniProtKB. There is perhaps a caveat here in that this locus is deemed uncurated TrEMBL.

@RichardBruskiewich
Copy link
Collaborator Author

RichardBruskiewich commented Jun 19, 2023

The common thread between the two examples I chose so far seems to be to search in UniProtKB for the identifier mappings. Note, however, that the search is slightly different for each one since the original identifiers are distinct in character.

We are using UniprotKB already but given the size of the id mapping file (>11 GB?), we likely need to be a bit clever (iteratively, based on each use case we find, maybe one subset of missing identifiers at a time?)

@RichardBruskiewich
Copy link
Collaborator Author

Related to monarch-initiative/monarch-app#351 which was closed?

"Donkey: "Are we there yet?" Shrek: "Shut up!"

@RichardBruskiewich RichardBruskiewich removed their assignment Jan 24, 2024
@RichardBruskiewich
Copy link
Collaborator Author

@sagehrke @kevinschaper, I don't have any more insights to add beyond the above analyses. The ENSEMBL team - if I recall - didn't seem to think that the identifiers in question are missing at their end.

The most fruitful approach here may be to leverage the UniprotKB in an SSSOM kind of way? I leave this with you...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant