Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing IRIs and metadata from many transforms #66

Open
caufieldjh opened this issue Jul 26, 2022 · 7 comments
Open

Missing IRIs and metadata from many transforms #66

caufieldjh opened this issue Jul 26, 2022 · 7 comments

Comments

@caufieldjh
Copy link
Collaborator

Many transforms appear to be missing descriptions, IRIs, and possibly other fields populated in the previous set of transforms.
Will need to verify the JSON -> TSV step is populating fields as expected, particularly name and description.

@caufieldjh
Copy link
Collaborator Author

caufieldjh commented Jul 26, 2022

example with ICD10PCS.

Previously:

id	category	name	provided_by	aggregator_knowledge_source	iri	object	predicate	primary_knowledge_source	relation	same_as	subject
ICD10PCS:0WJ34Z	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/0WJ34Z
ICD10PCS:079430Z	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/079430Z
ICD10PCS:0FPD4KZ	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/0FPD4KZ
ICD10PCS:2W3HX3Z	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/2W3HX3Z
ICD10PCS:2W56X1Z	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/2W56X1Z
ICD10PCS:01QC3ZZ	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/01QC3ZZ
ICD10PCS:2W0MX7Z	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/2W0MX7Z
ICD10PCS:0SJL3Z	biolink:Procedure|biolink:OntologyClass		BioPortal		http://purl.bioontology.org/ontology/ICD10PCS/0SJL3Z

Currently:

$ head transformed/ontologies/ICD10PCS/ICD10PCS_21_nodes.tsv 
id      category        name    description     provided_by
ICD10PCS:0WJ34Z biolink:Procedure                       BioPortal
ICD10PCS:079430Z        biolink:Procedure                       BioPortal
ICD10PCS:0FPD4KZ        biolink:Procedure                       BioPortal
ICD10PCS:2W3HX3Z        biolink:Procedure                       BioPortal
ICD10PCS:2W56X1Z        biolink:Procedure                       BioPortal
ICD10PCS:01QC3ZZ        biolink:Procedure                       BioPortal
ICD10PCS:2W0MX7Z        biolink:Procedure                       BioPortal
ICD10PCS:0SJL3Z biolink:Procedure                       BioPortal
ICD10PCS:2W6CX0Z        biolink:Procedure                       BioPortal

@caufieldjh
Copy link
Collaborator Author

This may also be a good juncture to see if the values added to edgefiles in primary_knowledge_source can be used in the nodelists too

@caufieldjh caufieldjh linked a pull request Jul 26, 2022 that will close this issue
@caufieldjh
Copy link
Collaborator Author

Another example, with BFO.

Previous transform:

id	category	name	description	provided_by	aggregator_knowledge_source	iri	object	predicate	primary_knowledge_source	relation	same_as	subject
BFO:0000019	biolink:OntologyClass	quality		BioPortal		http://purl.obolibrary.org/obo/BFO_0000019
BFO:0000015	biolink:OntologyClass	process	p is a process = Def. p is an occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. (axiom label in BFO2 Reference: [083-003])	BioPortal		http://purl.obolibrary.org/obo/BFO_0000015
BFO:0000016	biolink:OntologyClass	disposition		BioPortal		http://purl.obolibrary.org/obo/BFO_0000016
BFO:0000017	biolink:OntologyClass	realizable entity		BioPortal		http://purl.obolibrary.org/obo/BFO_0000017
BFO:0000018	biolink:OntologyClass	zero-dimensional spatial region		BioPortal		http://purl.obolibrary.org/obo/BFO_0000018
BFO:0000011	biolink:OntologyClass	spatiotemporal region		BioPortal		http://purl.obolibrary.org/obo/BFO_0000011
IAO:0000116	biolink:OntologyClass	editor note		BioPortal		http://purl.obolibrary.org/obo/IAO_0000116
IAO:0000117	biolink:OntologyClass	term editor		BioPortal		http://purl.obolibrary.org/obo/IAO_0000117
BFO:0000134	biolink:OntologyClass			BioPortal		http://purl.obolibrary.org/obo/BFO_0000134
BFO:0000179	biolink:OntologyClass	BFO OWL specification label	Relates an entity in the ontology to the name of the variable that is used to represent it in the code that generates the BFO OWL file from the lispy specification.	BioPortal		http://purl.obolibrary.org/obo/BFO_0000179
IAO:0000115	biolink:OntologyClass	definition		BioPortal		http://purl.obolibrary.org/obo/IAO_0000115
IAO:0000112	biolink:OntologyClass	example of usage		BioPortal		http://purl.obolibrary.org/obo/IAO_0000112
IAO:0000111	biolink:OntologyClass	editor preferred term		BioPortal		http://purl.obolibrary.org/obo/IAO_0000111
IAO:0000232	biolink:OntologyClass	curator note		BioPortal		http://purl.obolibrary.org/obo/IAO_0000232
BFO:0000008	biolink:OntologyClass	temporal region		BioPortal		http://purl.obolibrary.org/obo/BFO_0000008

Current transform:

id	category	name	description	provided_by
BFO:0000019	biolink:OntologyClass	quality		Basic Formal Ontology
BFO:0000015	biolink:OntologyClass	process	p is a process = Def. p is an occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t. (axiom label in BFO2 Reference: [083-003])	Basic Formal Ontology
BFO:0000016	biolink:OntologyClass	disposition		Basic Formal Ontology
BFO:0000017	biolink:OntologyClass	realizable entity		Basic Formal Ontology
BFO:0000018	biolink:OntologyClass	zero-dimensional spatial region		Basic Formal Ontology
BFO:0000011	biolink:OntologyClass	spatiotemporal region		Basic Formal Ontology
IAO:0000116	biolink:OntologyClass	editor note		Basic Formal Ontology
IAO:0000117	biolink:OntologyClass	term editor		Basic Formal Ontology
BFO:0000134	biolink:OntologyClass			Basic Formal Ontology
BFO:0000179	biolink:OntologyClass	BFO OWL specification label	Relates an entity in the ontology to the name of the variable that is used to represent it in the code that generates the BFO OWL file from the lispy specification.	Basic Formal Ontology
IAO:0000115	biolink:OntologyClass	definition		Basic Formal Ontology
IAO:0000112	biolink:OntologyClass	example of usage		Basic Formal Ontology
IAO:0000111	biolink:OntologyClass	editor preferred term		Basic Formal Ontology
IAO:0000232	biolink:OntologyClass	curator note		Basic Formal Ontology
BFO:0000008	biolink:OntologyClass	temporal region		Basic Formal Ontology

The name field is still populated, so that's great, but provided_by is now the name of the ontology instead of the aggregator knowledge source (probably also fine, but should include version, too), extra headings are different (an improvement, and perhaps something KGX is doing?), and iri isn't there at all. Would really prefer to have IRIs present so nodes may be mapped back to source BP ontologies.

@caufieldjh caufieldjh changed the title Missing descriptions and IRIs from many transforms Missing IRIs and metadata from many transforms Jul 26, 2022
@caufieldjh
Copy link
Collaborator Author

This may be due to a difference in bmt or in Biolink Model itself.

@caufieldjh
Copy link
Collaborator Author

Here's one confirmed difference: if I run a transform like the following

                        kgx.cli.transform(inputs=[repaired_outpath],
                            input_format='obojson',
                            output=outpath,
                            output_format='tsv',
                            stream=True,
                            knowledge_sources=[("aggregator_knowledge_source", "BioPortal"),
                                                ("primary_knowledge_source", primary_knowledge_source)])

then aggregator_knowledge_source is not added to the node or edge file, 'primary_knowledge_source' is added to the edgefile but the corresponding values are included under provided_by.

@caufieldjh caufieldjh removed a link to a pull request Aug 2, 2022
@caufieldjh
Copy link
Collaborator Author

This isn't really a blocker - the transforms should merge perfectly well without IRIs present - so if it's related to kgx or bmt then perhaps it can be solved as part of the kg-bioportal merge.

@caufieldjh
Copy link
Collaborator Author

caufieldjh commented May 10, 2023

Metadata is missing in new transforms; provided_by is back to providing only the source file name.
Example from ODNAE:

id	category	name	description	provided_by
CHEBI:25698	biolink:ChemicalSubstance	ether	A compound ROR (where R is not H).	ODNAE_3_relaxed.json
GO:0010646	biolink:BiologicalProcess	regulation of cell communication	Any process that modulates the frequency, rate or extent of cell communication. Cell communication is the process that mediates interactions between a cell and its surroundings. Encompasses interactions such as signaling or attachment between one cell and another cell, between a cell and an extracellular matrix, or between a cell and any other aspect of its environment.	ODNAE_3_relaxed.json
GO:0010647	biolink:BiologicalProcess	positive regulation of cell communication	Any process that increases the frequency, rate or extent of cell communication. Cell communication is the process that mediates interactions between a cell and its surroundings. Encompasses interactions such as signaling or attachment between one cell and another cell, between a cell and an extracellular matrix, or between a cell and any other aspect of its environment.	ODNAE_3_relaxed.json
ODNAE:0000100	biolink:NamedThing	zidovudine (Retrovir)-associated neuropathy AE		ODNAE_3_relaxed.json
DRON:00021698	biolink:Drug	Disulfiram Oral Tablet		ODNAE_3_relaxed.json

Will make this its own issue because I think I have a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant