Large Language Model-based creation of knowledge graph from a publication #235

dexterpratt · 2024-01-18T21:05:23Z

Background

This is a new project, with the goal of creating a Python application that creates a network of interactions based on LLM analysis of a published paper. Statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.

Goal

Create a Python application:
Input: academic paper that presents findings that include molecular interactions
Output: a network in which the nodes are biological entities and edges represent interactions and other relationships between the entities
Format: CX2, https://cytoscape.org/cx/cx2/specification/cytoscape-exchange-format-specification-(version-2)/
Action: upload to NDEx https://www.ndexbio.org/index.html#/
NDEx API is accessed via https://pypi.org/project/ndex2/

(note that the network in NDEx will be editable in Cytoscape www.cytoscape.org)

Examples interactions:
AKT1 binds GSK3B
Activation of AKT1 can increase Cell Proliferation

The interactions in the network are created by LLM-based analysis in which statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.

Difficulty Level: Medium

Medium, Hard for "stretch goals"
The programming for this problem is medium-level
The biological knowledge required can be basic.

However, there will be many opportunities to go beyond the minimal requirements. For example, if participants have more advanced biological knowledge, they might generate graphs expressing much more complex interactions.

Size and Length of Project

large: 350 hours
12+ weeks

Skills

Essential skills:
Python

Nice to have skills:

LLM APIs, Langchain (https://www.langchain.com/), or related packages
Cytoscape and NDEx familiarity - including ndex2 python package
Knowledge Graphs
Molecular interactions, protein function, and protein modifications, pathways.
A likely format is MiTAB https://psicquic.github.io/MITAB28Format.html

Public Repository

TBD, will be created in the Cytoscape organization

Potential Mentors

Dexter Pratt, Jing Chen

Foxtrot-14 · 2024-01-27T06:28:55Z

Hello, I'm Noaman, a CS undergrad from the class of 2025. I've worked on issue 223, and I find this project impactful. I'd love to continue contributing to it.

To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)
The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost.

Let's discuss these points.

I'm eager to work on this project for GSOC 2024. I've also read about your work on NDEx, and it's fascinating. Looking forward to hearing from you.

khanspers · 2024-02-22T16:19:58Z

NRNB has been accepted as a mentoring organization for GSoC 2024. The contributor application period is March 18 – April 2. Here are some useful links:

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

dexterpratt · 2024-03-01T16:40:47Z

The biology is probably the steepest slope for coming up to speed. The problem is that you would need to be able to have some idea of whether the extraction is producing sense or nonsense as you develop * Dexter From: Abdul Mateen Mulla ***@***.***> Date: Sunday, February 25, 2024 at 2:12 AM To: nrnb/GoogleSummerOfCode ***@***.***> Cc: Dexter Pratt ***@***.***>, Mention ***@***.***> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235) Hi @dexterpratt<https://urldefense.com/v3/__https:/github.com/dexterpratt__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__wShdgM0A$> @jingjingbic<https://urldefense.com/v3/__https:/github.com/jingjingbic__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__zZTTQRzQ$> I'm interested in Large Language Model-based creation of knowledge graph from a publication project . My Skills with regard to Project Hands-on: Python, Langchain/LLM Areas to Explore: Cytoscape, NDEx, Graphs, Biology, MiTAB Do you think this will be enough for me to grasp the project's working over the next two months before the submission? I'm open to diving deeper into these skills if needed. — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/nrnb/GoogleSummerOfCode/issues/235*issuecomment-1962876185__;Iw!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__ya2JkLkg$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AAIOVLCUP5GSEUTM4JB32MTYVMCNDAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA3TMMJYGU__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__zFyqIZcA$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Galvanized-Heart · 2024-03-06T23:33:35Z

Hey there, @dexterpratt my name is Maxim Kirby. I did my B.Sc. in Biochemistry at the University of Waterloo and graduated in 2023. I've been coding for about a year in Python and I've been getting really into deep learning since I am hoping to graduate work in deep learning for protein engineering. I saw this project in Google's Summer of Code and I thought I could really apply my skills well here becuase I do have a strong biology and chemistry background. I'd still need to brush up on Cytoscape, NDEx, knowledge graphs for this application, and MiTAB. I think I’ll still dig around and try to make some smaller contributions first though!

Yayi0117 · 2024-03-11T01:50:33Z

Hello, @dexterpratt @jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program.
After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is:

By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction.

How feasible of this proposal? I would appreciate any guidance from you.

Favourj-bit · 2024-03-12T14:14:17Z

Hello @dexterpratt @jingjingbic
I hope this message finds you well. My name is Favour James, and I recently came across your project on the list of ideas for GSOC. I am particularly interested in this issue and I would like to contribute to this project during the GSOC'24 program. I participated under NRNB last year and worked on this project: #217

Before I begin my application, I have a few questions I hope you could clarify:

Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
Is there a preferred development environment or setup for working on this project?
Do I need domain expertise in knowledge graphs before i can contribute to this project?
Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM

Thank you for your time and attention, and I look forward to hearing back from you soon.

dexterpratt · 2024-03-12T21:31:03Z

"To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)"

It's not the papers that are about graphs. Rather, the idea is that we use an LLM to understand a paper that reports interactions between biological entities and then express that understanding as a knowledge graph.

dexterpratt · 2024-03-12T21:33:27Z

"The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost."

We will handle setting up a service to use for the project which will in turn provide access to multiple LLMs. The word limits are much larger than 15K now, but it is an open question as to whether you get better results chunking or not.

dexterpratt · 2024-03-12T21:43:36Z

"Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
Is there a preferred development environment or setup for working on this project?
Do I need domain expertise in knowledge graphs before i can contribute to this project?
Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM"

We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input.

We typically would be using VS code with GitHub CoPilot in anaconda environments.

Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important.

Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you.

Favourj-bit · 2024-03-13T13:25:51Z

"Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?

Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?

Is there a preferred development environment or setup for working on this project?

Do I need domain expertise in knowledge graphs before i can contribute to this project?

Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM"

We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input.

We typically would be using VS code with GitHub CoPilot in anaconda environments.

Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important.

Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you.

Thank you for your response.
Another thing I am trying to understand is what the format and action under the goals refer to.
FInally, please can I add you to my proposal draft once i start working on it?

dexterpratt · 2024-03-15T13:01:52Z

This is not a specific proposal From: Yayi0117 ***@***.***> Date: Sunday, March 10, 2024 at 9:50 PM To: nrnb/GoogleSummerOfCode ***@***.***> Cc: Dexter Pratt ***@***.***>, Mention ***@***.***> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235) Hello, @dexterpratt<https://github.com/dexterpratt> @jingjingbic<https://github.com/jingjingbic>, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is: By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction. How feasible of this proposal? I would appreciate any guidance from you. — Reply to this email directly, view it on GitHub<#235 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIOVLEFXKJNX4O6O72DDYLYXUEYBAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGQ4DSNJYHE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Yayi0117 · 2024-03-17T01:42:34Z

This is not a specific proposal From: Yayi0117 @.> Date: Sunday, March 10, 2024 at 9:50 PM To: nrnb/GoogleSummerOfCode @.> Cc: Dexter Pratt @.>, Mention @.> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235) Hello, @dexterpratt https://github.com/dexterpratt @jingjingbic https://github.com/jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is: By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction. How feasible of this proposal? I would appreciate any guidance from you. — Reply to this email directly, view it on GitHub<#235 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIOVLEFXKJNX4O6O72DDYLYXUEYBAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGQ4DSNJYHE. You are receiving this because you were mentioned.Message ID: @.***>

Thanks for your reply.
Apart from iteratively refining prompts through human feedback, I haven't figured out a more specific plan to guide the LLM to generate correct entities and relations. Could you provide some guidance on this?

Foxtrot-14 · 2024-03-29T19:57:59Z

@dexterpratt, I have shared my GSOC proposal for early feedback (to your ucsd email id which I found on the IDEKER LAB website). Kindly take a look and let me know if any changes are needed.

khanspers · 2024-05-02T16:03:00Z

This is an active GSoC 2024 project. Closing this project idea as it is no longer available to other contributors.

khanspers assigned dexterpratt and jingjingbic Jan 19, 2024

khanspers added Python Difficulty: Medium Size: 350h LLM labels Jan 19, 2024

Foxtrot-14 mentioned this issue Jan 26, 2024

rollup-config-added cytoscape/cytoscape.js-automove#34

Closed

khanspers closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Language Model-based creation of knowledge graph from a publication #235

Large Language Model-based creation of knowledge graph from a publication #235

dexterpratt commented Jan 18, 2024 •

edited by khanspers

Loading

Foxtrot-14 commented Jan 27, 2024

khanspers commented Feb 22, 2024 •

edited

Loading

dexterpratt commented Mar 1, 2024 via email

Galvanized-Heart commented Mar 6, 2024

Yayi0117 commented Mar 11, 2024

Favourj-bit commented Mar 12, 2024

dexterpratt commented Mar 12, 2024

dexterpratt commented Mar 12, 2024

dexterpratt commented Mar 12, 2024

Favourj-bit commented Mar 13, 2024

dexterpratt commented Mar 15, 2024 via email

Yayi0117 commented Mar 17, 2024

Foxtrot-14 commented Mar 29, 2024

khanspers commented May 2, 2024

Large Language Model-based creation of knowledge graph from a publication #235

Large Language Model-based creation of knowledge graph from a publication #235

Comments

dexterpratt commented Jan 18, 2024 • edited by khanspers Loading

Background

Goal

Difficulty Level: Medium

Size and Length of Project

Skills

Public Repository

Potential Mentors

Foxtrot-14 commented Jan 27, 2024

khanspers commented Feb 22, 2024 • edited Loading

dexterpratt commented Mar 1, 2024 via email

Galvanized-Heart commented Mar 6, 2024

Yayi0117 commented Mar 11, 2024

Favourj-bit commented Mar 12, 2024

dexterpratt commented Mar 12, 2024

dexterpratt commented Mar 12, 2024

dexterpratt commented Mar 12, 2024

Favourj-bit commented Mar 13, 2024

dexterpratt commented Mar 15, 2024 via email

Yayi0117 commented Mar 17, 2024

Foxtrot-14 commented Mar 29, 2024

khanspers commented May 2, 2024

dexterpratt commented Jan 18, 2024 •

edited by khanspers

Loading

khanspers commented Feb 22, 2024 •

edited

Loading