Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Language Model-based creation of knowledge graph from a publication #235

Closed
dexterpratt opened this issue Jan 18, 2024 · 14 comments
Closed

Comments

@dexterpratt
Copy link

dexterpratt commented Jan 18, 2024

Background

This is a new project, with the goal of creating a Python application that creates a network of interactions based on LLM analysis of a published paper. Statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.

Goal

Create a Python application:
Input: academic paper that presents findings that include molecular interactions
Output: a network in which the nodes are biological entities and edges represent interactions and other relationships between the entities
Format: CX2, https://cytoscape.org/cx/cx2/specification/cytoscape-exchange-format-specification-(version-2)/
Action: upload to NDEx https://www.ndexbio.org/index.html#/
NDEx API is accessed via https://pypi.org/project/ndex2/

(note that the network in NDEx will be editable in Cytoscape www.cytoscape.org)

Examples interactions:
AKT1 binds GSK3B
Activation of AKT1 can increase Cell Proliferation

The interactions in the network are created by LLM-based analysis in which statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.

Difficulty Level: Medium

Medium, Hard for "stretch goals"
The programming for this problem is medium-level
The biological knowledge required can be basic.

However, there will be many opportunities to go beyond the minimal requirements. For example, if participants have more advanced biological knowledge, they might generate graphs expressing much more complex interactions.

Size and Length of Project

large: 350 hours
12+ weeks

Skills

Essential skills:
Python

Nice to have skills:

Public Repository

TBD, will be created in the Cytoscape organization

Potential Mentors

Dexter Pratt, Jing Chen

@Foxtrot-14
Copy link

Hello, I'm Noaman, a CS undergrad from the class of 2025. I've worked on issue 223, and I find this project impactful. I'd love to continue contributing to it.

  1. To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)

  2. The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost.

Let's discuss these points.

I'm eager to work on this project for GSOC 2024. I've also read about your work on NDEx, and it's fascinating. Looking forward to hearing from you.

@khanspers
Copy link
Contributor

khanspers commented Feb 22, 2024

NRNB has been accepted as a mentoring organization for GSoC 2024. The contributor application period is March 18 – April 2. Here are some useful links:

GSoC contributor guide
NRNB project proposal template
Eligibility requirements
Full program timeline

@dexterpratt
Copy link
Author

dexterpratt commented Mar 1, 2024 via email

@Galvanized-Heart
Copy link

Hey there, @dexterpratt my name is Maxim Kirby. I did my B.Sc. in Biochemistry at the University of Waterloo and graduated in 2023. I've been coding for about a year in Python and I've been getting really into deep learning since I am hoping to graduate work in deep learning for protein engineering. I saw this project in Google's Summer of Code and I thought I could really apply my skills well here becuase I do have a strong biology and chemistry background. I'd still need to brush up on Cytoscape, NDEx, knowledge graphs for this application, and MiTAB. I think I’ll still dig around and try to make some smaller contributions first though!

@Yayi0117
Copy link

Hello, @dexterpratt @jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program.
After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is:

By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction.

How feasible of this proposal? I would appreciate any guidance from you.

@Favourj-bit
Copy link
Member

Hello @dexterpratt @jingjingbic
I hope this message finds you well. My name is Favour James, and I recently came across your project on the list of ideas for GSOC. I am particularly interested in this issue and I would like to contribute to this project during the GSOC'24 program. I participated under NRNB last year and worked on this project: #217

Before I begin my application, I have a few questions I hope you could clarify:

  1. Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
  2. Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
  3. Is there a preferred development environment or setup for working on this project?
  4. Do I need domain expertise in knowledge graphs before i can contribute to this project?
  5. Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM

Thank you for your time and attention, and I look forward to hearing back from you soon.

@dexterpratt
Copy link
Author

"To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)"

It's not the papers that are about graphs. Rather, the idea is that we use an LLM to understand a paper that reports interactions between biological entities and then express that understanding as a knowledge graph.

@dexterpratt
Copy link
Author

"The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost."

We will handle setting up a service to use for the project which will in turn provide access to multiple LLMs. The word limits are much larger than 15K now, but it is an open question as to whether you get better results chunking or not.

@dexterpratt
Copy link
Author

  • "Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
  • Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
  • Is there a preferred development environment or setup for working on this project?
  • Do I need domain expertise in knowledge graphs before i can contribute to this project?
  • Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM"

We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input.

We typically would be using VS code with GitHub CoPilot in anaconda environments.

Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important.

Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you.

@Favourj-bit
Copy link
Member

  • "Are there specific areas within the project’s domain you recommend I focus on to better prepare myself before the project starts?
  • Are there some getting started tasks you would suggest I do to get familiar with the project before the contribution phase starts?
  • Is there a preferred development environment or setup for working on this project?
  • Do I need domain expertise in knowledge graphs before i can contribute to this project?
  • Could you suggest some academic papers that you would be considering during the project phase so I can start looking and experiment with the LLM"

We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input.

We typically would be using VS code with GitHub CoPilot in anaconda environments.

Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important.

Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you.

Thank you for your response.
Another thing I am trying to understand is what the format and action under the goals refer to.
FInally, please can I add you to my proposal draft once i start working on it?

@dexterpratt
Copy link
Author

dexterpratt commented Mar 15, 2024 via email

@Yayi0117
Copy link

This is not a specific proposal From: Yayi0117 @.> Date: Sunday, March 10, 2024 at 9:50 PM To: nrnb/GoogleSummerOfCode @.> Cc: Dexter Pratt @.>, Mention @.> Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235) Hello, @dexterpratthttps://github.com/dexterpratt @jingjingbichttps://github.com/jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is: By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction. How feasible of this proposal? I would appreciate any guidance from you. — Reply to this email directly, view it on GitHub<#235 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIOVLEFXKJNX4O6O72DDYLYXUEYBAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGQ4DSNJYHE. You are receiving this because you were mentioned.Message ID: @.***>

Thanks for your reply.
Apart from iteratively refining prompts through human feedback, I haven't figured out a more specific plan to guide the LLM to generate correct entities and relations. Could you provide some guidance on this?

@Foxtrot-14
Copy link

@dexterpratt, I have shared my GSOC proposal for early feedback (to your ucsd email id which I found on the IDEKER LAB website). Kindly take a look and let me know if any changes are needed.

@khanspers
Copy link
Contributor

This is an active GSoC 2024 project. Closing this project idea as it is no longer available to other contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants