-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Language Model-based creation of knowledge graph from a publication #235
Comments
Hello, I'm Noaman, a CS undergrad from the class of 2025. I've worked on issue 223, and I find this project impactful. I'd love to continue contributing to it.
Let's discuss these points. I'm eager to work on this project for GSOC 2024. I've also read about your work on NDEx, and it's fascinating. Looking forward to hearing from you. |
NRNB has been accepted as a mentoring organization for GSoC 2024. The contributor application period is March 18 – April 2. Here are some useful links: GSoC contributor guide |
The biology is probably the steepest slope for coming up to speed. The problem is that you would need to be able to have some idea of whether the extraction is producing sense or nonsense as you develop
* Dexter
From: Abdul Mateen Mulla ***@***.***>
Date: Sunday, February 25, 2024 at 2:12 AM
To: nrnb/GoogleSummerOfCode ***@***.***>
Cc: Dexter Pratt ***@***.***>, Mention ***@***.***>
Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235)
Hi @dexterpratt<https://urldefense.com/v3/__https:/github.com/dexterpratt__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__wShdgM0A$> @jingjingbic<https://urldefense.com/v3/__https:/github.com/jingjingbic__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__zZTTQRzQ$>
I'm interested in Large Language Model-based creation of knowledge graph from a publication project . My Skills with regard to Project
Hands-on: Python, Langchain/LLM
Areas to Explore: Cytoscape, NDEx, Graphs, Biology, MiTAB
Do you think this will be enough for me to grasp the project's working over the next two months before the submission? I'm open to diving deeper into these skills if needed.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/nrnb/GoogleSummerOfCode/issues/235*issuecomment-1962876185__;Iw!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__ya2JkLkg$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AAIOVLCUP5GSEUTM4JB32MTYVMCNDAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSHA3TMMJYGU__;!!LLK065n_VXAQ!kIHewFu8Q8KvO8UYDTaoBNRLmbLMFuaOWd_zovDyhzUUPrv4-ufp78k3LRWEo_7xBIe3YDnQCqHA1dAo__zFyqIZcA$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hey there, @dexterpratt my name is Maxim Kirby. I did my B.Sc. in Biochemistry at the University of Waterloo and graduated in 2023. I've been coding for about a year in Python and I've been getting really into deep learning since I am hoping to graduate work in deep learning for protein engineering. I saw this project in Google's Summer of Code and I thought I could really apply my skills well here becuase I do have a strong biology and chemistry background. I'd still need to brush up on Cytoscape, NDEx, knowledge graphs for this application, and MiTAB. I think I’ll still dig around and try to make some smaller contributions first though! |
Hello, @dexterpratt @jingjingbic, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program. By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction. How feasible of this proposal? I would appreciate any guidance from you. |
Hello @dexterpratt @jingjingbic Before I begin my application, I have a few questions I hope you could clarify:
Thank you for your time and attention, and I look forward to hearing back from you soon. |
"To kick off, I plan to delve into sample papers that feature graphs. This will help me understand what prompts to provide to the LLM. (If you have any suggestions for reference papers, please share.)" It's not the papers that are about graphs. Rather, the idea is that we use an LLM to understand a paper that reports interactions between biological entities and then express that understanding as a knowledge graph. |
"The next step will be configuring the LLM. I have some exposure to the GPT-3.5 API, but feeding the document to the LLM presents a challenge. The first obstacle that comes to mind is the word limit for prompts (15,000 characters). If we provide the document in chunks, we'll be making multiple API requests per document, increasing the cost." We will handle setting up a service to use for the project which will in turn provide access to multiple LLMs. The word limits are much larger than 15K now, but it is an open question as to whether you get better results chunking or not. |
We don't have a specific area of biology in mind. But, now that you ask, we are doing some LLM work in the host response to viral infection. So that might be a nice overlap. But any paper that discusses interactions such as described in the original post is a reasonable input. We typically would be using VS code with GitHub CoPilot in anaconda environments. Prior experience with knowledge graphs isn't a requirement. Experience with simpler biological networks, as described in the original post, is important. Getting started tasks - simply try prompting one of the LLMs in a public interface (i.e. ChatGTP) with text from a paper with instructions to extract the important relationships as a knowledge graph. See how far the naive experiment gets you. |
Thank you for your response. |
This is not a specific proposal
From: Yayi0117 ***@***.***>
Date: Sunday, March 10, 2024 at 9:50 PM
To: nrnb/GoogleSummerOfCode ***@***.***>
Cc: Dexter Pratt ***@***.***>, Mention ***@***.***>
Subject: Re: [nrnb/GoogleSummerOfCode] Large Language Model-based creation of knowledge graph from a publication (Issue #235)
Hello, @dexterpratt<https://github.com/dexterpratt> @jingjingbic<https://github.com/jingjingbic>, I'm Yayi Wang. With a BSc in Biotechnology and programming experiences, I have the suitable biological background and programming skills to contribute to this program.
After going through related materials, I thought that the main challenge of this program is to write proper prompts to use the LLM to generate right entities and relations, and my preliminary idea is:
By integrating iterative experimentation, refined design, and the RLHF approach, we can precisely guide LLMs to accomplish specific tasks, continuously optimizing prompts through human feedback and model output, thereby enhancing the accuracy and efficiency of information extraction and knowledge graph construction.
How feasible of this proposal? I would appreciate any guidance from you.
—
Reply to this email directly, view it on GitHub<#235 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIOVLEFXKJNX4O6O72DDYLYXUEYBAVCNFSM6AAAAABCA5VW6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBXGQ4DSNJYHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks for your reply. |
@dexterpratt, I have shared my GSOC proposal for early feedback (to your ucsd email id which I found on the IDEKER LAB website). Kindly take a look and let me know if any changes are needed. |
This is an active GSoC 2024 project. Closing this project idea as it is no longer available to other contributors. |
Background
This is a new project, with the goal of creating a Python application that creates a network of interactions based on LLM analysis of a published paper. Statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.
Goal
Create a Python application:
Input: academic paper that presents findings that include molecular interactions
Output: a network in which the nodes are biological entities and edges represent interactions and other relationships between the entities
Format: CX2, https://cytoscape.org/cx/cx2/specification/cytoscape-exchange-format-specification-(version-2)/
Action: upload to NDEx https://www.ndexbio.org/index.html#/
NDEx API is accessed via https://pypi.org/project/ndex2/
(note that the network in NDEx will be editable in Cytoscape www.cytoscape.org)
Examples interactions:
AKT1 binds GSK3B
Activation of AKT1 can increase Cell Proliferation
The interactions in the network are created by LLM-based analysis in which statements of relationships in the paper are identified and then expressed using standardized interaction and entity types. The LLM analysis also creates text explaining each interaction and distinguishes (1) background information presented in the paper from (2) the original findings of the paper.
Difficulty Level: Medium
Medium, Hard for "stretch goals"
The programming for this problem is medium-level
The biological knowledge required can be basic.
However, there will be many opportunities to go beyond the minimal requirements. For example, if participants have more advanced biological knowledge, they might generate graphs expressing much more complex interactions.
Size and Length of Project
large: 350 hours
12+ weeks
Skills
Essential skills:
Python
Nice to have skills:
Public Repository
TBD, will be created in the Cytoscape organization
Potential Mentors
Dexter Pratt, Jing Chen
The text was updated successfully, but these errors were encountered: