Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoleculeGPT: Dataset+Model+Unit tests+Example #9698

Open
Tracked by #9694
puririshi98 opened this issue Oct 8, 2024 · 3 comments
Open
Tracked by #9694

MoleculeGPT: Dataset+Model+Unit tests+Example #9698

puririshi98 opened this issue Oct 8, 2024 · 3 comments
Labels

Comments

@puririshi98
Copy link
Contributor

puririshi98 commented Oct 8, 2024

🚀 The feature, motivation and pitch

Paper: https://ai4d3.github.io/papers/34.pdf
Part of the community sprint #9694
The goal of this project is to reproduce the work done in MoleculeGPT while tying it as closely to the existing GNN+LLM frameworks in PyG. We recommend using as many existing features as possible from PyG. Additional features which you feel will be reusable for other workflows should be added to PyG. One-off functions that are specific to this workflow can be left inside the example.
Most of the effort will likely go into building a PyG dataset that matches the one described in the paper. At a high level the dataset is a composition of Q+A pairs for molecular field, with matching molecules as context. These Q+A pairs focus on molecular property prediction.

Alternatives

No response

Additional context

No response

@puririshi98 puririshi98 changed the title MolculeGPT example+dataset MolculeGPT example+dataset Oct 8, 2024
@puririshi98 puririshi98 changed the title MolculeGPT example+dataset MolculeGPT example+dataset+model+unit tests Oct 8, 2024
@xnuohz
Copy link
Contributor

xnuohz commented Oct 12, 2024

Would like to contribute to this paper. Listed what to do, need some discussion for the details^^.

Dataset

  • Format: <SMILES, Instruction, Response>
  • Is there any existing dataset? Or we need to extract and clean from PubChem from scratch.

Model

  • 2D Graph Branch
    • GraphMVP: GIN for 2D, SchNet for 3D, so I think we can use GINConv directly
    • QFormer: Implement torch_geometric.nn.attention.qformer
  • 1D Graph Branch
    • ChemBERTa-2: Use torch_geometric.nn.nlp.llm
    • QFormer: Implement torch_geometric.nn.attention.qformer
  • LLM
    • vicuna-7B-v1.5, Use torch_geometric.nn.nlp.llm
    • Not clear how to fit 1D+2D embedding and instructions to LLM

@rusty1s rusty1s changed the title MolculeGPT example+dataset+model+unit tests MolculeGPT: Dataset+Model+Unit tests+Example Oct 13, 2024
@rusty1s rusty1s changed the title MolculeGPT: Dataset+Model+Unit tests+Example MoleculeGPT: Dataset+Model+Unit tests+Example Oct 13, 2024
@akihironitta
Copy link
Member

Is there any existing dataset? Or we need to extract and clean from PubChem from scratch.

Hey @xnuohz sorry for the delay! Just had a quick look at the paper, and it looks like they haven't published the code and dataset that they curated for the paper, but as a general goal, we should aim for reproducing the result from the paper by re-implementing the dataset, preprocessing, and model with an example script.

We can also discuss this in PyG Slack :)

(cc'ing @puririshi98 for when he's back)

@zechengz
Copy link
Member

@xnuohz I think they seem to follow the this data preprocessing step https://github.com/chao1224/MoleculeSTM/tree/main/data as described in section 3.2
Also the 1D Graph Branch should be 1D SMILES Branch which uses the encoder designed to encode SMILES string https://github.com/seyonechithrananda/bert-loves-chemistry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants