Tutorial for fine-tuning protein LLM #4947

anuprulez · 2024-05-16T15:24:52Z

The PR adds a tutorial to fine-tune ProtT5 model (a protein LLM) using Galaxy Europe's GPU-JupyterLab tool

Can you have a look at the tutorial? ping @kxk302 @hujambo-dunia

Thanks a lot!

…erial into protTrans

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

bgruening · 2024-05-16T15:54:51Z

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

+
+> <hands-on-title>Fetch data from Zenodo</hands-on-title>
+>
+> 1. Create a new folder named `fine-tuning` alongside other folders such as "data", "outputs", "elyra" or you can use your favourite folder name.


@anuprulez what do you think about extending the Juypter IT to take a git URL. Then when the notebooks start we clone the repo.

We could even trigger an installation of requirements.yml ... similar to binder if we like.

@bgruening

what do you think about extending the Juypter IT to take a git URL. Then when the notebooks start we clone the repo.

We could even trigger an installation of requirements.yml ... similar to binder if we like.

These are nice ideas!! Thanks!

But I need to think about how to take Git URL and requirements.yml

The notebooks are also welcome to live in the GTN next to the tutorial, if you do not want to risk them getting out of sync.

The notebooks are also welcome to live in the GTN next to the tutorial, if you do not want to risk them getting out of sync.

thanks for the idea. Where should we keep it? Creating a new folder named notebooks along side images or workflows folders? The notebook requires two FASTA files as well. Can we keep them along side the notebook?

hujambo-dunia · 2024-05-30T11:32:36Z

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

+
+---
+
+The advent of [large language models](https://en.wikipedia.org/wiki/Large_language_model) has transformed the field of natural language processing, enabling machines to comprehend and generate human-like language with unprecedented accuracy. Pre-trained language models, such as [BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), and their variants, have achieved state-of-the-art results on various tasks, from sentiment analysis and question answering to language translation and text classification. Moreover, the emergence of transformer-based models, such as Generative Pre-trained Transformer ([GPT](https://openai.com/index/gpt-2-1-5b-release/)) and its variants, has enabled the creation of highly advanced language models to generate coherent and context-specific text. The latest iteration of these models, [ChatGPT](https://openai.com/index/chatgpt/), has taken the concept of conversational AI to new heights, allowing users to engage in natural-sounding conversations with machines. However, despite their impressive capabilities, these models are imperfect, and their performance can be significantly improved through fine-tuning. Fine-tuning involves adapting the pre-trained model to a specific task or domain by adjusting its parameters to optimise its performance on a target dataset. This process allows the model to learn task-specific features and relationships that may not be captured by the pre-trained model alone, resulting in highly accurate and specialised language models that can be applied to a wide range of applications. In this tutorial, we will discuss and do hands-on to fine-tune large language model trained on protein sequences [ProtT5](https://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning), exploring the benefits and challenges of this approach, as well as the various techniques and strategies such as low ranking adaptations (LoRA) that can be employed to fit large language models with billions of parameters on regular GPUs. [Protein large language models](https://ieeexplore.ieee.org/document/9477085) (LLMs) represent a significant advancement in Bioinformatics, leveraging the power of deep learning to understand and predict the behaviour of proteins at an unprecedented scale. These models, exemplified by the [ProtTrans](https://github.com/agemagician/ProtTrans) suite, are inspired by natural language processing (NLP) techniques, applying similar methodologies to biological sequences. ProtTrans models, including BERT and T5 adaptations, are trained on vast datasets of protein sequences from databases such as [UniProt](https://www.uniprot.org/) and [BFD](https://bfd.mmseqs.com/), storing millions of protein sequences and enabling them to capture the complex patterns and functions encoded within amino acid sequences. By interpreting these sequences much like languages, protein LLMs offer transformative potential in drug discovery, disease understanding, and synthetic biology, bridging the gap between computational predictions and experimental biology. In this tutorial, we will fine-tune the ProtT5 pre-trained model for [dephosphorylation](https://en.wikipedia.org/wiki/Dephosphorylation) site prediction, a binary classification task.


(low) Instead of:
we will discuss and do hands-on to fine-tune large language model
say
we will discuss and fine-tune a large language model

Please consider using the "Suggestion Mode" feature of GitHub (see step 6).

By providing a suggestion using the proper suggestion mode:

For authors, it is unambiguous what you are proposing

It's also easier for them to simply accept the suggestion, PR authors prefer suggestions!

You get credited in the Git commit helping us properly track attribution

hujambo-dunia · 2024-05-30T11:52:34Z

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

+The protein large language model has been developed using Pytorch and the model weights are stored at HuggingFace. Therefore, packages such as Pytorch, Transformers, and SentencePiece must be installed in the notebook to recreate the model. Additional packages such as Scikit-learn, Pandas, Matplotlib and Seaborn are also required for data preprocessing, manipulation and visualisation of model training and test performances. 	All the necessary packages are installed in the notebook using `!pip install` command. Note: the installed packages have a lifespan equal to the notebook sessions. When a new session of JupyterLab is created, all the packages need to be installed again.
+
+### Fetch and split data
+After installing all the packages and importing necessary Python packages, protein sequences (available as a FASTA file) and their labels are read into the notebook. These sequences are further divided into training and validation sets. The training set is used for fine-tuning the protein large language model, and the validation set is used for model evaluation after each training epoch.


(low) Instead of:
After installing all the packages and importing necessary Python packages
say
After installing and importing all the necessary packages

hujambo-dunia · 2024-05-30T11:57:34Z

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

+After installing all the packages and importing necessary Python packages, protein sequences (available as a FASTA file) and their labels are read into the notebook. These sequences are further divided into training and validation sets. The training set is used for fine-tuning the protein large language model, and the validation set is used for model evaluation after each training epoch.
+
+### Define configurations for LoRA with transformer (ProtT5) model
+The protein large language model (ProtT5) used in this tutorial has over 1.2 billion parameters (1,209,193,474). Training such a large model on any commercial GPU with 15GB of memory is impossible. The low-ranking adaption, [LoRA](https://arxiv.org/abs/2106.09685), the technique has been devised to make the fine-tuning process feasible on such GPUs. LoRA learns low-rank matrices and, when multiplied, takes the shape of a matrix of the original large language model. While fine-tuning, the weight matrices of the original large language model are kept frozen (not updated), and only these low-rank matrices are updated. Once fine-tuning is finished, these low-rank matrices are combined with the original frozen weight matrices to update the model. The low-rank matrices contain all the knowledge obtained by fine-tuning a small dataset. This approach helps retain the original knowledge of the model while adding the additional knowledge from the fine-tuning dataset. When LoRA is applied to the ProtT5 model, the trainable parameters become a little over 3 million (3,559,426), making it possible to fine-tune on a commercial GPU with at least around 10 GB of memory. The following figure compares [fine-tuning with and without LoRA](https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch). Fine-tuning without LoRA requires additional weight matrices to be the same size as the original model, which needs much more computational resources than LoRA, where much smaller weight matrices are learned.


(low) Instead of:
The low-ranking adaption, LoRA, the technique has been
say:
LoRA, the low-ranking adaption technique, has been

(low) Instead of:
While fine-tuning, the weight matrices of the original large language model are kept frozen (not updated), and only these low-rank matrices are updated.
say:
During fine-tuning, the weight matrices of the original large language model are kept frozen (not updated) while only these low-rank matrices are updated.

hujambo-dunia · 2024-05-30T12:12:17Z

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

+The ProtT5 model (inspired by [T5](https://huggingface.co/docs/transformers/en/model_doc/t5)) has two significant components - [encoder and sequence classifier](https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/PT5_LoRA_Finetuning_per_prot.ipynb).  Encoder learns a representation of protein sequences, and classifier is used for downstream classification of the learned representations of sequences. The self-attention technique is used to learn sequence representations by computing weights of highly interacting regions in sequences, thereby establishing long-range dependencies. Amino acids in protein sequences are represented in vector spaces in combination with positional embedding to maintain the order of amino acids in sequences.  
+
+### Create a model training method and train
+Once the model architecture is created, the weights of the pre-trained ProtT5 are downloaded from [HuggingFace](https://huggingface.co/Rostlab/ProstT5). HuggingFace provides an openly available repository of pre-trained weights of many LLM-like architectures such as ProtT5, [Llama](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [BioGPT](https://huggingface.co/microsoft/BioGPT-Large) and so on. The download of the pre-trained weights is facilitated by a Python package, `Transformers`, which provides methods for downloading weight matrices and tokenisers. After downloading the model weights and tokeniser, the original model is modified by adding LoRA layers to have low-rank matrices and the original weights are frozen. This brings down the number of parameters of the original ProtT5 model from 1.2 billion to 3.5 million. The LoRA updated model is then trained for several epochs when the error rate stops decreasing, signifying training stabilisation. The fine-tuned model is then saved to a file for later reuse for prediction.


(low) Instead of:
The LoRA updated model is then trained for several epochs when the error rate stops decreasing, signifying training stabilisation. The fine-tuned model is then saved to a file for later reuse for prediction.
say:
Then, the LoRA updated model is trained for several epochs until the error rate stops decreasing which signifies training stabilisation. Next, the fine-tuned model is saved to a file where it can be reused for prediction.

hujambo-dunia · 2024-05-30T12:15:54Z

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md

+![confusion_matrix](images/confusion_matrix.png "Confusion matrix of prediction on test sequences showing performance for both classes.")
+
+## Conclusion
+In the tutorial, we have discussed an approach to fine-tune a large language model trained on millions of protein sequences to classify dephosphorylation sites. Using low-ranking adaptation technique, it becomes possible to fine-tune a model having 1.2 billion trainable parameters by reducing it to contain just 3.5 million ones. The avialablity of the fine-tuning notebook provided with the tutorial and the GPU-JupyterLab infrastructure in Galaxy simplify the complex process of fine-tuning on different datasets. In addition to classification, it is also possible to extract embeddings/representations of entire protein sequences and individual amino acids in protein sequences.  


(low) Instead of:
avialablity
say:
availability

hujambo-dunia

@anuprulez This tutorial looks really good, I look forward to testing it out soon. Low priority comments only :)

anuprulez · 2024-05-31T07:56:20Z

@hujambo-dunia thank you for reviewing the tutorial. I will fix these minor things today. Additionally, your request for using GPU-JupyterLab has been approved which will give you access probably beginning next week. However, a few issues are going on with the underlying GPU machines that this tool uses and therefore, the specific tool is not functional. We are currently working on it.

anuprulez · 2024-05-31T14:29:20Z

I think all the comments are fixed :)

anuprulez · 2024-06-03T16:03:05Z

The tool is fixed and functional now :)

bgruening · 2024-06-17T12:35:17Z

Thanks @anuprulez. Great tutorial!

Ubuntu and others added 14 commits April 29, 2024 14:49

add tutorial

b7b2201

Merge branch 'galaxyproject:main' into protTrans

603da12

add formatting and test

15a0765

Merge branch 'protTrans' of https://github.com/anuprulez/training-mat…

3bf0928

…erial into protTrans

Merge branch 'galaxyproject:main' into protTrans

d8a276c

add image and text

373db64

fix image

6e4b948

add images

52ec173

update

5294924

fix tutorial

4a36487

update tutorial

5f90f14

fix formatting

10a6889

fix

76dbde2

Merge branch 'galaxyproject:main' into protTrans

e6c5b8a

anuprulez requested a review from a team as a code owner May 16, 2024 15:24

github-actions bot added the statistics label May 16, 2024

anuprulez requested a review from kxk302 May 16, 2024 15:26

remove contributor

1293b19

hexylena reviewed May 16, 2024

View reviewed changes

topics/statistics/tutorials/fine_tuning_protTrans/tutorial.md Outdated Show resolved Hide resolved

anuprulez added 4 commits May 16, 2024 15:34

fix req tutorial

8de38cf

add first contributor

4985885

check by removing first contributor

e3b24ad

fix indentation of data lib

d74f699

bgruening reviewed May 16, 2024

View reviewed changes

This was referenced May 22, 2024

Add GitHub URL input for GPU JuptyerLab anuprulez/galaxy#1

Closed

Update code repository addition feature at startup in GPU JupyterLab usegalaxy-eu/galaxy#239

Merged

anuprulez and others added 3 commits May 24, 2024 09:43

add automatic github cloning

e3b5daf

Merge branch 'galaxyproject:main' into protTrans

6d0fac8

Merge branch 'galaxyproject:main' into protTrans

766adb2

hujambo-dunia reviewed May 30, 2024

View reviewed changes

hujambo-dunia suggested changes May 30, 2024

View reviewed changes

fix review comments

95e5daf

anuprulez added 2 commits June 5, 2024 14:46

Merge branch 'galaxyproject:main' into protTrans

846026f

Merge branch 'galaxyproject:main' into protTrans

ba259ac

bgruening enabled auto-merge June 17, 2024 12:34

bgruening approved these changes Jun 17, 2024

View reviewed changes

bgruening merged commit 7ed0c3c into galaxyproject:main Jun 17, 2024
3 checks passed

anuprulez deleted the protTrans branch June 17, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial for fine-tuning protein LLM #4947

Tutorial for fine-tuning protein LLM #4947

anuprulez commented May 16, 2024

bgruening May 16, 2024

anuprulez May 17, 2024

hexylena May 21, 2024

anuprulez May 24, 2024

hujambo-dunia May 30, 2024 •

edited

Loading

hexylena Jun 3, 2024

hujambo-dunia May 30, 2024 •

edited

Loading

hujambo-dunia May 30, 2024 •

edited

Loading

hujambo-dunia May 30, 2024

hujambo-dunia May 30, 2024

hujambo-dunia left a comment

anuprulez commented May 31, 2024

anuprulez commented May 31, 2024

anuprulez commented Jun 3, 2024

bgruening commented Jun 17, 2024


		---

		The advent of [large language models](https://en.wikipedia.org/wiki/Large_language_model) has transformed the field of natural language processing, enabling machines to comprehend and generate human-like language with unprecedented accuracy. Pre-trained language models, such as [BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), and their variants, have achieved state-of-the-art results on various tasks, from sentiment analysis and question answering to language translation and text classification. Moreover, the emergence of transformer-based models, such as Generative Pre-trained Transformer ([GPT](https://openai.com/index/gpt-2-1-5b-release/)) and its variants, has enabled the creation of highly advanced language models to generate coherent and context-specific text. The latest iteration of these models, [ChatGPT](https://openai.com/index/chatgpt/), has taken the concept of conversational AI to new heights, allowing users to engage in natural-sounding conversations with machines. However, despite their impressive capabilities, these models are imperfect, and their performance can be significantly improved through fine-tuning. Fine-tuning involves adapting the pre-trained model to a specific task or domain by adjusting its parameters to optimise its performance on a target dataset. This process allows the model to learn task-specific features and relationships that may not be captured by the pre-trained model alone, resulting in highly accurate and specialised language models that can be applied to a wide range of applications. In this tutorial, we will discuss and do hands-on to fine-tune large language model trained on protein sequences [ProtT5](https://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning), exploring the benefits and challenges of this approach, as well as the various techniques and strategies such as low ranking adaptations (LoRA) that can be employed to fit large language models with billions of parameters on regular GPUs. [Protein large language models](https://ieeexplore.ieee.org/document/9477085) (LLMs) represent a significant advancement in Bioinformatics, leveraging the power of deep learning to understand and predict the behaviour of proteins at an unprecedented scale. These models, exemplified by the [ProtTrans](https://github.com/agemagician/ProtTrans) suite, are inspired by natural language processing (NLP) techniques, applying similar methodologies to biological sequences. ProtTrans models, including BERT and T5 adaptations, are trained on vast datasets of protein sequences from databases such as [UniProt](https://www.uniprot.org/) and [BFD](https://bfd.mmseqs.com/), storing millions of protein sequences and enabling them to capture the complex patterns and functions encoded within amino acid sequences. By interpreting these sequences much like languages, protein LLMs offer transformative potential in drug discovery, disease understanding, and synthetic biology, bridging the gap between computational predictions and experimental biology. In this tutorial, we will fine-tune the ProtT5 pre-trained model for [dephosphorylation](https://en.wikipedia.org/wiki/Dephosphorylation) site prediction, a binary classification task.

Tutorial for fine-tuning protein LLM #4947

Tutorial for fine-tuning protein LLM #4947

Conversation

anuprulez commented May 16, 2024

bgruening May 16, 2024

Choose a reason for hiding this comment

anuprulez May 17, 2024

Choose a reason for hiding this comment

hexylena May 21, 2024

Choose a reason for hiding this comment

anuprulez May 24, 2024

Choose a reason for hiding this comment

hujambo-dunia May 30, 2024 • edited Loading

Choose a reason for hiding this comment

hexylena Jun 3, 2024

Choose a reason for hiding this comment

hujambo-dunia May 30, 2024 • edited Loading

Choose a reason for hiding this comment

hujambo-dunia May 30, 2024 • edited Loading

Choose a reason for hiding this comment

hujambo-dunia May 30, 2024

Choose a reason for hiding this comment

hujambo-dunia May 30, 2024

Choose a reason for hiding this comment

hujambo-dunia left a comment

Choose a reason for hiding this comment

anuprulez commented May 31, 2024

anuprulez commented May 31, 2024

anuprulez commented Jun 3, 2024

bgruening commented Jun 17, 2024

hujambo-dunia May 30, 2024 •

edited

Loading

hujambo-dunia May 30, 2024 •

edited

Loading

hujambo-dunia May 30, 2024 •

edited

Loading