-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial for fine-tuning protein LLM #4947
Conversation
|
||
> <hands-on-title>Fetch data from Zenodo</hands-on-title> | ||
> | ||
> 1. Create a new folder named `fine-tuning` alongside other folders such as "data", "outputs", "elyra" or you can use your favourite folder name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anuprulez what do you think about extending the Juypter IT to take a git URL. Then when the notebooks start we clone the repo.
We could even trigger an installation of requirements.yml ... similar to binder if we like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about extending the Juypter IT to take a git URL. Then when the notebooks start we clone the repo.
We could even trigger an installation of requirements.yml ... similar to binder if we like.
These are nice ideas!! Thanks!
But I need to think about how to take Git URL and requirements.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebooks are also welcome to live in the GTN next to the tutorial, if you do not want to risk them getting out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebooks are also welcome to live in the GTN next to the tutorial, if you do not want to risk them getting out of sync.
thanks for the idea. Where should we keep it? Creating a new folder named notebooks
along side images
or workflows
folders? The notebook requires two FASTA files as well. Can we keep them along side the notebook?
|
||
--- | ||
|
||
The advent of [large language models](https://en.wikipedia.org/wiki/Large_language_model) has transformed the field of natural language processing, enabling machines to comprehend and generate human-like language with unprecedented accuracy. Pre-trained language models, such as [BERT](https://arxiv.org/abs/1810.04805), [RoBERTa](https://arxiv.org/abs/1907.11692), and their variants, have achieved state-of-the-art results on various tasks, from sentiment analysis and question answering to language translation and text classification. Moreover, the emergence of transformer-based models, such as Generative Pre-trained Transformer ([GPT](https://openai.com/index/gpt-2-1-5b-release/)) and its variants, has enabled the creation of highly advanced language models to generate coherent and context-specific text. The latest iteration of these models, [ChatGPT](https://openai.com/index/chatgpt/), has taken the concept of conversational AI to new heights, allowing users to engage in natural-sounding conversations with machines. However, despite their impressive capabilities, these models are imperfect, and their performance can be significantly improved through fine-tuning. Fine-tuning involves adapting the pre-trained model to a specific task or domain by adjusting its parameters to optimise its performance on a target dataset. This process allows the model to learn task-specific features and relationships that may not be captured by the pre-trained model alone, resulting in highly accurate and specialised language models that can be applied to a wide range of applications. In this tutorial, we will discuss and do hands-on to fine-tune large language model trained on protein sequences [ProtT5](https://github.com/agemagician/ProtTrans/tree/master/Fine-Tuning), exploring the benefits and challenges of this approach, as well as the various techniques and strategies such as low ranking adaptations (LoRA) that can be employed to fit large language models with billions of parameters on regular GPUs. [Protein large language models](https://ieeexplore.ieee.org/document/9477085) (LLMs) represent a significant advancement in Bioinformatics, leveraging the power of deep learning to understand and predict the behaviour of proteins at an unprecedented scale. These models, exemplified by the [ProtTrans](https://github.com/agemagician/ProtTrans) suite, are inspired by natural language processing (NLP) techniques, applying similar methodologies to biological sequences. ProtTrans models, including BERT and T5 adaptations, are trained on vast datasets of protein sequences from databases such as [UniProt](https://www.uniprot.org/) and [BFD](https://bfd.mmseqs.com/), storing millions of protein sequences and enabling them to capture the complex patterns and functions encoded within amino acid sequences. By interpreting these sequences much like languages, protein LLMs offer transformative potential in drug discovery, disease understanding, and synthetic biology, bridging the gap between computational predictions and experimental biology. In this tutorial, we will fine-tune the ProtT5 pre-trained model for [dephosphorylation](https://en.wikipedia.org/wiki/Dephosphorylation) site prediction, a binary classification task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(low) Instead of:
we will discuss and do hands-on to fine-tune large language model
say
we will discuss and fine-tune a large language model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider using the "Suggestion Mode" feature of GitHub (see step 6).
By providing a suggestion using the proper suggestion mode:
- For authors, it is unambiguous what you are proposing
- It's also easier for them to simply accept the suggestion, PR authors prefer suggestions!
- You get credited in the Git commit helping us properly track attribution
The protein large language model has been developed using Pytorch and the model weights are stored at HuggingFace. Therefore, packages such as Pytorch, Transformers, and SentencePiece must be installed in the notebook to recreate the model. Additional packages such as Scikit-learn, Pandas, Matplotlib and Seaborn are also required for data preprocessing, manipulation and visualisation of model training and test performances. All the necessary packages are installed in the notebook using `!pip install` command. Note: the installed packages have a lifespan equal to the notebook sessions. When a new session of JupyterLab is created, all the packages need to be installed again. | ||
|
||
### Fetch and split data | ||
After installing all the packages and importing necessary Python packages, protein sequences (available as a FASTA file) and their labels are read into the notebook. These sequences are further divided into training and validation sets. The training set is used for fine-tuning the protein large language model, and the validation set is used for model evaluation after each training epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(low) Instead of:
After installing all the packages and importing necessary Python packages
say
After installing and importing all the necessary packages
After installing all the packages and importing necessary Python packages, protein sequences (available as a FASTA file) and their labels are read into the notebook. These sequences are further divided into training and validation sets. The training set is used for fine-tuning the protein large language model, and the validation set is used for model evaluation after each training epoch. | ||
|
||
### Define configurations for LoRA with transformer (ProtT5) model | ||
The protein large language model (ProtT5) used in this tutorial has over 1.2 billion parameters (1,209,193,474). Training such a large model on any commercial GPU with 15GB of memory is impossible. The low-ranking adaption, [LoRA](https://arxiv.org/abs/2106.09685), the technique has been devised to make the fine-tuning process feasible on such GPUs. LoRA learns low-rank matrices and, when multiplied, takes the shape of a matrix of the original large language model. While fine-tuning, the weight matrices of the original large language model are kept frozen (not updated), and only these low-rank matrices are updated. Once fine-tuning is finished, these low-rank matrices are combined with the original frozen weight matrices to update the model. The low-rank matrices contain all the knowledge obtained by fine-tuning a small dataset. This approach helps retain the original knowledge of the model while adding the additional knowledge from the fine-tuning dataset. When LoRA is applied to the ProtT5 model, the trainable parameters become a little over 3 million (3,559,426), making it possible to fine-tune on a commercial GPU with at least around 10 GB of memory. The following figure compares [fine-tuning with and without LoRA](https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch). Fine-tuning without LoRA requires additional weight matrices to be the same size as the original model, which needs much more computational resources than LoRA, where much smaller weight matrices are learned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(low) Instead of:
The low-ranking adaption, LoRA, the technique has been
say:
LoRA, the low-ranking adaption technique, has been
(low) Instead of:
While fine-tuning, the weight matrices of the original large language model are kept frozen (not updated), and only these low-rank matrices are updated.
say:
During fine-tuning, the weight matrices of the original large language model are kept frozen (not updated) while only these low-rank matrices are updated.
The ProtT5 model (inspired by [T5](https://huggingface.co/docs/transformers/en/model_doc/t5)) has two significant components - [encoder and sequence classifier](https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/PT5_LoRA_Finetuning_per_prot.ipynb). Encoder learns a representation of protein sequences, and classifier is used for downstream classification of the learned representations of sequences. The self-attention technique is used to learn sequence representations by computing weights of highly interacting regions in sequences, thereby establishing long-range dependencies. Amino acids in protein sequences are represented in vector spaces in combination with positional embedding to maintain the order of amino acids in sequences. | ||
|
||
### Create a model training method and train | ||
Once the model architecture is created, the weights of the pre-trained ProtT5 are downloaded from [HuggingFace](https://huggingface.co/Rostlab/ProstT5). HuggingFace provides an openly available repository of pre-trained weights of many LLM-like architectures such as ProtT5, [Llama](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [BioGPT](https://huggingface.co/microsoft/BioGPT-Large) and so on. The download of the pre-trained weights is facilitated by a Python package, `Transformers`, which provides methods for downloading weight matrices and tokenisers. After downloading the model weights and tokeniser, the original model is modified by adding LoRA layers to have low-rank matrices and the original weights are frozen. This brings down the number of parameters of the original ProtT5 model from 1.2 billion to 3.5 million. The LoRA updated model is then trained for several epochs when the error rate stops decreasing, signifying training stabilisation. The fine-tuned model is then saved to a file for later reuse for prediction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(low) Instead of:
The LoRA updated model is then trained for several epochs when the error rate stops decreasing, signifying training stabilisation. The fine-tuned model is then saved to a file for later reuse for prediction.
say:
Then, the LoRA updated model is trained for several epochs until the error rate stops decreasing which signifies training stabilisation. Next, the fine-tuned model is saved to a file where it can be reused for prediction.
![confusion_matrix](images/confusion_matrix.png "Confusion matrix of prediction on test sequences showing performance for both classes.") | ||
|
||
## Conclusion | ||
In the tutorial, we have discussed an approach to fine-tune a large language model trained on millions of protein sequences to classify dephosphorylation sites. Using low-ranking adaptation technique, it becomes possible to fine-tune a model having 1.2 billion trainable parameters by reducing it to contain just 3.5 million ones. The avialablity of the fine-tuning notebook provided with the tutorial and the GPU-JupyterLab infrastructure in Galaxy simplify the complex process of fine-tuning on different datasets. In addition to classification, it is also possible to extract embeddings/representations of entire protein sequences and individual amino acids in protein sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(low) Instead of:
avialablity
say:
availability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anuprulez This tutorial looks really good, I look forward to testing it out soon. Low priority comments only :)
@hujambo-dunia thank you for reviewing the tutorial. I will fix these minor things today. Additionally, your request for using GPU-JupyterLab has been approved which will give you access probably beginning next week. However, a few issues are going on with the underlying GPU machines that this tool uses and therefore, the specific tool is not functional. We are currently working on it. |
I think all the comments are fixed :) |
The tool is fixed and functional now :) |
Thanks @anuprulez. Great tutorial! |
The PR adds a tutorial to fine-tune ProtT5 model (a protein LLM) using Galaxy Europe's GPU-JupyterLab tool
Can you have a look at the tutorial? ping @kxk302 @hujambo-dunia
Thanks a lot!