-
Harsha Vardhan (hv2237)
-
Alex Brebenel (ab5181)
This repo contains several programs to test and compare the runtime speeds and accuracy of a model before and after utilizing 3 different compression techniques:
- Pruning
- Knowledge Distillation
- Quantization
We use the RoBERTa-Large model as the starting template and use data models from the the GLUE benchmark. For each respective technique, we perform an analysis and comparison of how the template model and the newly compressed model runs.
- Dependencies
- sklearn
- transfromers
- datasets
- torch
- textpruner
- evaluate
- matplotlib
Update environment variables for cache and GPU number to be used Use the given notebook to download and preprocess the sst2 evaluation dataset from huggingface datasets library Initialize the Roberta Large model from huggingface finetuned on SST2 We then run multiple iteration of pruning using textpruner with varying FFN dim size, number of heads and plot the accuracies in each setting
- Dependencies
- sklearn
- transfromers
- datasets
- torch
- textbrewer
- evaluate
Update environment variables for cache and GPU number to be used Use the given notebook to download and preprocess the sst2 dataset from huggingface datasets library The notebook goes on to initialize a student and teacher model using the RobertaConfig from transformers We then use the textbrewer library to specify loss function, intermmediate losses and train the student model At the end we summarize the model size and compute accuracies for student and teacher
- Dependencies
- transformers
- numpy
- pandas
- nlp
- matplotlib
- torch
- nltk
- pytorch_lightning
- math
- os
- torchmatrics
There are two ways to run and compare the affects of 8-bit integer quantization.
Download the Before and After ipynb files and open them in colab. Next download the raw directory contents and place them in the local repository in Google Colab. Then just run all cells. After both files finish running, you can compare their results.
Your Colab directory should look something like this:
Download the regular and qunantize py files. Make sure to install the following:
- !pip install transformers numpy pandas nlp matplotlib torch
- !pip install nltk pytorch_lightning math os torchmatrics
- !pip install pytorch-lightning nlp
Also make sure to have a ./raw directory in the same location where you put your two py files.
If you are on IDE, just run the module.
If you are on the terminal, run the command "python regular.py", then after run "python quantize.py"
When both methods finish, compare the results.
Pruning the attention heads shows that a lot of the heads contain similar information, as pruning them by half didn't drop the performance significantly. From hyperparameters like attention heads, we see that such elaborate architectures may contain a lot of redundant parameters. The decrease in model size w.r.t dffn shows that a significant number of parameters are used on the feed forward layers. But decreasing the dimension by too much drastically effects performance.
Despite a drop in accracy of 6%, there is a significant decrease in model size and samples per second for inference
Tensor Board Graphs/Runtime of regular model:
Tensor Board Graphs/Runtime of quantized model:
Tensor Board Graphs/Runtime of quantized model:
Tensor Board Result exmample and code result/Runtime of quantized model:
We observe that there is a roughly 5 percent speedup in the run time for a epoch size of 30. The training loss and validation loss increased only by a little bit.
This shows that we have lightened the load of the processing and sped-up the runtime of training and inference, without sacrificing too much loss.