GitHub - laurabbb/bert-finetuning-catalyst: Code for BERT classifier finetuning for multiclass text classification

Instruction:

install Poetry, and run poetry install to create an environment and install all dependencies (you might need to adapt PyTorch version in pyproject.toml w.r.t. your CUDA version)
specify your data, model, and training parameters in config.yml
if needed, customize the code for data processing in src/data.py
specify your model in src/model.py, by default it's DistilBERT for sequence classification
run poetry run python src/train.py

Video-tutorial

I explain the pipeline in detail in a video-tutorial which consists of 4 parts:

Intro: overview of this pipeline, introducing the classification task + overview of the previous talk Firing a cannon at sparrows: BERT vs. logreg
Data preparation for training: from CSV files to PyTorch DataLoaders
The model: understanding the BERT classifier model by HuggingFace, digging into the code of the transformers library
Training: running the pipeline with Catalyst and GPUs

Also, see other tutorials/talks on the topic:

multi-class classification: classifying Amazon product reviews into categories, Kaggle Notebook
multi-label classification: identifying toxic comments, Kaggle Notebook
an overview of this pipeline is given in a video Firing a cannon at sparrows: BERT vs. logreg

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
img		img
logdir		logdir
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
config.yml		config.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback