Instruction:
- install Poetry, and run
poetry install
to create an environment and install all dependencies (you might need to adapt PyTorch version inpyproject.toml
w.r.t. your CUDA version) - specify your data, model, and training parameters in
config.yml
- if needed, customize the code for data processing in
src/data.py
- specify your model in
src/model.py
, by default it's DistilBERT for sequence classification - run
poetry run python src/train.py
Video-tutorial
I explain the pipeline in detail in a video-tutorial which consists of 4 parts:
- Intro: overview of this pipeline, introducing the classification task + overview of the previous talk Firing a cannon at sparrows: BERT vs. logreg
- Data preparation for training: from CSV files to PyTorch DataLoaders
- The model: understanding the BERT classifier model by HuggingFace, digging into the code of the transformers library
- Training: running the pipeline with Catalyst and GPUs
Also, see other tutorials/talks on the topic:
- multi-class classification: classifying Amazon product reviews into categories, Kaggle Notebook
- multi-label classification: identifying toxic comments, Kaggle Notebook
- an overview of this pipeline is given in a video Firing a cannon at sparrows: BERT vs. logreg