Skip to content

Commit

Permalink
docs: added a new model training tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Oct 2, 2024
1 parent 7a214cc commit 5e0fe85
Show file tree
Hide file tree
Showing 5 changed files with 308 additions and 11 deletions.
4 changes: 1 addition & 3 deletions docs/assets/overrides/main.html
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
{% extends "base.html" %}

{% block announce %}
EDS-NLP v0.11 introduces <a href="/concepts/pipeline#creating-a-pipeline">a new way</a>
to add pipes to your models, an <a href="/pipes/trainable/span-linker">entity linker</a>
and many other <a href="/changelog">features</a> !
Check out the new <a href="/tutorials/model-training">Model Training tutorial</a>!
{% endblock %}
16 changes: 8 additions & 8 deletions docs/tutorials/make-a-training-script.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Making a training script
# Custom training script

In this tutorial, we'll see how we can train a deep learning model with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.
In this tutorial, we'll see how we can write our own deep learning model training script with EDS-NLP. We will implement a script to train a named-entity recognition (NER) model.

## Step-by-step walkthrough

Expand Down Expand Up @@ -179,16 +179,16 @@ training loop
Finally, the model is evaluated on the validation dataset and saved at regular intervals.

```{ .python .no-check }
from edsnlp.scorers.ner import create_ner_exact_scorer
from edsnlp.metrics.ner import create_ner_exact_metric
from copy import deepcopy
scorer = create_ner_exact_scorer(nlp.pipes.ner.target_span_getter)
metric = create_ner_exact_metric(nlp.pipes.ner.target_span_getter)
...
if (step % 100) == 0:
with nlp.select_pipes(enable=["ner"]): # (1)
print(scorer(val_docs, nlp.pipe(deepcopy(val_docs)))) # (2)
print(metric(val_docs, nlp.pipe(deepcopy(val_docs)))) # (2)
nlp.to_disk("model") # (3)
```
Expand Down Expand Up @@ -217,7 +217,7 @@ Let's wrap the training code in a function, and make it callable from the comman

import edsnlp, edsnlp.pipes as eds
from edsnlp import registry, Pipeline
from edsnlp.scorers.ner import create_ner_exact_scorer
from edsnlp.metrics.ner import create_ner_exact_metric


@registry.adapters.register("ner_adapter")
Expand Down Expand Up @@ -277,7 +277,7 @@ Let's wrap the training code in a function, and make it callable from the comman
shuffle=True,
)

scorer = create_ner_exact_scorer(nlp.pipes.ner.target_span_getter)
metric = create_ner_exact_metric(nlp.pipes.ner.target_span_getter)

optimizer = torch.optim.AdamW(
params=nlp.parameters(),
Expand Down Expand Up @@ -305,7 +305,7 @@ Let's wrap the training code in a function, and make it callable from the comman
# Evaluating the model
if (step % 100) == 0:
with nlp.select_pipes(enable=["ner"]): #
print(scorer(val_docs, nlp.pipe(deepcopy(val_docs)))) #
print(metric(val_docs, nlp.pipe(deepcopy(val_docs)))) #

nlp.to_disk("model")

Expand Down
294 changes: 294 additions & 0 deletions docs/tutorials/training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
# Training a Named Entity Recognition model

In this tutorial, we'll see how we can train a deep learning model with EDS-NLP.
We also recommend looking at an existing project as a reference, such as [eds-pseudo](https://github.com/eds-pseudo) or [mlnorm](https://github.com/percevalw/mlnorm).

!!! warning "Hardware requirements"

Training a modern deep learning model requires a lot of computational resources. We recommend using a machine with a GPU, ideally with at least 16GB of VRAM. If you don't have access to a GPU, you can use a cloud service like [Google Colab](https://colab.research.google.com/), [Kaggle](https://www.kaggle.com/), [Paperspace](https://www.paperspace.com/) or [Vast.ai](https://vast.ai/).

If you need a high level of control over the training procedure, we suggest you read the next ["Custom training script"](../make-a-training-script) tutorial.

## Creating a project

If you already have installed `edsnlp[ml]` and do not want to setup a project, you can skip to the [next section](#training-the-model).

Create a new project:

```{ .bash data-md-color-scheme="slate" }
mkdir my_ner_project
cd my_ner_project

touch README.md pyproject.toml
mkdir -p configs data/dataset
```

Add a standard `pyproject.toml` file with the following content. This
file will be used to manage the dependencies of the project and its versioning.

```{ .toml title="pyproject.toml"}
[project]
name = "my_ner_project"
version = "0.1.0"
description = ""
authors = [
{ name="Firstname Lastname", email="[email protected]" }
]
readme = "README.md"
requires-python = ">3.7.1,<4.0"

dependencies = [
"edsnlp[ml]>=0.13.0",
"sentencepiece>=0.1.96"
]

[project.optional-dependencies]
dev = [
"dvc>=2.37.0; python_version >= '3.8'",
"pandas>=1.1.0,<2.0.0; python_version < '3.8'",
"pandas>=1.4.0,<2.0.0; python_version >= '3.8'",
"pre-commit>=2.18.1",
"accelerate>=0.21.0; python_version >= '3.8'",
"rich-logger>=0.3.0"
]
```

We recommend using a virtual environment ("venv") to isolate the dependencies of your project and using [uv](https://docs.astral.sh/uv/) to install the dependencies:

```{ .bash data-md-color-scheme="slate" }
pip install uv
# skip the next two lines if you do not want a venv
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]" -p $(uv python find)
```

## Training the model

??? note "A word about Confit"

EDS-NLP makes heavy use of [Confit](https://aphp.github.io/confit/), a configuration library that allows you call functions from Python or the CLI, and validate and optionally cast their arguments.

The EDS-NLP function used in this script is the `train` function of the `edsnlp.train` module. When passing a dict to a type-hinted argument (either from a `config.cfg` file, or by calling the function in Python), Confit will instantiate the correct class with the arguments provided in the dict. For instance, we pass a dict to the `val_data` parameter, which is actually type hinted as a `SampleGenerator`. Therefore, you can also instantiate a `SampleGenerator` object directly and pass it to the function.

You can also tell Confit specifically which class you want to instantiate by using the `@register_name = "name_of_the_registered_class"` key and value in a dict or config section. We make a heavy use of this mechanism to build pipeline architectures.

=== "From the command line"

Create a `config.cfg` file in the `configs` folder with the following content:

```{ .toml title="configs/config.cfg" }
# 🤖 PIPELINE DEFINITION

[nlp]
# Word-level tokenization: use the "eds" tokenizer
lang = "eds"
# Our pipeline will contain a single NER pipe
pipeline = ["ner"]
batch_size = 1
components = ${components}

# The NER pipe will be a CRF model
[components.ner]
@factory = "eds.ner_crf"
mode = "joint"
target_span_getter = ${vars.gold_span_group}
# Set spans as both to ents and in separate `ent.label` groups
span_setter = [ "ents", "*" ]
infer_span_setter = true

# The CRF model will use a CNN to re-contextualize embeddings
[components.ner.embedding]
@factory = "eds.text_cnn"
kernel_sizes = [3]

# The base embeddings will be computed by a transformer
# with a sliding window to reduce memory usage, increase
# speed and allow for sequences longer than 512 wordpieces
[components.ner.embedding.embedding]
@factory = "eds.transformer"
model = "camembert-base"
window = 128
stride = 96

# 📈 SCORERS

# that we will use to evaluate our model
[scorer.ner]
@metrics = "eds.ner_exact"
span_getter = ${vars.gold_span_group}

# Some variables grouped here, we could also
# put their values directly in the config
[vars]
train = "./data/dataset/train"
dev = "./data/dataset/test"
gold_span_group = "gold_spans"

# 🚀 TRAIN SCRIPT OPTIONS
# -> python -m edsnlp.train --config configs/config.cfg

[train]
nlp = ${nlp}
max_steps = 2000
validation_interval = ${train.max_steps//10}
warmup_rate = 0.1
# Adapt to the VRAM of your GPU
grad_accumulation_max_tokens = 48000
batch_size = 2000 words
transformer_lr = 5e-5
task_lr = 1e-4
scorer = ${scorer}
output_path = "artifacts/model-last"

[train.train_data]
randomize = true
# Documents will be split into sub-documents of 384 words
# at most, covering multiple sentences. This makes the
# assumption that entities do not span more than 384 words.
max_length = 384
multi_sentence = true
[train.train_data.reader]
# In what kind of files (ie. their extensions) is our
# training data stored
@readers = "standoff"
path = ${vars.train}
# What schema is used in the data files
converter = "standoff" # by default when readers==standoff
span_setter = ${vars.gold_span_group}

[train.val_data]
[train.val_data.reader]
@readers = "standoff"
path = ${vars.dev}
span_setter = ${vars.gold_span_group}

# 📦 PACKAGE SCRIPT OPTIONS
# -> python -m edsnlp.package --config configs/config.cfg

[package]
pipeline = ${train.output_path}
name = "my_ner_model"
```

To train the model, you can use the following command:

```{ .bash data-md-color-scheme="slate" }
python -m edsnlp.train --config configs/config.cfg --seed 42
```

*Any option can also be set either via the CLI or in `config.cfg` under `[train]`.*

=== "From a script or a notebook"

Create a notebook, with the following content:

```{ .python .no-check }
import edsnlp
from edsnlp.train import train
from edsnlp.metrics.ner import NerExactMetric
import edsnlp.pipes as eds

# 🤖 PIPELINE DEFINITION
nlp = edsnlp.blank("eds")
nlp.add_pipe(
# The NER pipe will be a CRF model
eds.ner_crf(
mode="joint",
target_span_getter="gold_spans",
# Set spans as both to ents and in separate `ent.label` groups
span_setter=["ents", "*"],
infer_span_setter=True,
# The CRF model will use a CNN to re-contextualize embeddings
embedding=eds.text_cnn(
kernel_sizes=[3],
# The base embeddings will be computed by a transformer
embedding=eds.transformer(
model="camembert-base",
window=128,
stride=96,
),
),
)
)

# 📈 SCORERS
ner_metric = NerExactMetric(span_getter="gold_spans")

# 📚 DATA
train_data_reader = edsnlp.data.read_standoff(
path="./data/dataset/train", span_setter="gold_spans"
)
val_data_reader = edsnlp.data.read_standoff(
path="./data/dataset/test", span_setter="gold_spans"
)

# 🚀 TRAIN
train(
nlp=nlp,
max_steps=2000,
validation_interval=200,
warmup_rate=0.1,
# Adapt to the VRAM of your GPU
grad_accumulation_max_tokens=48000,
batch_size=2000,
transformer_lr=5e-5,
task_lr=1e-4,
scorer={"ner": ner_metric},
output_path="artifacts/model-last",
train_data={
"randomize": True,
# Documents will be split into sub-documents of 384 words
# at most, covering multiple sentences. This makes the
# assumption that entities do not span more than 384 words.
"max_length": 384,
"multi_sentence": True,
"reader": train_data_reader,
},
val_data={
"reader": val_data_reader,
},
)
```

or use the config file:

```{ .python .no-check }
from edsnlp.train import train
import edsnlp
import confit

cfg = confit.Config.from_disk(
"configs/config.cfg", resolve=True, registry=edsnlp.registry
)
nlp = train(**cfg["train"])
```

## Use the model

You can now load the model and use it to process some text:

```{ .python .no-check }
import edsnlp
nlp = edsnlp.load("artifacts/model-last")
doc = nlp("Some sample text")
for ent in doc.ents:
print(ent, ent.label_)
```

## Packaging the model

To package the model and share it with friends or family (if the model does not contain sensitive data), you can use the following command:

```{ .bash data-md-color-scheme="slate" }
python -m edsnlp.package --pipeline artifacts/model-last/ --name my_ner_model --distributions sdist
```

*Parametrize either via the CLI or in `config.cfg` under `[package]`.*

Tthe model saved at the train script output path (`artifacts/model-last`) will be named `my_ner_model` and will be saved in the `dist` folder. You can upload it to a package registry or install it directly with

```{ .bash data-md-color-scheme="slate" }
pip install dist/my_ner_model-0.1.0.tar.gz
```
4 changes: 4 additions & 0 deletions edsnlp/pipes/trainable/embeddings/text_cnn/text_cnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ class TextCnnEncoder(WordContextualizerComponent):
The `eds.text_cnn` component is a simple 1D convolutional network to contextualize
word embeddings (as computed by the `embedding` component passed as argument).
To be memory efficient when handling batches of variable-length sequences, this
module employs sequence packing, while taking care of avoiding contamination between
the different docs.
Parameters
----------
nlp : PipelineProtocol
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ nav:
- tutorials/endlines.md
- tutorials/aggregating-results.md
- advanced-tutorials/fastapi.md
- tutorials/training.md
- tutorials/make-a-training-script.md
- tutorials/quick-examples.md
- Pipes:
Expand Down

0 comments on commit 5e0fe85

Please sign in to comment.