docs: add training tutorial, update docs & run pre-commit

aphp · Oct 11, 2023 · daed632 · daed632
1 parent e19f23a
commit daed632
Show file tree

Hide file tree

Showing 34 changed files with 607 additions and 234 deletions.
diff --git a/README.md b/README.md
@@ -5,13 +5,11 @@
 [![Codecov](https://img.shields.io/codecov/c/github/aphp/edsnlp?logo=codecov&style=flat-square)](https://codecov.io/gh/aphp/edsnlp)
 [![DOI](https://zenodo.org/badge/467585436.svg)](https://zenodo.org/badge/latestdoi/467585436)
 
-# EDS-NLP
+EDS-NLP is a collaborative NLP framework that aims at extracting information from French clinical notes.
+At its core, it is a collection of components or pipes, either rule-based functions or
+[deep learning modules](https://aphp.github.io/concepts/torch-component). These components are organized into a novel efficient and modular [pipeline system](https://aphp.github.io/concepts/pipeline), built for hybrid and multi-task models. We use [spaCy](https://spacy.io) to represent documents and their annotations, and [Pytorch](https://pytorch.org/) as a deep-learning backend for trainable components.
 
-EDS-NLP provides a set of spaCy components that are used to extract information from clinical notes written in French.
-
-Check out the interactive [demo](https://aphp.github.io/edsnlp/demo/)!
-
-If it's your first time with spaCy, we recommend you familiarise yourself with some of their key concepts by looking at the "[spaCy 101](https://aphp.github.io/edsnlp/latest/tutorials/spacy101/)" page in the documentation.
+Although initially designed for French clinical notes, the architecture of EDS-NLP is versatile and can be used on any document. The rule-based components are fully compatible with spaCy's pipelines, and vice versa, which makes it easy to integrate and extend with other NLP tools. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities. Check out our interactive [demo](https://aphp.github.io/edsnlp/demo/) to see EDS-NLP in action.
 
 ## Quick start
 
@@ -34,29 +32,29 @@ pip install edsnlp==0.9.1
 Once you've installed the library, let's begin with a very simple example that extracts mentions of COVID19 in a text, and detects whether they are negated.
 
 ```python
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 
 terms = dict(
     covid=["covid", "coronavirus"],
 )
 
-# Sentencizer component, needed for negation detection
+# Split the documents into sentences, this isneeded for negation detection
 nlp.add_pipe("eds.sentences")
 # Matcher component
 nlp.add_pipe("eds.matcher", config=dict(terms=terms))
 # Negation detection
 nlp.add_pipe("eds.negation")
 
 # Process your text in one call !
-doc = nlp("Le patient est atteint de covid")
+doc = nlp("Le patient n'est pas atteint de covid")
 
 doc.ents
 # Out: (covid,)
 
 doc.ents[0]._.negation
-# Out: False
+# Out: True
 ```
 
 ## Documentation

diff --git a/docs/advanced-tutorials/fastapi.md b/docs/advanced-tutorials/fastapi.md
@@ -12,9 +12,9 @@ Let's create a simple NLP model, that can:
 You know the drill:
 
 ```python title="pipeline.py"
-import spacy
+import edsnlp
 
-nlp = spacy.blank('fr')
+nlp = edsnlp.blank('fr')
 
 nlp.add_pipe("eds.sentences")
 

diff --git a/docs/concepts/torch-component.md b/docs/concepts/torch-component.md
@@ -17,31 +17,31 @@ In the trainable pipes of EDS-NLP, preprocessing and postprocessing are decouple
 ??? details "Methods of a trainable component"
 
     ### `preprocess` {: #edsnlp.core.torch_component.TorchComponent.preprocess }
-    
+
     ::: edsnlp.core.torch_component.TorchComponent.preprocess
         options:
             heading_level: 4
             show_source: false
             show_toc: false
-    
+
     ### `collate` {: #edsnlp.core.torch_component.TorchComponent.collate }
-    
+
     ::: edsnlp.core.torch_component.TorchComponent.collate
         options:
             heading_level: 4
             show_source: false
             show_toc: false
-    
+
     ### `forward` {: #edsnlp.core.torch_component.TorchComponent.forward }
-    
+
     ::: edsnlp.core.torch_component.TorchComponent.forward
         options:
             heading_level: 4
             show_source: false
             show_toc: false
-    
+
     ### `postprocess` {: #edsnlp.core.torch_component.TorchComponent.postprocess }
-    
+
     ::: edsnlp.core.torch_component.TorchComponent.postprocess
         options:
             heading_level: 4
@@ -50,10 +50,10 @@ In the trainable pipes of EDS-NLP, preprocessing and postprocessing are decouple
 
 
     Additionally, there is a fifth method:
-    
-    
+
+
     ### `post_init` {: #edsnlp.core.torch_component.TorchComponent.post_init }
-    
+
     ::: edsnlp.core.torch_component.TorchComponent.post_init
         options:
             heading_level: 3

diff --git a/docs/index.md b/docs/index.md
@@ -1,8 +1,10 @@
 # Getting started
 
-EDS-NLP provides a set of spaCy components that are used to extract information from clinical notes written in French.
+EDS-NLP is a collaborative NLP framework that aims at extracting information from French clinical notes.
+At its core, it is a collection of components or pipes, either rule-based functions or
+[deep learning modules](https://aphp.github.io/concepts/torch-component). These components are organized into a novel efficient and modular [pipeline system](https://aphp.github.io/concepts/pipeline), built for hybrid and multi-task models. We use [spaCy](https://spacy.io) to represent documents and their annotations, and [Pytorch](https://pytorch.org/) as a deep-learning backend for trainable components.
 
-If it's your first time with spaCy, we recommend you familiarise yourself with some of their key concepts by looking at the "[spaCy 101](tutorials/spacy101.md)" page.
+Although initially designed for French clinical notes, the architecture of EDS-NLP is versatile and can be used on any document. The rule-based components are fully compatible with spaCy's pipelines, and vice versa, which makes it easy to integrate and extend with other NLP tools. This library is a product of collaborative effort, and we encourage further contributions to enhance its capabilities. Check out our interactive [demo](https://aphp.github.io/edsnlp/demo/) to see EDS-NLP in action.
 
 ## Quick start
 
@@ -31,9 +33,9 @@ pip install edsnlp==0.9.1
 Once you've installed the library, let's begin with a very simple example that extracts mentions of COVID19 in a text, and detects whether they are negated.
 
 ```python
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")  # (1)
+nlp = edsnlp.blank("eds")  # (1)
 
 terms = dict(
     covid=["covid", "coronavirus"],  # (2)
@@ -47,23 +49,29 @@ nlp.add_pipe("eds.matcher", config=dict(terms=terms))  # (4)
 nlp.add_pipe("eds.negation")
 
 # Process your text in one call !
-doc = nlp("Le patient est atteint de covid")
+doc = nlp("Le patient n'est pas atteint de covid")
 
 doc.ents  # (5)
 # Out: (covid,)
 
 doc.ents[0]._.negation  # (6)
-# Out: False
+# Out: True
 ```
 
-1. We only need spaCy's French tokenizer.
+1. 'eds' is the name of the language, which defines the [tokenizer](/tokenizers).
 2. This example terminology provides a very simple, and by no means exhaustive, list of synonyms for COVID19.
 3. In spaCy, pipelines are added via the [`nlp.add_pipe` method](https://spacy.io/api/language#add_pipe). EDS-NLP pipelines are automatically discovered by spaCy.
 4. See the [matching tutorial](tutorials/matching-a-terminology.md) for mode details.
 5. spaCy stores extracted entities in the [`Doc.ents` attribute](https://spacy.io/api/doc#ents).
 6. The `eds.negation` component has adds a `negation` custom attribute.
 
-This example is complete, it should run as-is. Check out the [spaCy 101 page](tutorials/spacy101.md) if you're not familiar with spaCy.
+This example is complete, it should run as-is.
+
+## Tutorials
+
+To learn more about EDS-NLP, we have prepared a series of tutorials that should cover the main features of the library.
+
+--8<-- "docs/tutorials/overview.md:tutorials"
 
 ## Available pipeline components
 

diff --git a/docs/pipelines/core/contextual-matcher.md b/docs/pipelines/core/contextual-matcher.md
@@ -145,9 +145,9 @@ This parameter can be se to `True` **only for a single assign key per dictionary
 ## Examples
 
 ```python
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 
 nlp.add_pipe("sentences")
 nlp.add_pipe("normalizer")

diff --git a/docs/pipelines/core/normalizer.md b/docs/pipelines/core/normalizer.md
@@ -34,10 +34,10 @@ The normaliser can act on the input text in five dimensions :
 The normalisation is handled by the single `eds.normalizer` pipeline. The following code snippet is complete, and should run as is.
 
 ```python
-import spacy
+import edsnlp
 from edsnlp.matchers.utils import get_text
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer")
 
 # Notice the special character used for the apostrophe and the quotes
@@ -74,7 +74,7 @@ The `eds.lowercase` pipeline transforms every token to lowercase. It is not conf
 Consider the following example :
 
 ```python
-import spacy
+import edsnlp
 from edsnlp.matchers.utils import get_text
 
 config = dict(
@@ -85,7 +85,7 @@ config = dict(
     pollution=False,
 )
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer", config=config)
 
 text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
@@ -105,7 +105,7 @@ making it more predictable than using a library such as `unidecode`.
 Consider the following example :
 
 ```python
-import spacy
+import edsnlp
 from edsnlp.matchers.utils import get_text
 
 config = dict(
@@ -116,7 +116,7 @@ config = dict(
     pollution=False,
 )
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer", config=config)
 
 text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
@@ -135,7 +135,7 @@ Apostrophes and quotation marks can be encoded using unpredictable special chara
 Consider the following example :
 
 ```python
-import spacy
+import edsnlp
 from edsnlp.matchers.utils import get_text
 
 config = dict(
@@ -146,7 +146,7 @@ config = dict(
     pollution=False,
 )
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer", config=config)
 
 text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
@@ -169,7 +169,7 @@ matching.
       `ignore_space_tokens` parameter token to True in a downstream component.
 
 ```python
-import spacy
+import edsnlp
 
 config = dict(
     lowercase=False,
@@ -179,7 +179,7 @@ config = dict(
     pollution=False,
 )
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer", config=config)
 
 doc = nlp("Phrase    avec des espaces \n et un retour à la ligne")
@@ -194,7 +194,7 @@ The pollution pipeline uses a set of regular expressions to detect pollutions (i
 Consider the following example :
 
 ```python
-import spacy
+import edsnlp
 from edsnlp.matchers.utils import get_text
 
 config = dict(
@@ -205,7 +205,7 @@ config = dict(
     pollution=True,
 )
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer", config=config)
 
 text = "Pneumopathie à NBNbWbWbNbWbNBNbNbWbW `coronavirus'"
@@ -231,9 +231,9 @@ Pollution can come in various forms in clinical texts. We provide a small set of
 For instance, if we consider biology tables as pollution, we only need to instantiate the `normalizer` pipe as follows:
 
 ```python
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe(
     "eds.normalizer",
     config=dict(
@@ -260,9 +260,9 @@ If you want to exclude specific patterns, you can provide them as a RegEx (or a
 For instance, to consider text between "AAA" and "ZZZ" as pollution you might use:
 
 ```python
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe(
     "eds.normalizer",
     config=dict(

diff --git a/docs/pipelines/ner/behaviors/overview.md b/docs/pipelines/ner/behaviors/overview.md
@@ -40,9 +40,9 @@ Some general considerations about those components:
 ## Usage
 
 ```{ .python .no-check }
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.sentences")
 nlp.add_pipe(
     "eds.normalizer",

diff --git a/docs/pipelines/overview.md b/docs/pipelines/overview.md
@@ -41,9 +41,9 @@ EDS-NLP provides easy-to-use pipeline components (aka pipes).
 You can add them to your pipeline by simply calling `add_pipe`, for instance:
 
 ```python
-import spacy
+import edsnlp
 
-nlp = spacy.blank("eds")
+nlp = edsnlp.blank("eds")
 nlp.add_pipe("eds.normalizer")
 nlp.add_pipe("eds.sentences")
 nlp.add_pipe("eds.tnm")

diff --git a/docs/pipelines/trainable/embeddings/span_pooler.md b/docs/pipelines/trainable/embeddings/span_pooler.md
@@ -0,0 +1,8 @@
+# Span Pooler {: #edsnlp.pipelines.trainable.embeddings.span_pooler.factory.create_component }
+
+::: edsnlp.pipelines.trainable.embeddings.span_pooler.factory.create_component
+    options:
+        heading_level: 2
+        show_bases: false
+        show_source: false
+        only_class_level: true
diff --git a/docs/pipelines/trainable/overview.md b/docs/pipelines/trainable/overview.md
@@ -12,6 +12,7 @@ All trainable components implement the [`TorchComponent`][edsnlp.core.torch_comp
 |----------------------|----------------------------------------------------------------------|
 | `eds.transformer`    | Embed text with a transformer model                                  |
 | `eds.text_cnn`       | Contextualize embeddings with a CNN                                  |
+| `eds.span_pooler`    | A span embedding component that aggregates word embeddings           |
 | `eds.ner_crf`        | A trainable component to extract entities                            |
 | `eds.span_qualifier` | A trainable component for multi-class multi-label span qualification |
 

diff --git a/docs/scripts/griffe_ext.py b/docs/scripts/griffe_ext.py
@@ -68,7 +68,8 @@ def on_instance(self, node: Union[ast.AST, ObjectNode], obj: Object) -> None:
                 return
 
             callee = (
-                runtime_obj.__init__ if hasattr(runtime_obj, "__init__")
+                runtime_obj.__init__
+                if hasattr(runtime_obj, "__init__")
                 else runtime_obj
             )
             spec = inspect.getfullargspec(callee)

diff --git a/docs/tokenizers.md b/docs/tokenizers.md
@@ -21,15 +21,15 @@ To instantiate one of the two languages, you can call the `spacy.blank` method.
 === "EDSLanguage"
 
     ```python
-    import spacy
+    import edsnlp
 
-    nlp = spacy.blank("eds")
+    nlp = edsnlp.blank("eds")
     ```
 
 === "FrenchLanguage"
 
     ```python
-    import spacy
+    import edsnlp
 
-    nlp = spacy.blank("fr")
+    nlp = edsnlp.blank("fr")
     ```