Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use Pytorch for training models #202

Merged
merged 89 commits into from
Dec 4, 2023
Merged

Refactor to use Pytorch for training models #202

merged 89 commits into from
Dec 4, 2023

Conversation

percevalw
Copy link
Member

@percevalw percevalw commented Apr 4, 2023

Description

This PR refactors EDS-NLP to allow training models and performing inference using PyTorch as the deep-learning backend. Rather than a mere wrapper of Pytorch using spaCy, this is a new framework to build hybrid multi-task models.

To achieve this, instead of patching spaCy's pipeline, a new pipeline was implemented in a similar fashion to aphp/edspdf#12. The new pipeline tries to preserve the existing API, especially for non-machine learning uses such as rule-based components. This means that users can continue to use the library in the same way as before (spacy.blank('xx'), nlp.add_pipe(...)), while also having the option to train models using PyTorch. We still use spaCy data structures such as Doc and Span to represent the texts and their annotations.

It should be noted that this is a work-in-progress and will require further testing before it can be released. We should maybe release it under alpha version number ? Once testing is complete, the new version will be released as a stable version.

Core changes / new features:

  • Use the confit package to instantiate components (soon to be published)
  • Language.factory -> edsnlp.registry.factory.register (confit registry)
  • Lazy loading components from their entry point (had to patch spacy.Language.__init__) to avoid having to wrap every import torch statement for pure rule-based use cases. Hence, torch is not a required dependency
  • Training script with Pytorch only (tests/training/)
  • Re-implemented the trainable NER component with the new system under eds.ner
  • New efficient implementation for eds.transformer (to be used in place of spacy-transformer)
  • New eds.text_cnn embedding contextualizer

Checklist

  • Publish confit
  • Add Span sourcing options to eds.ner (from_ents, from_span_groups)
  • Add a training recipe ?
  • Re-implement the span qualifier from SpanQualifier trainable component #193
  • Update the documentation for NER
  • Add documentation for embedding components (eds.transformer, eds.text_cnn)
  • Add documentation for the new pipeline system
  • Add unit tests for the new pipeline
  • Update changelog

@codecov
Copy link

codecov bot commented Aug 8, 2023

Codecov Report

Attention: 36 lines in your changes are missing coverage. Please review.

Comparison is base (1b62d35) 94.76% compared to head (df2bf0a) 96.58%.

❗ Current head df2bf0a differs from pull request most recent head 3ec32ab. Consider uploading reports for the commit 3ec32ab to get more accurate results

Files Patch % Lines
edsnlp/optimization.py 91.89% 6 Missing ⚠️
edsnlp/core/pipeline.py 98.45% 5 Missing ⚠️
edsnlp/data/base.py 87.17% 5 Missing ⚠️
edsnlp/data/brat.py 0.00% 5 Missing ⚠️
edsnlp/core/torch_component.py 97.84% 4 Missing ⚠️
edsnlp/data/standoff.py 98.25% 3 Missing ⚠️
edsnlp/core/registry.py 98.34% 2 Missing ⚠️
edsnlp/data/json.py 97.77% 2 Missing ⚠️
edsnlp/pipes/ner/adicap/models.py 85.71% 2 Missing ⚠️
edsnlp/data/converters.py 99.47% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #202      +/-   ##
==========================================
+ Coverage   94.76%   96.58%   +1.81%     
==========================================
  Files         233      254      +21     
  Lines        6099     8356    +2257     
==========================================
+ Hits         5780     8071    +2291     
+ Misses        319      285      -34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@percevalw percevalw force-pushed the core-refacto branch 2 times, most recently from 62c7fbc to d06dde6 Compare August 9, 2023 12:56
@percevalw percevalw force-pushed the core-refacto branch 4 times, most recently from 440779e to a17230e Compare August 25, 2023 22:55
@percevalw percevalw marked this pull request as ready for review October 11, 2023 07:17
@percevalw percevalw force-pushed the core-refacto branch 8 times, most recently from 8d5a3d7 to d17e677 Compare October 16, 2023 17:26
@percevalw percevalw mentioned this pull request Oct 18, 2023
6 tasks
@percevalw percevalw force-pushed the core-refacto branch 4 times, most recently from 7aa37ef to a6b7e0b Compare October 26, 2023 15:05
@percevalw percevalw merged commit b9b496e into master Dec 4, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant