-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes on tokeniser, normalisation, qualifiers and CI #329
base: master
Are you sure you want to change the base?
Conversation
Span
in BaseQualifier.process
Span
in BaseQualifier.process
Coverage Report
Files without new missing coverage
263 files skipped due to complete coverage. Coverage failure: total of 97.77% is less than 97.78% ❌ |
5f31166
to
4f90b63
Compare
Span
in BaseQualifier.process
4f90b63
to
585b9d2
Compare
6852be5
to
c1cf750
Compare
2038fb9
to
232ca91
Compare
fe81659
to
1ffa7c6
Compare
Quality Gate failedFailed conditions |
|
||
assert not (max_steps and max_epochs), "Use only steps or epochs" | ||
if max_epochs: | ||
max_steps = int(0.9 * (4464 / batch_size[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks oddly specific 🤔
Description
Regarding tokenization:
In texts, words can be split with "-" when too long. This can impede matching:
dia-\nbete
won't be matched by a simple "diabete" regex. To this end:EDS.Tokenizer
now threats-\n
as a token by itselfeds.pollution
can tag this token a to-be-discardedRegarding
ignore_space_tokens
With
ignore_space_tokens=True
, usingedsnlp.utils.doc_to_text.get_text
(which is used under the hood by e.g. the regex matcher) will remove linebreaks, which can be problematic in texts with enumeration without trailing spaces. E.g,get_text("Tabac\nAlcool\nSport", "TEXT", ignore_space_tokens=True) would ouput
"TabacAlcoolSport"`.Now, we replace this
\n
with a space when necessaryRegarding the status mapping of behavior/disorder pipes
For entities matched by those pipes, there is:
_.status
attribute, by default set to 1, but that can take the value 2_.detailed_status
attribute, which is actually a getter that uses a mapping dictionary to get the human-readable statusWhen loading already-annotated docs, it can occurs that a status will be automaticaly set to None. To avoid a
KeyError
, when now handle thisstatus=None
caseRegarding CI
ubuntu-latest
doesn't support python 3.7 anymore, so we should useubuntu-22
Checklist