What's the difference between Document and Sentence in Spark NLP #1312
maziyarpanahi
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We usually use these terms interchangeably when we are addressing inputs in Spark NLP.
Document: This is the output of
DocumentAssmbler
. A column in a DataFrame is an input to this annotator and the result is the same text but with some extra information which we callDOCUMENT
or in a simpler way a document. This annotator doesn't care about how the text in that column is structured, whether there are multiple sentences or only 1 single string, or anything else. The output is identical to the input except you used some cleaning modes to remove new lines, etc.Sentence: This is the output of either
SentenceDetector
a rule-based annotator to detect sentences orSentenceDetectorDL
which detect sentence much accurately by using a Deep Learning model train on English and Multi-lingual content. The input to these to annotators isDocumentAssmbler
. Meaning here the text is going to be broken into multiple chunks which are called a sentence.You can decide whether you want other annotators to annotate based on either
document
orsentences
. Some use cases:For instance, if you select
Document
as one of theinputCols
in NerDLModel then the entities are being annotated for the whole document. But if you selectSentence
as one of theinputCols
to NerDLModel, you get entities for each sentence separately. This way you may be able to calculate entities per sentence if that matters to you.If you are dealing with document classifications, then you need to use
DocumentAssmbler
as theinputCols
to those annotators. Because each document should have one or multiple labels.If you are using some WordEmbeddings which are limited by max sequence length such as BERT, ALBERT, XLNET, etc. Then it's better to use
Sentence
as input because sentences are usually smaller than a whole document so you have a better chance of not have your inputs being trimmed. (a document can have 200 tokens, you can setmaxSentenceLength
as 60 - not getting trimmed content is one advantage, however, the longer the sequence the more sparse gets the vectors so it loses the meaning and its context)To be continued 😊
Beta Was this translation helpful? Give feedback.
All reactions