-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues on processing huge collections -> revise multithreading implementation #45
Comments
with (might not mean much, load varies) |
With tokenisation:
|
Reducing threads to 4 instead of 40 (no tokenisation, no stdout):
|
single threaded:
|
Another minor feature request for debugging:
|
So I think the conclusion is to go for parallellisation of document processing when |
Some timestamps have been added already. |
I did a quick examination of the possibility to handle several files in parallel in Frog, but this is quite difficult to accomplish. A more direct way might be to create a wrapper that uses the Frog Server and feed that with several files in parallel. My guess is, that such a wrapper is workable, and faster to accomplish. For now, I don't consider this a bug, but a design decision. |
Processing of huge amounts of pre-tokenised FoLiA documents (for Nederlab) goes unexpectedly slow, despite disabling various modules (
--skip=mcpa
). In about 24 hours, about 90 documents have been processed.Frog is called on a directory as follows (to eliminate initialisation overhead):
frog --skip=mcpa -override tokenizer.rulesFile=tokconfig-nld-historical --xmldir "." --threads 40 --testdir input/ -x
Log excerpt of a single document (
rarity:/scratch/proycon/morr001cryp01_01.tok.folia.xml
) in a long-running batch (days if not weeks):(full log in
rarity:/scratch/proycon/frog.log
)Comparison; a standalone run on only the highlighted document (without
-nostdout
) :I have some minor suggestions for better debugging:
Other possibility for
testdir
:The text was updated successfully, but these errors were encountered: