Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Frog options available in PiCCL just as in Frog web app #25

Closed
martinreynaert opened this issue Mar 23, 2018 · 7 comments
Closed
Assignees

Comments

@martinreynaert
Copy link
Collaborator

Hi proycon,
Right now, if one choses Frog in the PICCL workflow, it runs all it has.
Can you not replicate Frog option selection/deselection in the PICCL workflow, as you have it in the Frog web application?
Thanks!
Martin

@proycon
Copy link
Member

proycon commented Mar 23, 2018

Yeah, that would be possible.

@zeusttu
Copy link

zeusttu commented Apr 10, 2018

+1! It would be much appreciated if at least the dependency parser could be disabled. Running Frog with all options enabled on a book with several hundreds of pages requires an immense amount of memory to the point of being inadvisable in production environments.

Currently a Piccl job that includes Ticcl and Frog runs out of memory and crashes when offered a pre-OCR'ed pdf of the book Max Havelaar (downloaded from Google books in case you want to reproduce this yourself). Monitoring the memory usage of the machine during this job reveals that Ticcl does not use a lot of memory at all, but that Frog ends up eating all 12 gigabytes of memory the machine has plus the 800 megabytes of available swap space shortly before it crashes.

I have attached a memory usage graph. The axis labels are in Dutch (sorry for that). The blue line represents the machine's total memory usage, minus the idle-state offset (determined as the minimum memory usage over the measured interval). The orange line represents the swap usage. The vertical red lines mark the start of the job, the transition from Ticcl to Frog and (less visible) the moment where the job crashes according to the log file, respectively.

memory_and_swap_usage_while_running_piccl_with_ticcl_and_frog_on_max_havelaar

@proycon
Copy link
Member

proycon commented Apr 10, 2018

Hmm, I thought I had already disabled the dependency parser by default disabled in the current version, but indeed it seems not to be the case, I'll implement this right away then.

Thanks for the graph, that's quite insightful. Even without dependency parser, Frog remains a memory-based system so memory-usage will be on the higher end. For our purposes I consider 12GB quite a low amount of memory and wouldn't suggest a really system with less than 32GB. (our production server has 512GB, although shared with all other services).

I think the speed could be improved here as well (on a proper multicore system), we might have a reoccurrence of #13 here, I'll see if I can improve the pipeline a bit as it wasn't optimized yet here.

@zeusttu
Copy link

zeusttu commented Apr 10, 2018

Wow that's a fast response, thanks! From Frog's help text I think it should be as simple as passing --skip=p to frog.

@proycon
Copy link
Member

proycon commented Apr 10, 2018

Yes, indeed, it's a matter of passing a simple option, the frog.nf workflow in PICCL takes the same arguments.

I now implemented this in the latest development version (git master branch), but haven't tested it yet. I implemented it the other way round; users can select which modules they want to run, with a few preselected.

@zeusttu
Copy link

zeusttu commented Apr 10, 2018

I have not tested it either but looking at the diff I think it should work 🙂

I've got one small tip about Python though: dictionaries have a get-method which is similar to indexing it but returns a default value if the key does not exist. So some_dict.get("x", 5) is equivalent to some_dict["x"] if "x" in some_dict else 5. You can also specify the default value as a keyword argument (some_dict.get("x", default=5)) or leave it out, in which case it defaults to None. So "x" in some_dict and some_dict["x"] could be rewritten as simple as some_dict.get("x"). I hope you appreciate my unsollicited code review, if not then I apologise 😅

@proycon
Copy link
Member

proycon commented Apr 10, 2018

I have not tested it either but looking at the diff I think it should work :)

Yes, that's what I thought too but is usually when things go wrong in my experience ;)

I know about the dict.get() method yes, I just forget to use it and get stuck in old habits sometimes ;) So perhaps this helped ;)

@proycon proycon added the ready label Apr 10, 2018
@proycon proycon closed this as completed May 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants