Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to combine tabular and non-tabular content from a PDF? #498

Open
tpanza opened this issue May 22, 2024 · 1 comment
Open

How to combine tabular and non-tabular content from a PDF? #498

tpanza opened this issue May 22, 2024 · 1 comment

Comments

@tpanza
Copy link

tpanza commented May 22, 2024

Thanks for a great tool. I haven't seen this addressed anywhere, so I'll ask it here.

I have some large PDFs that consist of tables and some "regular" text. What I'd like to do is convert the PDF to a single HTML (or Markdown) file that does a simple text extract for the non-tabular parts, but then uses Camelot for the tabular parts, while keeping the overall order of the document intact.

Basically, keep all of the content in order, but with the tabular data appropriately formatted in HTML/Markdown. For my situation, I want to keep the surrounding context before and after the tables.

Is there a way to do this? If not, might someone point me to where in the Camelot code would be a good place to insert such a patch?

@bulrush15
Copy link

bulrush15 commented Sep 19, 2024

I'd love this too! I haven't found a free tool that does this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants