From bca743611055d22290a7a34660d4b6e24e53e19d Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 11 Nov 2024 14:33:08 +0000 Subject: [PATCH] Docs: Updates find_tables() method with some extra notes about Pandas. --- docs/page.rst | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/docs/page.rst b/docs/page.rst index 366021dd2..c976aaa45 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -491,7 +491,7 @@ In a nutshell, this is what you can do with PyMuPDF: * `bbox`: the bounding box of the table as a tuple `(x0, y0, x1, y1)`. * `cells`: bounding boxes of the table's cells (list of tuples). A cell may also be `None`. * `extract()`: this method returns the text content of each table cell as a list of list of strings. - * `to_markdown()`: this method returns the table as a **string in markdown format** (compatible to Github). Supporting viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye. + * ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Supporting viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye. * `to_pandas()`: this method returns the table as a `pandas `_ `DataFrame `_. DataFrames are very versatile objects allowing a plethora of table manipulation methods and outputs to almost 20 well-known formats, among them Excel files, CSV, JSON, markdown-formatted tables and more. `DataFrame.to_markdown()` generates a Github-compatible markdown format optimized for human readability. This method however requires the package [tablutate](https://pypi.org/project/tabulate/) to installed in addition to pandas itself. * ``header``: a `TableHeader` object containing header information of the table. * `col_count`: an integer containing the number of table columns. @@ -507,10 +507,22 @@ In a nutshell, this is what you can do with PyMuPDF: Please have a look at these `Jupyter notebooks `_, which cover standard situations like multiple tables on one page or joining table fragments across multiple pages. - .. caution:: The lifetime of the TableFinder object, as well as that of all its tables **equals the lifetime of the page**. If the page object is deleted or reassigned, all tables are no longer valid. + .. caution:: The lifetime of the `TableFinder` object, as well as that of all its tables **equals the lifetime of the page**. If the page object is deleted or reassigned, all tables are no longer valid. The only way to keep table content beyond the page's availability is to **extract it** via methods `Table.to_markdown()`, `Table.to_pandas()` or a copy of `Table.extract()` (e.g. `Table.extract()[:]`). + .. note:: + + Once a table has been extracted to a **Pandas DataFrame** with `to_pandas()` it is easy to convert to other file types with the **Pandas API**: + + - table to Markdown, use `to_markdown `_ + - table to JSON, use: `to_json `_ + - table to Excel, use: `to_excel `_ + - table to CSV, use: `to_csv `_ + - table to HTML, use: `to_html `_ + - table to SQL, use: `to_sql `_ + + |history_begin| * New in version 1.23.0