Skip to content

Document Structure Classification

Eric French edited this page Oct 14, 2023 · 2 revisions

Introduction

Documents contain valuable information, but come in a variety of basic structures that bring with them significant contextual content that we should maintain when processing them. Think of a typical report - it will have a cover page, table of contents, list of abbreviations, executive summary, etc. This structure will be very different from the structure of a meeting minutes document, or an invoice.

Similarly, Excel and other matrix-like formatted documents can express different structures. Merged columns and rows, useful for humans to quickly orient themselves to them, can cause havoc for programs if they aren't designed to account for them.

and so on...good source of inspiration: https://towardsdatascience.com/multi-page-document-classification-using-machine-learning-and-nlp-ba6151405c03

We'll tackle three levels of classification:

Level 1: Document Classification (Is this a Report or Other) Level 2: Page Classification (Is this a cover page, section separator, table of contents, etc) Level 3: Paragraph Classification (defined as a group of text delimited by newlines) (Is this a level header? acronym definition? general text?)

The result for each will be a dataframe with a structure where each doc will be organized vaguely like this

image

Clone this wiki locally