Releases · opendatalab/MinerU

15 Nov 11:27

myhloli

magic_pdf-0.9.3-released

845a3ff

magic_pdf-0.9.3-released Latest

Latest

What's Changed

feat(model): add xycut algorithm for block sorting by @myhloli in #898
refactor(pdf_parse): adjust line count threshold for layoutreader by @myhloli in #902
Feat/add en docs by @icecraft in #906
feat: using next_docs by @icecraft in #907
feat(table): integrate RapidTable model for table recognition by @myhloli in #910
fix(gradio-app): add missing file type in upload by @myhloli in #911
refactor(magic_pdf_parse_main): optimize model data handling and JSON output by @myhloli in #912
Modify the test directory by @DTwz in #913
test(table): improve ppTableModel test coverage by @myhloli in #914
feat(table): add RapidOCR support for RapidTable model by @myhloli in #915
新增DocLayout-YOLO超链接 by @qiangqiang199 in #889
fix: remove classes hierarchy diagram by @icecraft in #919
refactor(model download script) by @myhloli in #922
docs(readme): update table recognition configuration and documentation by @myhloli in #924
docs(README_ja-JP.md): update warning message and remove outdated content by @myhloli in #925
更新 para_split_v3.py by @hyastar in #923
Style/docs by @icecraft in #927
docs: rewrite zh_cn docs without translate by @icecraft in #928
fix: typo by @icecraft in #931
fix: 修复Dockerfile文件中download_models.py脚本路径问题 by @kimi360 in #938
build(Dockerfile): update model download script and dependencies by @myhloli in #941
fix(ocr_mkcontent): improve handling of single-character content #937 by @myhloli in #943
feat: tune docs by @icecraft in #948
fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print. by @myhloli in #957
refactor(model): rename and restructure model modules by @myhloli in #964
docs：update docs for 0.9.3 by @myhloli in #965
docs(README): update project references and translations by @myhloli in #967

New Contributors

@DTwz made their first contribution in #913
@qiangqiang199 made their first contribution in #889
@hyastar made their first contribution in #923
@kimi360 made their first contribution in #938

Full Changelog: magic_pdf-0.9.2-released...magic_pdf-0.9.3-released

Contributors

kimi360, myhloli, and 4 other contributors

Assets 3

06 Nov 10:18

myhloli

magic_pdf-0.9.2-released

b25ff7a

magic_pdf-0.9.2-released

What's Changed

fix: add ci repository by @dt-yy in #869
fix(table_model_init): remove unused code by @myhloli in #882
docs(README): update version number and improve documentation formatting by @myhloli in #884

Full Changelog: magic_pdf-0.9.1-released...magic_pdf-0.9.2-released

Contributors

myhloli and dt-yy

Assets 3

06 Nov 04:07

myhloli

magic_pdf-0.9.1-released

069bcfe

magic_pdf-0.9.1-released

What's Changed

Feat/tune docs by @icecraft in #833
fix(ocr_mkcontent): improve content handling for different languages and equation types by @myhloli in #839
feat(list): improve list detection algorithm & fix(list): improve list identification accuracy by @myhloli in #843
docs(tutorial): update magic-pdf command with output directory by @myhloli in #844
feat(para_split_v3): improve list identification with block aspect ratio by @myhloli in #845
fix(dict2md): improve text concatenation logic by @myhloli in #847
Update pdf_extract_kit.py by @CiaranYoung in #853
feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit by @myhloli in #854
feat(model): add HTML minification to StructTableModel by @myhloli in #855
chore: add .gitattributes to configure file linguist attributes by @myhloli in #856
fix(merge_text): add ligature replacement functionality #305 #241 by @myhloli in #857
chore: add CSS and SCSS files to linguist-vendored- Update .gitattributes to mark CSS and SCSS files as vendored by @myhloli in #858
docs(README): update Colab demo link by @myhloli in #860
fix(table): improve table image processing by @myhloli in #866
docs(faq): add troubleshooting for illegal instruction error on Linux servers by @myhloli in #867
feat: mineru_demo接口文档替换为链接 by @LollipopsAndWine in #871
test(table): improve HTML validation for table extraction by @myhloli in #874
docs: update arXiv paper link in README files by @myhloli in #875
docs(README): update changelog for v0.9.1 release by @myhloli in #877

New Contributors

@CiaranYoung made their first contribution in #853

Full Changelog: magic_pdf-0.9.0-released...magic_pdf-0.9.1-released

Contributors

myhloli, icecraft, and 2 other contributors

Assets 3

01 Nov 11:04

myhloli

magic_pdf-0.9.0-released

3a42ebb

magic_pdf-0.9.0-released

What's Changed

Update README_zh-CN.md (#404) by @drunkpig in #409
feat: add dockerfile by @Lincyaw in #189
fix(ocr_mkcontent): improve language detection and content formatting by @myhloli in #458
fix(self_modify): merge detection boxes for optimized text region detection by @myhloli in #448
fix(pdf-extract): adjust box threshold for OCR detection to fix issue about OCR mode lost some line by @myhloli in #447
feat: rename the file generated by command line tools by @icecraft in #401
fix(ocr_mkcontent): revise table caption output by @myhloli in #397
build(docker): update docker build step by @myhloli in #471
upload an introduction about chemical formula and update readme.md by @GDDGCZ518 in #489
fix: remove the default value of output option in tools/cli.py and to… by @icecraft in #494
feat: add test case by @dt-yy in #499
fixes #492 decrease span threshold for block filling by @myhloli in #500
fix(detect_all_bboxes): remove small overlapping blocks by merging by @myhloli in #501
feat(cli&analyze&pipeline): add start_page and end_page args for pagination by @myhloli in #507
Feat/support rag by @icecraft in #510
feat(gradio): add app by gradio by @myhloli in #512
fix: replace \u0002, \u0003 in common text by @drunkpig in #521
fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. by @myhloli in #518
fix(para): When an English line ends with a hyphen, do not add a space at the end. by @drunkpig in #523
Release: Release 0.7.1 verison, update dev by @dt-yy in #527
Hotfix readme 0.7.1 by @Focusshang in #529
fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #542
fix: typo error in markdown by @icecraft in #536
fix(gradio): remove unused imports and simplify pdf display by @myhloli in #534
Feat/support footnote in figure by @icecraft in #532
refactor(pdf_extract_kit): implement singleton pattern for atomic models by @myhloli in #533
feat: mineru_web by @LollipopsAndWine in #555
features@add mineru gpu&web_api by @yanqiangmiffy in #568
docs(models_download): update model download instructions to use python script by @myhloli in #560
fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #574
feat(ocr): supports minority languages by @myhloli in #577
refactor(pdf_extract_kit): update model config and weight paths for UniMERNet-0.2.0 by @myhloli in #584
feat(gradio_app): add web app with PDF processing as a project by @myhloli in #579
fix: web_api by @LollipopsAndWine in #580
Realese 0.8.0 by @drunkpig in #587
fix: 1. resolve uncorrect pair relation of figure and footnote, 2. re… by @icecraft in #603
fix: recovert the lang option in tools/cli.py by @icecraft in #604
fix: solve conflicts by @myhloli in #607
fix: remove useless files by @myhloli in #608
feat(gradio_app): add examples accordion to the PDF conversion interface by @myhloli in #597
feat(pipeline): pass language parameter for parsing and markdown conversion by @myhloli in #602
feat(ocr_mkcontent): support drop reason in none_with_reason mode by @myhloli in #630
feat(UNIPipe): change default drop_mode to NONE_WITH_REASON by @myhloli in #631
refactor(pdf_extract): use Image.crop directly with layout detection by @myhloli in #635
fix(pdf-extract): ensure model is set to evaluation mode before processing by @myhloli in #636
fix(pdf_extract_kit):change unimernet base -> small by @myhloli in #639
feat: add test case by @dt-yy in #645
feat: 集成前端界面，配置一键启动 by @LollipopsAndWine in #668
feat: 删除无用的文件,更新前端style by @LollipopsAndWine in #669
docs: update project lists in README files to include web_api by @myhloli in #670
feat：add layoutreader to sort blocks by @myhloli in #672
refactor(model): improve timing information and performance by @myhloli in #690
feat: add arXiv paper link to header and adjust PDF parsing logic by @myhloli in #693
perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity by @myhloli in #694
fix: caption or footnote match algorithm by @icecraft in #695
fix: caption|footnote match algorithm by @icecraft in #696
feat(layoutreader): support local model directory and improve model loading by @myhloli in #698
feat(docs): automate model download and configuration by @myhloli in #699
docs: add filename to wget command in model download scripts by @myhloli in #700
docs: update CUDA acceleration guides and README content by @myhloli in #701
Update README_Windows_CUDA_Acceleration_en_US.md by @myhloli in #706
feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support by @myhloli in #716
Update how_to_download_models_zh_cn.md by @myhloli in #717
fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks by @myhloli in #718
feat: manager docs with sphinx by @icecraft in #737
feat(list&index block): detect and merge list and index blocks by @myhloli in #740
refactor(para_split_v3): merge list and index block detection by @myhloli in #743
fix(para_split_v3): refine list block detection in paragraph splitting by @myhloli in #744
update example files by @myhloli in #747
refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. by @myhloli in #753
refactor(para): improve paragraph splitting algorithm by @myhloli in #765
docs:Update the driver requirements on the Ubuntu system. by @myhloli in #766
update：update config json by @myhloli in #769
feat(model): add support for DocLayout-YOLO model by @myhloli in #773
build(setup): add doclayout_yolo dependency by @myhloli in #774
build(docker): add doclayout-yolo dependency by @myhloli in #776
feat: add support for non-PDF file conversion to PDF by @myhloli in #777
Feat/data api by @icecraft in #782
Feat/new table caption match by @icecraft in #784
refactor(parse_core): improve image and table block handling by @myhloli in #785
refactor(ocr): adjust OCR processing parameters by @myhloli in #786
fix: add init to magic_pdf.config by @myhloli in #788
fix: add init to magic_pdf.utils by @myhloli in #789
feat(draw_bbox): update bounding box drawing for tables and images by @myhloli in #791
Add multi_gpu process project by @randydl in #79...

Contributors

myhloli, icecraft, and 9 other contributors

Assets 3

09 Oct 08:58

myhloli

magic_pdf-0.8.1-update-docs

62aa1cb

magic_pdf-0.8.1-update-docs

What's Changed

refactor(docs): update model download instructions and configuration process by @myhloli in #707

Full Changelog: magic_pdf-0.8.1-released...magic_pdf-0.8.1-update-docs

Contributors

myhloli

Assets 2

12 Sep 14:00

myhloli

magic_pdf-0.8.1-released

c95f381

magic_pdf-0.8.1-released

What's Changed

fix:

resolve uncorrect pair relation of figure and footnote
resolve uncorrect pair relation of table and caption #590 by @icecraft in #599

Full Changelog: magic_pdf-0.8.0-released...magic_pdf-0.8.1-released

Contributors

icecraft

Assets 3

10 Sep 12:20

myhloli

magic_pdf-0.8.0-released

9f352df

magic_pdf-0.8.0-released

What's Changed

feat：

Add RAG API
Integration of RAG into llama_index project
Update Dockerfile
Fine grained model singleton, reducing memory usage and accelerating initialization speed
CLI and API add parsing range parameters, allowing customization of start and end pages
Support image footnotes

bugfix：

When removing the smaller overlapping block, retain the boundary information of that block
Fill in the threshold of 0.6->0.3 for the span block
The problem of losing low score lines in OCR DET stage
Merge multiple spans of a single line in the OCR DET stage
Optimization of English Adhesive Word Segmentation Logic
Inaccurate layout box issue
The problem of merging words after being broken by line breaks
The final output result contains certain special characters

Full Changelog: magic_pdf-0.7.1-released...magic_pdf-0.8.0-released

Assets 3

02 Sep 12:34

myhloli

magic_pdf-0.7.1-released

1dc915a

magic_pdf-0.7.1-released

What's Changed

feat: add tablemaster_paddle by @papayalove in #463
(para_split_v2): index out of range issue of span_text first char by @papayalove in #396

Full Changelog: magic_pdf-0.7.0b1-released...magic_pdf-0.7.1-released

Contributors

papayalove

Assets 3

09 Aug 13:37

myhloli

magic_pdf-0.7.0b1-released

fa3475a

magic_pdf-0.7.0b1-released

What's Changed

feat: add table recognition success detect by @papayalove in #354
fix: #366 by @icecraft in #371
fix&refactor(pdf-extract-kit): table recognition and ocr by @myhloli in #374
fix(doc-analyze): adjust image scaling limit to 9000 pixels by @myhloli in #379
feat(draw_bbox): add model bbox drawing functionality by @myhloli in #386

New Contributors

@zuanzuanshao made their first contribution in #355

Full Changelog: magic_pdf-0.7.0a1-released...magic_pdf-0.7.0b1-released

Contributors

myhloli, icecraft, and 2 other contributors

Assets 3

31 Jul 09:56

myhloli

magic_pdf-0.6.2b1-released

3aec9c6

magic_pdf-0.6.2b1-released

What's Changed

Optimized model loading logic, now requiring only a single load during batch processing.
Command-line interface now supports batch input.
When import fails, prints complete error messages to facilitate troubleshooting.
Fixed a bug where overlapping spans were incorrectly removed multiple times.
Improved OCR recognition areas, doubling the OCR speed.
Embedded language identification models within the whl package for easier offline deployment.
Replaced interline_equation_blocks with interline_equations to enhance interline formula recognition capabilities in non-academic paper scenarios.
Added page number indexing to the output results of content_list.
Locked some dependency versions and adjusted the dependency installation logic to reduce conflicts and redundant installations, cutting down the number of packages by 30% and improving the initial installation success rate.

New Contributors

@yzztin made their first contribution in #214
@eltociear made their first contribution in #231

Full Changelog: magic_pdf-0.6.1-released...magic_pdf-0.6.2b1-released

Contributors

eltociear and yzztin

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

fix:

Contributors

What's Changed

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: opendatalab/MinerU

magic_pdf-0.9.3-released

What's Changed

New Contributors

Contributors

magic_pdf-0.9.2-released

What's Changed

Contributors

magic_pdf-0.9.1-released

What's Changed

New Contributors

Contributors

magic_pdf-0.9.0-released

What's Changed

Contributors

magic_pdf-0.8.1-update-docs

What's Changed

Contributors

magic_pdf-0.8.1-released

What's Changed

fix:

Contributors

magic_pdf-0.8.0-released

What's Changed

magic_pdf-0.7.1-released

What's Changed

Contributors

magic_pdf-0.7.0b1-released

What's Changed

New Contributors

Contributors

magic_pdf-0.6.2b1-released

What's Changed

New Contributors

Contributors