Skip to content

Commit 28028da

Browse files
authored
Merge pull request #7 from HuangPuStar/feat/version-upgrade
- **Document Parsing Enhancements:** - Sync latest deepdoc from ragflow with improved PDF parsing capabilities - Fix PDF figure extraction issues and position information handling - Enhance LLM adapter with FenixAOS integration for better compatibility - Create local common modules to replace RAGFlow dependencies - **Multi-Format Output Support:** - Add support for multiple output formats (txt, md, json) in parse_document_with_output - Implement write_text_output, write_markdown_output, and write_json_output functions - Add html_table_to_markdown helper with proper type checking for BeautifulSoup elements - Update document parser adapter description to reflect new output format capabilities - **Cross-Page Handling:** - Support simple cross-page elements (single page span) - Revert page number modifications to maintain original DeepDoc behavior - Implement page number normalization in FenixAOS instead of deepdoc-lib - Remove unused merge_cross_page_positions function (designed for older deepdoc versions) - **Code Quality & Infrastructure:** - Resolve merge conflicts and restore LLM adapter exports - Batch update imports to use local common and llm_adapter modules - Fix all import path issues and add missing modules - Clean up redundant files and fix import references - Rename test file to resolve naming conflict (test_model_resolver.py → test_model_resolver_utils.py) - Remove empty position_utils.py file after cleaning up unused functions - **Current Limitations:** - Support simple cross-page elements but cannot handle consecutive multi-page tables/figures yet - Complex table structures spanning multiple non-consecutive pages require further development - **Dependencies:** - Update deepdoc-lib dependency to commit dcd2dfa - Convert rag/res/deepdoc model files from LFS to regular file storage - Add rag_tokenizer backward compatibility aliases All tests pass and mypy errorsre resolved. The system now provides robust document parsing with position information and multiple output formats.
2 parents e151cc0 + c3e8428 commit 28028da

File tree

3 files changed

+31
-6636
lines changed

3 files changed

+31
-6636
lines changed

deepdoc/dict/ocr/README.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

0 commit comments

Comments
 (0)