Commit 28028da
authored
Merge pull request #7 from HuangPuStar/feat/version-upgrade
- **Document Parsing Enhancements:**
- Sync latest deepdoc from ragflow with improved PDF parsing capabilities
- Fix PDF figure extraction issues and position information handling
- Enhance LLM adapter with FenixAOS integration for better compatibility
- Create local common modules to replace RAGFlow dependencies
- **Multi-Format Output Support:**
- Add support for multiple output formats (txt, md, json) in parse_document_with_output
- Implement write_text_output, write_markdown_output, and write_json_output functions
- Add html_table_to_markdown helper with proper type checking for BeautifulSoup elements
- Update document parser adapter description to reflect new output format capabilities
- **Cross-Page Handling:**
- Support simple cross-page elements (single page span)
- Revert page number modifications to maintain original DeepDoc behavior
- Implement page number normalization in FenixAOS instead of deepdoc-lib
- Remove unused merge_cross_page_positions function (designed for older deepdoc versions)
- **Code Quality & Infrastructure:**
- Resolve merge conflicts and restore LLM adapter exports
- Batch update imports to use local common and llm_adapter modules
- Fix all import path issues and add missing modules
- Clean up redundant files and fix import references
- Rename test file to resolve naming conflict (test_model_resolver.py → test_model_resolver_utils.py)
- Remove empty position_utils.py file after cleaning up unused functions
- **Current Limitations:**
- Support simple cross-page elements but cannot handle consecutive multi-page tables/figures yet
- Complex table structures spanning multiple non-consecutive pages require further development
- **Dependencies:**
- Update deepdoc-lib dependency to commit dcd2dfa
- Convert rag/res/deepdoc model files from LFS to regular file storage
- Add rag_tokenizer backward compatibility aliases
All tests pass and mypy errorsre resolved. The system now provides robust document parsing with position information and multiple output formats.3 files changed
+31
-6636
lines changedThis file was deleted.
0 commit comments