Skip to content

Commit 7543331

Browse files
authored
Feat: implements better model files configuration (#10)
* feat: switch model artifacts to model store resolver refactor: add model config * fix: exceptions on close * refactor: publish to pypi * feat: add model download script * refactor: use default config for parsers * feat: implement download and configure tiktoken cache
1 parent 356098f commit 7543331

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+1520
-7069
lines changed

.github/workflows/publish.yml

Lines changed: 66 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1,96 +1,76 @@
1-
name: Publish Python 🐍 distribution 📦 to GitHub Pages
1+
name: Publish Python 🐍 distribution 📦 to PyPI and TestPyPI
22

3-
on:
4-
push:
5-
branches:
6-
- master
7-
workflow_dispatch:
8-
9-
permissions:
10-
contents: write
11-
pages: write
12-
id-token: write
13-
14-
concurrency:
15-
group: "pages"
16-
cancel-in-progress: false
3+
on: push
174

185
jobs:
19-
build:
20-
name: Build distribution 📦
21-
runs-on: ubuntu-latest
22-
23-
steps:
24-
- uses: actions/checkout@v4
25-
with:
26-
persist-credentials: false
27-
fetch-depth: 0
28-
29-
- name: Set up Python
30-
uses: actions/setup-python@v5
31-
with:
32-
python-version: "3.x"
33-
34-
- name: Install pypa/build
35-
run: |
36-
python3 -m pip install build --user
37-
38-
- name: Build a binary wheel and a source tarball
39-
run: python3 -m build
40-
41-
- name: Store the distribution packages
42-
uses: actions/upload-artifact@v4
43-
with:
44-
name: python-package-distributions
45-
path: dist/
46-
47-
publish-to-github-pages:
48-
name: Publish 📦 to GitHub Pages
49-
needs: build
50-
runs-on: ubuntu-latest
51-
environment:
52-
name: github-pages
53-
url: ${{ steps.deployment.outputs.page_url }}
6+
build:
7+
name: Build distribution 📦
8+
runs-on: ubuntu-latest
549

55-
steps:
56-
- name: Checkout
57-
uses: actions/checkout@v4
58-
with:
59-
fetch-depth: 0
10+
steps:
11+
- uses: actions/checkout@v4
12+
with:
13+
persist-credentials: false
14+
- name: Set up Python
15+
uses: actions/setup-python@v5
16+
with:
17+
python-version: "3.x"
18+
- name: Install pypa/build
19+
run: >-
20+
python3 -m
21+
pip install
22+
build
23+
--user
24+
- name: Build a binary wheel and a source tarball
25+
run: python3 -m build
26+
- name: Store the distribution packages
27+
uses: actions/upload-artifact@v4
28+
with:
29+
name: python-package-distributions
30+
path: dist/
6031

61-
- name: Download all the dists
62-
uses: actions/download-artifact@v4
63-
with:
64-
name: python-package-distributions
65-
path: dist/
32+
publish-to-pypi:
33+
name: >-
34+
Publish Python 🐍 distribution 📦 to PyPI
35+
if: startsWith(github.ref, 'refs/tags/') # only publish to PyPI on tag pushes
36+
needs:
37+
- build
38+
runs-on: ubuntu-latest
39+
environment:
40+
name: pypi
41+
url: https://pypi.org/p/deepdoc-lib
42+
permissions:
43+
id-token: write # IMPORTANT: mandatory for trusted publishing
6644

67-
- name: Install dumb-pypi
68-
run: pip install dumb-pypi
45+
steps:
46+
- name: Download all the dists
47+
uses: actions/download-artifact@v4
48+
with:
49+
name: python-package-distributions
50+
path: dist/
51+
- name: Publish distribution 📦 to PyPI
52+
uses: pypa/gh-action-pypi-publish@release/v1
6953

70-
- name: Create package index
71-
run: |
72-
# Put wheels in packages directory
73-
mkdir -p index/packages
74-
cp dist/*.whl index/packages/
75-
76-
# Create package list
77-
ls index/packages/*.whl | xargs -n 1 basename > package_list.txt
78-
79-
# Generate index pointing to ../packages/
80-
dumb-pypi \
81-
--package-list package_list.txt \
82-
--packages-url ../../packages/ \
83-
--output-dir index \
84-
--title "Deepdoc PyPI"
54+
publish-to-testpypi:
55+
name: Publish Python 🐍 distribution 📦 to TestPyPI
56+
needs:
57+
- build
58+
runs-on: ubuntu-latest
8559

86-
- name: Setup Pages
87-
uses: actions/configure-pages@v5
60+
environment:
61+
name: testpypi
62+
url: https://test.pypi.org/p/deepdoc-lib
8863

89-
- name: Upload artifact
90-
uses: actions/upload-pages-artifact@v3
91-
with:
92-
path: 'index'
64+
permissions:
65+
id-token: write # IMPORTANT: mandatory for trusted publishing
9366

94-
- name: Deploy to GitHub Pages
95-
id: deployment
96-
uses: actions/deploy-pages@v4
67+
steps:
68+
- name: Download all the dists
69+
uses: actions/download-artifact@v4
70+
with:
71+
name: python-package-distributions
72+
path: dist/
73+
- name: Publish distribution 📦 to TestPyPI
74+
uses: pypa/gh-action-pypi-publish@release/v1
75+
with:
76+
repository-url: https://test.pypi.org/legacy/

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,3 +219,6 @@ __marimo__/
219219
.streamlit/secrets.toml
220220

221221
.DS_Store
222+
223+
# Tokenizer trie cache
224+
deepdoc/dict/*.trie

MANIFEST.in

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,9 @@ recursive-include deepdoc *.json
1111
recursive-include deepdoc *.yaml
1212
recursive-include deepdoc *.yml
1313
recursive-include deepdoc *.csv
14-
recursive-include deepdoc *.onnx
15-
recursive-include deepdoc *.model
16-
recursive-include deepdoc *.res
17-
recursive-include deepdoc *.trie
14+
15+
# Exclude heavyweight model artifacts (resolved at runtime)
16+
prune deepdoc/rag/res
1817

1918
# Exclude unwanted files
2019
global-exclude *.pyc

README.md

Lines changed: 104 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55
CPU-only (default):
66

77
```bash
8-
pip install git+https://github.com/xorbitsai/deepdoc-lib
8+
pip install deepdoc-lib
99
```
1010

1111
GPU (Linux x86_64 only):
1212

1313
```bash
14-
pip install "deepdoc-lib[gpu] @ git+https://github.com/xorbitsai/deepdoc-lib"
14+
pip install deepdoc-lib[gpu]
1515
```
1616

1717
Note: `onnxruntime` (CPU) and `onnxruntime-gpu` should not be installed together. If you're switching an existing environment to GPU, uninstall CPU ORT first:
@@ -24,22 +24,113 @@ pip install onnxruntime-gpu==1.19.2
2424
### Parser Usage
2525

2626
```python
27-
from deepdoc import PdfParser, DocxParser, ExcelParser
28-
29-
# 解析 PDF
30-
pdf_parser = PdfParser()
27+
from deepdoc import (
28+
DocxParser,
29+
ExcelParser,
30+
HtmlParser,
31+
PdfModelConfig,
32+
PdfParser,
33+
TokenizerConfig,
34+
)
35+
36+
# Build configs
37+
# Method 1: Explicit configuration (offline mode)
38+
tokenizer_cfg = TokenizerConfig(
39+
offline=True,
40+
nltk_data_dir="/path/to/nltk_data",
41+
)
42+
pdf_model_cfg = PdfModelConfig(
43+
vision_model_dir="/path/to/models/vision",
44+
xgb_model_dir="/path/to/models/xgb",
45+
model_provider="local",
46+
)
47+
48+
# Method 2: Empty configuration (auto-download models and nltk_data)
49+
# tokenizer_cfg = TokenizerConfig()
50+
# pdf_model_cfg = PdfModelConfig()
51+
52+
53+
# Parse PDF
54+
pdf_parser = PdfParser(model_cfg=pdf_model_cfg, tokenizer_cfg=tokenizer_cfg)
3155
result = pdf_parser("document.pdf")
3256

33-
# 解析 Word
34-
docx_parser = DocxParser()
35-
result = docx_parser("document.docx")
57+
# Parse DOCX / HTML (tokenizer only)
58+
docx_parser = DocxParser(tokenizer_cfg=tokenizer_cfg)
59+
html_parser = HtmlParser(tokenizer_cfg=tokenizer_cfg)
3660

37-
# 解析 Excel
61+
# Parse Excel (no model/tokenizer dependency)
3862
excel_parser = ExcelParser()
3963
with open("data.xlsx", "rb") as f:
4064
result = excel_parser(f.read())
4165
```
4266

67+
Or use explicit env factories:
68+
69+
```python
70+
tokenizer_cfg = TokenizerConfig.from_env()
71+
pdf_model_cfg = PdfModelConfig.from_env()
72+
pdf_parser = PdfParser(model_cfg=pdf_model_cfg, tokenizer_cfg=tokenizer_cfg)
73+
```
74+
75+
Or rely on defaults (env + cache). Deepdoc will look for cached bundles under
76+
`$DEEPDOC_MODEL_HOME` (or `~/.cache/deepdoc`) and only download missing files
77+
when the provider allows remote access:
78+
79+
```python
80+
pdf_parser = PdfParser()
81+
```
82+
83+
env definitions:
84+
85+
```bash
86+
# provider: auto | local | modelscope
87+
export DEEPDOC_MODEL_PROVIDER=auto
88+
89+
# shared model cache root (default: ~/.cache/deepdoc)
90+
export DEEPDOC_MODEL_HOME=/path/to/deepdoc-models
91+
92+
# optional bundle-specific local directories
93+
export DEEPDOC_VISION_MODEL_DIR=/path/to/vision
94+
export DEEPDOC_XGB_MODEL_DIR=/path/to/xgb
95+
96+
# single combined ModelScope repo (all bundles in one repo)
97+
# (default: Xorbits/deepdoc)
98+
export DEEPDOC_MODELSCOPE_REPO=Xorbits/deepdoc
99+
# optional shared revision (default: master)
100+
export DEEPDOC_MODELSCOPE_REVISION=master
101+
102+
# offline mode for tokenizer NLTK auto-download
103+
export DEEPDOC_OFFLINE=0
104+
105+
# optional NLTK data controls for tokenizer
106+
export DEEPDOC_NLTK_DATA_DIR=/path/to/nltk_data
107+
```
108+
109+
### Download model artifacts
110+
111+
To pre-download all model bundles (vision/xgb/tokenizer) into the default cache directory (`~/.cache/deepdoc`), run:
112+
113+
```bash
114+
deepdoc-download-models
115+
# or (from source checkout)
116+
python -m deepdoc.download_models
117+
```
118+
119+
If you want to override the cache location, set `DEEPDOC_MODEL_HOME`:
120+
121+
```bash
122+
export DEEPDOC_MODEL_HOME=./models
123+
deepdoc-download-models
124+
```
125+
126+
By default this also downloads the required NLTK resources into `~/.cache/deepdoc/nltk_data` (or `$DEEPDOC_MODEL_HOME/nltk_data`) and the cached `cl100k_base` tiktoken file into `~/.cache/deepdoc/tiktoken_cache` (or `$DEEPDOC_MODEL_HOME/tiktoken_cache`). `deepdoc.common.token_utils` automatically points `TIKTOKEN_CACHE_DIR` at the same location unless you override it with `DEEPDOC_TIKTOKEN_CACHE_DIR` or `TIKTOKEN_CACHE_DIR`.
127+
128+
If you want to skip either optional offline asset, use:
129+
130+
```bash
131+
deepdoc-download-models --no-nltk --no-tiktoken
132+
```
133+
43134

44135
### Vision Model Usage
45136

@@ -50,15 +141,15 @@ from deepdoc import create_vision_model
50141
- Use Environment Variable
51142

52143
```bash
53-
# 视觉模型配置
144+
# Vision model configs
54145
export DEEPDOC_VISION_PROVIDER="qwen"
55146
export DEEPDOC_VISION_API_KEY="your-api-key"
56147
export DEEPDOC_VISION_MODEL="qwen-vl-max"
57148
export DEEPDOC_VISION_LANG="Chinese"
58149
export DEEPDOC_VISION_BASE_URL="http://your_base_url"
59150

60-
# 其他配置
61-
export DEEPDOC_LIGHTEN=0 # 是否使用轻量模式
151+
# Other configs
152+
export DEEPDOC_LIGHTEN=0 # Whether to use lighten mode
62153
```
63154

64155
``` python
@@ -99,4 +190,3 @@ vision_model = create_vision_model("/path/to/deepdoc_config.yaml")
99190
with open("image.jpg", "rb") as f:
100191
result = vision_model.describe_with_prompt(f.read())
101192
```
102-

deepdoc/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
from .parser import *
2121
from .depend.simple_cv_model import *
22+
from .config import PdfModelConfig, TokenizerConfig, ParserRuntimeConfig
2223
from .llm_adapter import LLMAdapter, LLMType, vision_llm_chunk
2324

2425
__all__ = [
@@ -32,6 +33,9 @@
3233
"JsonParser",
3334
"MarkdownParser",
3435
"TxtParser",
36+
"TokenizerConfig",
37+
"PdfModelConfig",
38+
"ParserRuntimeConfig",
3539
# LLM Adapter exports
3640
"LLMAdapter",
3741
"LLMType",

deepdoc/common/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,13 @@
1111
from .connection_utils import timeout
1212
from .config_utils import get_base_config, get_config_value
1313
from .settings import PARALLEL_DEVICES, check_and_install_torch
14+
from .model_store import (
15+
resolve_bundle_dir,
16+
resolve_tokenizer_dict_prefix,
17+
resolve_vision_model_dir,
18+
resolve_xgb_model_dir,
19+
validate_bundle_dir,
20+
)
1421

1522
__all__ = [
1623
# file_utils
@@ -35,4 +42,11 @@
3542
# settings
3643
"PARALLEL_DEVICES",
3744
"check_and_install_torch",
45+
46+
# model_store
47+
"resolve_bundle_dir",
48+
"resolve_tokenizer_dict_prefix",
49+
"resolve_vision_model_dir",
50+
"resolve_xgb_model_dir",
51+
"validate_bundle_dir",
3852
]

0 commit comments

Comments
 (0)