Skip to content

Commit 874c7ca

Browse files
committed
docs(uspto): improve documentation of USPTO XML parser security config
Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
1 parent 075fa69 commit 874c7ca

File tree

1 file changed

+29
-16
lines changed

1 file changed

+29
-16
lines changed

docling/backend/xml/uspto_backend.py

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,35 @@
55
The original files can be found in https://bulkdata.uspto.gov.
66
77
Security Note:
8-
This module uses defusedxml.sax.make_parser() with customized security settings
9-
to protect against XML External Entity (XXE) attacks while allowing USPTO XML files
10-
to be parsed. In addition, it includes safeguards against entity expansion attacks
11-
and entity nesting depth. USPTO files contain DTD declarations that defusedxml
12-
blocks by default, so we configure the parser with:
13-
14-
- feature_external_ges: False (blocks external general entities)
15-
- feature_external_pes: False (blocks external parameter entities)
16-
- forbid_dtd: False (allows DTD declarations in the XML)
17-
- forbid_entities: False (allows entity declarations)
18-
- forbid_external: False (allows external references in declarations)
19-
20-
This configuration permits DTD declarations (required for USPTO files) while the
21-
disabled external entity features prevent actual fetching of external resources,
22-
effectively blocking XXE attacks. The parser processes the XML structure without
23-
accessing any external files or URLs.
8+
This module uses defusedxml.sax.make_parser() with security settings to protect
9+
against XML External Entity (XXE) attacks and entity expansion attacks (Billion
10+
Laughs/CWE-776). The parser is configured with:
11+
12+
- feature_external_ges: False (blocks external general entity resolution)
13+
- feature_external_pes: False (blocks external parameter entity resolution)
14+
- forbid_dtd: False (allows DTD declarations required by USPTO XML format)
15+
- forbid_entities: False (allows entity declarations including NDATA)
16+
- forbid_external: False (allows SYSTEM declarations in DTD)
17+
18+
Security Analysis:
19+
1. XXE Prevention: While external entities can be declared (forbid_external=False),
20+
they are never resolved or fetched due to feature_external_ges=False and
21+
feature_external_pes=False. This prevents XXE attacks.
22+
23+
2. Billion Laughs Mitigation: defusedxml's built-in entity expansion limits
24+
(MAX_ENTITY_EXPANSION=10,000) prevent exponential entity expansion from
25+
causing memory exhaustion. While not completely blocking entity expansion,
26+
this limit prevents the worst-case denial-of-service scenarios.
27+
28+
3. NDATA Entities: USPTO files use NDATA entities for image references
29+
(e.g., <!ENTITY img SYSTEM "file.tif" NDATA TIF>). These are unparsed
30+
entities that don't expand inline and aren't fetched due to the external
31+
entity resolution being disabled.
32+
33+
This configuration balances security with USPTO format compatibility. The key
34+
insight is that defusedxml distinguishes between entity declaration (allowed)
35+
and entity resolution/fetched (blocked), providing protection while allowing
36+
the required DTD structure.
2437
"""
2538

2639
import html

0 commit comments

Comments
 (0)