Summary
We're using docx-rs (v0.4) in office2pdf to parse DOCX files. When processing ~2,800 real-world DOCX files (from LibreOffice and Apache POI test suites), 19 files fail. 16 of these return "Failed to read from zip" even though the zip archives are valid — the error actually comes from XML parsing failures within the zip entries.
Related to #832.
Failing files (19 files)
"Failed to read from zip" (16 files)
These are valid ZIP files (verified independently). The error message is misleading — the actual failure is in XML parsing:
cloud.docx, sdt_after_section_break.docx (LibreOffice)
tdf108350.docx, tdf108408.docx, tdf108714.docx, tdf108806.docx, tdf108849.docx, tdf109306.docx, tdf109524.docx, tdf111550.docx, tdf111964.docx, tdf124670.docx, tdf129659.docx (LibreOffice regression tests)
tdf171025_pageAfter.docx, tdf171038_pageAfter.docx (page break after section)
xml_space.docx (xml:space attribute handling)
"Failed to read xml" (1 file)
math-malformed_xml.docx — intentionally malformed math XML
ParseFloatError panic (1 file)
tdf79272_strictDxa.docx — unwrap() on parsing table column width as float for Strict OOXML dxa unit values
Deep nesting failure (1 file)
deep-table-cell.docx (Apache POI) — deeply nested table cells cause parse failure
Issues
- Misleading error message: "Failed to read from zip" when the actual problem is XML parsing. It would help if the error distinguished between zip I/O errors and XML parsing errors.
- Panic on Strict OOXML: The
dxa parsing calls unwrap() on parse::<f64>(), which panics on Strict OOXML unit formats.
- Limited OOXML support: Many LibreOffice-specific OOXML extensions (structured document tags, complex section breaks) are not handled.
Expected behavior
- XML parsing errors should be reported as such, not masked as zip errors
- No panics — parsing failures should return
Err
- Unsupported OOXML features should be skipped with a warning rather than causing total parse failure
Environment
docx-rs v0.4
office2pdf v0.3.0
- Rust 1.85+
- Files from LibreOffice and Apache POI test suites (publicly available)
Summary
We're using
docx-rs(v0.4) in office2pdf to parse DOCX files. When processing ~2,800 real-world DOCX files (from LibreOffice and Apache POI test suites), 19 files fail. 16 of these return "Failed to read from zip" even though the zip archives are valid — the error actually comes from XML parsing failures within the zip entries.Related to #832.
Failing files (19 files)
"Failed to read from zip" (16 files)
These are valid ZIP files (verified independently). The error message is misleading — the actual failure is in XML parsing:
cloud.docx,sdt_after_section_break.docx(LibreOffice)tdf108350.docx,tdf108408.docx,tdf108714.docx,tdf108806.docx,tdf108849.docx,tdf109306.docx,tdf109524.docx,tdf111550.docx,tdf111964.docx,tdf124670.docx,tdf129659.docx(LibreOffice regression tests)tdf171025_pageAfter.docx,tdf171038_pageAfter.docx(page break after section)xml_space.docx(xml:spaceattribute handling)"Failed to read xml" (1 file)
math-malformed_xml.docx— intentionally malformed math XMLParseFloatError panic (1 file)
tdf79272_strictDxa.docx—unwrap()on parsing table column width as float for Strict OOXMLdxaunit valuesDeep nesting failure (1 file)
deep-table-cell.docx(Apache POI) — deeply nested table cells cause parse failureIssues
dxaparsing callsunwrap()onparse::<f64>(), which panics on Strict OOXML unit formats.Expected behavior
ErrEnvironment
docx-rsv0.4office2pdfv0.3.0