Skip to content

Parse failures on valid DOCX files: 'Failed to read from zip' masks XML parsing errors #873

@developer0hye

Description

@developer0hye

Summary

We're using docx-rs (v0.4) in office2pdf to parse DOCX files. When processing ~2,800 real-world DOCX files (from LibreOffice and Apache POI test suites), 19 files fail. 16 of these return "Failed to read from zip" even though the zip archives are valid — the error actually comes from XML parsing failures within the zip entries.

Related to #832.

Failing files (19 files)

"Failed to read from zip" (16 files)

These are valid ZIP files (verified independently). The error message is misleading — the actual failure is in XML parsing:

  • cloud.docx, sdt_after_section_break.docx (LibreOffice)
  • tdf108350.docx, tdf108408.docx, tdf108714.docx, tdf108806.docx, tdf108849.docx, tdf109306.docx, tdf109524.docx, tdf111550.docx, tdf111964.docx, tdf124670.docx, tdf129659.docx (LibreOffice regression tests)
  • tdf171025_pageAfter.docx, tdf171038_pageAfter.docx (page break after section)
  • xml_space.docx (xml:space attribute handling)

"Failed to read xml" (1 file)

  • math-malformed_xml.docx — intentionally malformed math XML

ParseFloatError panic (1 file)

  • tdf79272_strictDxa.docxunwrap() on parsing table column width as float for Strict OOXML dxa unit values

Deep nesting failure (1 file)

  • deep-table-cell.docx (Apache POI) — deeply nested table cells cause parse failure

Issues

  1. Misleading error message: "Failed to read from zip" when the actual problem is XML parsing. It would help if the error distinguished between zip I/O errors and XML parsing errors.
  2. Panic on Strict OOXML: The dxa parsing calls unwrap() on parse::<f64>(), which panics on Strict OOXML unit formats.
  3. Limited OOXML support: Many LibreOffice-specific OOXML extensions (structured document tags, complex section breaks) are not handled.

Expected behavior

  • XML parsing errors should be reported as such, not masked as zip errors
  • No panics — parsing failures should return Err
  • Unsupported OOXML features should be skipped with a warning rather than causing total parse failure

Environment

  • docx-rs v0.4
  • office2pdf v0.3.0
  • Rust 1.85+
  • Files from LibreOffice and Apache POI test suites (publicly available)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions