Skip to content

Add optional XMLReader streaming mode for XLSX cell data parsing#4830

Draft
kemo wants to merge 12 commits intoPHPOffice:masterfrom
kemo:perf/xlsx-xmlreader-streaming
Draft

Add optional XMLReader streaming mode for XLSX cell data parsing#4830
kemo wants to merge 12 commits intoPHPOffice:masterfrom
kemo:perf/xlsx-xmlreader-streaming

Conversation

@kemo
Copy link
Copy Markdown
Contributor

@kemo kemo commented Mar 10, 2026

Summary

  • Adds an opt-in $useStreamingReader flag (default false) to the XLSX reader
  • When enabled, uses XMLReader instead of SimpleXML for the main cell data parsing loop, processing rows one at a time instead of loading the entire sheet XML into memory
  • Handles all core data types: shared strings, numbers, booleans, formulas, inline strings, errors
  • Keeps the existing SimpleXML path as default for full backward compatibility

Changes

  • src/PhpSpreadsheet/Reader/Xlsx.php: Added setUseStreamingReader()/getUseStreamingReader(), loadSheetDataWithXmlReader() method with helpers
  • tests/PhpSpreadsheetTests/Reader/Xlsx/StreamingReadTest.php: 10 tests (78 assertions) covering all data types, cell-by-cell comparison with SimpleXML, styles, inline strings
  • tests/PhpSpreadsheetTests/Benchmark/XlsxStreamingReaderBenchmark.php: Benchmark with 2000x15 mixed-type XLSX comparing SimpleXML vs XMLReader timing and memory

Test plan

  • Verify streaming flag defaults to false
  • Verify all data types load correctly in streaming mode
  • Verify cell-by-cell parity between SimpleXML and XMLReader modes
  • Verify existing XLSX test files load correctly with streaming enabled
  • Run benchmark: vendor/bin/phpunit --group benchmark --filter XlsxStreamingReaderBenchmark --stderr

@kemo
Copy link
Copy Markdown
Contributor Author

kemo commented Mar 11, 2026

Benchmark Results (xlsx-xmlreader-streaming)

Targeted benchmark: read XLSX files of various sizes, including generated large files.

Benchmark Master Branch Delta
issue.3654c (1.7MB) 372ms 387ms +4%
issue.2362 (253K) 1,059ms 1,043ms -2%
Generated 5kx15 (0.4MB) 3,316ms 3,268ms -1%
Generated 20kx15 (1.8MB) 13,421ms 14,076ms +5%

No improvement on these test files. Slight regression on the largest generated file (+5%). Memory usage unchanged. The XMLReader streaming approach adds overhead for files that fit comfortably in memory — the benefit would need to be demonstrated on files large enough to cause memory pressure with the DOM-based parser.

@oleibman
Copy link
Copy Markdown
Collaborator

Thank you for the summaries. For purposes of prioritization, I am going to put the PRs with less demonstrable improvement, including this one, in draft status for now. This does not mean that I will not return to them.

@oleibman oleibman marked this pull request as draft March 11, 2026 14:38
@kemo
Copy link
Copy Markdown
Contributor Author

kemo commented Mar 11, 2026

Thank you for the summaries. For purposes of prioritization, I am going to put the PRs with less demonstrable improvement, including this one, in draft status for now. This does not mean that I will not return to them.

Thank you for going through them, we're pretty heavy users of the library so it is great to give back.

@oleibman
Copy link
Copy Markdown
Collaborator

I have restructured your change. What you are trying to do is replace a large block of code with an even larger block of code. Nothing wrong with that per se, but now we have two large blocks of code. It seems like you would be better served if I made it easier to extend the Xlsx Reader class by putting the original block in a protected method, allowing your change to just extend the class with your new methods. Xlsx Reader is desparately in need of refactoring; I might be interested in advancing this change if only because it finally got me to do a small bit of refactoring.

I admittedly have made the interface for the new method hideous. I will see if I can do better, but didn't want to delay getting this change out there.

The results do not look promising. I am consistently seeing Simple perform in about 80% of the time of Streaming, and neither seems to report a memory delta greater than 0. For that reason, I have taken the new class that I made from your code out of the src tree, and your new test out of the test\PhpSpreadsheetTests tree, and moved them both to test\Benchmark. If you continue to work with these tests and can demonstrate a benefit, I can always move those back to the more preferred locations. If you can't come up with a small test but your changes are more effective with your production code, I am, as I said above, willing to merge this as it is now, so that you can take advantage of it more easily than you do now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants