Stream encoding conversion in CSV reader to reduce peak memory by kemo · Pull Request #4827 · PHPOffice/PhpSpreadsheet

kemo · 2026-03-10T14:31:00Z

Summary

Replaces file_get_contents() + full-string encoding conversion with chunk-based streaming (64KB chunks)
Handles multi-byte character boundaries for UTF-16 and UTF-32 encodings by carrying leftover bytes between chunks
Uses php://temp stream which spills to disk if memory threshold is exceeded
Reduces peak memory from 2x file size to ~128KB constant overhead

Changes

src/PhpSpreadsheet/Reader/Csv.php: New convertEncodingStreaming() method with encodingCharWidth() helper, replaces the old file_get_contents path in openFileOrMemory()
tests/PhpSpreadsheetTests/Reader/Csv/CsvStreamingEncodingTest.php: 18 tests covering ISO-8859-1, UTF-16BE/LE, UTF-32BE/LE, Windows-1252, large files (>64KB), chunk boundary handling
tests/PhpSpreadsheetTests/Benchmark/CsvStreamingEncodingBenchmark.php: Memory scaling benchmark for large non-UTF-8 CSVs

Test plan

Verify non-UTF-8 CSV files (ISO-8859-1, UTF-16, UTF-32, Windows-1252) read correctly
Verify large files with encoding conversion produce identical results
Verify UTF-8 files are unaffected (no conversion path taken)
Verify multi-byte characters spanning chunk boundaries are handled correctly
Run benchmark: vendor/bin/phpunit --group benchmark --filter CsvStreamingEncodingBenchmark --stderr

kemo · 2026-03-11T14:18:01Z

Benchmark Results (csv-streaming-encoding)

Targeted benchmark: read ISO-8859-1 and UTF-16LE encoded CSV files at scale (20 columns).

Benchmark	Master	Branch	Delta
5k rows ISO-8859-1 (4.1MB)	4,323ms	4,807ms	+11%
25k rows ISO-8859-1 (21MB)	21,340ms	21,768ms	+2%
50k rows	OOM	OOM	Both OOM at 512MB

No measurable time or memory improvement at these sizes. Both branches OOM at 50k rows — the bottleneck is the Spreadsheet object, not the encoding conversion. The streaming approach may show benefits on files where the old code materialised the entire encoded content in memory before parsing.

oleibman · 2026-03-11T14:36:47Z

Thank you for the summaries. For purposes of prioritization, I am going to put the PRs with less demonstrable improvement, including this one, in draft status for now. This does not mean that I will not return to them.

oleibman · 2026-03-23T07:52:03Z

I believe your chunking will fail if the encoding is a UTF-16 variant and the chunk winds up splitting a surrogate pair. And, not that we want to encourage it, it will often fail for UTF-7 as well.

kemo · 2026-03-24T12:08:44Z

Good catch — the chunked approach doesn't account for multi-byte character boundaries. UTF-16 surrogate pairs and UTF-7 shift sequences would both break if a chunk splits mid-character. I'll address this if the PR is revisited.

A follow-up to PR PHPOffice#4837. PR PHPOffice#4827, to which I am not necessarily committed, shows that there could be a use case for letting a user perform customized logic to convert a non-UTF8-encoded CSV. In addition, several other CSV Reader properties require the use of static or locale properties, which is somewhat problematic; this PR allows them to be set as instance variables, falling back to static/locale only when unset.

oleibman · 2026-03-27T23:27:48Z

As we discussed this, it seemed to me that there is a case for making Csv Reader's encoding conversion an overridable function. This is done in PR #4845. If that change is merged, this one will probably end up with an easily resolved merge conflict. As part of the tests for that change, I introduce a new test class which shows how you might be able to leverage a Php Iconv Filter to reduce both memory consumption and execution time. See tests/PhpSpreadsheetTests/CsvIconv2 and NonUtf8Test. I'm not sure I want to travel down that road as a permanently supported feature (I don't know much about filters and don't want to have to spend time supporting them), but it should lay the groundwork for making it easier for you to test and make suggestions.

oleibman · 2026-04-07T05:37:55Z

As predicted, there were merge conflicts, which I have now resolved. This does not mean that I am more inclined to move forward with this change than I was, but you should be able to experiment further using the current state of the PR as a starting point.

CsvChunk will not be moved to production, so its test should not be in the regular unit test suite. However, it still has some value in the Benchmark suite, so move it there.

kemo added 9 commits March 10, 2026 15:07

Stream encoding conversion in CSV reader to reduce peak memory

2a80fc2

Add benchmark test for CSV streaming encoding conversion

a418af9

Fix phpstan and code style errors

54a7d9f

Throw exception on fopen failure in streaming encoding

e0ecd38

Add tests for listWorksheetNames and empty file coverage

545cafb

Add coverage tests for streaming encoding edge cases

29ff76f

Mark unreachable fread guard with codeCoverageIgnore

2223373

Exclude benchmark from coverage and ignore defensive guards

9bc796f

Ignore unreachable fwrite in leftover flush block

56aaa23

oleibman marked this pull request as draft March 11, 2026 14:36

Move benchmarks to tests/Benchmark to avoid running with unit tests

dc4c094

oleibman mentioned this pull request Mar 27, 2026

More Extendability for CSV Reader #4845

Merged

11 tasks

Resolve Merge Conflicts

8a033d0

oleibman added 6 commits April 6, 2026 22:38

Merge branch 'master' into perf/csv-streaming-encoding

b9d0d0b

Minor Test Change

ce3f4db

Add Some Flexibility to Benchmark

6f5c667

Move 1 Test

65bc9b9

CsvChunk will not be moved to production, so its test should not be in the regular unit test suite. However, it still has some value in the Benchmark suite, so move it there.

One More Tweak

2046b62

Merge branch 'master' into perf/csv-streaming-encoding

4449293

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream encoding conversion in CSV reader to reduce peak memory#4827

Stream encoding conversion in CSV reader to reduce peak memory#4827
kemo wants to merge 17 commits intoPHPOffice:masterfrom
kemo:perf/csv-streaming-encoding

kemo commented Mar 10, 2026

Uh oh!

kemo commented Mar 11, 2026

Uh oh!

oleibman commented Mar 11, 2026

Uh oh!

oleibman commented Mar 23, 2026 •

edited

Loading

Uh oh!

kemo commented Mar 24, 2026

Uh oh!

oleibman commented Mar 27, 2026

Uh oh!

oleibman commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

kemo commented Mar 10, 2026

Summary

Changes

Test plan

Uh oh!

kemo commented Mar 11, 2026

Benchmark Results (csv-streaming-encoding)

Uh oh!

oleibman commented Mar 11, 2026

Uh oh!

oleibman commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kemo commented Mar 24, 2026

Uh oh!

oleibman commented Mar 27, 2026

Uh oh!

oleibman commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

oleibman commented Mar 23, 2026 •

edited

Loading