Skip to content

Stream encoding conversion in CSV reader to reduce peak memory#4827

Draft
kemo wants to merge 17 commits intoPHPOffice:masterfrom
kemo:perf/csv-streaming-encoding
Draft

Stream encoding conversion in CSV reader to reduce peak memory#4827
kemo wants to merge 17 commits intoPHPOffice:masterfrom
kemo:perf/csv-streaming-encoding

Conversation

@kemo
Copy link
Copy Markdown
Contributor

@kemo kemo commented Mar 10, 2026

Summary

  • Replaces file_get_contents() + full-string encoding conversion with chunk-based streaming (64KB chunks)
  • Handles multi-byte character boundaries for UTF-16 and UTF-32 encodings by carrying leftover bytes between chunks
  • Uses php://temp stream which spills to disk if memory threshold is exceeded
  • Reduces peak memory from 2x file size to ~128KB constant overhead

Changes

  • src/PhpSpreadsheet/Reader/Csv.php: New convertEncodingStreaming() method with encodingCharWidth() helper, replaces the old file_get_contents path in openFileOrMemory()
  • tests/PhpSpreadsheetTests/Reader/Csv/CsvStreamingEncodingTest.php: 18 tests covering ISO-8859-1, UTF-16BE/LE, UTF-32BE/LE, Windows-1252, large files (>64KB), chunk boundary handling
  • tests/PhpSpreadsheetTests/Benchmark/CsvStreamingEncodingBenchmark.php: Memory scaling benchmark for large non-UTF-8 CSVs

Test plan

  • Verify non-UTF-8 CSV files (ISO-8859-1, UTF-16, UTF-32, Windows-1252) read correctly
  • Verify large files with encoding conversion produce identical results
  • Verify UTF-8 files are unaffected (no conversion path taken)
  • Verify multi-byte characters spanning chunk boundaries are handled correctly
  • Run benchmark: vendor/bin/phpunit --group benchmark --filter CsvStreamingEncodingBenchmark --stderr

@kemo
Copy link
Copy Markdown
Contributor Author

kemo commented Mar 11, 2026

Benchmark Results (csv-streaming-encoding)

Targeted benchmark: read ISO-8859-1 and UTF-16LE encoded CSV files at scale (20 columns).

Benchmark Master Branch Delta
5k rows ISO-8859-1 (4.1MB) 4,323ms 4,807ms +11%
25k rows ISO-8859-1 (21MB) 21,340ms 21,768ms +2%
50k rows OOM OOM Both OOM at 512MB

No measurable time or memory improvement at these sizes. Both branches OOM at 50k rows — the bottleneck is the Spreadsheet object, not the encoding conversion. The streaming approach may show benefits on files where the old code materialised the entire encoded content in memory before parsing.

@oleibman
Copy link
Copy Markdown
Collaborator

Thank you for the summaries. For purposes of prioritization, I am going to put the PRs with less demonstrable improvement, including this one, in draft status for now. This does not mean that I will not return to them.

@oleibman oleibman marked this pull request as draft March 11, 2026 14:36
@oleibman
Copy link
Copy Markdown
Collaborator

oleibman commented Mar 23, 2026

I believe your chunking will fail if the encoding is a UTF-16 variant and the chunk winds up splitting a surrogate pair. And, not that we want to encourage it, it will often fail for UTF-7 as well.

@kemo
Copy link
Copy Markdown
Contributor Author

kemo commented Mar 24, 2026

Good catch — the chunked approach doesn't account for multi-byte character boundaries. UTF-16 surrogate pairs and UTF-7 shift sequences would both break if a chunk splits mid-character. I'll address this if the PR is revisited.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request Mar 27, 2026
A follow-up to PR PHPOffice#4837. PR PHPOffice#4827, to which I am not necessarily committed, shows that there could be a use case for letting a user perform customized logic to convert a non-UTF8-encoded CSV. In addition, several other CSV Reader properties require the use of static or locale properties, which is somewhat problematic; this PR allows them to be set as instance variables, falling back to static/locale only when unset.
@oleibman oleibman mentioned this pull request Mar 27, 2026
11 tasks
@oleibman
Copy link
Copy Markdown
Collaborator

As we discussed this, it seemed to me that there is a case for making Csv Reader's encoding conversion an overridable function. This is done in PR #4845. If that change is merged, this one will probably end up with an easily resolved merge conflict. As part of the tests for that change, I introduce a new test class which shows how you might be able to leverage a Php Iconv Filter to reduce both memory consumption and execution time. See tests/PhpSpreadsheetTests/CsvIconv2 and NonUtf8Test. I'm not sure I want to travel down that road as a permanently supported feature (I don't know much about filters and don't want to have to spend time supporting them), but it should lay the groundwork for making it easier for you to test and make suggestions.

@oleibman
Copy link
Copy Markdown
Collaborator

oleibman commented Apr 7, 2026

As predicted, there were merge conflicts, which I have now resolved. This does not mean that I am more inclined to move forward with this change than I was, but you should be able to experiment further using the current state of the PR as a starting point.

oleibman added 6 commits April 6, 2026 22:38
CsvChunk will not be moved to production, so its test should not be in the regular unit test suite. However, it still has some value in the Benchmark suite, so move it there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants