Stream encoding conversion in CSV reader to reduce peak memory#4827
Stream encoding conversion in CSV reader to reduce peak memory#4827kemo wants to merge 17 commits intoPHPOffice:masterfrom
Conversation
Benchmark Results (csv-streaming-encoding)Targeted benchmark: read ISO-8859-1 and UTF-16LE encoded CSV files at scale (20 columns).
No measurable time or memory improvement at these sizes. Both branches OOM at 50k rows — the bottleneck is the Spreadsheet object, not the encoding conversion. The streaming approach may show benefits on files where the old code materialised the entire encoded content in memory before parsing. |
|
Thank you for the summaries. For purposes of prioritization, I am going to put the PRs with less demonstrable improvement, including this one, in draft status for now. This does not mean that I will not return to them. |
|
I believe your chunking will fail if the encoding is a UTF-16 variant and the chunk winds up splitting a surrogate pair. And, not that we want to encourage it, it will often fail for UTF-7 as well. |
|
Good catch — the chunked approach doesn't account for multi-byte character boundaries. UTF-16 surrogate pairs and UTF-7 shift sequences would both break if a chunk splits mid-character. I'll address this if the PR is revisited. |
A follow-up to PR PHPOffice#4837. PR PHPOffice#4827, to which I am not necessarily committed, shows that there could be a use case for letting a user perform customized logic to convert a non-UTF8-encoded CSV. In addition, several other CSV Reader properties require the use of static or locale properties, which is somewhat problematic; this PR allows them to be set as instance variables, falling back to static/locale only when unset.
|
As we discussed this, it seemed to me that there is a case for making Csv Reader's encoding conversion an overridable function. This is done in PR #4845. If that change is merged, this one will probably end up with an easily resolved merge conflict. As part of the tests for that change, I introduce a new test class which shows how you might be able to leverage a Php Iconv Filter to reduce both memory consumption and execution time. See tests/PhpSpreadsheetTests/CsvIconv2 and NonUtf8Test. I'm not sure I want to travel down that road as a permanently supported feature (I don't know much about filters and don't want to have to spend time supporting them), but it should lay the groundwork for making it easier for you to test and make suggestions. |
|
As predicted, there were merge conflicts, which I have now resolved. This does not mean that I am more inclined to move forward with this change than I was, but you should be able to experiment further using the current state of the PR as a starting point. |
CsvChunk will not be moved to production, so its test should not be in the regular unit test suite. However, it still has some value in the Benchmark suite, so move it there.
Summary
file_get_contents()+ full-string encoding conversion with chunk-based streaming (64KB chunks)php://tempstream which spills to disk if memory threshold is exceededChanges
src/PhpSpreadsheet/Reader/Csv.php: NewconvertEncodingStreaming()method withencodingCharWidth()helper, replaces the oldfile_get_contentspath inopenFileOrMemory()tests/PhpSpreadsheetTests/Reader/Csv/CsvStreamingEncodingTest.php: 18 tests covering ISO-8859-1, UTF-16BE/LE, UTF-32BE/LE, Windows-1252, large files (>64KB), chunk boundary handlingtests/PhpSpreadsheetTests/Benchmark/CsvStreamingEncodingBenchmark.php: Memory scaling benchmark for large non-UTF-8 CSVsTest plan
vendor/bin/phpunit --group benchmark --filter CsvStreamingEncodingBenchmark --stderr