CASSANDRA-21134: Direct I/O for background SSTable writes#4815
CASSANDRA-21134: Direct I/O for background SSTable writes#4815samueldlightfoot wants to merge 3 commits into
Conversation
3349322 to
005f4e1
Compare
b95f990 to
3d95be1
Compare
Adds an opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for data that is unlikely to be re-read soon after being written. Memtable flushes remain buffered. Enabled via two new YAML knobs: - background_write_disk_access_mode: standard (default) | direct - direct_write_buffer_size: 1MiB (default; aligned up to FS block size, auto-grown to chunk_length) The path is gated by config, table compression being enabled, and an OperationType allowlist in DataComponent. The allowlist is exhaustive: any new OperationType with writesData=true that is not classified will fail static initialization. Operations on the DIO path: COMPACTION, MAJOR_COMPACTION, TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP, UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only). Operations off the DIO path: - FLUSH (policy: just-flushed data is hot, keep in page cache) - SCRUB (correctness: tryAppend needs mark/resetAndTruncate) - Zero-Copy Streaming (bypasses DataComponent.buildWriter) - Uncompressed writers (only CompressedSequentialWriter has a DIO subclass in this change) StartupChecks fails fast if 'direct' is requested on a platform/FS that does not support O_DIRECT. patch by Sam Lightfoot; reviewed by <reviewers> for CASSANDRA-21134
a49806c to
ca8ef09
Compare
Fail fast on unsupported background_write_disk_access_mode values, drop "scrub" from the DIO operation docs (it's UNSUPPORTED_CORRECTNESS), and tighten AntiCompactionTest/CompactionsTest to assert Direct I/O actually engages rather than passing spuriously.
…st.yaml Add background_write_disk_access_mode: direct to cassandra_latest.yaml so new-install/latest configurations exercise Direct I/O by default, and keep the test latest config in sync (test/conf/latest_diff.yaml and the DTEST_JVM_DTESTS_USE_LATEST block in InstanceConfig.java). The cassandra.yaml default remains standard.
aweisberg
left a comment
There was a problem hiding this comment.
Mostly just nits but at least one or two things worth doing and then I shared a pastebin with you of an analysis of the DirectCompressedSequentialWriterTest code and boundary coverage that is worth considering.
| # (compaction, streaming, scrub, cleanup, repair, etc.). The allowed values are: | ||
| # - standard: use buffered I/O (default) | ||
| # - direct: use direct I/O, bypassing the OS page cache | ||
| # Note: Only applies to compressed tables. Uncompressed tables always use buffered I/O. |
There was a problem hiding this comment.
You don't really need to prefix with Note:. It's not really information bearing.
|
|
||
| if (providedMode == DiskAccessMode.auto) | ||
| { | ||
| providedMode = DiskAccessMode.standard; |
There was a problem hiding this comment.
Auto seems pretty unconditional here so the check above to log that it changed doesn't seem to mean much? I guess it clarifies that AUTO always resolves to STANDARD. Not sure when it would change in the future.
It's fine just something I noted.
|
|
||
| this.sstableMetadataCollector = sstableMetadataCollector; | ||
| crcMetadata = new ChecksumWriter(new DataOutputStream(Channels.newOutputStream(channel))); | ||
| crcMetadata = new ChecksumWriter(new DataOutputStream(Channels.newOutputStream(this.channel))); |
There was a problem hiding this comment.
Why add this? I think our code style even mentions avoiding this when it isn't needed.
| && isDirectWriteSupported(operationType)) | ||
| { | ||
| if (directWriteLogged.add(operationType)) | ||
| logger.info("Direct I/O writer activated for {}", operationType); |
There was a problem hiding this comment.
This is a bit quirky for an info level log? It means it happened once, but doesn't tell you if other attempts in different paths would also have succeeded.
| // fit contiguously. flushCompleteBlocks aligns down to blockSize, leaving up to | ||
| // (blockSize - 1) bytes carried over via compact(); the floor accounts for that. | ||
| private ByteBuffer writeBuffer; | ||
| private int writeBufferPosition = 0; |
There was a problem hiding this comment.
Trying to figure out why this is tracked separately from ByteBuffer.position()
| } | ||
|
|
||
| @Test | ||
| public void testInitializeBackgroundWriteDiskAccessModeRejectsZeroBufferSize() |
There was a problem hiding this comment.
Technically haven't exercised the checks of the paths of isDirectIOSupported which would require being able to intercept that and provide test answers to whether paths support that. Maybe something that can be done with jimfs.
The auto path which we discussed might be worth changing.
| + ") WITH compression = {'enabled':'true', 'class':'LZ4Compressor'}"; | ||
| String insert = "INSERT INTO " + qualifiedTable + " (k, v) VALUES (?, ?)"; | ||
|
|
||
| DirectIoTestUtils.activatedDirectWriteOperations().clear(); |
There was a problem hiding this comment.
Oof. I get the pain of trying to observe this side effect. This is something jimfs might be able to help with, a mock file system would let you spy on what it is doing.
|
|
||
| if (DatabaseDescriptor.getBackgroundWriteDiskAccessMode() == DiskAccessMode.direct) | ||
| { | ||
| Set<OperationType> activated = DirectIoTestUtils.activatedDirectWriteOperations(); |
There was a problem hiding this comment.
Should assert the negative as well when non-direct is requested?
|
|
||
| import static org.junit.Assert.assertTrue; | ||
|
|
||
| public class DataComponentDirectWriteSelectionTest |
There was a problem hiding this comment.
The negative checks are enumerated instead of being derived from including everything that is not in SUPPORTED_OPS so it's not going to exercise everything as new components are added.
It also doesn't fail if an unrecognized component should succeed.
|
|
||
| import static org.assertj.core.api.Assertions.assertThat; | ||
|
|
||
| public class StreamingDirectWriteTest extends TestBaseImpl |
There was a problem hiding this comment.
Curious why Streaming needed a dedicated test. Incoming buffers is just incoming buffers right? Not complaining about the coverage though.
CASSANDRA-21134: Direct I/O for background SSTable writes
Summary
Opt-in
O_DIRECTwrite path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).Gated by (1) config, (2) table compression enabled, (3) an
OperationTypeallowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central inDataComponent.buildWriter; producers are unchanged.Performance
Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.
Operations covered (DIO eligible)
WRITECQLSSTableWriterDaemonTest(parameterised on disk mode)COMPACTIONCompactionsTest(parameterised on disk mode)MAJOR_COMPACTIONCompactionsTest.testCompactionWithSizeLimitedRewriterCLEANUP,GARBAGE_COLLECT,TOMBSTONE_COMPACTION,UPGRADE_SSTABLESCompactionsTest(transitive)ANTICOMPACTIONAntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWritesSTREAMStreamingDirectWriteTestThe allowlist is exhaustive: any new
OperationTypewithwritesData == truethat is not classified fails static initialization (AssertionError).Operations NOT covered
FLUSH(memtable)UNSUPPORTED_POLICYDataComponentDirectWriteSelectionTestSCRUBUNSUPPORTED_CORRECTNESStryAppendneedsmark()/resetAndTruncate(), which DIO cannot satisfy.DataComponentDirectWriteSelectionTestDataComponent.buildWriter.StreamingDirectWriteTest(disables ZCS)CompressedSequentialWriterhas a DIO subclass.DataComponentDirectWriteSelectionTest(compression gate)Removing an
UNSUPPORTED_CORRECTNESSentry requires code changes;UNSUPPORTED_POLICYis a policy decision.Key code
io/DirectIoSupport.java— eligibility enum (SUPPORTED/UNSUPPORTED_CORRECTNESS/UNSUPPORTED_POLICY/NOT_APPLICABLE).io/sstable/format/DataComponent.java— selection, allowlist, exhaustiveness check;per-op first-activation log.
io/compress/DirectCompressedSequentialWriter.java— new writer; aligned buffers,mark()/resetAndTruncate()unsupported.io/compress/CompressedSequentialWriter.java— refactored so the DIO subclass canoverride the write-chunk path;
writeChunkcontract documented and asserted.config/Config.java,config/DatabaseDescriptor.java— new knobs, validation, startupwiring; buffer size aligned to FS block size, auto-grown to chunk length.
service/StartupChecks.java— fails fast ifdirectis requested on a platform/FSthat does not support
O_DIRECT.Tests introduced
vs. the buffered writer over compressors × chunk lengths × random payload sizes;
seed-logged for repro (
DirectCompressedSequentialWriterTest).flushCompleteBlocks(DirectCompressedSequentialWriterTest).OperationTypeeligibility, allowlist exhaustiveness,compression gate, config-mode gate (
DataComponentDirectWriteSelectionTest).WRITE(CQLSSTableWriterDirectWriteTest),STREAM(in-JVM dtest,StreamingDirectWriteTest), and the compaction family +ANTICOMPACTION(extendedCompactionsTest,AntiCompactionTest).block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
(
DirectCompressedSequentialWriterTest).BufferPoolMXBeancheck that the off-heap alignedbuffer is returned on close (
DirectCompressedSequentialWriterTest).in
DatabaseDescriptorTest.Not in scope
Reviewer notes
Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.