-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.html
More file actions
425 lines (401 loc) · 24.1 KB
/
README.html
File metadata and controls
425 lines (401 loc) · 24.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>DevZip</title>
<link rel="stylesheet" href="README.css" />
</head>
<body>
<div class="page-shell">
<header class="hero">
<div class="hero-copy">
<p class="eyebrow">Windows-first archival engine</p>
<h1>DevZip</h1>
<p class="lede">
DevZip is an experimental <span class="code">.dvz</span> container focused on
size-first compression for mixed large datasets. The goal is to beat strong
desktop archivers on representative corpora while keeping the shipping engine
commercially usable. Product success is measured by end size first; Weissman
stays as a secondary comparison only.
</p>
</div>
<div class="hero-panel">
<p class="panel-label">Compression overhaul — per-type sweep</p>
<h2><span class="code">-18.0%</span> vs 7z-lzma2</h2>
<p class="panel-value">44.561 MB</p>
<p class="panel-caption">
DevZIP <span class="code">--level max</span> compresses the per-type sweep to
<span class="code">44.561 MB</span> vs <span class="code">54.317 MB</span> for
<span class="code">7z-lzma2</span> — a <span class="code">-18.0%</span> aggregate win
that now leads in <span class="code">every</span> category (text, code, executables,
JPEG, PNG). All results round-trip byte-exact.
</p>
</div>
</header>
<main>
<section class="metrics-grid" aria-label="Benchmark summary cards">
<article class="metric-card">
<p class="metric-label">Baseline lane</p>
<p class="metric-value"><span class="code">7z-lzma2</span></p>
<p class="metric-note">Aggregate <span class="code">54.317 MB</span> across the per-type sweep (max-compression tuned).</p>
</article>
<article class="metric-card">
<p class="metric-label">DevZIP <span class="code">max</span></p>
<p class="metric-value">44.561 MB</p>
<p class="metric-note"><span class="code">-18.0%</span> vs 7z-lzma2 (<span class="code">-17.5%</span> vs best 7-Zip method per type).</p>
</article>
<article class="metric-card">
<p class="metric-label">DevZIP <span class="code">balanced</span> (default)</p>
<p class="metric-value">47.977 MB</p>
<p class="metric-note"><span class="code">-11.7%</span> vs 7z-lzma2 without the slow preflate PNG path.</p>
</article>
<article class="metric-card">
<p class="metric-label">Categories won</p>
<p class="metric-value">5 / 5</p>
<p class="metric-note">Text, code, executables, JPEG, and PNG all beat 7-Zip; every row round-trips byte-exact.</p>
</article>
</section>
<section class="content-section">
<div class="section-heading">
<p class="eyebrow">Benchmark snapshot</p>
<h2>Per-type sweep vs 7-Zip</h2>
<p>
Compression-overhaul sweep mirrored from
<span class="code">docs/benchmarks/baseline-results.md</span>. Sizes in MB; "Win"
is negative when DevZIP is smaller. Every DevZIP figure was extracted and compared
byte-for-byte against the source.
</p>
</div>
<div class="table-wrap">
<table>
<thead>
<tr>
<th>Category</th>
<th>7z-lzma2 (MB)</th>
<th>DevZIP balanced (MB)</th>
<th>Win</th>
<th>DevZIP max (MB)</th>
<th>Win</th>
</tr>
</thead>
<tbody>
<tr><td>Text / structured</td><td>1.520</td><td>0.901</td><td>-40.7%</td><td>0.901</td><td>-40.7%</td></tr>
<tr><td>JPEG photos</td><td>24.337</td><td>19.018</td><td>-21.9%</td><td>19.018</td><td>-21.9%</td></tr>
<tr><td>PNG lossless</td><td>21.002</td><td>20.960</td><td>-0.2%</td><td>17.550</td><td>-16.4%</td></tr>
<tr><td>Code</td><td>0.193</td><td>0.162</td><td>-16.1%</td><td>0.162</td><td>-15.9%</td></tr>
<tr><td>Executables (PE)</td><td>7.265</td><td>6.936</td><td>-4.5%</td><td>6.930</td><td>-4.6%</td></tr>
<tr class="highlight-row"><td><strong>Aggregate</strong></td><td>54.317</td><td>47.977</td><td>-11.7%</td><td>44.561</td><td>-18.0%</td></tr>
</tbody>
</table>
</div>
<div class="notes-grid">
<article class="note-card">
<h3>How to read it</h3>
<p>Lower end size is better; a negative "Win" means DevZIP is smaller than 7z-lzma2.</p>
<p><span class="code">balanced</span> is the default level; <span class="code">max</span> adds the preflate PNG path. Against the best 7-Zip method per type, <span class="code">max</span> is still <span class="code">-17.5%</span> in aggregate.</p>
</article>
<article class="note-card">
<h3>Current status</h3>
<p>Format <span class="code">.dvz</span> v4 ships per-format recompressors (brunsli JPEG transcode, preflate deflate-undo), architecture-aware BCJ filters, similarity-ordered solid packing, and a <span class="code">best-of-N</span> backend that competes LZMA2, ZPAQ-5, PPMd, and libbsc BWT per solid group, keeping the smallest.</p>
<p>The overhaul flipped the categories that used to lag: JPEG (<span class="code">-1.5% -> -21.9%</span>), PNG (<span class="code">-0.95% -> -16.4%</span>), and binaries (<span class="code">+0.32% loss -> -4.6% win</span>), moving the aggregate from break-even into a double-digit win. All transforms self-verify at compress time and round-trip byte-exact.</p>
</article>
</div>
</section>
<section class="content-section">
<div class="section-heading">
<p class="eyebrow">Levels</p>
<h2>Speed / size tradeoff</h2>
<p>
<span class="code">--level</span> selects how hard the engine works. The container
format is identical across levels; higher levels enable more (and more expensive)
recompressors and backends. Default is <span class="code">balanced</span>.
</p>
</div>
<div class="table-wrap">
<table>
<thead>
<tr>
<th>Level</th>
<th>Transforms</th>
<th>Backend pool</th>
<th>Use when</th>
</tr>
</thead>
<tbody>
<tr><td><span class="code">fast</span></td><td>BCJ + delta filters</td><td>LZMA2</td><td>Speed first</td></tr>
<tr class="highlight-row"><td><span class="code">balanced</span></td><td>+ brunsli, code dict, PNG IDAT strip</td><td>LZMA2 + ZPAQ-5</td><td>Wins every category, no 5-minute PNG path</td></tr>
<tr><td><span class="code">max</span></td><td>+ preflate (PNG / zip / gzip)</td><td>+ PPMd</td><td>Smallest size</td></tr>
<tr><td><span class="code">insane</span></td><td>+ create-time roundtrip verify</td><td>+ libbsc BWT</td><td>Smallest size, paranoid verification</td></tr>
</tbody>
</table>
</div>
<div class="section-heading" style="margin-top: 2rem;">
<h2>Compression time (seconds)</h2>
<p>DevZIP trades time for size: ZPAQ-5 context mixing and the preflate PNG path are the dominant costs. Codecs run concurrently per solid group and groups compress in parallel across cores, so wall-time tracks the slowest codec rather than their sum.</p>
</div>
<div class="table-wrap">
<table>
<thead>
<tr><th>Category</th><th>7z-lzma2</th><th>DevZIP balanced</th><th>DevZIP max</th></tr>
</thead>
<tbody>
<tr><td>Text</td><td>2.7</td><td>94.5</td><td>83.0</td></tr>
<tr><td>Code</td><td>0.1</td><td>2.1</td><td>2.1</td></tr>
<tr><td>Executables</td><td>6.7</td><td>91.1</td><td>89.6</td></tr>
<tr><td>JPEG</td><td>1.8</td><td>33.4</td><td>89.1</td></tr>
<tr><td>PNG</td><td>1.2</td><td>54.1</td><td>309.7</td></tr>
<tr class="highlight-row"><td><strong>Total</strong></td><td>12.5</td><td>275.2</td><td>573.8</td></tr>
</tbody>
</table>
</div>
<div class="notes-grid">
<article class="note-card">
<h3>What changed in the overhaul</h3>
<p>New per-format recompressors (brunsli JPEG, preflate deflate-undo), architecture-aware BCJ for PE binaries, a content-adaptive LZMA2 literal model, similarity-ordered solid packing, and a <span class="code">best-of-N</span> backend (LZMA2 / ZPAQ-5 / PPMd / libbsc).</p>
</article>
<article class="note-card">
<h3>What it means</h3>
<p>DevZIP now leads 7-Zip in every measured category and by <span class="code">-18.0%</span> in aggregate at <span class="code">max</span>. Use <span class="code">--level fast</span> when speed matters more than the last few percent of size.</p>
</article>
</div>
</section>
<section class="content-section">
<div class="section-heading">
<p class="eyebrow">Competitive analysis</p>
<h2>Versus the best-of-the-best</h2>
<p>Benchmarked against the strongest compressors available, same corpora, all roundtrip-verified. General-purpose codecs compress a solid TAR per corpus; <span class="code">cjxl</span>/<span class="code">zopflipng</span> run per file. Aggregate end size in MB — smaller is better.</p>
</div>
<div class="table-wrap">
<table>
<thead>
<tr><th>Tool</th><th>Aggregate (MB)</th><th>vs DevZIP max</th></tr>
</thead>
<tbody>
<tr class="highlight-row"><td><strong>DevZIP max</strong></td><td><strong>44.561</strong></td><td>—</td></tr>
<tr><td>kanzi -l9 (context mixing)</td><td>52.503</td><td>+17.8%</td></tr>
<tr><td>7z-lzma2</td><td>54.317</td><td>+21.9%</td></tr>
<tr><td>zstd --ultra -22</td><td>55.803</td><td>+25.2%</td></tr>
<tr><td>brotli -11</td><td>57.365</td><td>+28.7%</td></tr>
</tbody>
</table>
</div>
<div class="table-wrap">
<table>
<thead>
<tr><th>Corpus</th><th>7z-lzma2</th><th>zstd -22</th><th>brotli -11</th><th>kanzi -l9</th><th>per-type</th><th>DevZIP max</th></tr>
</thead>
<tbody>
<tr><td>text</td><td>1.520</td><td>1.604</td><td>1.599</td><td>0.918</td><td>—</td><td><strong>0.901</strong></td></tr>
<tr><td>code</td><td>0.193</td><td>0.199</td><td>0.192</td><td><em>0.151</em></td><td>—</td><td>0.162</td></tr>
<tr><td>exe</td><td>7.265</td><td>8.679</td><td>8.253</td><td><em>6.792</em></td><td>—</td><td>6.930</td></tr>
<tr><td>jpeg</td><td>24.337</td><td>24.323</td><td>25.579</td><td>23.661</td><td>cjxl 20.912</td><td><strong>19.018</strong></td></tr>
<tr><td>png</td><td>21.002</td><td>20.998</td><td>21.742</td><td>20.981</td><td>zopflipng 19.629</td><td><strong>17.550</strong></td></tr>
<tr class="highlight-row"><td><strong>Aggregate</strong></td><td>54.317</td><td>55.803</td><td>57.365</td><td>52.503</td><td>—</td><td><strong>44.561</strong></td></tr>
</tbody>
</table>
</div>
<div class="notes-grid">
<article class="note-card">
<h3>Why DevZIP wins</h3>
<p>Content-aware transforms, not just a stronger coder: brunsli beats JPEG XL (<span class="code">cjxl</span>) on JPEG (19.018 vs 20.912 MB) and preflate beats <span class="code">zopflipng</span> on PNG (17.550 vs 19.629 MB) while staying byte-exact.</p>
</article>
<article class="note-card">
<h3>The honest gap</h3>
<p>On raw code/exe streams, <span class="code">kanzi -l9</span> (Apache-2.0 context mixing) edges DevZIP's current backend, and the <span class="code">paq8px</span> ceiling on code is 0.109 MB. A kanzi-class CM codec is the next <span class="code">best-of-N</span> candidate. See <span class="code">docs/benchmarks/competitive-landscape.md</span>.</p>
</article>
</div>
</section>
<section class="content-section">
<div class="section-heading">
<p class="eyebrow">Pre-overhaul reference</p>
<h2>Other categories (older mixed corpus)</h2>
<p>
These categories were not part of the overhaul sweep above and use the older
standardized datasets. Behavior for them is unchanged by the overhaul (the raw
delta filter still applies), so they are kept here as a reference. Text, code,
JPEG, PNG, and executables are covered by the per-type sweep at the top of this
page. Lower end size is better.
</p>
</div>
<h3>Raw bitmaps (uncompressed)</h3>
<p class="table-desc">Kodak Photo CD 24-bit BMPs (8 images, 144 MB raw).</p>
<div class="table-wrap">
<table>
<thead>
<tr><th>Tool</th><th>End Size (MB)</th><th>Time (s)</th><th>Delta vs 7z-lzma2</th></tr>
</thead>
<tbody>
<tr class="highlight-row"><td><span class="code">devzip-native</span></td><td>59.244</td><td>181.9</td><td>-23.84%</td></tr>
<tr><td><span class="code">7z-ppmd</span></td><td>66.346</td><td>60.2</td><td>-14.71%</td></tr>
<tr><td><span class="code">7z-bzip2</span></td><td>71.591</td><td>114.1</td><td>-7.96%</td></tr>
<tr><td><span class="code">7z-lzma2</span></td><td>77.786</td><td>103.1</td><td>+0.00%</td></tr>
<tr><td><span class="code">winrar</span></td><td>85.847</td><td>210.5</td><td>+10.36%</td></tr>
<tr><td><span class="code">7z-deflate</span></td><td>98.053</td><td>147.9</td><td>+26.05%</td></tr>
<tr><td><span class="code">windows-zip</span></td><td>105.510</td><td>7.0</td><td>+35.64%</td></tr>
</tbody>
</table>
</div>
<h3>Video (raw YUV)</h3>
<p class="table-desc">Synthetic uncompressed video — 320×240 RGB, 75 frames (1 file, 16.5 MB raw).</p>
<div class="table-wrap">
<table>
<thead>
<tr><th>Tool</th><th>End Size (MB)</th><th>Time (s)</th><th>Delta vs 7z-lzma2</th></tr>
</thead>
<tbody>
<tr><td><span class="code">winrar</span></td><td>0.032</td><td>1.2</td><td>-98.27%</td></tr>
<tr class="highlight-row"><td><span class="code">devzip-native</span></td><td>1.854</td><td>15.0</td><td>-0.94%</td></tr>
<tr><td><span class="code">7z-lzma2</span></td><td>1.871</td><td>3.4</td><td>+0.00%</td></tr>
<tr><td><span class="code">7z-bzip2</span></td><td>5.231</td><td>11.8</td><td>+179.53%</td></tr>
<tr><td><span class="code">7z-ppmd</span></td><td>8.422</td><td>3.4</td><td>+350.06%</td></tr>
<tr><td><span class="code">7z-deflate</span></td><td>11.136</td><td>13.7</td><td>+495.07%</td></tr>
<tr><td><span class="code">windows-zip</span></td><td>12.539</td><td>1.5</td><td>+570.05%</td></tr>
</tbody>
</table>
</div>
<h3>Random / high-entropy</h3>
<p class="table-desc">Pseudo-random binary data — worst case for all compressors (5 files, 10 MB raw).</p>
<div class="table-wrap">
<table>
<thead>
<tr><th>Tool</th><th>End Size (MB)</th><th>Time (s)</th><th>Delta vs 7z-lzma2</th></tr>
</thead>
<tbody>
<tr class="highlight-row"><td><span class="code">devzip-native</span></td><td>10.486</td><td>10.0</td><td>-0.01%</td></tr>
<tr><td><span class="code">7z-lzma2</span></td><td>10.487</td><td>3.4</td><td>+0.00%</td></tr>
<tr><td><span class="code">7z-deflate</span></td><td>10.487</td><td>8.4</td><td>+0.00%</td></tr>
<tr><td><span class="code">windows-zip</span></td><td>10.490</td><td>1.1</td><td>+0.03%</td></tr>
<tr><td><span class="code">winrar</span></td><td>10.506</td><td>2.3</td><td>+0.19%</td></tr>
<tr><td><span class="code">7z-bzip2</span></td><td>10.515</td><td>9.6</td><td>+0.27%</td></tr>
<tr><td><span class="code">7z-ppmd</span></td><td>10.683</td><td>6.0</td><td>+1.87%</td></tr>
</tbody>
</table>
</div>
<div class="notes-grid">
<article class="note-card">
<h3>Key findings</h3>
<p>On the overhaul sweep, DevZIP now leads <span class="code">7z-lzma2</span> in every category: text (<span class="code">-40.7%</span>), JPEG (<span class="code">-21.9%</span>), PNG (<span class="code">-16.4%</span>), code (<span class="code">-16.1%</span>), executables (<span class="code">-4.6%</span>) — aggregate <span class="code">-18.0%</span>.</p>
<p>The pre-overhaul reference categories above (raw bitmaps, video, random) were not re-measured; raw bitmaps remained the original strong win (<span class="code">~-24%</span>) via the delta filter, and high-entropy random stays at parity as expected.</p>
</article>
<article class="note-card">
<h3>Dataset locations</h3>
<p>Overhaul sweep corpora live under <span class="code">sample-data/bench/</span> (text, code, exe, jpeg, png). The older reference datasets are under <span class="code">sample-data/</span>.</p>
<p>Rebuild the sweep corpus: <span class="code">python benchmarks/tools/build_overhaul_corpus.py</span></p>
<p>Rerun the sweep: <span class="code">powershell -File benchmarks/tools/overhaul_sweep.ps1</span></p>
</article>
</div>
</section>
<section class="content-section">
<div class="section-heading">
<p class="eyebrow">Project layout</p>
<h2>What the project includes</h2>
</div>
<div class="feature-grid">
<article class="feature-card">
<h3><span class="code">apps/windows-ui/DevZip.App</span></h3>
<p>WPF shell for drag-and-drop compression and extraction.</p>
</article>
<article class="feature-card">
<h3><span class="code">native/engine</span></h3>
<p>Native C++ compression engine, CLI, and tests.</p>
</article>
<article class="feature-card">
<h3><span class="code">benchmarks</span></h3>
<p>Corpus manifests, wrappers, and comparison harness.</p>
</article>
<article class="feature-card">
<h3><span class="code">docs/format</span></h3>
<p><span class="code">.dvz</span> container documentation.</p>
</article>
<article class="feature-card">
<h3><span class="code">docs/specs</span></h3>
<p>Design, licensing, and product notes.</p>
</article>
</div>
</section>
<section class="content-section split-layout">
<div class="stack">
<div class="section-heading">
<p class="eyebrow">Development notes</p>
<h2>What currently ships</h2>
</div>
<ul class="bullet-list">
<li>A native C++ engine and CLI with a <span class="code">best-of-N</span> backend that competes LZMA2, ZPAQ-5, PPMd, and libbsc BWT per solid group, auto-selecting the smallest; per-format recompressors (brunsli JPEG, preflate deflate-undo) and architecture-aware BCJ filters all write readable <span class="code">.dvz</span> v4 archives.</li>
<li><span class="code">--level fast | balanced | max | insane</span> tunes the speed/size tradeoff (default <span class="code">balanced</span>), exposed in both the CLI and the WPF UI.</li>
<li>A WPF launcher that shells out to <span class="code">devzip_cli.exe</span>.</li>
<li>A corpus benchmark harness for tracking aggregate end size, timing, and optional Weissman context.</li>
</ul>
</div>
<div class="stack">
<div class="section-heading">
<p class="eyebrow">Prerequisites</p>
<h2>What you need</h2>
</div>
<div class="subpanel">
<h3>Full native/UI build</h3>
<ul class="bullet-list">
<li>.NET 8 SDK</li>
<li>CMake 3.28+</li>
<li>MSVC or LLVM with C++20 support</li>
</ul>
</div>
<div class="subpanel">
<h3>Benchmark harness</h3>
<ul class="bullet-list">
<li>Python 3.11+</li>
<li>Optional tools on <span class="code">PATH</span> or in standard install locations: <span class="code">7z</span>, <span class="code">Rar.exe</span></li>
</ul>
</div>
</div>
</section>
<section class="content-section">
<div class="section-heading">
<p class="eyebrow">Quick start</p>
<h2>Run the benchmark harness</h2>
<p>
Build <span class="code">devzip_cli.exe</span> once before running the full matrix so the
<span class="code">devzip-native</span> lane is included in the results.
</p>
</div>
<div class="command-grid">
<article class="command-card">
<h3>Build the native lane</h3>
<pre><code>cmake --build native/engine/build --target devzip_cli devzip_tests</code></pre>
</article>
<article class="command-card">
<h3>Basic run</h3>
<pre><code>python benchmarks/run_benchmarks.py --manifest benchmarks/manifests/mixed-large.json</code></pre>
</article>
<article class="command-card">
<h3>Write the markdown report too</h3>
<pre><code>python benchmarks/run_benchmarks.py --manifest benchmarks/manifests/mixed-large.json --markdown-out docs/benchmarks/baseline-results.md</code></pre>
</article>
<article class="command-card">
<h3>Override tool paths</h3>
<pre><code>python benchmarks/run_benchmarks.py --manifest benchmarks/manifests/mixed-large.json --tool 7z-lzma2="C:\Program Files\7-Zip\7z.exe" --tool winrar="C:\Program Files\WinRAR\Rar.exe"</code></pre>
</article>
</div>
</section>
<section class="content-section boundary-panel">
<div class="section-heading">
<p class="eyebrow">Tooling boundary</p>
<h2>What stays out of shipping binaries</h2>
</div>
<p>
<span class="code">7-Zip</span> algorithm lanes, <span class="code">WinRAR</span>, and
native Windows ZIP are benchmark peers only. They must not be linked into shipping
binaries.
</p>
<p>
The native engine's <span class="code">balanced</span> default competes LZMA2 and ZPAQ-5 per solid group and keeps the smaller result; <span class="code">max</span> and <span class="code">insane</span> add PPMd and libbsc BWT to the pool. All codecs and recompressors stay inside the <span class="code">.dvz</span> engine and are permissively licensed.
</p>
<p>
Weissman scoring remains enabled in the harness, but the benchmark gate is still based
on aggregate end size only.
</p>
</section>
</main>
</div>
</body>
</html>