Releases: JuliaGPU/cuTile.jl
Releases · JuliaGPU/cuTile.jl
v0.2.1
cuTile v0.2.1
Merged pull requests:
v0.2.0
cuTile v0.2.0
Breaking changes: see https://juliagpu.org/post/2026-04-08-cutile_0.2/
Merged pull requests:
- Add alias-aware token threading for memory operations. (#89) (@shreyas-omkar)
- Matmul: Switch to trailing batch dims and allow mat-vec, vec-mat (#132) (@AntonOresten)
- Register ghost CGVals for ghost-type kernel arguments (#133) (@maleadt)
- Add mapreduce. (#134) (@maleadt)
- Fix @device_code_tiled crash on kernels with reduce. (#135) (@maleadt)
- Add atomic max, min, or, and, xor operations (#136) (@maleadt)
- Clean-up and optimize mapreduce (#137) (@maleadt)
- Support Constant(nothing) as kernel argument (#138) (@maleadt)
- Align keyword argument with cuTile Python (#139) (@maleadt)
- Switch to EnumX.jl. (#140) (@maleadt)
- Use atomic reductions to eliminate two-pass overhead (#141) (@maleadt)
- Store tile shapes in row-major order (#142) (@maleadt)
- Keep shapes until bytecode emission. (#143) (@maleadt)
- Update examples (#144) (@maleadt)
- Align FFT examples. (#145) (@maleadt)
- Declarative IR rewrite pattern infrastructure (#147) (@maleadt)
- Simplify using new IRStructurizer.jl tools. (#148) (@maleadt)
- Add loop parallel store optimization and DCE pass (#149) (@maleadt)
- Add normalize pass to lower Julia intrinsics to cuTile equivalents. (#150) (@maleadt)
- Update benchmarks scripts and results. (#151) (@maleadt)
- Improve the rewriter (#152) (@maleadt)
- Fix divisibility computation exceeding max_divisor. (#153) (@maleadt)
- Add simple constant folding (#154) (@maleadt)
- Add algebraic simplifications to get rid of 1-based index IR bloat (#156) (@maleadt)
- Extend reflection: support constants, add kwargs. (#157) (@maleadt)
- Canonicalize the IR before optimization (#158) (@maleadt)
- Add reshape to transparent ops in rewrite pattern matching (#159) (@maleadt)
- Constant propagation and identity fold for the rewrite framework (#160) (@maleadt)
- Add
cmpstrength reduction (#161) (@maleadt) - Update benchmark times. (#162) (@maleadt)
- Add Mixture of Experts (MoE) example (#163) (@maleadt)
- Fix alias analysis not tracking pointer aliases through getfield/offset (#164) (@maleadt)
- Add a LICM pass (#165) (@maleadt)
- Optimize
gatherbounds mask calculation (#166) (@maleadt) - Rewriter: intermediate type inference and transitive worklist propagation (#167) (@maleadt)
- Rewriter: add some algebra rules (#168) (@maleadt)
- Add AttentionFMHA example (#170) (@maleadt)
- Get rid of ScalarShape, represent as RowMajorShape. (#171) (@maleadt)
- Replace per-op floating-point mode args with a context macro. (#172) (@maleadt)
- Add printing functionality (+ rewriter splat functionality) (#173) (@maleadt)
- Use Julia-native for loops (#174) (@maleadt)
- Add support for debug info. (#175) (@maleadt)
- Add pow2 strength reduction (#177) (@maleadt)
- Add
isnanoverlay and fix ordering (#179) (@maleadt) - README update and ct.where removal (#180) (@maleadt)
- Add support for
Typearguments and fix static parameter codegen (#181) (@AntonOresten) - Support for 1.13 (#182) (@maleadt)
Closed issues:
- Alias-aware token threading for better parallelism (#1)
- Port additional examples (#14)
- Support for overflow options in Integer cases (#59)
- Matmul broadcasting (#115)
- TagBot: Manual intervention needed for releases (#131)
- Layernorm regression: Token threading requires loop parallel store optimization (#146)
- Broadcast involving scalar fails (#169)
v0.1.2
cuTile v0.1.2
Merged pull requests:
- BFloat16 scalar support (#90) (@AntonOresten)
- Insert trailing 1s instead of leading before broadcasting (#91) (@AntonOresten)
- Allow early returns (#92) (@AntonOresten)
- Generic support for ghost types in
launch(#93) (@AntonOresten) - Fix scalar indexing on TileArrays and add codegen test (#94) (@AntonOresten)
- Add tile-indexed atomic operations (#96) (@AntonOresten)
- Support runtime values in ct.full (Intrinsics.constant) (#100) (@0xtaruhi)
- Fix bitwise ops encoding and add execution tests (#106) (@0xtaruhi)
- Add not_int intrinsic for for-loop iteration support (#107) (@0xtaruhi)
- Add CI (#110) (@maleadt)
- Fixes for CUDA 13.2 (#111) (@maleadt)
- Update execution requirements (#113) (@AntonOresten)
- Add Grid Definition to Quick Start (#116) (@dearmachi)
- Support reductions without dims arg. (#118) (@maleadt)
- Update to CompilerCaching v0.2. (#119) (@maleadt)
- Add architecture-specific optimization annotations (#122) (@maleadt)
- Overlay
fill,zeros, andonesfrom Base (#123) (@AntonOresten) - Allow
transposeon 1D tiles, disallow on >2D (#124) (@AntonOresten) - Fix codegen crash when SSAValue appears as a statement (#125) (@maleadt)
- Upgrade to IRStructurizer v0.2 (#126) (@maleadt)
- Fix @device_code_tiled for kernels with Constant arguments (#127) (@maleadt)
- Support destructuring arbitrary arguments (#128) (@maleadt)
- Provide host-level broadcast (#129) (@maleadt)
Closed issues:
- Launching kernels with arrays outside GPU (#98)
- Compiler failures with nested while loops and runtime values in ct.full (#99)
- TagBot trigger issue (#101)
- Nested while loops produce wrong results (#102)
- [Feature Request] Support
forloops within the kernel (#103) - InvalidTerminatorError: dot-broadcast inside
whileloop causes yield type mismatch (#104) - Bitwise operations on tiles crash
tileiraswithProcessExited(3)(#105) - Pre-Blackwell support (#109)
- Architecture-specific configuration (#112)
transposesemantics onTile(#114)- Splatted tile sizes to align with Julia functions (#117)
- TagBot doesn't work (#120)
- Downgrading deps makes tests fail (#130)
v0.1.1
What's Changed
- BFloat16 scalar support by @AntonOresten in #90
- Generic support for ghost types in
launchby @AntonOresten in #93 - Fix scalar indexing on TileArrays and add codegen test by @AntonOresten in #94
- Support runtime values in ct.full (Intrinsics.constant) by @0xtaruhi in #100
- Allow early returns by @AntonOresten in #92
- Add tile-indexed atomic operations by @AntonOresten in #96
- Insert trailing 1s instead of leading before broadcasting by @AntonOresten in #91
Full Changelog: v0.1.0...v0.1.1
v0.1.0
What's Changed
- Port the batched matmul example. by @maleadt in #8
- IRStructurizer: Switch to
code_ircodeby @maleadt in #6 - Add and clean-up intrinsics by @maleadt in #12
- Lay out SSAArray as a StructOfArrays by @vchuravy in #5
- Deduce view index type from array fields. by @maleadt in #17
- Permute during reshape to support column-major storage. by @maleadt in #18
- Simplify type extraction for tile-only intrinsics. by @maleadt in #19
- Validate tile sizes to be pow2. by @maleadt in #20
- Fix index calculation in 2D gather/scatter by @AntonOresten in #23
- Add initial CI (only codegen tests) by @maleadt in #28
- Simplify constant emission. by @maleadt in #29
- Fix FFT example and add benchmark. by @maleadt in #30
- Expose entry hints through
launchby @AntonOresten in #27 - Remove bad matmul-related overrides. by @maleadt in #33
- Expose load/store optimization hints by @AntonOresten in #32
- Refactor examples by @maleadt in #35
- Add UInt8 support to julia_to_tile_dtype by @arhik in #38
- Allow redefinition of kernel methods by @AntonOresten in #31
- Remove Int64 from
encode_signed_varint!signature by @arhik in #42 - replace unsigned only and float only varint calls with
encode_varint!by @arhik in #43 - IRStructurizer: handle merge phis for if-then regions by @AntonOresten in #53
- Support BFloat16 by @AntonOresten in #34
- Require terminators + validate them by @maleadt in #54
- Move IRStructurizer and FileCheck subpackages out of the repo. by @maleadt in #55
- Integrate widh CompilerCaching.jl by @maleadt in #46
- Remove unnecessary rtol specifications. by @maleadt in #56
- Add more broadcastable operators. by @maleadt in #57
- Add and shorten some tests by @maleadt in #58
- Support Float8 types by @AntonOresten in #36
- feat: Add integer reduction support for reduce ops by @arhik in #37
- Add support for
PermutedDimsArrayby @AntonOresten in #48 - Add scan (prefix sum) operations support by @arhik in #39
- Switch to an idiomatic reduce/scan API by @maleadt in #60
- Add some more reduction-like operators by @maleadt in #61
- Support
mapand generalizebroadcastby @maleadt in #62 - Compiler simplifications. by @maleadt in #63
- Use Base.transpose. by @maleadt in #64
- Fix and add reflection macros. by @maleadt in #65
- Replace ct.permute with permutedims by @maleadt in #67
- Encode ArraySpec fields as typevars. by @maleadt in #69
- Fix benchmark scripts. by @maleadt in #70
- Make sure constructed ghosts yield SSA values. by @maleadt in #73
- Add
orderkwarg toload/storefor dimension reordering. by @maleadt in #72 - Fix constants and switch intrinsics to constant value inputs +
tfuncs by @maleadt in #79 - Auto-match tile rank in load/store by @AntonOresten in #74
- Emit GlobalRef constants eagerly (#77) by @maleadt in #80
- Add scalar ops on Tile and TileArray. by @maleadt in #76
- Switch to
Base.sizeforTileArrayby @AntonOresten in #75 - Replace astype by idiomatic convert/broadcast. by @maleadt in #81
- Add support for assertions. by @maleadt in #83
- Split tests for better parallelism. by @maleadt in #84
- Rework intrinsics by @maleadt in #85
- Sanitize kernel names. by @maleadt in #86
- Support broadcasting unsafe_trunc and trunc by @maleadt in #88
- Pass constants as scalars, infer as constants. by @maleadt in #87
New Contributors
- @maleadt made their first contribution in #8
- @vchuravy made their first contribution in #5
- @AntonOresten made their first contribution in #23
- @arhik made their first contribution in #38
Full Changelog: https://github.com/JuliaGPU/cuTile.jl/commits/v0.1.0