Summary
Add an autotuning infrastructure in TornadoVM to enable automatic exploration and selection of high-performance kernel execution configurations, similar in concept to triton.autotune in the Triton GPU programming framework.
See: triton-lang.org – triton.autotune
Motivation
Triton’s autotuner allows developers to define a set of candidate kernel configurations (e.g., block tile sizes, warp count) and automatically benchmark them at runtime to select the best-performing variant for a given data shape and hardware. This enables near-optimal performance without manual tuning, especially across diverse input sizes and GPU architectures.
In contrast, TornadoVM currently relies on static or manually configured execution parameters (e.g., grid/block dimensions, work-group sizes). While TornadoVM can generate efficient GPU code, there is no built-in mechanism to automatically determine the best parameters for specific workloads at runtime. As a result, performance tuning remains a manual and time-consuming task.
Implementing a Triton-like autotuning feature within TornadoVM would provide:
- Automatic exploration of execution parameters (work-group/grid sizes, memory tiling, etc.)
- Runtime performance benchmarking to select the best variant per input/workload
- Improved performance portability across hardware generations
- A better developer experience by reducing the manual performance-tuning burden
What “Autotune” Should Do (High-Level)
1. Configuration Definition API
Provide a programmatic way for users (and potential compiler passes) to define multiple kernel configuration variants, e.g., differing work dimensions, resource usage estimates, and heuristics.
2. Execution & Benchmarking
During a first execution (or optionally an offline/profile mode), run each candidate configuration on representative data to collect performance metrics.
3. Selection & Caching
Select the best-performing configuration for the current workload and optionally cache results keyed by problem size and device properties to avoid re-benchmarking across runs.
4. Integration with TornadoVM APIs
Seamlessly integrate with TornadoVM’s task graph, loop-parallel, and kernel APIs so that autotuned variants can transparently replace default kernels without additional user intervention.
Benefits
-
Performance Portability
Different GPUs (e.g., NVIDIA, AMD) and future hardware benefit from automated configuration selection rather than hard-coded tuning.
-
Reduced Manual Optimization
Users can write portable kernels and rely on the autotuner to select optimal execution parameters.
-
Adaptive Execution
Autotuning results can adapt dynamically to input sizes or workload patterns at runtime.
Example Use Case (Hypothetical)
Consider a matrix multiplication kernel expressed via TornadoVM’s task graph API. With autotuning enabled, TornadoVM would:
- Generate a set of candidate work-group and grid configurations (e.g., different tile sizes).
- Execute and benchmark each configuration on a representative matrix size.
- Cache the best configuration for that matrix shape on the current device.
- Automatically apply the cached configuration for subsequent runs.
This mirrors Triton’s decorator-based autotuning pattern, where multiple triton.Config entries are evaluated to find the best configuration per scenario.
See: triton-lang.org – triton.autotune
Considerations
-
Overhead
Autotuning introduces runtime overhead due to benchmarking multiple configurations. A robust caching mechanism is essential to amortize this cost.
-
Heuristics
Heuristic-guided pruning can reduce the configuration search space for large parameter sets.
-
API Design
A clean, expressive API is needed to specify configurations and tuning criteria while integrating naturally with existing TornadoVM abstractions.
Request
Please consider adding a kernel autotuning framework to TornadoVM that:
- Defines an API for configurable kernel performance variants
- Performs runtime evaluation and selection
- Caches optimal configurations per input size and device
- Integrates cleanly with existing task and parallel abstractions
Such a feature could significantly improve performance outcomes and make TornadoVM more competitive with modern GPU DSLs that provide built-in autotuning support.
Summary
Add an autotuning infrastructure in TornadoVM to enable automatic exploration and selection of high-performance kernel execution configurations, similar in concept to
triton.autotunein the Triton GPU programming framework.See: triton-lang.org – triton.autotune
Motivation
Triton’s autotuner allows developers to define a set of candidate kernel configurations (e.g., block tile sizes, warp count) and automatically benchmark them at runtime to select the best-performing variant for a given data shape and hardware. This enables near-optimal performance without manual tuning, especially across diverse input sizes and GPU architectures.
In contrast, TornadoVM currently relies on static or manually configured execution parameters (e.g., grid/block dimensions, work-group sizes). While TornadoVM can generate efficient GPU code, there is no built-in mechanism to automatically determine the best parameters for specific workloads at runtime. As a result, performance tuning remains a manual and time-consuming task.
Implementing a Triton-like autotuning feature within TornadoVM would provide:
What “Autotune” Should Do (High-Level)
1. Configuration Definition API
Provide a programmatic way for users (and potential compiler passes) to define multiple kernel configuration variants, e.g., differing work dimensions, resource usage estimates, and heuristics.
2. Execution & Benchmarking
During a first execution (or optionally an offline/profile mode), run each candidate configuration on representative data to collect performance metrics.
3. Selection & Caching
Select the best-performing configuration for the current workload and optionally cache results keyed by problem size and device properties to avoid re-benchmarking across runs.
4. Integration with TornadoVM APIs
Seamlessly integrate with TornadoVM’s task graph, loop-parallel, and kernel APIs so that autotuned variants can transparently replace default kernels without additional user intervention.
Benefits
Performance Portability
Different GPUs (e.g., NVIDIA, AMD) and future hardware benefit from automated configuration selection rather than hard-coded tuning.
Reduced Manual Optimization
Users can write portable kernels and rely on the autotuner to select optimal execution parameters.
Adaptive Execution
Autotuning results can adapt dynamically to input sizes or workload patterns at runtime.
Example Use Case (Hypothetical)
Consider a matrix multiplication kernel expressed via TornadoVM’s task graph API. With autotuning enabled, TornadoVM would:
This mirrors Triton’s decorator-based autotuning pattern, where multiple
triton.Configentries are evaluated to find the best configuration per scenario.See: triton-lang.org – triton.autotune
Considerations
Overhead
Autotuning introduces runtime overhead due to benchmarking multiple configurations. A robust caching mechanism is essential to amortize this cost.
Heuristics
Heuristic-guided pruning can reduce the configuration search space for large parameter sets.
API Design
A clean, expressive API is needed to specify configurations and tuning criteria while integrating naturally with existing TornadoVM abstractions.
Request
Please consider adding a kernel autotuning framework to TornadoVM that:
Such a feature could significantly improve performance outcomes and make TornadoVM more competitive with modern GPU DSLs that provide built-in autotuning support.