Skip to content

Proposal: Implement Triton-style Autotuning Support in TornadoVM #778

@mikepapadim

Description

@mikepapadim

Summary

Add an autotuning infrastructure in TornadoVM to enable automatic exploration and selection of high-performance kernel execution configurations, similar in concept to triton.autotune in the Triton GPU programming framework.
See: triton-lang.org – triton.autotune


Motivation

Triton’s autotuner allows developers to define a set of candidate kernel configurations (e.g., block tile sizes, warp count) and automatically benchmark them at runtime to select the best-performing variant for a given data shape and hardware. This enables near-optimal performance without manual tuning, especially across diverse input sizes and GPU architectures.

In contrast, TornadoVM currently relies on static or manually configured execution parameters (e.g., grid/block dimensions, work-group sizes). While TornadoVM can generate efficient GPU code, there is no built-in mechanism to automatically determine the best parameters for specific workloads at runtime. As a result, performance tuning remains a manual and time-consuming task.

Implementing a Triton-like autotuning feature within TornadoVM would provide:

  • Automatic exploration of execution parameters (work-group/grid sizes, memory tiling, etc.)
  • Runtime performance benchmarking to select the best variant per input/workload
  • Improved performance portability across hardware generations
  • A better developer experience by reducing the manual performance-tuning burden

What “Autotune” Should Do (High-Level)

1. Configuration Definition API

Provide a programmatic way for users (and potential compiler passes) to define multiple kernel configuration variants, e.g., differing work dimensions, resource usage estimates, and heuristics.

2. Execution & Benchmarking

During a first execution (or optionally an offline/profile mode), run each candidate configuration on representative data to collect performance metrics.

3. Selection & Caching

Select the best-performing configuration for the current workload and optionally cache results keyed by problem size and device properties to avoid re-benchmarking across runs.

4. Integration with TornadoVM APIs

Seamlessly integrate with TornadoVM’s task graph, loop-parallel, and kernel APIs so that autotuned variants can transparently replace default kernels without additional user intervention.


Benefits

  • Performance Portability
    Different GPUs (e.g., NVIDIA, AMD) and future hardware benefit from automated configuration selection rather than hard-coded tuning.

  • Reduced Manual Optimization
    Users can write portable kernels and rely on the autotuner to select optimal execution parameters.

  • Adaptive Execution
    Autotuning results can adapt dynamically to input sizes or workload patterns at runtime.


Example Use Case (Hypothetical)

Consider a matrix multiplication kernel expressed via TornadoVM’s task graph API. With autotuning enabled, TornadoVM would:

  1. Generate a set of candidate work-group and grid configurations (e.g., different tile sizes).
  2. Execute and benchmark each configuration on a representative matrix size.
  3. Cache the best configuration for that matrix shape on the current device.
  4. Automatically apply the cached configuration for subsequent runs.

This mirrors Triton’s decorator-based autotuning pattern, where multiple triton.Config entries are evaluated to find the best configuration per scenario.
See: triton-lang.org – triton.autotune


Considerations

  • Overhead
    Autotuning introduces runtime overhead due to benchmarking multiple configurations. A robust caching mechanism is essential to amortize this cost.

  • Heuristics
    Heuristic-guided pruning can reduce the configuration search space for large parameter sets.

  • API Design
    A clean, expressive API is needed to specify configurations and tuning criteria while integrating naturally with existing TornadoVM abstractions.


Request

Please consider adding a kernel autotuning framework to TornadoVM that:

  • Defines an API for configurable kernel performance variants
  • Performs runtime evaluation and selection
  • Caches optimal configurations per input size and device
  • Integrates cleanly with existing task and parallel abstractions

Such a feature could significantly improve performance outcomes and make TornadoVM more competitive with modern GPU DSLs that provide built-in autotuning support.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions