Proposal: Implement Triton-style Autotuning Support in TornadoVM

## **Summary**

Add an autotuning infrastructure in **TornadoVM** to enable automatic exploration and selection of high-performance kernel execution configurations, similar in concept to `triton.autotune` in the Triton GPU programming framework.  
See: [triton-lang.org – triton.autotune](https://triton-lang.org/main/python-api/generated/triton.autotune.html?utm_source=chatgpt.com)

---

## **Motivation**

Triton’s autotuner allows developers to define a set of candidate kernel configurations (e.g., block tile sizes, warp count) and automatically benchmark them at runtime to select the best-performing variant for a given data shape and hardware. This enables near-optimal performance without manual tuning, especially across diverse input sizes and GPU architectures.

In contrast, TornadoVM currently relies on static or manually configured execution parameters (e.g., grid/block dimensions, work-group sizes). While TornadoVM can generate efficient GPU code, there is no built-in mechanism to automatically determine the best parameters for specific workloads at runtime. As a result, performance tuning remains a manual and time-consuming task.

Implementing a Triton-like autotuning feature within TornadoVM would provide:

- Automatic exploration of execution parameters (work-group/grid sizes, memory tiling, etc.)
- Runtime performance benchmarking to select the best variant per input/workload
- Improved performance portability across hardware generations
- A better developer experience by reducing the manual performance-tuning burden

---

## **What “Autotune” Should Do (High-Level)**

### **1. Configuration Definition API**
Provide a programmatic way for users (and potential compiler passes) to define multiple kernel configuration variants, e.g., differing work dimensions, resource usage estimates, and heuristics.

### **2. Execution & Benchmarking**
During a first execution (or optionally an offline/profile mode), run each candidate configuration on representative data to collect performance metrics.

### **3. Selection & Caching**
Select the best-performing configuration for the current workload and optionally cache results keyed by problem size and device properties to avoid re-benchmarking across runs.

### **4. Integration with TornadoVM APIs**
Seamlessly integrate with TornadoVM’s task graph, loop-parallel, and kernel APIs so that autotuned variants can transparently replace default kernels without additional user intervention.

---

## **Benefits**

- **Performance Portability**  
  Different GPUs (e.g., NVIDIA, AMD) and future hardware benefit from automated configuration selection rather than hard-coded tuning.

- **Reduced Manual Optimization**  
  Users can write portable kernels and rely on the autotuner to select optimal execution parameters.

- **Adaptive Execution**  
  Autotuning results can adapt dynamically to input sizes or workload patterns at runtime.

---

## **Example Use Case (Hypothetical)**

Consider a matrix multiplication kernel expressed via TornadoVM’s task graph API. With autotuning enabled, TornadoVM would:

1. Generate a set of candidate work-group and grid configurations (e.g., different tile sizes).
2. Execute and benchmark each configuration on a representative matrix size.
3. Cache the best configuration for that matrix shape on the current device.
4. Automatically apply the cached configuration for subsequent runs.

This mirrors Triton’s decorator-based autotuning pattern, where multiple `triton.Config` entries are evaluated to find the best configuration per scenario.  
See: [triton-lang.org – triton.autotune](https://triton-lang.org/main/python-api/generated/triton.autotune.html?utm_source=chatgpt.com)

---

## **Considerations**

- **Overhead**  
  Autotuning introduces runtime overhead due to benchmarking multiple configurations. A robust caching mechanism is essential to amortize this cost.

- **Heuristics**  
  Heuristic-guided pruning can reduce the configuration search space for large parameter sets.

- **API Design**  
  A clean, expressive API is needed to specify configurations and tuning criteria while integrating naturally with existing TornadoVM abstractions.

---

## **Request**

Please consider adding a kernel autotuning framework to TornadoVM that:

- Defines an API for configurable kernel performance variants  
- Performs runtime evaluation and selection  
- Caches optimal configurations per input size and device  
- Integrates cleanly with existing task and parallel abstractions  

Such a feature could significantly improve performance outcomes and make TornadoVM more competitive with modern GPU DSLs that provide built-in autotuning support.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Implement Triton-style Autotuning Support in TornadoVM #778

Summary

Motivation

What “Autotune” Should Do (High-Level)

1. Configuration Definition API

2. Execution & Benchmarking

3. Selection & Caching

4. Integration with TornadoVM APIs

Benefits

Example Use Case (Hypothetical)

Considerations

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Implement Triton-style Autotuning Support in TornadoVM #778

Description

Summary

Motivation

What “Autotune” Should Do (High-Level)

1. Configuration Definition API

2. Execution & Benchmarking

3. Selection & Caching

4. Integration with TornadoVM APIs

Benefits

Example Use Case (Hypothetical)

Considerations

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions