Release Torch-TensorRT v2.11.0 · pytorch/TensorRT

Torch-TensorRT 2.11.0 Linux x86-64 and Windows targets

PyTorch 2.11, CUDA 12.6 12.8 12.9, 13.0, TensorRT 10.15, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

https://pypi.org/project/torch-tensorrt/

CUDA 12.6/12.8/12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

https://download.pytorch.org/whl/torch-tensorrt

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.11 + TensorRT 10.15

Available via PyPI: https://pypi.org/project/torch-tensorrt/
Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt

Jetson Orin

no torch_tensorrt 2.9/2.10/2.11 release for Jetson Orin
please continue using torch_tensorrt 2.8 release

Torch-TensorRT-RTX 2.11.0 Linux x86-64 and Windows targets

PyTorch 2.11, CUDA 12.9, 13.0, TensorRT-RTX 1.3, Python 3.10~3.13

Torch-TensorRT-RTX Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

https://pypi.org/project/torch-tensorrt-rtx/

CUDA 12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

https://download.pytorch.org/whl/torch-tensorrt-rtx

note: tensorrt-rtx 1.3 wheel is not in pypi yet, so please download the tarball from https://developer.nvidia.com/tensorrt-rtx
install the wheel from tarball

IAttention Layer

In this release, TensorRT's native IAttention layer is used by default to handle various attention-related ATen ops, including SDPA, Flash-SDPA, Efficient-SDPA, and cuDNN-SDPA. This integration enables more efficient execution and can improve model performance. To explicitly enable this behavior, set decompose_attention=False in the compile() function. When enabled, the native TensorRT implementation is utilized to achieve optimized attention computation.
However, due to current TensorRT limitations, certain operations such as compute_log_sumexp and Grouped Query Attention (GQA) are not yet supported. If these cases are encountered, an informational prompt will be displayed during compilation. Alternatively, you can set decompose_attention=True to decompose the attention ops into multiple basic ATen ops. Although this approach may not achieve the same level of performance optimization, it offers broader operator coverage and greater compatibility across different model architectures.

Improvements to the Symbolic Shape System

Two key improvements have been made to the symbolic shape system used to track mutations on dynamic dimensions through out the body of the graph.

A shape prop formula is recorded as metadata in for every compiled engine.

Now instead of needing to instantiate the engine in order to do key tasks like serialization, retracing or similar tasks which require fake tensor propagation, we record the shape relation between inputs and outputs for every TRT subgraph at compile time and store it as metadata. The torch.ops.tensorrt.execute_engine meta kernel now just replays this function in the new shape environment.
This should enable more seamless integration with the rest of the torch.compile ecosystem as meta kernels in Torch-TensorRT will now work in the same way as other meta kernels.

For unbounded shape ranges, we now select sane defaults

Dynamic shapes are lazily inserted into Dynamo graphs. This is most noticeable when using Torch-TensorRT as a backend for torch.compile. Here when the first inference call is made to a boxed function

trt_mod = torch.compile(mod, backend="tensorrt")
trt_mod(*inputs)

The shapes of intermediate tensors are considered static and derived eagerly from the shapes of input tensors.

If input shapes then change

trt_mod(*other_sized_inputs)

TorchDynamo will start marking dimensions where shapes differ as dynamic dimensions. However, it does not assume upper bounds and critically, has no "optimal" or target size as required by TensorRT.

As such when we see such [FIXED_SIZE, inf) ranges, we set a sane upper-bound (max_int / max_dims) and assume the optimal size as the mid point of that range. We highly recommend users explicitly set dynamic shape bounds for both torch.export and torch.compile use cases, but this system can serve as fallback. https://docs.pytorch.org/TensorRT/user_guide/compilation/dynamic_shapes.html

Torch-TensorRT-RTX

TensorRT-RTX is a JIT focused version of TensorRT that allows users to target many different hardware platforms with one artifact. It allows developers to easily provide performance to their users across the many variations of RTX GPUs. In previous versions of Torch-TensorRT we have provided source code support for using TensorRT-RTX as a backend for Torch-TensorRT, allowing users to get access to the same workflows they would with standard Torch-TensorRT with more JIT oriented optimization approach.

With 2.11 Torch-TensorRT-RTX has graduated to its own package that you can install with pip install torch-tensorrt-rtx. This package uses all the same APIs as Torch-TensorRT, just with a different backend. torch-tensorrt-rtx 2.11 targets TensorRT-RTX 1.3. For 2.11, TensorRT-RTX must be installed via a wheel distributed on developer.nvidia.com: https://developer.nvidia.com/tensorrt-rtx

Know Limitations:

bf16 precision is generally supported, however it is possible in some models that there may still be numerical accuracy issues. This will be addressed in futuer versions of TensorRT-RTX
There is a know accuracy issue when running Grouped Query Attention TensorRT-RTX which will be addressed in a future release of TensorRT-RTX

run_llm int8 quantization

We have added support for performing post training quantization in int8 precision from the command line using the run_llm tool .
You can apply int8 quantization backed by the TensorRT-Model-Optimizer-Toolkit using --quant_format fp8

python run_llm.py --model meta-llama/Llama-3.1-8B --quant_format fp8 --prompt "What is parallel programming?" --model_precision FP16 --num_tokens 128

Empty Tensor

The Torch-TensorRT Runtime has

We have added support for providing torch empty tensors (tensors with one or more zero sized dimensions) as input to Torch-TensorRT compiled programs.

Under the hood we use TensorRT native empty tensor semantics. Empty tensors are marked by a 1B placeholder input to the engine. Both the python and C++ runtimes support this feature

What's Changed

filter out unsupported cuda versions by @lanluo-nvidia in #3990
changing the setting_to_be_engine_invariant from tuple to set by @apbose in #3984
fix the job name issue in Actions UI by @lanluo-nvidia in #3992
Graph break overhead by @cehongwang in #3946
upgrade torch from 2.10.dev to2.11.dev by @lanluo-nvidia in #3989
Added debugger example by @cehongwang in #3997
Improve documentation after trying on a new machine. by @SandSnip3r in #4002
Add venv install & cuda driver info to documentation by @SandSnip3r in #4016
fix: Skip setting output tensor ownership in dryrun mode by @SandSnip3r in #4014
DLFW 26.01 changes to main by @apbose in #4004
Support modelopt pre-quantized model in llm by @lanluo-nvidia in #4003
Dynamic memory allocation by @cehongwang in #3727
Fix the converter issue caused by this missing unset_fake_temporarily by @wenbingl in #4006
lowering pass: fully remove SymInt by @zewenli98 in #4001
fix the layer info test failure and deal with potential segfault by @narendasan in #4042
cherry pick 4033: skip llm test if modelopt is not installed from release branch to main by @lanluo-nvidia in #4034
cherry pick 4038 from ngc release branch to main: skip failed test on orin until issue 3982 is fixed by @lanluo-nvidia in #4039
cherry pick 4028: fix resource partitioner issue from release branch to main by @lanluo-nvidia in #4031
cherry pick 4029: upgrade aarch64 base image from release branch to main by @lanluo-nvidia in #4030
fix: torchtrtc precision setting logic by @yeetypete in #3883
Empty tensor handling by @apbose in #3891
fix: example argument issue raised in 4070 by @zewenli98 in #4071
create torch_tensorrt_rtx wheel by @lanluo-nvidia in #4077
upgrade trt from 10.14.1 to 10.15.1 by @lanluo-nvidia in #4075
fix the cannot find libnvrtc-builtins.so.13.0 issue by @lanluo-nvidia in #4078
fix: Refactor the cat converter and seperate out the mixed use by @narendasan in #4059
scatter.src and scatter.value dynamic case by @apbose in #4062
Fixed the bug caused by cpu offloading by @cehongwang in #4063
Resource partitioner CI fix by @cehongwang in #4005
Rank based logging for distributed examples by @apbose in #4081
add int8 quantization support for llm models by @lanluo-nvidia in #4086
upgrade rtx from 1.2 to 1.3 by @lanluo-nvidia in #4084
A bunch of test fixes by @narendasan in #4088
fix typo by @lanluo-nvidia in #4091
2.11 release cut by @lanluo-nvidia in #4092
add torch_tensorrt_rtx to nightly and release ci/cd by @lanluo-nvidia in #4094
cherry pick 4079 from main to 2.11 release by @lanluo-nvidia in #4096
add rtx release build ci by @lanluo-nvidia in #4098
Cherry Pick fused_rums_norm_lowering from release2 .10 to release 2.11 (#4057) by @lanluo-nvidia in #4109
cherry pick 4104 from main to release 2.11 by @lanluo-nvidia in #4116
add rtx documentation in release 2.11 by @lanluo-nvidia in #4112
fix the llm test failing issue + add back cu126/cu128 by @lanluo-nvidia in #4118
only support cu130 for aarch64 by @lanluo-nvidia in #4122
cherry pick 4124 from main to 2.11 by @lanluo-nvidia in #4125
cherrypick: split index.Tensor converter for bool vs int indexing (#4123) by @lanluo-nvidia in #4133
cherry pick 4131 from main to release 2.11 by @lanluo-nvidia in #4149

New Contributors

@SandSnip3r made their first contribution in #4002
@yeetypete made their first contribution in #3883

Full Changelog: v2.10.0...v2.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch-TensorRT v2.11.0

Choose a tag to compare

Sorry, something went wrong.