Skip to content

Torch-TensorRT v2.11.0

Latest

Choose a tag to compare

@lanluo-nvidia lanluo-nvidia released this 07 Apr 17:03
· 82 commits to main since this release
0cc00aa

Torch-TensorRT 2.11.0 Linux x86-64 and Windows targets

PyTorch 2.11, CUDA 12.6 12.8 12.9, 13.0, TensorRT 10.15, Python 3.10~3.13

Torch-TensorRT Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 12.6/12.8/12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.11 + TensorRT 10.15

Jetson Orin

  • no torch_tensorrt 2.9/2.10/2.11 release for Jetson Orin
  • please continue using torch_tensorrt 2.8 release

Torch-TensorRT-RTX 2.11.0 Linux x86-64 and Windows targets

PyTorch 2.11, CUDA 12.9, 13.0, TensorRT-RTX 1.3, Python 3.10~3.13

Torch-TensorRT-RTX Wheels are available:

x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI

CUDA 12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index

note: tensorrt-rtx 1.3 wheel is not in pypi yet, so please download the tarball from https://developer.nvidia.com/tensorrt-rtx
install the wheel from tarball

IAttention Layer

In this release, TensorRT's native IAttention layer is used by default to handle various attention-related ATen ops, including SDPA, Flash-SDPA, Efficient-SDPA, and cuDNN-SDPA. This integration enables more efficient execution and can improve model performance. To explicitly enable this behavior, set decompose_attention=False in the compile() function. When enabled, the native TensorRT implementation is utilized to achieve optimized attention computation.
However, due to current TensorRT limitations, certain operations such as compute_log_sumexp and Grouped Query Attention (GQA) are not yet supported. If these cases are encountered, an informational prompt will be displayed during compilation. Alternatively, you can set decompose_attention=True to decompose the attention ops into multiple basic ATen ops. Although this approach may not achieve the same level of performance optimization, it offers broader operator coverage and greater compatibility across different model architectures.

Improvements to the Symbolic Shape System

Two key improvements have been made to the symbolic shape system used to track mutations on dynamic dimensions through out the body of the graph.

  1. A shape prop formula is recorded as metadata in for every compiled engine.

Now instead of needing to instantiate the engine in order to do key tasks like serialization, retracing or similar tasks which require fake tensor propagation, we record the shape relation between inputs and outputs for every TRT subgraph at compile time and store it as metadata. The torch.ops.tensorrt.execute_engine meta kernel now just replays this function in the new shape environment.
This should enable more seamless integration with the rest of the torch.compile ecosystem as meta kernels in Torch-TensorRT will now work in the same way as other meta kernels.

  1. For unbounded shape ranges, we now select sane defaults

Dynamic shapes are lazily inserted into Dynamo graphs. This is most noticeable when using Torch-TensorRT as a backend for torch.compile. Here when the first inference call is made to a boxed function

trt_mod = torch.compile(mod, backend="tensorrt")
trt_mod(*inputs)

The shapes of intermediate tensors are considered static and derived eagerly from the shapes of input tensors.

If input shapes then change

trt_mod(*other_sized_inputs)

TorchDynamo will start marking dimensions where shapes differ as dynamic dimensions. However, it does not assume upper bounds and critically, has no "optimal" or target size as required by TensorRT.

As such when we see such [FIXED_SIZE, inf) ranges, we set a sane upper-bound (max_int / max_dims) and assume the optimal size as the mid point of that range. We highly recommend users explicitly set dynamic shape bounds for both torch.export and torch.compile use cases, but this system can serve as fallback. https://docs.pytorch.org/TensorRT/user_guide/compilation/dynamic_shapes.html

Torch-TensorRT-RTX

TensorRT-RTX is a JIT focused version of TensorRT that allows users to target many different hardware platforms with one artifact. It allows developers to easily provide performance to their users across the many variations of RTX GPUs. In previous versions of Torch-TensorRT we have provided source code support for using TensorRT-RTX as a backend for Torch-TensorRT, allowing users to get access to the same workflows they would with standard Torch-TensorRT with more JIT oriented optimization approach.

With 2.11 Torch-TensorRT-RTX has graduated to its own package that you can install with pip install torch-tensorrt-rtx. This package uses all the same APIs as Torch-TensorRT, just with a different backend. torch-tensorrt-rtx 2.11 targets TensorRT-RTX 1.3. For 2.11, TensorRT-RTX must be installed via a wheel distributed on developer.nvidia.com: https://developer.nvidia.com/tensorrt-rtx

Know Limitations:

  • bf16 precision is generally supported, however it is possible in some models that there may still be numerical accuracy issues. This will be addressed in futuer versions of TensorRT-RTX
  • There is a know accuracy issue when running Grouped Query Attention TensorRT-RTX which will be addressed in a future release of TensorRT-RTX

run_llm int8 quantization

We have added support for performing post training quantization in int8 precision from the command line using the run_llm tool .
You can apply int8 quantization backed by the TensorRT-Model-Optimizer-Toolkit using --quant_format fp8

python run_llm.py --model meta-llama/Llama-3.1-8B --quant_format fp8 --prompt "What is parallel programming?" --model_precision FP16 --num_tokens 128

Empty Tensor

The Torch-TensorRT Runtime has

We have added support for providing torch empty tensors (tensors with one or more zero sized dimensions) as input to Torch-TensorRT compiled programs.

Under the hood we use TensorRT native empty tensor semantics. Empty tensors are marked by a 1B placeholder input to the engine. Both the python and C++ runtimes support this feature

What's Changed

New Contributors

Full Changelog: v2.10.0...v2.11.0