Torch-TensorRT 2.11.0 Linux x86-64 and Windows targets
PyTorch 2.11, CUDA 12.6 12.8 12.9, 13.0, TensorRT 10.15, Python 3.10~3.13
Torch-TensorRT Wheels are available:
x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI
CUDA 12.6/12.8/12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index
aarch64 SBSA Linux and Jetson Thor
CUDA 13.0 + Python 3.10–3.13 + Torch 2.11 + TensorRT 10.15
- Available via PyPI: https://pypi.org/project/torch-tensorrt/
- Available via PyTorch index: https://download.pytorch.org/whl/torch-tensorrt
Jetson Orin
- no torch_tensorrt 2.9/2.10/2.11 release for Jetson Orin
- please continue using torch_tensorrt 2.8 release
Torch-TensorRT-RTX 2.11.0 Linux x86-64 and Windows targets
PyTorch 2.11, CUDA 12.9, 13.0, TensorRT-RTX 1.3, Python 3.10~3.13
Torch-TensorRT-RTX Wheels are available:
x86-64 Linux and Windows:
CUDA 13.0 + Python 3.10-3.13 is Available via PyPI
CUDA 12.9/13.0 + Python 3.10-3.13 is also Available via Pytorch Index
note: tensorrt-rtx 1.3 wheel is not in pypi yet, so please download the tarball from https://developer.nvidia.com/tensorrt-rtx
install the wheel from tarball
IAttention Layer
In this release, TensorRT's native IAttention layer is used by default to handle various attention-related ATen ops, including SDPA, Flash-SDPA, Efficient-SDPA, and cuDNN-SDPA. This integration enables more efficient execution and can improve model performance. To explicitly enable this behavior, set decompose_attention=False in the compile() function. When enabled, the native TensorRT implementation is utilized to achieve optimized attention computation.
However, due to current TensorRT limitations, certain operations such as compute_log_sumexp and Grouped Query Attention (GQA) are not yet supported. If these cases are encountered, an informational prompt will be displayed during compilation. Alternatively, you can set decompose_attention=True to decompose the attention ops into multiple basic ATen ops. Although this approach may not achieve the same level of performance optimization, it offers broader operator coverage and greater compatibility across different model architectures.
Improvements to the Symbolic Shape System
Two key improvements have been made to the symbolic shape system used to track mutations on dynamic dimensions through out the body of the graph.
- A shape prop formula is recorded as metadata in for every compiled engine.
Now instead of needing to instantiate the engine in order to do key tasks like serialization, retracing or similar tasks which require fake tensor propagation, we record the shape relation between inputs and outputs for every TRT subgraph at compile time and store it as metadata. The torch.ops.tensorrt.execute_engine meta kernel now just replays this function in the new shape environment.
This should enable more seamless integration with the rest of the torch.compile ecosystem as meta kernels in Torch-TensorRT will now work in the same way as other meta kernels.
- For unbounded shape ranges, we now select sane defaults
Dynamic shapes are lazily inserted into Dynamo graphs. This is most noticeable when using Torch-TensorRT as a backend for torch.compile. Here when the first inference call is made to a boxed function
trt_mod = torch.compile(mod, backend="tensorrt")
trt_mod(*inputs)The shapes of intermediate tensors are considered static and derived eagerly from the shapes of input tensors.
If input shapes then change
trt_mod(*other_sized_inputs)TorchDynamo will start marking dimensions where shapes differ as dynamic dimensions. However, it does not assume upper bounds and critically, has no "optimal" or target size as required by TensorRT.
As such when we see such [FIXED_SIZE, inf) ranges, we set a sane upper-bound (max_int / max_dims) and assume the optimal size as the mid point of that range. We highly recommend users explicitly set dynamic shape bounds for both torch.export and torch.compile use cases, but this system can serve as fallback. https://docs.pytorch.org/TensorRT/user_guide/compilation/dynamic_shapes.html
Torch-TensorRT-RTX
TensorRT-RTX is a JIT focused version of TensorRT that allows users to target many different hardware platforms with one artifact. It allows developers to easily provide performance to their users across the many variations of RTX GPUs. In previous versions of Torch-TensorRT we have provided source code support for using TensorRT-RTX as a backend for Torch-TensorRT, allowing users to get access to the same workflows they would with standard Torch-TensorRT with more JIT oriented optimization approach.
With 2.11 Torch-TensorRT-RTX has graduated to its own package that you can install with pip install torch-tensorrt-rtx. This package uses all the same APIs as Torch-TensorRT, just with a different backend. torch-tensorrt-rtx 2.11 targets TensorRT-RTX 1.3. For 2.11, TensorRT-RTX must be installed via a wheel distributed on developer.nvidia.com: https://developer.nvidia.com/tensorrt-rtx
Know Limitations:
- bf16 precision is generally supported, however it is possible in some models that there may still be numerical accuracy issues. This will be addressed in futuer versions of TensorRT-RTX
- There is a know accuracy issue when running Grouped Query Attention TensorRT-RTX which will be addressed in a future release of TensorRT-RTX
run_llm int8 quantization
We have added support for performing post training quantization in int8 precision from the command line using the run_llm tool .
You can apply int8 quantization backed by the TensorRT-Model-Optimizer-Toolkit using --quant_format fp8
python run_llm.py --model meta-llama/Llama-3.1-8B --quant_format fp8 --prompt "What is parallel programming?" --model_precision FP16 --num_tokens 128Empty Tensor
The Torch-TensorRT Runtime has
We have added support for providing torch empty tensors (tensors with one or more zero sized dimensions) as input to Torch-TensorRT compiled programs.
Under the hood we use TensorRT native empty tensor semantics. Empty tensors are marked by a 1B placeholder input to the engine. Both the python and C++ runtimes support this feature
What's Changed
- filter out unsupported cuda versions by @lanluo-nvidia in #3990
- changing the setting_to_be_engine_invariant from tuple to set by @apbose in #3984
- fix the job name issue in Actions UI by @lanluo-nvidia in #3992
- Graph break overhead by @cehongwang in #3946
- upgrade torch from 2.10.dev to2.11.dev by @lanluo-nvidia in #3989
- Added debugger example by @cehongwang in #3997
- Improve documentation after trying on a new machine. by @SandSnip3r in #4002
- Add venv install & cuda driver info to documentation by @SandSnip3r in #4016
- fix: Skip setting output tensor ownership in dryrun mode by @SandSnip3r in #4014
- DLFW 26.01 changes to main by @apbose in #4004
- Support modelopt pre-quantized model in llm by @lanluo-nvidia in #4003
- Dynamic memory allocation by @cehongwang in #3727
- Fix the converter issue caused by this missing unset_fake_temporarily by @wenbingl in #4006
- lowering pass: fully remove SymInt by @zewenli98 in #4001
- fix the layer info test failure and deal with potential segfault by @narendasan in #4042
- cherry pick 4033: skip llm test if modelopt is not installed from release branch to main by @lanluo-nvidia in #4034
- cherry pick 4038 from ngc release branch to main: skip failed test on orin until issue 3982 is fixed by @lanluo-nvidia in #4039
- cherry pick 4028: fix resource partitioner issue from release branch to main by @lanluo-nvidia in #4031
- cherry pick 4029: upgrade aarch64 base image from release branch to main by @lanluo-nvidia in #4030
- fix: torchtrtc precision setting logic by @yeetypete in #3883
- Empty tensor handling by @apbose in #3891
- fix: example argument issue raised in 4070 by @zewenli98 in #4071
- create torch_tensorrt_rtx wheel by @lanluo-nvidia in #4077
- upgrade trt from 10.14.1 to 10.15.1 by @lanluo-nvidia in #4075
- fix the cannot find libnvrtc-builtins.so.13.0 issue by @lanluo-nvidia in #4078
- fix: Refactor the cat converter and seperate out the mixed use by @narendasan in #4059
- scatter.src and scatter.value dynamic case by @apbose in #4062
- Fixed the bug caused by cpu offloading by @cehongwang in #4063
- Resource partitioner CI fix by @cehongwang in #4005
- Rank based logging for distributed examples by @apbose in #4081
- add int8 quantization support for llm models by @lanluo-nvidia in #4086
- upgrade rtx from 1.2 to 1.3 by @lanluo-nvidia in #4084
- A bunch of test fixes by @narendasan in #4088
- fix typo by @lanluo-nvidia in #4091
- 2.11 release cut by @lanluo-nvidia in #4092
- add torch_tensorrt_rtx to nightly and release ci/cd by @lanluo-nvidia in #4094
- cherry pick 4079 from main to 2.11 release by @lanluo-nvidia in #4096
- add rtx release build ci by @lanluo-nvidia in #4098
- Cherry Pick fused_rums_norm_lowering from release2 .10 to release 2.11 (#4057) by @lanluo-nvidia in #4109
- cherry pick 4104 from main to release 2.11 by @lanluo-nvidia in #4116
- add rtx documentation in release 2.11 by @lanluo-nvidia in #4112
- fix the llm test failing issue + add back cu126/cu128 by @lanluo-nvidia in #4118
- only support cu130 for aarch64 by @lanluo-nvidia in #4122
- cherry pick 4124 from main to 2.11 by @lanluo-nvidia in #4125
- cherrypick: split index.Tensor converter for bool vs int indexing (#4123) by @lanluo-nvidia in #4133
- cherry pick 4131 from main to release 2.11 by @lanluo-nvidia in #4149
New Contributors
- @SandSnip3r made their first contribution in #4002
- @yeetypete made their first contribution in #3883
Full Changelog: v2.10.0...v2.11.0