Nvidia says its new TensorRT-LL open-source software can dramatically boost performance of large language models (LLMs) on its GPUs. According to the company, the capabilities of Nvidia’s TensorRT-LL let it boost performance of its H100 compute GPU by two times in GPT-J LLM with six billion parameters. Importantly, the software can enable this performance improvement without re-training the model.
Nvidia developed TensorRT-LLM specifically to speed up performance of LLM inference and performance graphcs provided by Nvidia indeed show a 2X speed boost for its H100 due to appropriate software optimizations. A particular standout feature of Nvidia’s TensorRT-LLM is its innovative in-flight batching technique. This method addresses the dynamic and diverse workloads of LLMs, which can vary greatly in their computational demands.