Integrating NVIDIA TensorRT-LLM with the Databricks Inference Stack

The Databricks Mosaic R&D team launched the first iteration of our inference service architecture only seven months ago; since then, we’ve been making tremendous strides in delivering a scalable, modular, and performant platform that is ready to integrate every new advance in the fast-growing generative AI landscape. In January 2024, we will start using a new inference engine for serving Large Language Models (LLMs), built on NVIDIA TensorRT-LLM.

Introducing NVIDIA TensorRT-LLM

TensorRT-LLM is an open source library for state-of-the-art LLM inference. It consists of several components: first-class integration with NVIDIA’s TensorRT deep learning compiler, optimized kernels for key operations in language models, and communication primitives to enable efficient multi-GPU serving. These optimizations seamlessly work on inference services powered by NVIDIA Tensor Core GPUs and are a key part of how we deliver state-of-the-art performance.

Aggregating Inferences — **Figure 1:** Inference requests are aggregated from multiple clients by the TensorRT-LLM server for inference. The inference server must solve a complex many-to-many optimization problem: incoming requests need to be dynamically grouped together into batched tensors, and then those tensors need to be distributed across many GPUs.

For the last six months, we’ve been collaborating with NVIDIA to integrate TensorRT-LLM with our inference service, and we are excited about what we’ve been able to accomplish. Using TensorRT-LLM, we are able to deliver a significant improvement in both time to first token and time per output token. As we discussed in an earlier post, these metrics are key estimators for the quality of the user experience when working with LLMs.

Our collaboration with NVIDIA has been mutually advantageous. During the early access phase of the TensorRT-LLM project, our team contributed MPT model conversion scripts, making it faster and easier to serve an MPT model directly from Hugging Face, or your own pre-trained or fine-tuned model using the MPT architecture. In turn, NVIDIA’s team augmented MPT model support by adding installation instructions, as well as introducing quantization and FP8 support on H100 Tensor Core GPUs. We’re thrilled to have first-class support for the MPT architecture in TensorRT-LLM, as this collaboration not only benefits our team and customers, but also empowers the broader community to freely adapt MPT models for their specific needs with state-of-the-art inference performance.

Flexibility Through Plugins

Extending TensorRT-LLM with newer model architectures has been a smooth process. The inherent flexibility of TensorRT-LLM and its ability to add different optimizations through plugins enabled our engineers to quickly modify it to support our unique modeling needs. This flexibility has not only accelerated our development process but also alleviated the need for the NVIDIA team to single-handedly support all user requirements.

Python API for Easier Integration

TensorRT-LLM’s offline inference performance becomes more powerful when used in tandem with its native in-flight (continuous) batching support. We’ve found that in-flight batching is a crucial component of maintaining high request throughput in settings with lots of traffic. Recently, the NVIDIA team has been working on Python support for the batch manager written in C++, allowing TensorRT-LLM to be seamlessly integrated into our backend web server.

Continuous Batching Illustration — **Figure 2:** An illustration of in-flight (aka continuous) batching. Rather than waiting until all slots are idle due to the length of Seq 2, the batch manager is able to start processing the next sequences in the queue in other slots (Seq 4 and 5). (Source: NVIDIA.com)

Ready to Begin Experimenting?

If you’re a Databricks customer, you can use our inference server via our AI Playground (currently in public preview) today. Just log in and find the Playground item in the left navigation bar under Machine Learning.

We want to thank the team at NVIDIA for being terrific collaborators as we’ve worked through the journey of integrating TensorRT-LLM as the inference engine for hosting LLMs. We’re going to be leveraging TensorRT-LLM in upcoming releases of Databricks inference products, and we’re looking forward to sharing our platform’s performance improvements over previous implementations. Stay tuned for an upcoming blog post (with a deeper dive into the performance details) next month.

Integrating NVIDIA TensorRT-LLM with the Databricks Inference Stack

Introducing NVIDIA TensorRT-LLM

Flexibility Through Plugins

Python API for Easier Integration

Ready to Begin Experimenting?

Tesseract Ventures Announces Revolutionary SWARM Drone Technology for Special Operations Forces – sUAS News – The Business of Drones

The depressing truth about TikTok’s impending ban

NetApp partners with Google Cloud to maximise flexibility for cloud data storage

AF researchers design, build, fly autonomous aircraft in 24 hours – sUAS News – The Business of Drones