Bing

Bing Enhances Search with LLM and SLM Models Optimized by TensorRT-LLM

December 18, 2024 at 5:34:30 AM

TL;DR Bing is enhancing search technology by transitioning to Large Language Models (LLMs) and Small Language Models (SLMs). LLMs can be costly and slow, while SLMs provide a 100x throughput improvement. Nvidia TensorRT-LLM optimizes SLM performance, reducing latency and operational costs by 57%. This integration results in faster, more accurate search results without sacrificing quality. Bing remains committed to advancing search technology and improving user experience.

Bing Enhances Search with LLM and SLM Models Optimized by TensorRT-LLM

Bing is advancing its search technology by integrating Large Language Models (LLMs) and Small Language Models (SLMs) to enhance search capabilities. The complexity of search queries has prompted the need for more efficient models. While LLMs are powerful, they can be costly and slow. In contrast, SLMs provide approximately 100x throughput improvement over LLMs, allowing for more precise processing of search queries.

Optimizing with TensorRT-LLM

To tackle latency and cost challenges associated with larger models, Bing has incorporated Nvidia TensorRT-LLM into its workflow, optimizing SLM inference performance. This optimization is particularly evident in the Deep search product, which utilizes SLMs to deliver optimal web results. The process involves understanding user intent and ensuring the relevance of results, with a focus on balancing speed and quality. TensorRT-LLM reduces model inference time, enhancing user experience without compromising result quality.

Before optimization, the original Transformer model had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second. After implementing TensorRT-LLM, latency improved to 3.03 seconds per batch, and throughput increased to 6.6 queries per second, resulting in a 57% reduction in operational costs.

Optimization Technique

The SmoothQuant technique, introduced in a research paper, allows inference using INT8 for both activations and weights while maintaining accuracy. TensorRT-LLM includes scripts for preprocessing model weights to utilize this method effectively.

Benefits for Users

The transition to SLM models and TensorRT-LLM integration offers several advantages:

Faster Search Results: Users experience quicker response times.
Improved Accuracy: Enhanced SLM capabilities provide more accurate and contextualized results.
Cost Efficiency: Reduced operational costs enable continued investment in innovations.

Looking Ahead

Bing is committed to refining its search technology and enhancing user experience through the ongoing development of LLM and SLM models, along with TensorRT-LLM integration. Future advancements are anticipated, promising to further push the boundaries of search technology.