Bing is advancing its search technology by integrating Large Language Models (LLMs) and Small Language Models (SLMs) to enhance search capabilities. The complexity of search queries has prompted the need for more efficient models. While LLMs are powerful, they can be costly and slow. In contrast, SLMs provide approximately 100x throughput improvement over LLMs, allowing for more precise processing of search queries.
Optimizing with TensorRT-LLM
To tackle latency and cost challenges associated with larger models, Bing has incorporated Nvidia TensorRT-LLM into its workflow, optimizing SLM inference performance. This optimization is particularly evident in the Deep search product, which utilizes SLMs to deliver optimal web results. The process involves understanding user intent and ensuring the relevance of results, with a focus on balancing speed and quality. TensorRT-LLM reduces model inference time, enhancing user experience without compromising result quality.
Before optimization, the original Transformer model had a 95th percentile latency of 4.76 seconds per batch and a throughput of 4.2 queries per second. After implementing TensorRT-LLM, latency improved to 3.03 seconds per batch, and throughput increased to 6.6 queries per second, resulting in a 57% reduction in operational costs.
Optimization Technique
The SmoothQuant technique, introduced in a research paper, allows inference using INT8 for both activations and weights while maintaining accuracy. TensorRT-LLM includes scripts for preprocessing model weights to utilize this method effectively.
Benefits for Users
The transition to SLM models and TensorRT-LLM integration offers several advantages:
- Faster Search Results: Users experience quicker response times.
- Improved Accuracy: Enhanced SLM capabilities provide more accurate and contextualized results.
- Cost Efficiency: Reduced operational costs enable continued investment in innovations.
Looking Ahead
Bing is committed to refining its search technology and enhancing user experience through the ongoing development of LLM and SLM models, along with TensorRT-LLM integration. Future advancements are anticipated, promising to further push the boundaries of search technology.