LinkedIn Introduces Semantic Capability in Content Search Engine for Better Results

August 16, 2024 at 6:32:36 AM

TL;DR LinkedIn has introduced semantic matching in its content search engine to improve results for complex queries. The new system uses a two-layer approach: a retrieval layer with token-based and embedding-based retrievers, and a multi-stage ranking layer. The retrieval layer selects candidate posts, while the ranking layer scores them based on on-topic rate and long-dwells. This has improved search result quality and user engagement.

LinkedIn Introduces Semantic Capability in Content Search Engine for Better Results

LinkedIn has introduced a semantic capability in its content search engine to improve search results for complex queries. This enhancement addresses the limitations of the previous system, which struggled with queries that used natural language or included complex concepts. The new system aims to provide high-quality, engaging posts by optimizing two key metrics: on-topic rate and long-dwells.

Objectives

On-topic rate: Measures the percentage of posts that are well-written and answer the query.
Long-dwells: Measures the time spent by the searcher on each returned post, indicating engagement.

High-Level Design

The content search engine consists of two layers:

Retrieval Layer: Selects a few thousand candidate posts from billions of posts.
Multi-Stage Ranking Layer: Scores these candidate posts in two stages and returns a ranked list.

Retrieval Layer

The retrieval layer includes two retrievers:

Token-Based Retriever (TBR): Selects posts containing the exact keywords from the query.
Embedding-Based Retriever (EBR): Uses a two-tower AI model to select posts based on semantic matching. This model pre-computes post embeddings and stores them for efficient retrieval.

Multi-Stage Ranking Layer

This layer scores fewer posts in real-time using a complex model that allows interactions between query and post features. The ranking is done in two stages:

L1 Ranking Stage: Uses a simple model to score and filter posts.
L2 Ranking Stage: Uses a complex model to score the filtered posts and prepare the final search results.

Multi-Stage Ranking Layer Linkedin

Models and Features

On-topicness Prediction Model: Uses query and post text embeddings to produce an on-topicness score.
Long-Dwell Prediction Model: Uses a variety of features, including query text, post text, searcher and author features, to produce a long-dwell score.

Efficient Serving

To ensure low latency, several optimizations are made:

Limiting the number of posts scanned during the approximate nearest neighbor search.
Precomputing text embeddings of all posts and storing them in a key-value store.

The new content search engine has improved the on-topic rate and long-dwells by more than 10%, leading to increased engagement on LinkedIn.

LinkedIn plans to evolve the on-topic rate metric to better capture the quality expectations for various types of queries. This will involve leveraging large language models (LLMs) in the ranking layer.