Earlybird: Real-Time Search at Twitter — A Summary (Part 1)

6 min readSep 6, 2024

Introduction

Web search is a mature technology, there are several techniques that are used to rank a page and show relevant results. Search engines are capable of query billions of web pages within a few milliseconds.
Web search comes from decades of research, covering aspects such as:

Machine-Learned Ranking Algorithms: These algorithms use large datasets to predict the relevance of web pages to user queries.
Exploitation of Log Data: Techniques that use user behavior data (like clicks and queries) to improve search relevance.
Web Crawling: The process of systematically browsing the web to collect and index pages.
Systems Engineering for Large-Scale Search Engines: The architectural design and operational strategies that allow search engines to scale globally.

The world today is changing from static web pages to social and real-time signals. This aspect of the web introduces new challenges and opportunities for information retrieval, which the authors identify as the next frontier in the field.

Social and real-time signals refer to the data and interactions generated on social media platforms (Likes, shares, comments, re-tweets, etc) and other real-time communication channels (breaking news, live events, etc.) that reflect the current activities, behaviors, and interests of users. These signals are crucial for understanding trends, identifying relevant content, and providing personalized experiences in real-time.
Real-time signals are critical for systems like Twitter’s search engine, where users expect to find the latest and most relevant information immediately. For example, during a major event, users might search for the most recent tweets or updates, and the search engine needs to prioritize the latest content over older posts.

Earlybird, is the retrieval engine that lies at the core of Twitter’s real-time search service.

Requirements if Real-time search

1. Low-Latency, High-Throughput Query Evaluation

To meet user expectations, the search engine must ensure that it can evaluate and return results quickly, even as new data (like tweets) is continuously being added.

2. High Ingestion Rate and Immediate Data Availability

Ingestion refers to the process of collecting, processing, and indexing new data so that it can be searched. In a real-time search engine, data needs to be ingested at a very high rate and made available for search almost immediately after it is created. Users expect to find the most recent content, and any delay in indexing new data can lead to outdated or irrelevant search results.

3. Concurrent Reads and Writes

In real-time search, the search engine must allow for simultaneous reading (querying) and writing (indexing) of data. This means that the system must be able to update its indexes with new data while simultaneously handling search queries without performance degradation or data inconsistency.

4. Dominance of the Temporal Signal

The temporal signal refers to the time at which a document (or in this case, a tweet) was created. In real-time search, the recency of the content is a crucial factor in determining its relevance. It ensures that users see the most current information first.

Background and Related Work

1. Distributed Search Architectures

Geographically Distributed Systems: A web search service has a distributed architecture. User queries are routed to the closest data centers based on factors like query load, network latency, and even electricity costs. These data centers consist of large clusters of servers, each responsible for a portion of the overall task.

Document-Partitioned Distributed Search Architecture: In this model, the document collection (e.g., web pages) is divided into segments (shards), with each server responsible for a specific segment. Read more about it here.
A broker server coordinates the queries, forwarding them to the appropriate servers and aggregating the results.

Replication and Caching: To ensure robustness and efficiency, search engines replicate data across multiple servers and use caching techniques to store frequently accessed data or query results.

2. Query Evaluation

Query evaluation refers to the process of determining which documents (or tweets, in the case of Twitter) match a user’s query and how to rank those results. Typically, query evaluation occurs in two phases:

First Phase: A fast, approximate algorithm is used to generate a candidate list of potentially relevant documents based on basic relevance measures like BM25 or PageRank.
Second Phase: A more sophisticated, often machine-learned algorithm re-ranks the candidates based on richer features such as term proximity or anchor text.

Inverted Index: The backbone of most retrieval systems, an inverted index maps terms to the documents in which they appear. These indexes are typically sorted by document IDs or impact scores (a pre-computed score), enabling efficient query evaluation.

3. Other Considerations

In traditional web search, content acquisition involves crawling the web, extracting links, and continuously updating the index. This is a complex process that balances latency, throughput, and freshness. The web is not homogeneous, so different types of content (e.g., news, images, academic papers) often require specialized handling. For instance, news content is crawled more frequently, and its relevance is more sensitive to recency. This divides the web into something called as “verticals”.
The process of integrating different types of search results (e.g., mixing news articles with general web search results) is a challenge that traditional search engines face. This is called as Vertical Integration. Twitter has a similar challenge in deciding when and where to display tweets versus user profiles or other relevant content.

Architecture

Ingestion Pipeline: This pipeline is responsible for processing incoming tweets. Tweets enter the ingestion pipeline after being posted by users on Twitter. The pipeline performs several preprocessing tasks, such as tokenizing the text (breaking it down into words) and annotating the tweet with metadata (like language or user information).
Updater: The primary role of the Updater is to continuously update the resonance signals of tweets. Resonance signals refer to dynamic, real-time metrics that evolve over time, such as the number of retweets, likes (favorites), replies, or other forms of user engagement with a tweet.
The Updater component pushes these dynamic updates (i.e., resonance signals) to the Earlybird servers, ensuring that the relevance ranking of tweets is adjusted in real-time. Even after a tweet has been indexed, its ranking in search results can change as its engagement metrics fluctuate.
Earlybird Servers: The Earlybird servers are the core of the real-time search architecture. They are responsible for indexing the tweets and serving search queries. After the tweets are processed in the ingestion pipeline, they are distributed across multiple Earlybird servers using a hash partitioning method.
Blender Servers (Front-End): The Blender servers handle user search queries. When a user searches for something on Twitter, their query is first sent to a Blender server, which parses the query and passes it along to the appropriate Earlybird servers. It is responsible for distributing the user’s query to multiple Earlybird servers.

Key Flow of Data

Tweet Creation: A user posts a tweet, which enters the ingestion pipeline.
Processing: The tweet is tokenized and annotated with metadata.
Hash Partitioning: The tweet is distributed to an appropriate Earlybird server based on hash partitioning.
Indexing: The Earlybird server indexes the tweet in real-time.
Query Handling: When a user performs a search, the Blender server parses the query and sends it to multiple Earlybird servers.
Personalization: The query may include personalized information based on the user’s social connections.
Result Aggregation: Earlybird servers return the most relevant tweets, and the Blender server merges and re-ranks these results before sending them back to the user.

This was all for this part. In next part, we will go through an overview of Earlybird.

Let’s Get in Touch

You are most welcome to follow me here on Medium. In addition, feel free to check out: