Index, the conference for engineers building search, analytics and AI applications at scale, took place last Thursday, November 2, with attendees packing out the Computer History Museum’s learning lab as well as the Index livestream.
The conference was a wonderful celebration of all the engineering innovation that goes into building the apps that permeate our lives. Many of the talks showcased real-world applications, such as search, recommendation engines and chatbots, and discussed the iterative processes through which they were implemented, tuned and scaled. We even had the opportunity to mark the 10th anniversary of RocksDB with a panel of engineers who worked on RocksDB early in its life. Index was truly a time for builders to learn from the experiences of others–through the session content or through impromptu conversations.
Design Patterns for Next-Gen Apps
The day kicked off with Venkat Venkataramani of Rockset setting the stage with lessons learned from building at scale, highlighting picking the right stack, developer velocity and the need to scale efficiently. He was joined by Confluent CEO Jay Kreps to discuss the convergence of data streaming and GenAI. A key consideration is getting the data needed to the right place at the right time for these apps. Incorporating the most recent activity–new facts about the business or customers–and indexing the data for retrieval at runtime using a RAG architecture is crucial for powering AI apps that need to be up to date with the business.
Venkat and Jay were followed by a slew of distinguished speakers, often going into deep technical details while sharing their experiences and takeaways from building and scaling search and AI applications at companies like Uber, Pinterest and Roblox. As the conference went on, several themes emerged from their talks.
Several presenters referenced an evolution within their organizations, over the last several years, towards real-time search, analytics and AI. Nikhil Garg of Fennel succinctly described how real time means two things: (1) low-latency online serving and (2) serving updated, not precomputed, results. Both matter.
In other talks, JetBlue’s Sai Ravruru and Ashley Van Name spoke about how streaming data is essential for their internal operational analytics and customer-facing app and website, while Girish Baliga described how Uber builds an entire path for their live updates, involving live ingestion through Flink and the use of live indexes to supplement their base indexes. Yexi Jiang highlighted how the freshness of content is critical in Roblox’s homepage recommendations because of the synergy across heterogeneous content, such as in instances where new friend connections or recently played games affect what is recommended for a user. At Whatnot, Emmanuel Fuentes shared how they face a multitude of real-time challenges–epehmeral content, channel surfing and the need for low end-to-end latency for their user experience–in personalizing their livestream feed.
Shu Zhang of Pinterest recounted their journey from push-based home feeds ordered by time and relevance to real-time, pull-based ranking at query time. Shu provided some insight into the latency requirements Pinterest operates with on the ad serving side, such as being able to score 500 ads within 100ms. The benefits of real-time AI also go beyond the user experience and, as Nikhil and Jaya Kawale from Tubi point out, can result in more efficient use of compute resources when recommendations are generated in real time, only when needed, instead of being precomputed.
The need for real time is ubiquitous, and a number of speakers interestingly highlighted RocksDB as the storage engine or inspiration they turned to for delivering real-time performance.
Separation of Indexing and Serving
When operating at scale, when performance matters, organizations have taken to separating indexing from serving to minimize the performance impact compute-intensive indexing can have on queries. Sarthank Nandi explained that this was a challenge with the Elasticsearch deployment they had at Yelp, where every Elasticsearch data node was both an indexer and a searcher, resulting in indexing pressure slowing down search. Increasing the number of replicas does not solve the problem, as all the replica shards need to perform indexing as well, leading to a heavier indexing load overall.
Yelp rearchitected their search platform to overcome these performance challenges such that in their current platform, indexing requests go to a primary and search requests go to replicas. Only the primary performs indexing and segment merging, and replicas need only copy over the merged segments from the primary. In this architecture, indexing and serving are effectively separated, and replicas can service search requests without contending with indexing load.
Uber faced a similar situation where indexing load on their serving system could affect query performance. In Uber’s case, their live indexes are periodically written to snapshots, which are then propagated back to their base search indexes. The snapshot computations caused CPU and memory spikes, which required additional resources to be provisioned. Uber solved this by splitting their search platform into a serving cluster and a cluster dedicated to computing snapshots, so that the serving system only needs to handle query traffic and queries can run fast without being impacted by index maintenance.
Architecting for Scale
Multiple presenters discussed some of their realizations and the changes they had to implement as their applications grew and scaled. When Tubi had a small catalog, Jaya shared that ranking the entire catalog for all users was possible using offline batch jobs. As their catalog grew, this became too compute intensive and Tubi limited the number of candidates ranked or moved to real-time inference. At Glean, an AI-powered workplace search app, T.R. Vishwanath and James Simonsen discussed how greater scale gave rise to longer crawl backlogs on their search index. In meeting this challenge, they had to design for different aspects of their system scaling at different rates. They took advantage of asynchronous processing to allow different parts of their crawl to scale independently while also prioritizing what to crawl in situations when their crawlers were saturated.
Cost is a common concern when operating at scale. Describing storage tradeoffs in recommendation systems, Nikhil from Fennel explained that fitting everything in memory is cost prohibitive. Engineering teams should plan for disk-based alternatives, of which RocksDB is a good candidate, and when SSDs become costly, S3 tiering is needed. In Yelp’s case, their team invested in deploying search clusters in stateless mode on Kubernetes, which allowed them to avoid ongoing maintenance costs and autoscale to align with consumer traffic patterns, resulting in greater efficiency and ~50% reduction in costs.
These were just some of the scaling experiences shared in the talks, and while not all scaling challenges may be evident from the start, it behooves organizations to be mindful of at-scale considerations early on and think through what it takes to scale in the longer term.
Want to Learn More?
The inaugural Index Conference was a great forum to hear from all these engineering leaders who are at the forefront of building, scaling and productionizing search and AI applications. Their presentations were full of learning opportunities for participants, and there’s a lot more knowledge that was shared in the their complete talks.