Serving LLMs and Batch Inference at Scale - Adopting Ray at Tripadvisor
I was fortunate enough to be invited to talk at last year's Ray Summit 2025 about our journey integrating Ray at Tripadvisor. Ray currently supports our AI services and LLM inference, as well as our offline batch inference, model training and feature engineering pipelines. In this talk I discuss the compute, cost and platform efficiencies that I was able to leverage by integrating Ray clusters into our ML platform stack.
For a bit more context, Ray is an open-source distributed compute framework that's become one of the default substrates for ML workloads at scale. What made it compelling for us is that we could scale heterogeneous hardware very easily, which enabled us to serve very different shapes of work, i.e. latency-sensitive online inference alongside throughput-sensitive batch jobs, without us having to maintain separate systems to do so. Great for reliability and maintenance.
On the AI-services and LLM-inference side, we're still developing in this space, and Ray lets us treat a model server as a first-class distributed application, i.e. scaling replicas, co-locating pre- and post-processing with the model, and mixing CPU and GPU stages in one pipeline, rather than stitching a bespoke serving stack on top of a generic container platform. For latency-sensitive LLM traffic in particular, Ray Serve scales replicas on queue depth and in-flight requests rather than CPU or memory, which responds to LLM load far more accurately than a generic HPA would. We're also looking at Ray Serve with vLLM for continuous batching and KV-cache-efficient serving, which should push utilisation on the LLM side up another notch.
The batch inference, model training and feature engineering pipelines benefited from the same underlying ideas. Feature engineering and batch inference in particular are embarrassingly parallel but memory-heavy, and running them on Ray meant we could lean on its streaming execution model to decouple CPU-bound steps (data loading, preprocessing for example) from GPU-bound steps (embedding generation, model forward passes etc.), whilst keeping the expensive hardware saturated. This was particularly important when processing billions of rows of data across many boxes in AWS, cost efficiency is key! Model training also benefited from these same scheduling primitives without us having to maintain a second system, which is great for reliability.
If you work in inference infrastructure, LLM serving, or large-scale ML systems, it would be great to hear how you're approaching the same tradeoffs! The video above goes into more depth on the architecture and a few of the production pipelines we have, though we've progressed significantly from the time of this talk! Hopefully I'll get to speak at this year's one too about the progress we've made, who knows.