From MapReduce to Real-Time ML: The Streaming Revolution in Feature Platforms

From MapReduce to Real-Time ML: The Streaming Revolution in Feature Platforms

In 2004, Jeff Dean and Sanjay Ghemawat published the MapReduce paper, which cemented a kind of mental model in the data space that shaped infrastructure for the next decade, i.e. collect everything, process it all at once, write the results. It was successful, and fit the infrastructure of the time. Elegant, scaled to Google's web index and convinced a generation of engineers that the world can exist as a series of batch jobs.

Eventually, machine learning came along and introduced a whole host of additional complexity when it came to data engineering and infrastructure. Batch was ingrained, but ML showed that we might need something on top of it. This post traces the arc of MapReduce through to the streaming revolution, and to the machine learning feature platforms that we have available today. I'll close by introducing Thyme, my new streaming-first feature platform that I've been building.

The Batch Era

MapReduce gave us a programming model where you wrote a map function and a reduce function, aimed them at a cluster, and waited. The framework would then handle distribution, fault tolerance and scheduling. This was pre-distributed systems, where you didn't need to think about where your data lived or how many machines processed it. Hadoop brought this to the masses in the 2006-2008 era, Spark massively accelerated this by 2014 onwards, and by the mid 2010s we had a recognisable pattern emerging for machine learning production data processing:

  1. Events flow into blob storage / data lake throughout the day
  2. A nightly job aggregates those events into feature tables
  3. The feature table gets copied to a serving layer
  4. Models read those features at prediction time

The core issue with this structure is that features were hours old at best, a full day old at worst. Feature staleness was a fact of life, and still largely is.

Business questions are real-time, banking fraud could have occurred in the last 5 minutes, which is no good when your underlying data is a day old. The infra at the time answered "what happened yesterday?".

Batch staleness gap

Stream processing did exist in the early 2010s (Storm open sources in Sep 2011, Samza entered the Apache incubator in mid-2013) but the MapReduce mental model was so dominant that it seemed the aim was for "retraining models more often" or "faster batch jobs" rather than considering alternative architectures in feature engineering.

The Streaming Breakthrough

In 2013 MillWheel (Google, 2013 VLDB paper) became the first system to implement event-time processing, watermarks, and exactly-once at internet scale. Tyler Akidau was on this team, who then wrote the excellent blog post  Streaming 101: The world beyond batch, reframing data processing and giving a conceptual vocabulary to explain data streaming for the broader industry. The core insight was deceptively simple - there are two kinds of time, event time (when something happened), vs processing time (when your system sees it). Batch processing essentially sidestepped this issue by pretending the problem doesn't exist, i.e. it waits for a "complete" dataset, but completeness is essentially a fiction in distributed systems (late-arriving events from mobile devices, network delays, buffering etc.).

Event time skew

So Akidau laid out a framework for thinking about this properly. He framed it around four questions that every data processing system has to answer:

  1. What results are being computed? (the transformations i.e. sums, averages, joins, whatever)
  2. Where in event time are results grouped? (windowing - how you slice an infinite stream into finite chunks)
  3. When in processing time are results materialised? (watermarks and triggers - when do you actually emit output?)
  4. How do refinements relate to each other? (accumulation - is each emission a replacement, a delta, or a retraction?)

If we think about machine learning feature computation, the windowing question is particularly important - a fixed window groups events into non-overlapping time buckets (e.g. all product ratings for the month of March, and then April). A sliding window gives you a rolling aggregate (the last 30 days of product ratings, continually updated). A session window groups events by activity, with gaps defining boundaries. Most machine learning features are sliding windows (e.g. the average purchase amount for a user over the last 7 days or click through rate in the last hour).

For the "when" question, a watermark is the system's heuristic estimate of completeness - "I believe I've seen all events with event time up to T". When the watermark passes the end of a window, the system can declare that window complete. The issue is that watermarks are heuristics, and can be wrong, and late data will still arrive. Triggers then determine when to actually emit results, per-element (lowest latency, highest volume), at the watermark (better completeness), or at regular processing-time intervals (a pragmatic middle ground).

A follow-up was written called Streaming 102 fleshed out these concepts with worked examples (another great read btw). Just prior,  The Dataflow Model paper (2015) formalised it all, and then Apache Beam (2016), Flink, and Kafka Streams made it practical. The key takeaway here is that these primitives (event time, watermarks, windows, triggers) are the vocabulary you need to reason about feature freshness. Every feature computation maps onto them.

Lambda vs Kappa

Lambda

Before streaming-first became viable as a concept, engineers in the space tried to bridge the gap between batch and real-time with two competing architectures.

Nathan Marz proposed the Lambda architecture around 2011 as a pragmatic compromise, with the idea being that two systems could run in parallel:

  • a batch layer - processing the complete historical dataset periodically, producing correct comprehensive results
  • a speed layer (stream processor) handling recent data in real-time, producing approximate results. A serving layer merges the outputs, preferring batch results when available and falling back to the speed layer for recent data.

Companies started shipping Lambda systems that improved upon the pure batch systems that preceded it. The operational cost however was quite onerous, you maintained two separate codepaths essentially computing the same business logic. The batch job was typically in Spark, and the speed layer used Storm or Samza. This meant different bugs, failure modes, testing strategies, and every time the business logic changed, you changed it in two places and hope they agreed. This lasted for quite some time, and some of the larger tech companies started shifting away to a "lambda-less" Lambda Architecture, such as Linkedin - Lambda to Lambda-less Architecture .

Developers had to build, deploy, and maintain two pipelines that essentially produce the same data for the most part. The two processing flows needed to be in sync in terms of business logic and these challenges incurred significant costs in terms of developer time. Additionally, the nearline job would have lags in message processing, and the offline jobs would fail from time to time, both instances that we became much too familiar with. Ultimately, the upkeep was not worth the value in that it significantly slowed down development velocity for the team. They migrated to a "Lambda-less" architecture using only Samza (streaming), and saw significant improvements in development velocity by halving most measures of development time, and reduced maintenance overheads by more than half.

Kappa

"The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be."

Jay Kreps, one of the creators of Apache Kafka, responded with the Kappa Architecture in 2014. The logic of this being that if you have a replayable log (i.e kafka, who'd have thought!), you don't need a separate batch layer. Streaming handles the real-time processing, replay handles reprocessing, i.e. you replay the log through a new version of your streaming pipeline. Everything is therefore processed in one codebase with one set of semantics. The elegance of Kappa is that it treats batch as a special case of streaming, specifically streaming over a bounded dataset. As Akidau put it, there's nothing special about "all of yesterdays data", it's just a stream with a known start and end. With this in mind, if your streaming pipeline handles unbounded data correctly, it handles bounded data trivially.

This isn't to say Kappa was entirely practical at the time, retaining that full replayable log was prohibitively expensive. Kafka's tiered storage and similar features are what made it more viable at scale as time went on, and we're now at a point where it is much more viable.

Feature Platforms of Three Generations

Feature stores emerged because the ML industry collectively realised it had almost been reinventing data infrastructure from scratch. Every team was building its own bespoke pipeline to compute, store, and serve features. At first nothing was really sharerd, versioned or consistent between training and serving. If you look at the history of feature platform concepts, they could be broadly categorised into three distinct "generations", each one recapitulating a phase of the broader streaming evolution.

  • Gen 1: Batch. Uber's Michelangelo (which launched internally early 2016, the blog post that popularised the "feature store" concept); Hopsworks (late 2018, first open-source feature store); Feast v0 (2019, Gojek + Google Cloud). The architecture had features defined as Spark/SQL transformations, batch-computed on schedule, results in offline store for training, copied to online store (Redis/DynamoDB) for serving. It helped to solve sharing and versioning, but features were still hours/days stale - the online store was essentially just a cache of the last batch run.
  • Gen 2: Hybrid. Tecton (ish, it almost gets to gen 3) and Feast v2. These added streaming without fully rethinking the architecture. The idea was to keep batch pipelines for slow features, add streaming pipelines for fast features. Tecton however was a marked improvement, it provided a single declarative Python SDK where features are defined once, and delegates to different engines under the hood (Spark for batch, Spark Structured Streaming/Flink for streaming) but the user didn't maintain two separate codepaths. The operational complexity was in orchestrating multiple engines, not in code duplication and freshness improved to seconds/minutes.
  • Gen 3: Streaming-first. Fennel (founded 2022, ex-Meta team, Rust-powered, Kappa architecture with streaming all the time, batch as special case). This, similar to Tecton, had a single declarative definition, compiled to streaming execution plan, but had no separate batch path. Historical computations were done via log replay and train/serve skew was eliminated by construction.

Feature platform generations

Real time machine learning

Real time beats slow data. Traditional feature engineering for machine learning has steadily trended towards requiring fresher and fresher data over time. Take an example, think about fraud detection (the kind of thing Stripe's Radar system handles). A stolen credit card gets used for 15 purchases across different merchants in 10 minutes. A batch feature like "total spend in the last 30 days" updated nightly would show the cardholder's normal spending pattern, and therefore nothing suspicious. The fraud model sees a normal customer making a normal purchase.

If you introduce streaming, it tells a different story. "Transaction count in the last 5 minutes: 8." "Unique merchants in the last hour: 12." "Spend velocity (last 10 minutes vs 30-day average): 47x." These features, computed per-event with sub-second latency, light up the fraud model and give a very strong signal to take action. By the time a nightly batch catches the pattern, the thief has maxed out the card, goods have shipped, and chargebacks are in flight. The difference between catching fraud in the moment and catching it the morning after is the difference here between a blocked transaction and a chargeback.

LLM and agentic workflows are also seeing significant increases in requirements for real time fresh features. Take Xebia's [Architecture Blueprint for Context-Aware Agentic AI](Xebia, 'Beyond RAG: AI Agents With A Real-Time Context', 2026). Stale features can corrupt an entire conversation. That streaming layer described in the post is, in essence, a bespoke feature pipeline, which computes aggregated context per-customer, per-event, and injects it into the agent's reasoning window. It's the exact kind of infrastructure a streaming-first feature platform is designed to generalise. In classical ML, a stale fraud score lets one marginal transaction through. In an agentic system, a stale context window produces a fluent, confident, wrong conversation, and the customer doesn't blame the feature pipeline, they end up blaming the company.

So given all this, where do we stand? The Kappa / "gen 3" style feature solution is the right direction, but options are very thin. Fennel and Tecton were acquired, Feast is architecturally batch first, and other platforms introduce vendor lock across the entire data stack.

Thyme

The history described in this post led me to start building Thyme, and it's design stands on the shoulders (and lessons learned) of giants. Thyme is a streaming data platform for real time machine learning and AI features, and has the following high level design goals:

  • Easy to use for Data Scientists, Machine Learning and AI Engineers - remove the operational complexity of creating and deploying new features, let Thyme take care of it.
  • Cost - we aim to be as effective as any in house system, and cheaper.
  • Governance and management of features - teams can work in tandem, lineage is available between all sources and sinks, and we aim to avoid any issues that

Alongside some key technical features:

  • A kappa based architecture with exactly once processing, eliminating operational complexity whilst maintaining performance.
  • Python ergonomics and rust performance - machine learning teams write python, rust provides tight control over CPU, asynchronous processing, and is memory efficient with no GC, and forms the basis of our streaming engine.
  • Build complex pipelines from multiple sources with multiples dependencies, and multiple sinks. You build the features, Thyme controls the infrastructure.
  • A rich UI / UX for governance and maintenance of data and features - teams can work in tandem, without treading on each others toes. The feature catalog allows for discoverability of features.

I'm extremely excited to be building Thyme, and will be explaining it's architecture and case studies in future blog posts. Stay tuned!


References