QCon San Francisco 2021 November 1-5, 2021 | | Taming Large State: Lessons From Building Stream Processing

This presentation is now available to view on InfoQ.com

What You’ll Learn

Find out what it takes to convert a batch job into a data stream processing app.
Learn what are some of the fundamental constructs to be used in a streaming app.
Hear what were the challenges Netflix had to deal with converting a batch into a streaming app.

Abstract

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered, and written out in real time with an ease matching that of batch processing. However, the real challenge of matching the prowess of batch ETL remains in doing joins, maintaining state, and dynamically pausing or resting the data.

At Netflix, micro-services serve and record many different kinds of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when the company wants to combine the events coming from one high-traffic micro-service to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations.

Historically, Netflix has done this joining of large volume datasets in batch. Recently, the company asked, If the data is being generated in real time, why can’t it be processed downstream in real time? Why wait a full day to get information from an event that was generated a few minutes ago?

This talks describes how we solved a complex join of two high-volume event streams at Netflix using Flink. You’ll learn about

Managing out of order events and processing late arriving data
Exploring keyed state for maintaining large state
Fault tolerance of a stateful application
Strategies for failure recovery
Schema evolution in a stateful realtime application
Data validation batch vs streaming

Question:

Tell me a bit about the team that you're all on and the problems you're solving in Netflix.

Answer:

We are part of the Netflix Data Science and Engineering, and specifically our team supports the creation and maintenance of data products that are used for the purpose of personalization. We process data at huge scale, making it available for the training of machine learning models. We've been experimenting with creating a lot of low-latency data sets to increase the frequency at which the models can be trained and to provide a faster feedback to the model.

Question:

Tell me a bit about the talk that you're going to do.

Answer:

The core problem that we're trying to address is that for the past three, four years stream processing has exploded, it is typically been associated with low-latency event processing. It's not as commonly used for state for aggregation of data that batch is typically used for. This is the core data problem that we're trying to address, that it is possible to do stateful processing and produce low-latency data, which is usually done through batch. But to do it in streaming is what we're trying to solve.

Question:

The first sentence of your abstract mentions Apache Flink. Is this a Flink talk or is a data architecture talk?

Answer:

I would say largely it's a data architecture talk, because a lot of constructs and the challenges we faced were a result of taking a really large batch dataset that required holding a lot of state and do streaming, and all the problems that come with that. That's fault tolerance, data recovery, data reprocessing. It is about how do you think of processing this in streaming. However we have solved it in Flink, so some of the things we've used are unique. The features that we've leveraged to solve these problems are features of Flink. But I'm sure other engines also have similar features.

Question:

Are you talking to a data engineer or are you talking to an architect?

Answer:

This talk would appeal to an audience of a large variety. As an architect, this talk would give you a good insight into how to think about creating a system that can handle this level of scale. As a data engineer, this will give you enough points to consider what kind of jobs are a good candidate to work with streaming. Because not everything needs that kind of investment. I think this talk could appeal to a variety of audiences. One thing is that we will cover a lot of technical depth. So having that prior knowledge of what it takes to build these large scale distributed systems would be good to have.

Question:

What do you want someone to leave the talk with?

Answer:

On a high level, what it entails to convert a large batch job into a low-latency stream processing job. The different constructs of creating a streaming application. And also we will talk about the challenges that were encountered on the way and how we went about solving them.

Question:

If I'm dealing with batch, ETL jobs at moderate scale, is that enough background or should I be pretty familiar with a streaming tool like Flink?

Answer:

Some basic knowledge of how stream processing engines interact with queues like Kafka and what some I think basic terminology, what does it mean to do data processing or what does it mean to handle duplicate checklists? Knowing this terminology and constructs is enough. I don't think anyone needs to know the specific features or constructs of Flink or Spark, but just a broader knowledge of what stream data processing is.

Speaker: Sonali Sharma

Data Engineering and Analytics @Netflix

Sonali Sharma a data engineer on the data personalization team at Netflix, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix home page. They have been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. A UC Berkeley graduate, Sonali has worked on a variety of problems involving big data. Previously, she worked on the mail monetization and data insights engineering team at Yahoo, where she focused on building great data-driven products to do large-scale unstructured data extractions, recommendation systems, and audience insights for targeting using technologies like Spark, the Hadoop ecosystem (Pig, Hive, MapReduce), Solr, Druid, and Elasticsearch.

Find Sonali Sharma at

Speaker page

https://www.linkedin.com/in/sonalisharma/

Speaker: Shriya Arora

Senior Software Engineer @Netflix

Shriya works on the Personalization Data engineering team at Netflix. The team is responsible for all the data products that are used for training and scoring of the various machine learning models that power the Netflix homepage. On the team, she been focusing on moving some of the core datasets from being processed in a once-a-day daily batch ETL to being processed in near-real time for both technical and business wins. Before Netflix, she was at Walmart Labs, where she helped build and architect the new generation item-setup, moving from batch processing to real-time item processing.