QCon San Francisco 2021 November 1-5, 2021 | | Monitoring and Tracing @Netflix Streaming Data Infrastructure

This presentation is now available to view on InfoQ.com

What You’ll Learn

Find out why observability is useful for a streaming data infrastructure.
Hear how Netflix is monitoring and doing message tracing for their streaming data infrastructure which enables critical automations.
Listen on the importance of building observability tools and how quickly they pay off the investment in developing them.

Abstract

Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.

The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.

Question:

What is the work you're doing today?

Answer:

My major responsibility at Netflix is to create a scalable and high quality streaming data infrastructure that handles trillions of messages per day. The first thing that comes to my mind is that we need to make sure the architecture is scalable for this high volume of streaming services. I have created a multi-cluster Kafka architecture that is fault tolerant and scalable as I have shown in the past QCon London conference. We spend a lot of time integrating this on our client side and create a higher level of abstraction that makes it easy for everybody to use. The second and equally important job that I'm doing daily is to create operational tools to increase the visibility and operability. We are a team of a little more than 10 people and we do everything from design, coding to deployment and operation. We need to make sure that we can operate this thing efficiently and spend most of our time not dealing with operations.

Question:

What are the goals of the talk?

Answer:

There have been quite a few talks covering microservices from an operational perspective. But when I look at the past conferences there are very few talks about operating data infrastructure. In the past we have built years of experience of design and operating this data infrastructures. So I would like to fill the gap and share with the audience the tools and principles that have enabled us to efficiently operate this huge data infrastructure.

Question:

Will there be general lessons to learn if I am not a Kafka user?

Answer:

Sure. Observability plays an important role in streaming data infrastructure. So regardless of the scale, observability plays a vital role there because it enables the critical automation and increases the transparency that earns trust from people that use this infrastructure. That's why we spend a lot of time to increase observability with targeted tools development. These tools help tremendously to improve efficiency and quality. Creating these tools is not trivial. In our case they are usually a multi quarter effort and requires smart thinking which I will cover extensively in my talk, but I find that once they are up and running the payoff is pretty quick. So I believe that investing in these tooling improves the overall agility of the team and the product for data infrastructure. That's the top point that I wish that everybody can take away.

Question:

Are the tools available as open source?

Answer:

Unfortunately no. We have a plan to open source them but obviously there are some high priority tasks that keep us very busy. But in general we want to open source them.

Question:

What do you want people to leave the talk with?

Answer:

As I mentioned before, I want them to know that improving observability is important. People should spend effort and invest on this kind of tooling because it will quickly pay off. Also in my talk, there will be some details about Kafka and stream processing with Flink and how we utilize those tools to do message tracing and loss detection. That's something I want to share with the people, and hopefully they can take away some learning.

Speaker: Allen Wang

Architect & Engineer in Real Time Data Infrastructure Team @Netflix

Allen Wang is an architect and engineer in Real Time Data Infrastructure team at Netflix. He architected the multi-cluster Kafka infrastructure for Netflix in cloud environment and is heavily involved in developing the tools needed for operating the streaming data infrastructure. He is an open source contributor for Apache Kafka and NetflixOSS and a frequent speaker for Kafka.