QCon San Francisco 2021 November 1-5, 2021 | | High Resolution Performance Telemetry at Scale

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear how Twitter uses Rezolus to monitor their systems to find out if there are performance problems.
Learn how to use Rezolus to pinpoint bottlenecks even in smaller systems.

Abstract

One of the most critical aspects of running large distributed systems is understanding and quantifying performance. Without telemetry it is challenging to diagnose performance issues, plan for capacity needs, and tune for maximum efficiency. Even when we have telemetry, the resolution is insufficient to capture anomalies and bursty behaviors that are typical in microservice architectures.

In this talk, we explore the issues of resolution in performance monitoring, cover sources of performance telemetry including hardware performance and eBPF, and learn some tricks for getting high resolution telemetry without high costs.

Question:

What is the work you’re doing today?

Answer:

I work on a team that's focused on infrastructure optimization and performance. As part of that we quantify workloads, measure the performance of systems, and come up with tuning changes that can help either increase the performance just so we can get more or so we can actually start to reduce the amount of resources allocated to specific services to reduce costs. As part of that, I work with a lot of like benchmarking and measuring runtime performance. Luckily here at Twitter I get to do this in open source capacity and I manage two open source projects. One is called rpc-perf, which is a benchmarking tool for in-memory caches. And my newer project which is a systems performance telemetry agent called Rezolus. This talk winds up centered more on the telemetry component of this and how we use Rezolus at Twitter.

Question:

You use the word telemetry. Why that word?

Answer:

Telemetry just fits really nicely. Because with telemetry it's basically the remote transmission metrics, which are just numbers. So it's a way for us to grab metrics from our fleet-wide systems words. Basically we want to record the runtime performance characteristics of our systems so we can go back and look at how they have been performing, how that's changed over time. We use it for runtime performance diagnostics and for being able to judge tuning changes so we can actually know whether we're moving the gauge up or down.

Question:

What's the goal of the talk?

Answer:

We're going to talk about the challenges of measuring performance at scale especially in distributed microservice architecture. A web request is hundreds of milliseconds. Having a lot of traditional telemetry collection happens at much coarser time scales and a lot of that is due to the cost of aggregation and processing those time series. We found that the traditional resolution that we had was insufficient to capture the actual performance anomalies. And it was at the point where it was interfering with my ability to do tuning work. That got me thinking about ways that we could capture bursty behavior without necessarily spending whatever X to increase the resolution.

Question:

How did you come up with a sampling rate that worked?

Answer:

Essentially it comes down to summarization. Thinking about the questions that you're trying to answer with telemetry and whether you actually need that high resolution as your end-product or whether you just need that to get to your end-product. The talk is about how we did that. Essentially it's about using percentile metrics. Basically you can sample at a really high rate and then do some metrics processing on the fly and then export summary metrics instead of exporting a second layer, 10 times per second or 100 times per second time series. And as long as you have that sampling rate short you can still have a minutely time series as your end-product and have a hint at the subminute behavior.

Question:

What do you want someone to leave your talk with?

Answer:

I would like for people to leave the talk with a deeper appreciation of the complexities of measuring performance. There are behaviors that we're just not aware of due to blind spots in how telemetry is collected today. There have been a lot of efforts to enable people to diagnose that in a more hands on fashion. I think really one of the core ideas here is that you can do it fleet-wide without necessarily spending X million dollars more to store really high resolution samples. Inspire people that this is possible and maybe that they would like to check out Rezolus as an open source project and contribute to it.

Question:

How do you answer someone who says we don't operate at Twitter scale? Is this going to have important takeaways for me at normal scale?

Answer:

Even in smaller environments systems performance is very important. And it can even be more so at really small shops where you don't have the budget to just throw money at the problem, where you need to squeeze out the most performance. And you might not have a team who can develop a sophisticated observability system. I think they would be able to leverage something like Rezolus to help capture runtime performance issues without having to fund a performance team. It's like the tool dovetails nicely into the rest of the open source observability ecosystem with Prometheus and stuff like that. I think it could provide people of different skills the ability to go to runtime performance diagnostics which I know at least when I used to work at a small shop one of the common things was the CTO coming and saying, the website feels slow now, and then having to go figure out why. It would have been really nice to have the visibility into what was happening.

Speaker: Brian Martin

Software Developer @Twitter

Brian is a Staff SRE at Twitter. He works on infrastructure optimization and performance. His work with tuning high performance services led him to discovering a need for better performance telemetry. He is the author and maintainer of Rezolus, Twitter's high resolution systems performance telemetry agent.