Presentation: Controlled Chaos: Taming Organic, Federated Growth of Microservices
This presentation is now available to view on InfoQ.com
Watch videoWhat You’ll Learn
- Hear about some of the complexity of deploying and operating microservices.
- Learn some patterns to apply in a microservices architecture to preserve stability and security over time.
Abstract
The success with which enterprises execute on microservice strategies and the degree to which cloud-native technologies boost developer productivity leave operations and security teams with an organically growing landscape of federated services that is increasingly difficult to control. As a result, failures mount, resilience declines, and innovation dies.
In this talk, I focus on the challenges that result from organic, federated growth as well as the patterns that can be applied to monitor and control these dynamic systems, like bulkheads, backpressure, and quarantines, from both an operational and security perspective. I illustrate how visibility and control of the surrounding environment become more important than the observability of individual threads of execution and how the nature of this organic architectural style necessitates a shift to remediating behaviors in real-time.
What is the work you're doing today?
I am the co-founder and CEO of Glasnostic, a startup that provides an operations solution for rapidly evolving networks of services that lets enterprises gain control over the unpredictable behaviors that such environments exhibit. My background is in tech. I was the co-founder of the company that became Red Hat OpenShift and, like everybody else, have always focused on how to optimally support the building of applications until I realized that successful applications don’t typically become successful because they are well-engineered. They become successful because they are operated well. And this is what I’ve set out to do at Glasnostic: help fix the massive operations crisis that we are facing today.
What are the goals for the talk?
I want to raise awareness of how deeply enterprise agility and microservices change the way we architect and operate service architectures. On the business side, there is this new, agile operating model with small, self-managing teams that work in rapid decision and learning cycles. On the technical side, businesses execute a microservices-first strategy, developer productivity is buoyed by cloud-native technologies and development itself is scaled out with many teams deploying in parallel. These factors leave architects and operators with a continually evolving landscape of sprawling and increasingly connected services that, while being of great benefit to the business, is inherently unstable and insecure and thus difficult to control. And, as we all know, if we can’t control, we can’t operate, and if we can’t operate, innovation dies. That’s the problem that we are solving.
Now, the key difference between applications and such federated, evolving service landscapes is that failure in service landscapes occurs overwhelmingly not because of code defects in individual threads of execution but due to environmental factors. This is also the reason why service landscapes truly represent a new technological paradigm. The old model of looking at code execution to find a “root cause” that will fix the current failure is irrelevant in a world where my code is significantly more likely to be impacted by noisy neighbors, grey failures and other random occurrences. And because I, as a developer, am ultimately responsible for only a handful of services within a much larger landscape, it becomes evident how monitoring, tracing, and in general, any local “observability” is much less useful than we’ve been trained to think. We can all engineer stand-alone, single-blueprint applications. The difficulties arise once you string decomposed applications together to form a federated landscape of services.
Because service landscapes represent a new paradigm where the impact of environmental factors outweighs code defects and because these environmental factors are inherently unpredictable, we need to change how we operate these architectures. We need to be able to remediate in real time, with near-zero mean time to repair (MTTR). This means, for instance, that, from an observability perspective, we can’t be looking at petabytes of machine data, with or without ML. We need to look at “golden signals” such as requests, latencies, concurrencies and bandwidth that are relevant to the environment and that are universally observable so we can quickly detect and identify failures, grey or not. And on the remediation side, we need to have a playbook of predictable operational patterns that we can apply in real-time such as backpressure, bulkheads, or quarantines.
Can you give me an example of quarantining, how that would work in practice?
Sure. If you run a service landscape and deploy a new service, it will be connected to a number of existing services and, because it has an API, you don’t know who will call you an hour, a day or a week from now. So, fundamentally, you won’t be able to engage in big-design-up-front architecture. And because you can’t architect, you can’t be sure that this new service will work within the landscape. As a result, you’ll want to ease the service into the landscape. Also, because you can’t stage a service landscape in any meaningful way due to its complexity and, by the way, because staging without production traffic and production scale is moot anyway—so, because you can’t stage, you’ll want to ease the deployment into production. And one operational pattern you can apply to that effect is the quarantine pattern.
At its most basic level, the quarantine pattern involves restricting a service’s or a group of service’s upstream traffic, keeping it very limited, and then to remove that restriction slowly so that the operations team can observe the effects that the deployment has on the wider architecture. So, curiously, the quarantine pattern is often a way to implement a governor pattern. It is a powerful pattern because the vast majority of failures are triggered when changes are introduced in the landscape. As a result, because quarantining is so useful in reducing deployment risk, we often see this deployed automatically.
What do you want people to leave the talk with?
I want people to leave with two insights. First, that service landscapes genuinely represent a new paradigm. If you are looking to become a digital enterprise, and if you, therefore, strive for agility and execute a microservices-first strategy, then you will find yourself running a service landscape. It is not a matter of “if” but “when.” You simply can’t continue to build distributed applications. In fact, distributed systems engineering is probably the worst antipattern today because it is so slow and expensive and involves waterfall-like big design up-front.
Second, I want people to leave with the insight that failures in service landscapes, i.e. stability and security issues, are overwhelmingly due to complex environmental behaviors, not individual threads of execution, which is how we as engineers have been brought up to think. And, that these environmental behaviors are fundamentally unpredictable, which necessitates an entirely different, “mission control” mindset when it comes to operating such landscapes. We need to remediate in real-time, which means we’ll have to look at golden signals, not get lured into the abyss that is high-cardinality observability. And armed with such actionable visibility, we need to be able to apply predictable operational patterns. Finally, we need to be able to actually do something in real-time, not run a half-day incident response process.
It is this mission control operational mindset that enables enterprises to innovate rapidly and successfully in the digital domain. Apollo 13 didn’t make it back to earth because the mission was well engineered. They made it back because the mission was operated well.
Similar Talks
License Compliance for Your Container Supply Chain
Open Source Engineer @VMware
Nisha Kumar
Observability in the SSC: Seeing Into Your Build System
Engineer @honeycombio
Ben Hartshorne
Evolution of Edge @Netflix
Engineering Leader @Netflix
Vasily Vlasov
Mistakes and Discoveries While Cultivating Ownership
Engineering Manager @Netflix in Cloud Infrastructure
Aaron Blohowiak
Optimizing Yourself: Neurodiversity in Tech
Consultant @Microsoft
Elizabeth Schneider
Monitoring and Tracing @Netflix Streaming Data Infrastructure
Architect & Engineer in Real Time Data Infrastructure Team @Netflix
Allen Wang
Future of Data Engineering
Distinguished Engineer @WePay
Chris Riccomini
Coding without Complexity
CEO/Cofounder @darklang
Ellen Chisa
Holistic EdTech & Diversity
Holistic Tech Coach @unlockacademy