Presentation: The Highs and Lows of Stateful Containers
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- Hear what are some of the challenges encountered when running a stateful system in a container.
- Learn about some of the features Kubernetes has for running stateful systems, helpful patterns and some of the pitfalls to avoid.
- Find out what is still missing and what the future might bring to stateful containers.
Abstract
As modern organizations have rapidly embraced containers in recent years, stateful applications have proven tougher to transition into this brave new world than other workloads. When persistent state is involved, more is required both of the container orchestration system and of the stateful application itself to ensure the safety and availability of the data.
This talk will walk through my experiences trying to reliably run a distributed database on Kubernetes, optimize its performance, and help others do the same in their heterogeneous environments. We’ll look at what kinds of stateful applications can most easily be run in containers, which Kubernetes features and usage patterns are most helpful for running them, and a number of pitfalls I encountered along the way. Finally, we’ll ponder what’s missing and what the future may hold for stateful containers.
Tell me about the work that you do today.
I work on the open source CockroachDB database, the product itself, performance, stability of the core system, and then making sure it runs really well in all environments that users and customers want it. There are a lot of people trying to run it on Kubernetes, in a single cluster or across multiple regions. I've had a lot of exposure to trying to make a stateful system work in these various orchestrated container environments.
What's the big challenge when you're talking about stateful systems with an orchestrator?
Orchestrated systems don't provide all the same guarantees you'd expect when you're running something directly on your own VMs. A lot of the challenges are from not properly understanding how these systems work and what guarantees they’re providing that are a little different from what you might be used to in more traditional environments. You need to understand the environment and the system that you're running a little better than you'd have to run something directly out on bare metal.
Can you give me an example of that?
Particularly early on every orchestration system for a container assumed that all your containers are fungible. There was no need to think of one container as being any different from another. And there was also no need to care about where the containers are placed. Every container can be statelessly moved from machine to machine, killed as well without any concern for what it was doing. These kinds of assumptions don't work for many stateful systems. They could be the only instance with a certain piece of data on them or may have an expensive bootup process where they have to reload a bunch of data into memory.
Who are you targeting with this talk?
It's primarily for people who are architecting their applications, deciding where they want to run their systems and whether to put stateful workloads into a container. I’ll be sharing different problems that I've run into myself, helping others with running stateful applications in these systems so they can have a better sense of what it's really like, and cutting through the marketing hype as well as trying to give a better understanding of how to overcome some of the common problems encountered.
Can you illustrate one of the common problems that you might talk through that somebody will walk away with a pattern for?
One of the biggest mistakes that people make when they try to move their stuff into stateful workloads into containers is not understanding what they can rely on from the provider. People can get themselves into situations where they actually just lose their data because they assume that the storage they're running on will always be there, and it works great when they set up the demo, when they follow the steps from a blog post and works well for a week or two, but then when an unexpected failure hits them they realize they didn't plan for recovery, which is something important inside any orchestrator. If you don't plan for failure and pick your storage medium properly, you're going to have a really bad time a week or two down the road when the first failure comes into play.
When you say 'pick your storage correctly', what does it mean in this context?
When you take Kubernetes for example, you have a number of different options for where you want to store your data. You can choose to store it inside the container itself, on the host's desk, on a remote network attached storage, and there are multiple varieties if you're running in the cloud, and if you don't think about what you're doing you're probably defaulting to an incorrect choice that's going to leave you with lost data.
Are you saying 'Just use network storage and you're done', or you need to be intentional about what you're selecting?
You need to first be aware of the most likely mistakes. The defaults are usually wrong, you have to make a conscious choice to avoid defaults. And beyond that for performance reasons and also for different types of failures. You have to make an intelligent choice between the various default options, local disk or some more advanced network solution.
Is this going to be specifically for Cockroach DB or is this applicable to any stateful system you want to deploy with containers?
The same issues are applicable to any system that you're deploying in containers.
Similar Talks
Linux Foundation's Project EVE: A Cloud-Native Edge Computing Platform
Co-founder, VP Product and Strategy @ZededaEdge & Member Board Of Directors for LF Edge @linuxfoundation
Roman Shaposhnik
License Compliance for Your Container Supply Chain
Open Source Engineer @VMware
Nisha Kumar
Observability in the SSC: Seeing Into Your Build System
Engineer @honeycombio
Ben Hartshorne
Evolution of Edge @Netflix
Engineering Leader @Netflix
Vasily Vlasov
Mistakes and Discoveries While Cultivating Ownership
Engineering Manager @Netflix in Cloud Infrastructure
Aaron Blohowiak
Optimizing Yourself: Neurodiversity in Tech
Consultant @Microsoft
Elizabeth Schneider
Monitoring and Tracing @Netflix Streaming Data Infrastructure
Architect & Engineer in Real Time Data Infrastructure Team @Netflix
Allen Wang
Future of Data Engineering
Distinguished Engineer @WePay
Chris Riccomini
Coding without Complexity
CEO/Cofounder @darklang