A production readiness review is used by software companies to determine whether the design and implementation of the system is ready to be released to its customers. The process is used to identify and address the reliability of a service, sufficiency of the coverage of privacy and security needs, and the ease of the operability. This track explores what types of aspects of software need to be prepared to start taking on full production load with customer’s data. Topics include observability, emergency response, capacity planning, release processes, and SLOs for availability and latency.
Track: Production Readiness: Building Resilient Systems
Location: Ballroom BC
Day of week:
Track Host: Michelle Brush
10:35am - 11:25am
Monitoring and Tracing @Netflix Streaming Data Infrastructure
Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.
The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.
11:50am - 12:40pm
Observability in the Development Process: Not Just for Ops Anymore
Monitoring has been historically considered an afterthought of the software development cycle: something owned by the ops side of the room. But instead of trying to predict the various ways something might go sideways right before release, what might it look like instead to learn about our production systems in order to figure out what to build, and how to build it, and whom for?
Observability is all about asking new questions of your systems -- and is something that should be built into the process of crafting software from the very beginning. In this talk, we'll explore what it looks like in practice, so that production stops being just where our development code runs into issues: it becomes where part of our development process lives.
1:40pm - 2:30pm
Building Confidence in Healthcare Systems Through Chaos Engineering
Healthcare demands resilient software. Healthcare systems are resistant to change, as change can be viewed as a threat to system availability. To scale and modernize these systems, software engineers have to build confidence in how they can continually introduce change.
This talk will cover how Cerner evolved their service workloads and applied gameday exercises to improve their resiliency. It will focus on how they transitioned their Java services from traditional enterprise application servers to a container deployment on Kubernetes using Spinnaker. It will share how they standardized their service deployment to have consistent instrumentation to get deep insight into the overall behavior of their system. It will explain strategies for how they applied traffic management approaches to safely introduce chaos engineering experiments, improving their overall understanding of the system.
2:55pm - 3:45pm
How to Invest in Technical Infrastructure
Deciding what to work on is always difficult and is especially treacherous for folks working as infrastructure engineers and leaders. Will Larson unpacks the process of picking and prioritizing technical infrastructure work, which is essential to long-term company success but discussed infrequently. Will shares Stripe's approaches to prioritizing infrastructure as your company scales, justifying—and maybe even expanding—your company's spend on technical infrastructure, exploring the whole range of possible areas to invest into infrastructure, adapting your approach between periods of firefighting and periods of innovation, and balancing investment in supporting existing products and enabling new product development.
4:10pm - 5:00pm
Stop Talking & Listen; Practices for Creating Effective Customer SLOs
In this data-driven age we are constantly collecting and analyzing monumental quantities of data. We want to know everything about our product, how our customers use it, how long they use it and more importantly is the product even working? With all this data, we should be able to answer all of these questions. But turns out, that’s not always the case. In this talk, we’ll discuss some of the common pitfalls that arise from collecting and analyzing service data such as only using 'out-of-the-box' metrics and not having feedback loops. Then we'll discuss some practical tips for reducing noise and increasing effective customer signals with SLOs and analyzing customer pain points.
Last Year's Tracks
Monday, 1 November
-
Microservices / Serverless Patterns & Practices
Evolving, observing, persisting, and building modern microservices
-
Practices of DevOps & Lean Thinking
Practical approaches using DevOps & Lean Thinking
-
JavaScript & Web Tech
Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
-
Modern CS in the Real World
Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
-
Modern Operating Systems
Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
-
Optimizing You: Human Skills for Individuals
Better teams start with a better self. Learn practical skills for IC
-
Open Spaces
Tuesday, 2 November
-
Architectures You've Always Wondered About
Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
-
21st Century Languages
Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
-
Emerging Trends in Data Engineering
Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
-
Bare Knuckle Performance
Killing latency and getting the most out of your hardware
-
Socially Conscious Software
Building socially responsible software that protects users privacy & safety
-
Delivering on the Promise of Containers
Runtime containers, libraries, and services that power microservices
-
Open Spaces
Wednesday, 3 November
-
Applied AI & Machine Learning
Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
-
Production Readiness: Building Resilient Systems
More than just building software, building deployable production ready software
-
Developer Experience: Level up your Engineering Effectiveness
Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
-
Security: Lessons Attacking & Defending
Security from the defender's AND the attacker's point of view
-
Future of Human Computer Interaction
IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
-
Enterprise Languages
Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track