Track: Production Readiness: Building Resilient Systems

Location: Ballroom BC

Day of week:

A production readiness review is used by software companies to determine whether the design and implementation of the system is ready to be released to its customers. The process is used to identify and address the reliability of a service, sufficiency of the coverage of privacy and security needs, and the ease of the operability. This track explores what types of aspects of software need to be prepared to start taking on full production load with customer’s data. Topics include observability, emergency response, capacity planning, release processes, and SLOs for availability and latency.

Track Host: Michelle Brush

Engineering Manager, SRE @Google

Michelle Brush is a math geek turned computer geek with 20 years of software development experience. She has developed algorithms and data structures for pathfinding, search, compression, and data mining in embedded as well as distributed systems. In her current role as an SRE Manager for Google, she leads the teams of SREs that ensure GCP's APIs are reliable. Previously, she served as the Director of HealtheIntent Architecture for Cerner Corporation, responsible for the data processing platform for Cerner’s Population Health solutions. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm.

10:35am - 11:25am

Monitoring and Tracing @Netflix Streaming Data Infrastructure

Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.

The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.

Allen Wang, Architect & Engineer in Real Time Data Infrastructure Team @Netflix

11:50am - 12:40pm

Observability in the Development Process: Not Just for Ops Anymore

Monitoring has been historically considered an afterthought of the software development cycle: something owned by the ops side of the room. But instead of trying to predict the various ways something might go sideways right before release, what might it look like instead to learn about our production systems in order to figure out what to build, and how to build it, and whom for?

Observability is all about asking new questions of your systems -- and is something that should be built into the process of crafting software from the very beginning. In this talk, we'll explore what it looks like in practice, so that production stops being just where our development code runs into issues: it becomes where part of our development process lives.

Christine Yen, Cofounder @honeycombio

1:40pm - 2:30pm

Building Confidence in Healthcare Systems Through Chaos Engineering

Healthcare demands resilient software. Healthcare systems are resistant to change, as change can be viewed as a threat to system availability. To scale and modernize these systems, software engineers have to build confidence in how they can continually introduce change.

This talk will cover how Cerner evolved their service workloads and applied gameday exercises to improve their resiliency. It will focus on how they transitioned their Java services from traditional enterprise application servers to a container deployment on Kubernetes using Spinnaker. It will share how they standardized their service deployment to have consistent instrumentation to get deep insight into the overall behavior of their system. It will explain strategies for how they applied traffic management approaches to safely introduce chaos engineering experiments, improving their overall understanding of the system.

Carl Chesser, Principal Engineer @Cerner

2:55pm - 3:45pm

How to Invest in Technical Infrastructure

Deciding what to work on is always difficult and is especially treacherous for folks working as infrastructure engineers and leaders. Will Larson unpacks the process of picking and prioritizing technical infrastructure work, which is essential to long-term company success but discussed infrequently. Will shares Stripe's approaches to prioritizing infrastructure as your company scales, justifying—and maybe even expanding—your company's spend on technical infrastructure, exploring the whole range of possible areas to invest into infrastructure, adapting your approach between periods of firefighting and periods of innovation, and balancing investment in supporting existing products and enabling new product development.

Will Larson, Foundation Engineering @Stripe

4:10pm - 5:00pm

Stop Talking & Listen; Practices for Creating Effective Customer SLOs

In this data-driven age we are constantly collecting and analyzing monumental quantities of data. We want to know everything about our product, how our customers use it, how long they use it and more importantly is the product even working? With all this data, we should be able to answer all of these questions. But turns out, that’s not always the case. In this talk, we’ll discuss some of the common pitfalls that arise from collecting and analyzing service data such as only using 'out-of-the-box' metrics and not having feedback loops. Then we'll discuss some practical tips for reducing noise and increasing effective customer signals with SLOs and analyzing customer pain points.

Cindy Quach, Site Reliability Engineer @Google

Last Year's Tracks

Monday, 1 November
Microservices / Serverless Patterns & Practices

Evolving, observing, persisting, and building modern microservices
Practices of DevOps & Lean Thinking

Practical approaches using DevOps & Lean Thinking
JavaScript & Web Tech

Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
Modern CS in the Real World

Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
Modern Operating Systems

Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
Optimizing You: Human Skills for Individuals

Better teams start with a better self. Learn practical skills for IC
Open Spaces
Tuesday, 2 November
Architectures You've Always Wondered About

Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
21st Century Languages

Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
Emerging Trends in Data Engineering

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
Bare Knuckle Performance

Killing latency and getting the most out of your hardware
Socially Conscious Software

Building socially responsible software that protects users privacy & safety
Delivering on the Promise of Containers

Runtime containers, libraries, and services that power microservices
Open Spaces
Wednesday, 3 November
Applied AI & Machine Learning

Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
Production Readiness: Building Resilient Systems

More than just building software, building deployable production ready software
Developer Experience: Level up your Engineering Effectiveness

Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
Security: Lessons Attacking & Defending

Security from the defender's AND the attacker's point of view
Future of Human Computer Interaction

IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
Enterprise Languages

Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track

Schedule

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Track: Production Readiness: Building Resilient Systems

Location: Ballroom BC

Day of week:

Track Host: Michelle Brush

Monitoring and Tracing @Netflix Streaming Data Infrastructure

Observability in the Development Process: Not Just for Ops Anymore

Building Confidence in Healthcare Systems Through Chaos Engineering

How to Invest in Technical Infrastructure

Stop Talking & Listen; Practices for Creating Effective Customer SLOs

Last Year's Tracks

Monday, 1 November

Microservices / Serverless Patterns & Practices

Practices of DevOps & Lean Thinking

JavaScript & Web Tech

Modern CS in the Real World

Modern Operating Systems

Optimizing You: Human Skills for Individuals

Open Spaces

Tuesday, 2 November

Architectures You've Always Wondered About

21st Century Languages

Emerging Trends in Data Engineering

Bare Knuckle Performance

Socially Conscious Software

Delivering on the Promise of Containers

Open Spaces

Wednesday, 3 November

Applied AI & Machine Learning

Production Readiness: Building Resilient Systems

Developer Experience: Level up your Engineering Effectiveness

Security: Lessons Attacking & Defending

Future of Human Computer Interaction

Enterprise Languages

Follow QCon

Contact

Menu

QCons around the World