Track: The Art of Chaos Engineering

Location: Ballroom BC

Day of week:

Chaos Engineering is an emerging discipline, but the underlying concepts are not. Failure is going to happen - Are you ready? Put simply, Chaos Engineering is one approach to “breaking things on purpose” that teaches us new information about our systems through experimentation. By triggering incidents intentionally in a controlled way, we gain confidence that our systems can deal with those failures before they occur in production. Come learn from those just starting this journey as well as the experts pushing the state of the art. We will hear war stories from those putting out the fires in the middle of the night, as well as those starting the fires during the day! In the end we’ll learn how to build systems and organizations that improve in the face of failure.

Track Host: Kolton Andrus

Founder of Gremlin Inc, former Netflix

Kolton is the founder of Gremlin - helping companies build more robust services. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. At both companies he has served as a ‘Call Leader’, managing the resolution of company-wide incidents. Kolton is passionate about building resilient systems, primarily as it lets him break things for fun and profit.

Chaos Architecture

Perfectly engineered resilient systems may be broken by confused operators when they behave differently in response to underlying failures. Highly available applications need to be resilient to failures in infrastructure, networks, applications and operators. Chaos engineering is needed to exercise the incident handling mechanisms at every level, including people and processes. This talk will look at best practices and challenges in getting to a chaos architecture mindset.

Adrian Cockcroft, VP Cloud Architecture Strategy @AWSCloud & Microservices Pioneer

Chaos: The Last Stand Against Our Robot Overlords

As the complexity and criticality of our software systems is rapidly increasing; our ability and available methodologies to ensure their determinism and correctness are often nascent or sometimes even non-existent. We see the effects of this paradox as we advance the role and responsibility of software in society. Often the evidence is observed in service outages, security breaches, financial market "flash crashes", and now the ever shortening length of time between the development and eventual production of autonomous vehicles.


The pursuit of automating aspects of our lives is often stifled simply by chaos: i.e. our best laid plans coming in contact with the unexpected. An essential element of working with the chaos present in every system is to first be able to effectively characterize it. Chaos Engineering and chaos experiments on the complex data, interfaces, and algorithms used in autonomous vehicles should be a minimum requirement in validating operational safety. Taking it a step further, Chaos Engineering could be the beginning of bringing to Software Engineering the kind of determinism, predictability, and assurance we often take for granted everyday from disciplines like Structural, Mechanical, and Electrical Engineering. We need to begin to shift towards working with chaos instead of against it, in order to build safe, reliable, and increasingly deterministic complex systems. The change in how we engineer software for large-scale consumption is shifting from, "It might work, but I wouldn't bet my life on it." to, "I know this will work, I'd bet my life on it."

Failure at Netflix Velocity

Netflix is a strong believer in Chaos Engineering and the Velocity of Innovation. Most of the time, our customers never notice the former and appreciate the latter. Occasionally however…

Can not connect to Netflix. You press play and it doesn't work. You can't log in. Nothing is on the screen and Stranger Things Season 2 just released!

A behind the scenes look at how Netflix engineering teams think about failure. The tools, techniques, and training we use to shorten the inevitable failures of our systems and impacts to our customers. Come hear why we believe chaos is your friend, failure is guaranteed, and why our organization is better off having both.

Dave Hahn, Sr SRE, Reliability and Chaos Engineering @Netflix

Chaos Engineering on a Budget

As the systems that support internet-scale services grow larger and ever more complex, chaos engineering has emerged as industry best practice for ensuring system resiliency. Many companies maintain entire teams devoted to chaos testing their product. But what can you do if you don't have these kinds of resources to devote to the problem? How can you get started with chaos engineering without hiring an entire team of experts?

This is the story of implementing chaos testing on a small product, and how several small and targeted early investments in chaos engineering saved huge amounts of time and effort down the road.

Heather Nakama, Software Engineer @Microsoft - Azure Search

The Art of Chaos Engineering Panel

Kolton Andrus, Founder of Gremlin Inc, former Netflix
Willie Wheeler, Principal Application Engineer @Expedia
Sahar Samiei, Senior Product Manager @Expedia
Nathan Äschbacher
Dave Hahn, Sr SRE, Reliability and Chaos Engineering @Netflix
Adrian Cockcroft, VP Cloud Architecture Strategy @AWSCloud & Microservices Pioneer
Heather Nakama, Software Engineer @Microsoft - Azure Search

Expedia’s Journey Toward Site Resiliency

Those coming from product-driven organizations—where product features are often prioritized over resiliency-related concerns—will understand how challenging it can be to convince teams to do resiliency work. In this presentation we’ll share Expedia’s resiliency journey, starting with resiliency as an afterthought and progressing toward resiliency as a first-class concern. Attendees will learn about the importance of partnering with the teams experiencing operational struggles, and equipping them with the data to make the right investments at the right time.

Willie Wheeler, Principal Application Engineer @Expedia
Sahar Samiei, Senior Product Manager @Expedia

Last Year's Tracks

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.