Track: Stream Processing In The Modern Age

Location: Bayview AB

Day of week:

Stream processing pipelines have become essential to building engaging experiences on the web today. Whether you enjoy personalized news feeds on LinkedIn and Facebook, profit from near real time updates to search engines and recommender systems, or benefit from near-realtime fraud detection on a lost or stolen credit card, you have come to rely on the fruits of stream processing as an end user. As a ops-focused engineer, you may employ stream processing to understand complex call trees in your microservice-based infrastructure with the aim to eliminate redundant system load or improve mobile and web application performance. As an business analyst, you may want to execute a SQL query on a live stream of data to reveal some insights. Come learn about interesting applications of stream processing as well as recent advances in the field.

Track Host: Tyler Akidau

Engineer @Google & Founder/Committer on Apache Beam

Tyler Akidau is a senior staff software engineer at Google Seattle. He leads technical infrastructure’s internal data processing teams in Seattle (MillWheel & Flume), is a founding member of the Apache Beam PMC, and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper, the Streaming 101 and Streaming 102 articles, and the upcoming Streaming Systems book. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Custom, Complex Windows @Scale Using Apache Flink

100 Million members in over 190 countries leads to more than 1 Trillion events and 3 PB of data flowing through Netflix’s real-time data infrastructure each day. We’ve built a data pipeline in the cloud that reliably collects and routes these events to a variety of sinks. The data in these events are are used in several ways; from personalizing the customer experience to business intelligence.

The windowing capabilities offered by most stream processing engines are limited to aligned windows of a fixed duration. However, many real-world event processing use cases don’t fit this rigid structure, resulting in awkward processing pipelines. There haven’t been good alternatives, until recently that is. Apache Flink* offers a rich Window API that supports implementing unaligned windows of varying duration. In this talk, Matt Zimmer will discuss using this API to aggregate events into windows customized along varying definitions of a session. He will talk about implementation details such as:

  • Handling out-of-order events
  • Limiting state build-up while aggregating a subset of events from an event stream
  • Periodically emitting early results
  • Creating windows bounded by a type of event

Attendees will leave this talk with practical techniques and knowledge to implement their own custom windows in Apache Flink.

* Apache Flink (https://flink.apache.org/) is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

Matt Zimmer, Real-time Data Infrastructure Senior Engineer @Netflix

Predictive Datacenter Analytics With Strymon

A modern enterprise datacenter is a complex, multi-layered system whose components often interact in unpredictable ways. Yet, to keep operational costs low and maximize efficiency, we would like to foresee the impact of changing workloads, updating configurations, modifying policies, or deploying new services.

In this talk, I will share our research group’s ongoing work on Strymon: a system for predicting datacenter behavior in hypothetical scenarios using queryable online simulation. Strymon leverages existing logging and monitoring pipelines of modern production datacenters to ingest cross-layer events in a streaming fashion and predict possible effects of such events in what-if scenarios. Predictions are made online by simulating the hypothetical datacenter state alongside the real one. Driven by a real-use case from our industrial partners, I will highlight the challenges we are facing in building Strymon to support a diverse set of data representations, input sources, query languages, and execution models.

Finally, I will share our initial design decisions and give an overview of Timely Dataflow; a high-performance distributed streaming engine and our platform of choice for Strymon’s core implementation.

Vasia Kalavri, PMC Member of Apache Flink (Core Developer Graph Processing API) & Postdoctoral researcher at the ETH Zurich Systems group

Streaming SQL Foundation: Why I ❤ Streams+Tables

What does it mean to execute robust streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing conceptually, or different? And how does all of this relate to the programmatic frameworks like we’re all familiar with? This talk will address all of those questions in two parts.

First, we’ll explore the relationship between the Beam Model (as described in The Dataflow Model paper and the Streaming 101 and Streaming 102 blog posts) and stream & table theory (as popularized by Martin Kleppmann and Jay Kreps, amongst others, but essentially originating out of the database world). It turns out that stream & table theory does an illuminating job of describing the low-level concepts that underlie the Beam Model.

Second, we’ll apply our clear understanding of that relationship towards explaining what is required to provide robust stream processing support in SQL. We’ll discuss concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, compare to other offerings such as Apache Kafka’s KSQL and Apache Spark’s Structured streaming, and talk about new ideas yet to come.

In the end, you can expect to have a much better understanding of the key concepts underpinning data processing, regardless of whether that data processing batch or streaming, SQL or programmatic, as well as a concrete notion of what robust stream processing in SQL looks like.

Tyler Akidau, Engineer @Google & Founder/Committer on Apache Beam

The Power of Distributed Snapshots in Apache Flink

Come learn how Apache Flink is handles stateful stream processing and how to manage distributed stream processing and data driven applications efficiently with Flink's checkpoints and savepoints.

Over the last years, data stream processing has redefined how many of us build data pipelines. Apache Flink is one of the systems at the forefront of that development: With its versatile APIs (event-time streaming, Stream SQL, events/state) and powerful execution model, Flink has been part of re-defining what stream processing can do. By now, Apache Flink powers some of the largest data stream processing pipelines in open source data stream processing. Ranging from batch and streaming pipelines and analytics to microservices and applications, Flink has been used for a wide range of applications that can be unified under the paradigm of data stream processing. A key ingredient to that flexibility is Flink's handling of Streams and State. In the talk we will show how these are handled in Flink today: The types of state, why we picked distributed snapshots as the core consistency model, and how these checkpoints/savepoints form an increadibly powerful base to manage applications, including upgrades, rollbacks, reinstatements, migrations, forking, or blue/green deployments. Demo included.

Stephan Ewen, Committer @ApacheFlink, CTO @dataArtisans

Data Decisions With Realtime Stream Processing

At Facebook, we can move fast and iterate because of our ability to make data-driven decisions. Data from our stream processing systems provide real-time data analytics and insights; the system is also implemented into various Facebook products, which have to aggregate data from many sources. In this talk, we cover:

  1. the difficulties of stream processing at scale
  2. the solutions we've created to date
  3. three case studies on improving the time to deliver insights with data via stream processing

Our case studies include examples from search product development, accelerating daily pipelines in the Data Warehouse, and seamless integration with our machine learning platforms. Each case study shows how we can deliver value to more teams while continuing to abstract the details of stream processing from various teams at Facebook. We conclude by speaking to the future of stream processing.

Serhat Yilmaz, Software Engineer @Facebook

Panel: SQL Over Streams, Ask the Experts

Queries over streams are generally "continuous," executing for long periods of time and returning incremental results. Yet operations over streams must have the ability to be monotonic. New Generation of Stream Processing Engines has added support for Stream SQL. This AMA / panel features a discussion with thought leaders evolving and shaping the space.

Julian Hyde, Original Developer @ApacheCalcite, Co-Founder SQLstream, & Architect @Hortonworks
Tyler Akidau, Engineer @Google & Founder/Committer on Apache Beam
Jay Kreps, Co-Founder and CEO @Confluent
Michael Armbrust, Initial Author of Apache Spark SQL & Leads Streaming Team @Databricks
Stephan Ewen, Committer @ApacheFlink, CTO @dataArtisans

Last Year's Tracks

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.