10:35am - 11:25am
The current generation of data engineering has left us with data pipelines, data warehouses, and machine learning platforms that are largely batch-based and centrally managed. They're often largely manually operated, and integrating new systems can be cumbersome. Over the next few years, a number of trends are going to require us to rethink how and what we build. Data is now realtime, companies are running many database technologies, teams are demanding more control of their data, and regulatory policy has begun dictating how and when we store data. This talk will present a vision of what it will take for data engineers deliver a next generation data ecosystem.
11:50am - 12:40pm
Many enterprises are investing in their next generation data platform, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale.
In this talk Zhamak shares her observations on the failure modes of a centralized paradigm of a data lake, or its predecessor data warehouse.
She introduces Data Mesh, the next generation data platforms, that shifts to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
1:40pm - 2:30pm
Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered, and written out in real time with an ease matching that of batch processing. However, the real challenge of matching the prowess of batch ETL remains in doing joins, maintaining state, and dynamically pausing or resting the data.
At Netflix, micro-services serve and record many different kinds of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when the company wants to combine the events coming from one high-traffic micro-service to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations.
Historically, Netflix has done this joining of large volume datasets in batch. Recently, the company asked, If the data is being generated in real time, why can’t it be processed downstream in real time? Why wait a full day to get information from an event that was generated a few minutes ago?
This talks describes how we solved a complex join of two high-volume event streams at Netflix using Flink. You’ll learn about
- Managing out of order events and processing late arriving data
- Exploring keyed state for maintaining large state
- Fault tolerance of a stateful application
- Strategies for failure recovery
- Schema evolution in a stateful realtime application
- Data validation batch vs streaming
2:55pm - 3:45pm
Session details to follow.
4:10pm - 5:00pm
We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help.
5:25pm - 6:15pm
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/) - Secret Sauce for Change Data Capture
Apache Kafka is a highly popular option for asynchronous event propagation between microservices. Things get challenging though when adding a service’s database to the picture: How can you avoid inconsistencies between Kafka and the database?
Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.
In this session you’ll see how to leverage CDC for reliable microservices integration, e.g. using the outbox pattern, as well as many other CDC applications, such as maintaining audit logs, automatically keeping your full-text search index in sync, and driving streaming queries. We’ll also discuss practical matters, e.g. HA set-ups, best practices for running Debezium in production on and off Kubernetes, and the many use cases enabled by Kafka Connect's single message transformations.