Presentation: The Whys and Hows of Database Streaming

Track: Emerging Trends in Data Engineering

Location: Bayview AB

Duration: 5:25pm - 6:15pm

Day of week:

Slides: Download Slides

Level: Intermediate

Persona: Architect, Backend Developer, Data Engineering, Developer, General Software

This presentation is now available to view on InfoQ.com

Watch video with transcript

Abstract

Batch-style ETL pipelines have been the de facto method for getting data from OLTP to OLAP database systems for a long time. At WePay, when we first built our data pipeline from MySQL to BigQuery, we adopted this tried-and-true approach. However, as our company scaled and our business needs grew, we observed a stronger demand for making data available for analytics in real-time. This led us to redesign our pipeline to a streaming-based approach using open-source technologies such as Debezium and Kafka.

 

This talk goes over the central design pattern around database streaming, change data capture (CDC), and what its advantages are over alternative approaches like trigger or event-sourcing. To solidify the concept, we will go through our MySQL-to-BigQuery streaming pipeline in detail, explaining the core components involved, and how we built this pipeline to be resilient to failure. Finally, we will expand on some of our on-going work around the additional challenges we face when streaming peer-to-peer distributed databases (i.e. Cassandra), and what some potential solutions around it are.

Speaker: Joy Gao

Sr. Software Engineer @WePay

Joy is a senior software engineer at WePay. She works on the data infrastructure team, building streaming and batch data pipelines with open source software. She is a FOSS enthusiast and a committer for Apache-Airflow.

Find Joy Gao at

Similar Talks

Scaling Patterns for Netflix's Edge

Qcon

Playback Edge Engineering @Netflix

Justin Ryan