Presentation: Practical Change Data Streaming Use Cases With Apache Kafka & Debezium
This presentation is now available to view on InfoQ.com
Watch video with transcriptWhat You’ll Learn
- Hear about change data capture (CDC) and project Debezium.
- Find out about the use cases of CDC.
- Learn about the outbox pattern.
Abstract
Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/) - Secret Sauce for Change Data Capture
Apache Kafka is a highly popular option for asynchronous event propagation between microservices. Things get challenging though when adding a service’s database to the picture: How can you avoid inconsistencies between Kafka and the database?
Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.
In this session you’ll see how to leverage CDC for reliable microservices integration, e.g. using the outbox pattern, as well as many other CDC applications, such as maintaining audit logs, automatically keeping your full-text search index in sync, and driving streaming queries. We’ll also discuss practical matters, e.g. HA set-ups, best practices for running Debezium in production on and off Kubernetes, and the many use cases enabled by Kafka Connect's single message transformations.
What is the work you're doing today?
I work as a software engineer at Red Hat and there I'm the lead of the Debezium project, which is a tool for change data capture.
What are the goals you have to talk?
I would like to first and foremost familiarize people with the concepts and ideas of change data capture. What is this about? But most importantly, what are the use cases for it? Why would you like to use change data capture? There are many use cases like data replication, data exchange between microservices, you could use this to enable streaming queries or auditing. I would like to familiarize people with those concepts. It's liberation for your data. You have data sitting there in a database and CDC allows to react to changes in the data. This liberation of data, that's what I would like to talk about.
Could you briefly describe what the outbox pattern is?
The idea there is that very often people have this requirement that they need to update multiple things from within their application. Let's say they need to process a purchase order. But at the same time, they would also like to update the search index or they would like to send a message to Kafka to notify any downstream consumers about the order. Typically those two things, the database and Kafka, they cannot be updated atomically within one global transaction. If you've tried to do this, you're bound to fail and you will end up with inconsistencies. This outbox pattern essentially is a way to avoid this. How it works? You don't only update the business tables in your database, but within the same transaction you also insert an event record into the outbox table. You then capture the insert from the outbox table and stream these change events to downstream consumers.
It's a two step transaction?
In the end, you could say it's that. It essentially allows you to have this instant read your own writes for your own changes. You could go to the database and you would see this newly persisted purchase order. But then at the same time, it also gives you eventually consistent eventing to downstream consumers.
What advantages does having an event bus like Kafka in the middle of that flow give you?
First of all, there is this notion of decoupling. This all will be asynchronous. Even if you cannot reach Kafka for some time, eventually you’ll be able to send events to Kafka again, with the source application not being impacted by any downtime. Also, if any consumers of those events are not available, let's say our search index, we cannot access it for some reason, we are not bothered by that because Kafka's it's in there and decouples them. Then one of the things which I really like about Kafka is that it's like a durable log. We can keep change events in Kafka topics for as long as we want, and you could reread topics from the beginning. This means we could add a new consumer down the road long after those change events were produced. For instance, we could add a consumer which takes the data and writes it to a data warehouse. And maybe we didn't even think about this use case when we were producing those events originally.
What do you want people to leave the talk with?
Three things, mostly. One touches a bit on the outbox pattern. Friends don't let friends do dual writes. That's one of the points I would like to talk. The next one, people should get an understanding what it is in there for them if they would use change data capture. What are the use cases? How could they benefit in their jobs by using this? And then finally, I would like to run them through some practical matters. How could you use this on things like Kubernetes or what are typical topologies? Sometimes people would like to stream changes for secondary database in the cluster. Those practical matters.
Similar Talks
License Compliance for Your Container Supply Chain
Open Source Engineer @VMware
Nisha Kumar
Observability in the SSC: Seeing Into Your Build System
Engineer @honeycombio
Ben Hartshorne
Evolution of Edge @Netflix
Engineering Leader @Netflix
Vasily Vlasov
Mistakes and Discoveries While Cultivating Ownership
Engineering Manager @Netflix in Cloud Infrastructure
Aaron Blohowiak
Optimizing Yourself: Neurodiversity in Tech
Consultant @Microsoft
Elizabeth Schneider
Monitoring and Tracing @Netflix Streaming Data Infrastructure
Architect & Engineer in Real Time Data Infrastructure Team @Netflix
Allen Wang
Coding without Complexity
CEO/Cofounder @darklang
Ellen Chisa
Holistic EdTech & Diversity
Holistic Tech Coach @unlockacademy
Antoine Patton
Exploiting Common iOS Apps’ Vulnerabilities
Software Engineer @Google