Track: Emerging Trends in Data Engineering

Location: Bayview AB

Day of week:

Data Engineering is becoming increasingly relevant to our highly-connected, AI driven world. In the past, software engineers focused their efforts on developing scalable web architectures until they realized that their biggest headache was their data architecture. For most of us, data architecture simply meant running an RDBMS for all of our needs, from transactional read-write workloads to ad-hoc point and scan analytics loads. As our data grew, so did our use-cases for data-driven products (e.g. fraud detection systems, recommender systems, personalization services) -- these 2 rising trends combined to stress our RDBMS beyond their capabilities. Data engineers entered the field to solve our problems by introducing specialized data stores (e.g. search engines, graph engines, large scale data processing (e.g. Spark), NoSQL, stream processing (E.g. Beam, Flink, Spark)) and the machinery to glue them together (e.g. ETL pipelines, Kafka, Sqoop, Flume). Today, data architectures are as vast and varied as the use-cases they supports. What are some emerging technologies and trends in this space and how are some of cutting-edge companies solving their problems? Come to this track to learn more.

Track Host: Sid Anand

Hacker at Large, Co-chair @QCon & Data Council, PMC & Committer @ApacheAirflow

Sid Anand recently served as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search & Data Analytics @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid enjoys spending time with family and friends.

10:35am - 11:25am

Data Engineering Open Space

11:50am - 12:40pm

Massively scaling MySQL using Vitess

Are you dealing with the challenges of rapid growth? Are you thinking about how to scale your database layer? Should you use NoSQL? Should you shard your relational database? If you are facing these kinds of problems, this session is for you. Vitess is a database solution for deploying, scaling and managing large clusters of MySQL instances. It's architected to run as effectively in a public or private cloud architecture as it does on dedicated hardware. It combines and extends many important MySQL features with the scalability of a NoSQL database. This session gives an overview of the salient features of Vitess, and at the end, we'll cover some advanced features with a demo.

Sugu Sougoumarane, Co-Founder / CTO @planetscaledata & Co-Creator @vitessio

1:40pm - 2:30pm

Transaction Processing in FoundationDB

FoundationDB provides users strongly consistent transactions without a two-phase commit protocol. This talk will go through the architecture of FoundationDB and describe what is happening in the internals of the database when a client commits a transaction.

Evan Tschannen, Lead Developer/Committer FoundationDB

2:55pm - 3:45pm

Patterns of Streaming Applications

Stream processing engines are becoming pivotal in analyzing data. They have evolved beyond a data transport and simple processing machinery, to one that's capable of complex processing. The necessary features and building blocks of these engines are well known. And most capable engines have a familiar Dataflow based programming model.

As with any new paradigm, building streaming applications requires a different mindset and approach. Hence there is a need for identifying and describing patterns and anti-patterns for building these applications. Currently this mindshare is scarce.

Drawn from my experience working with several engineers within and outside of Netflix, this talk will present the following:

  • A blueprint for streaming data architectures and a review of desirable features of a streaming engine
  • Streaming Application patterns and anti-patterns
  • Use cases and concrete examples using Flink

Attendees will come away with patterns that can be applied to any capable stream processing framework such as Apache Flink.

Monal Daxini, Distributed Systems Engineer / Leader @Netflix

4:10pm - 5:00pm

Training Deep Learning Models at Scale on Kubernetes

Deep Learning has recently become very important for all kinds of AI applications from conversational chatbots to self-driving cars. In this talk, we will talk about how we use deep learning for natural language processing, utilize Tensorflow for training deep learning models, run Tensorflow on top of Kubernetes, and use GPUs. 

We have a need to train deep learning models for each conversational bot that we deploy on our platform. Training individual bots on one-off systems using ad-hoc processes is no longer a feasible solution as it does not scale with the number of bots in our system. In order to address the above requirements, we have built a framework for running long running jobs that leverages our existing Kubernetes infrastructure. We have designed our jobs framework to have the following key benefits.   

  1. Jobs can be executed either on a fixed schedule or a manual trigger or an automated trigger ( i.e some other event in our system can trigger a job) 
  2. High availability of job workers.   
  3. Scale up (or down) the number of workers for each job type based on need. 
  4. We can assign specific attributes to specific workers. For example, we ensure that our training workers are always executed on GPU nodes so that they can take full advantage of the GPU resources available in our infrastructure. 
  5. Simplified job management. This includes the ability to monitor, audit and debug each job that was executed.  Further, using our systems for centralized logging and monitoring, we can quickly understand key results from the job. For example, in case of model training jobs, we can quickly look at the confusion matrix to understand if the trained model should be promoted to our production systems.   

In the talk, we will present how we have leveraged Kubernetes to realize each of the above benefits.

Deepak Bobbarjung, Founding Engineer @PassageAI
Mitul Tiwari, CTO @PassageAI

5:25pm - 6:15pm

The Whys and Hows of Database Streaming

Batch-style ETL pipelines have been the de facto method for getting data from OLTP to OLAP database systems for a long time. At WePay, when we first built our data pipeline from MySQL to BigQuery, we adopted this tried-and-true approach. However, as our company scaled and our business needs grew, we observed a stronger demand for making data available for analytics in real-time. This led us to redesign our pipeline to a streaming-based approach using open-source technologies such as Debezium and Kafka.

 

This talk goes over the central design pattern around database streaming, change data capture (CDC), and what its advantages are over alternative approaches like trigger or event-sourcing. To solidify the concept, we will go through our MySQL-to-BigQuery streaming pipeline in detail, explaining the core components involved, and how we built this pipeline to be resilient to failure. Finally, we will expand on some of our on-going work around the additional challenges we face when streaming peer-to-peer distributed databases (i.e. Cassandra), and what some potential solutions around it are.

Joy Gao, Sr. Software Engineer @WePay

Last Year's Tracks

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.