Presentation: Training Deep Learning Models at Scale on Kubernetes
Abstract
Deep Learning has recently become very important for all kinds of AI applications from conversational chatbots to self-driving cars. In this talk, we will talk about how we use deep learning for natural language processing, utilize Tensorflow for training deep learning models, run Tensorflow on top of Kubernetes, and use GPUs.
We have a need to train deep learning models for each conversational bot that we deploy on our platform. Training individual bots on one-off systems using ad-hoc processes is no longer a feasible solution as it does not scale with the number of bots in our system. In order to address the above requirements, we have built a framework for running long running jobs that leverages our existing Kubernetes infrastructure. We have designed our jobs framework to have the following key benefits.
- Jobs can be executed either on a fixed schedule or a manual trigger or an automated trigger ( i.e some other event in our system can trigger a job)
- High availability of job workers.
- Scale up (or down) the number of workers for each job type based on need.
- We can assign specific attributes to specific workers. For example, we ensure that our training workers are always executed on GPU nodes so that they can take full advantage of the GPU resources available in our infrastructure.
- Simplified job management. This includes the ability to monitor, audit and debug each job that was executed. Further, using our systems for centralized logging and monitoring, we can quickly understand key results from the job. For example, in case of model training jobs, we can quickly look at the confusion matrix to understand if the trained model should be promoted to our production systems.
In the talk, we will present how we have leveraged Kubernetes to realize each of the above benefits.
Similar Talks
Linux Foundation's Project EVE: A Cloud-Native Edge Computing Platform
Co-founder, VP Product and Strategy @ZededaEdge & Member Board Of Directors for LF Edge @linuxfoundation
Roman Shaposhnik
Shifting Left with Cloud Native CI/CD
Software Engineer @Google
Christie Wilson
Programming the Cloud: Empowering Developers to Do Infrastructure
TypeScript Co-Creator
Luke Hoban
Scaling Patterns for Netflix's Edge
Playback Edge Engineering @Netflix
Justin Ryan
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Co-founder @gruntwork_io
Yevgeniy Brikman
AWS Cloud Development Kit (CDK)
Developer Tooling Advocate @AWSCloud & CDK Core Contributor
Richard Boyd
Snowflake Architecture: Building a Data Warehouse for the Cloud
Co-founder Snowflake Computing @SnowflakeDB
Thierry Cruanes
Helm 3: A Mariner's Delight
Principal Program Manager @Microsoft & K8s Release Lead for 1.16