Building Scalable Data Pipelines With Argo Workflows | David Joyce | Conf42 Kube Native 2022
At Spectrum Labs, our deep learning models are trained and validated against large data sets that can only be processed in a distributed manner. Data Scientists require the ability to invoke batch jobs as part of the the model development lifecycle, and these jobs must run in a scalable, fault-tolerant manner. Argo Workflows is chosen as our data pipeline framework as it provides a container-native workflow engine that allows us to build on our existing Kubernetes deployment. We use Apache Spark as our distributed processing engine, which we deploy and manage ourselves on Kubernetes. The integration of these 2 technologies provides us with a batch job framework to meet our Data Scientists needs. In this session, we will provide an overview of Argo Workflows and how we use it within Spectrum Labs to execute Spark jobs that process datasets that contain over 100 million records. We will demo our current pipelines and describe how we have built a framework that allows us to deploy a scalable Spark application with only a few lines of configuration. We will also provide an overview of Argo Events which allows us to orchestrate our pipelines in an automated manner. Lastly, we will discuss some of the advantages and challenges of using the Argo framework. Other talks at this conference 🚀🪐 https://www.conf42.com/kubenative2022 — 0:00 Intro 0:22 Talk