Designing an intuitive framework for complex pipelines in PySpark by Sebastian Ånerud
Designing an intuitive framework for complex pipelines in PySpark When creating complex and pipelined jobs in PySpark, your code quickly gets unstructured and virtually impossible to read. Especially when working with the dataframe API. As part of a large scale project, we developed a flexible code structure, dependent on the Spark SQL interface, which allowed easy addition of new jobs, increased readability and made collaboration easier. In this presentation we share our key findings in how to structure your project for readability, flexibility and maintainability.