List of videos

Talk: Javier Jorge Cano - “Sorry, Could you repeat that again?” - Speech Recognition with Python

Presented by: Javier Jorge Cano Nowadays, we are surrounded by devices that can listen to us: Alexa, Siri, Cortana, etc, and the interaction with them has become easier and easier, and more intuitive. The first challenge to communicate in a colloquial way with all these devices is to convert the voice signal to text. To do this, several approaches based on searching methods, algorithmic techniques, and machine learning are combined in very smart and interesting ways. In this talk, I will introduce the underneath speech recognition systems that these devices utilize. This will be illustrated with a guided example where we will develop a system to recognize isolated words in Python. Finally, I will show how we are implementing these and more advanced techniques in our production systems, providing transcriptions for different companies and institutions, using Python in different parts of the process. Slides: https://drive.google.com/drive/folders/109zntIEpvth370iOZAr9-IiBvPHc1UB7?usp=sharing

Watch

Talk: Robson Luis Monteiro Junior - Polyglot data with python: Introducing Pandas and Apache Arrow

Presented by: Robson Luis Monteiro Junior Nowadays Python is synonymous of data, but not necessarily the best choice for some data tasks. For example, exchange data between different ecosystems is one of the challenges for Python. Pandas and NumPy are very efficient and de facto tools to deal with a reasonable amount of data with performance, but they are limited outside of the Python ecosystem. Acquire and exchange data might be painful due to the problem to write slow conversion code or generated unnecessary large files to talk with other ecosystems, likes large CSV files. Apache Arrow playing with Pandas is a great option as technologies that handle these problems with an excellent performance playing natively with Python. This talk aims to show how to work in a heterogeneous environment with data coming from another ecosystem, be handled inside the Python ecosystem and send back to another ecosystem transparently.

Watch

Talk: Daniel Imberman - Bridging Data Science and Data Infrastructure with Apache Airflow

Presented by: Daniel Imberman When supporting a data science team, data engineers are tasked with building a platform that keeps a wide range of stakeholders happy. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. Collaboration between these stakeholders can be difficult, as every data science pipeline has a unique set of constraints and system requirements (compute resources, network connectivity, etc). For these reasons, data engineers strive to give their data scientists as much flexibility as possible, while maintaining an observable and resilient infrastructure. In recent years, Apache Airflow (a Python-based task orchestrator developed at Airbnb) has gained popularity as a collaborative platform between data-centric Pythonistas, and infrastructure engineers looking to spare their users from verbose and rigid yaml files. Apache Airflow exposes a flexible pythonic interface that can be used as a collaboration point between data engineers and data scientists. Data engineers can build custom operators that abstract details of the underlying system and data scientists can use those operators (and many more) to build a diverse range of data pipelines. For this 30 minute talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, hyper-parameter-tuned, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.

Watch

Talk: Matthew Rocklin - Deploying Python at Scale with Dask

Presented by: Matthew Rocklin This talk discusses the challenges and options to scale Python with Dask on distributed hardware. We particularly focus on how Dask gets deployed on cluster resource managers like Kubernetes, Yarn, and the cloud today. The Python data science stack (Numpy, Pandas, Scikit-Learn and others) has become the gold standard in most data centric fields due to a combination of intuitive high level APIs and efficient low-level code. However, these libraries were not originally designed to scale beyond a single CPU or data that fits in memory. Over the last few years the Dask library has worked with these libraries to provide scalable variants, which do run on multi-core workstations or on distributed clusters. This has allowed advanced users the ability to scale Python to handle 100+TB datasets. However, deploying Dask within an institution remains a challenge. How do we balance load across many machines? How do we share with other distributed systems running on those same machines? How do we control access and provide authentication and security? As more institutions adopt Python to handle scalable computation these questions arise with greater urgency. This talk discusses the options today to deploy Dask securly within an institution on distributed hardware, and dives into some examples where this has had a large positive social impact.

Watch

Talk: Colin Carroll - Getting started with automatic differentiation

Presented by: Colin Carroll The derivative is a concept from calculus which gives you the rate of change of a function: for a small change in an input, how much does the output change? This idea turns out to be very important in natural sciences, and is used in many optimization algorithms, which find the maximum or minimum of functions. Automatic differentiation is a technique for computing the derivative of a function. Python has a number of libraries implementing automatic differentiation, many of which are put to use for deep learning, but can be used on their own. In this talk I will give intuition for the derivative and its high dimensional sibling, the gradient. We will take a tour of applications, including optimization and computational art, with examples using jax, TensorFlow, and PyTorch. We conclude with a brief description of alternative ways of computing derivatives in Python, and their relative strengths.

Watch

Sponsor Workshop: Google: Charles Engelke - Serverless Python Applications with Google Cloud

Presented by: Charles Engelke You know how to program in Python. Now learn how to apply that skill to building distributed serverless applications in the cloud. You won’t write just one program, you’ll create several and connect them through a shared database, network requests, message queues, or triggering events. This is a hands-on workshop–bring your own PC and leave having built and deployed a significant cloud application. You’ll see not only how to use different cloud services to run your code, but why you would choose each one for an application’s specific needs. The workshop uses Google Cloud Platform services, including Cloud Functions, Cloud Run, App Engine, Firestore, PubSub, and Identity-Aware Proxy, but the concepts covered can be applied to any major cloud platform. Additional materials (slides, source code, codelabs, email addresses) available at https://serverlessworkshop.dev/

Watch

Easy Data Processing With Azure Fun -Tania Allard - Sponsor Workshop: Microsoft

Serverless or Function as a Service (FaaS) enables you to focus more on your code while making it easier to deploy your applications. FaaS can be great tools to perform data processing scenarios. Azure functions allow you to leverage Azure's robust, managed, and scalable cloud computing platform. In this workshop, Tania Allard will teach you how you can get started with Azure functions and Python for data processing scenarios. ✨ Check the tutorial at https://aka.ms/pycon2020-azurefunctions ✨ Slides: https://aka.ms/pycon2020-azurefunctions-slides ✨ GitHub repository https://github.com/trallard/pycon2020-azure-functions 🐍 PyCon schedule description: https://us.pycon.org/2020/schedule/presentation/230/

Watch

Sponsor Workshop: Louise Grandjonc - Optimize Python & Django apps with Postgres superpowers

Building Django apps that perform well & scale can be a challenge. In this video, you will learn Postgres superpowers to help optimize performance of Python and Django apps, using a sample Django application which runs on Azure Database for PostgreSQL. This talk walks through how to tackle the most common pain point: slow queries. You’ll dive into the Django ORM to identify performance bottlenecks, and we’ll analyze both Python and SQL code. We’ll explore the usefulness of pg_stat_statements, django-debug-toolbar, Postgres logs, keyset pagination—as well as Query Performance Insight and Performance Recommendations on Azure. You’ll learn how to eradicate loops in your code (this is important); how to limit what you select; and how to scale out Postgres horizontally using Hyperscale (Citus) on Azure Database for PostgreSQL, in a way that is transparent to your app and does not require any re-architecture. Just like chocolate and peanut butter go better together, so do Python and Postgres. These Postgres optimization techniques (what I call “superpowers”) will empower you to improve the performance of your Python and Django apps. (Also featuring fish I hand-painted many years ago, before I became a Django and Postgres developer.) #Python #PostgreSQL #Django ✅ More demos, workshops, and labs from our PyCon 2020 team: https://aka.ms/pycon2020 ✅ Hyperscale (Citus) Quickstart Docs: https://aka.ms/hyperscale-citus-quickstart ✅ GitHub repo for Citus open source extension to Postgres: https://aka.ms/citus ✅ Slides for this PyCon talk on Optimizing your Django & Python apps with Postgres superpowers: https://aka.ms/optimize-django-postgres-superpowers-slides ✅ Talk on Debugging, “When it all goes Wrong (with Postgres)” https://aka.ms/all-goes-wrong-PG-talk ✅ Louise Grandjonc’s DjangoCon talk on Postgres Index Types: https://aka.ms/django-PG-index-talk ✅ GitHub repo for Django Ad Application used in this video: https://aka.ms/django-PG-ad-app ✅ Join Citus monthly technical newsletter, made with love by the Postgres team at Microsoft: https://aka.ms/citus-newsletter

Watch

Sponsor Workshop: Mark Ibrahim - Facebook: Machine Learning on Encrypted Data with CrypTen

Presented by: Mark Ibrahim CrypTen is a machine learning framework built on PyTorch that enables you to easily study and develop machine learning models using secure computing techniques. CrypTen allows you to develop models with the PyTorch API while performing computations on encrypted data – without revealing the protected information. Different parties can contribute information to the model or measurement without revealing what they contributed. In this workshop, we will teach participants how to use CrypTen using interactive notebooks - participants should bring a laptop with Jupyter notebook installed. We will work through four common use scenarios for privacy preserving machine learning using secure multiparty computation to allow learning without sharing data: - Feature Aggregation: multiple parties hold distinct sets of features, and want to perform computations over the joint feature set. - Data Labeling: one party holds feature data while another party holds corresponding labels, and they would like to learn a relationship between the features and labels. - Dataset Augmentation: several parties each hold a small number of observations, and would like to use all the observations in order to improve the statistical power of a measurement or model. - Model Hiding: one party has access to a trained model, while another party would like to apply that model to its own data. What we’ll cover: - Installation / Setup (20 min) - Machine Learning and CrypTen (10 min) - Secure Multiparty Compute and Tensors in CrypTen (15 min) - Training a Machine Learning Model on Encrypted Data (45 min) Tutorial slides: https://github.com/facebookresearch/CrypTen

Watch