Migrating to Multi Cluster Managed Kafka - 0 Downtime | Natan Silnitsky | Conf42 Cloud Native 2022

Conference: Conf42 Cloud Native 2022

Year: 2022

As Wix Kafka usage grew to 1.5B messages per day, over 10K topics and over 100K leader partitions serving 2000 microservices, we decided to migrate from self-running cluster per data-center to a managed cloud service (Confluent Cloud) with multi-cluster setup. This talk is about how we successfully migrated with 0 downtime and full traffic and the lessons we learned along the way. These lessons include: 1. Automation, Automation, Automation - all the process has to be completely automated at such scale 2. Prefer a gradual approach - E.g. migrate topics in small chunks and not all at once. Reduces risks if things go bad 3. First migrate test topics with relayed real traffic - So data will be real but will not effect production. 4. Cleanup first - avoid migrating unused topics or topics with too many unnecessary partitions 5. Adapt to Confluent Cloud APIs - e.g. lag monitoring Other talks at this conference 🚀🪐 https://www.conf42.com​/cloud2022 — 0:00 Intro 1:27 Talk