Operating a multi-tenant distributed system at scale
Operating distributed system at scale is challenging. Here we will provide insight into how we at Vespa.ai manage Vespa Cloud, our globally distributed stateful system. Looking at both the technical aspects and how we work as a team.
This talk focuses on how we quickly mitigate any operational impact and daily deliver new features while maintaining 10,000 containers across 3,000 machine instances handling 400,000 queries per second.
Intended audience
This talk primarily targets engineers (and their managers) providing hosted services to their customers. The talk will also be valuable for product owners and other roles involved with product planning. Using operational issues and learnings as input to future work will build systems that are more stable and easier to operate—allowing the focus to be on the next great feature.
Eirik Nygaard is a principal software engineer at Yahoo! working on vespa.ai, the big data serving engine. He believes in continuous improvement to build resilient systems and teams that excel. Working with team structure and processes and being on the front lines of system failure has given him insights into the does and don'ts of operating software at scale.