aalan

Bug in Google Cloud System That Screwed Spotify For 2.5 hours

This is how a bug in the Google cloud system has screwed Spotify for 2.5 hours.

Spotify has its services hosted on Google Cloud, and they use multiple service discovery technologies for microservices to talk to each other.

What is service discovery?

In a microservice architecture, when you have a number of microservices talking to each other, there must be a way for a microservice to find another. Different ways of accomplishing this communication between microservices are termed service discovery.

For example, if there is a delivery microservice that is talking to a payment microservice, the delivery micro service should be able to find the machine that has a payment service running. 

Spotify had simple yet powerful, DNS-based service discovery for some microservices and for other’s, it was using Google traffic director which is a production-grade service mesh solution.  

What is DNS based discovery system? For a given service you would just get the IP that has the required service running on it from a DNS service based on hostname, and you use the returned IP as a contact point for the service.

What is a service mesh system? Whenever an instance comes up, it has to register itself with a registry and each of the microservice will have a client which would talk to this service registry to find another microservice that is needed.

As a part of the new architecture migration, Google made an update to the existing traffic director code.

There were test failures for the migration they were planning to push to prod along with the test failures for the unreleased features.

They thought the test failures were for new unreleased features and concluded migration was successful.

Or in other words, test failures that were believed to be the part of unreleased features had actually masked the true failures that were happening as part of ongoing migration.

Result?

On March 8th’2022 Traffic director faced an outage, along with that all the Spotify services that have Google traffic director configured as a discovery system failed to respond back.

Since Spotify had both DNS and service mesh solutions in place, they were able to move their existing service mesh managed backends to DNS-based service discovery.

What are the lessons learn’t?

1. Expect the worst and be prepared for it, always keep a backup plan.
2. Always remember that any libraries/ services you are dependent on have got issues then it’s your issue too, it can affect you too.
3. Outages are bound to happen, all that matters is how effectively you can mitigate such outages with minimal business impact.

And Kudos to the Spotify team for building such a good architecture and sailing through this emergency.

Related Posts

Why Use Signal App over WhatsApp

How JWT Work

Leave a comment

You must login to add a new comment.

[wpqa_login]