Azure Outage Post Mortem Report

It’s been a tough week for Microsoft, outage after outage hits it’s clous services results in global outage.

Start of this week , number of Microsoft customers worldwide were impacted by a cascading series of problems resulting in many being unable to access their Microsoft apps and services. Microsoft released a not for this outage.

Customers reported they can’t sign into Microsoft and third-party applications which used Azure Active Directory (Azure AD) for authentication. Microsoft acknowledge this issue is with SDP (Safe Deployment Program) mishaps

Azure AD is designed to be geo-distributed and deployed with multiple partitions across multiple data centers around the world, and is built with isolation boundaries. Microsoft normally applies changes across a validation ring that doesn’t include customer data, followed by four additional rings over the course of several days before they hit production. But this week the SDP didn’t correctly target the validation ring due to a defect and all rings were targeted concurrently causing service availability to degrade.

Microsoft engineering knew within five minutes of the problem that something was wrong. During the next 30 minutes, Microsoft started taking steps to expedite mitigation by scaling out some Azure AD services to handle the load once a mitigation would have been applied and failing over certain workloads into a backup Azure AD authentication system. But there roll back failed due to the corruption in the backup SDP metadata resulted in manual configuration

Microsoft fixed the latent code defect in the Azure AD backend SDP system; fixed the existing rollback system; and expanded the scope and frequency of rollback operation drills. The team still needs to apply more protections to the Azure AD SDP system to prevent these kinds of issues. It also needs to expedite the rollout of the Azure AD backup authentication system to all key services, and to onboard Azure AD scenarios to the automated communications pipeline .

Microsoft’s report also doesn’t mention that the past couple of days customers in various geographies have been reporting problems with Exchange Online and Outlook on their mobile devices. Microsoft attributed that problem to a situation involving Exchange ActiveSync and “a recent configuration update to components that route user requests was the cause of impact.”

On 1st October again an outage of cloud services has been noticed for s shorter period.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s