Azure Outage Post Mortem Report

It’s been a tough week for Microsoft, outage after outage hits it’s clous services results in global outage.

Start of this week , number of Microsoft customers worldwide were impacted by a cascading series of problems resulting in many being unable to access their Microsoft apps and services. Microsoft released a not for this outage.

Customers reported they can’t sign into Microsoft and third-party applications which used Azure Active Directory (Azure AD) for authentication. Microsoft acknowledge this issue is with SDP (Safe Deployment Program) mishaps

Azure AD is designed to be geo-distributed and deployed with multiple partitions across multiple data centers around the world, and is built with isolation boundaries. Microsoft normally applies changes across a validation ring that doesn’t include customer data, followed by four additional rings over the course of several days before they hit production. But this week the SDP didn’t correctly target the validation ring due to a defect and all rings were targeted concurrently causing service availability to degrade.

Microsoft engineering knew within five minutes of the problem that something was wrong. During the next 30 minutes, Microsoft started taking steps to expedite mitigation by scaling out some Azure AD services to handle the load once a mitigation would have been applied and failing over certain workloads into a backup Azure AD authentication system. But there roll back failed due to the corruption in the backup SDP metadata resulted in manual configuration

Microsoft fixed the latent code defect in the Azure AD backend SDP system; fixed the existing rollback system; and expanded the scope and frequency of rollback operation drills. The team still needs to apply more protections to the Azure AD SDP system to prevent these kinds of issues. It also needs to expedite the rollout of the Azure AD backup authentication system to all key services, and to onboard Azure AD scenarios to the automated communications pipeline .

Microsoft’s report also doesn’t mention that the past couple of days customers in various geographies have been reporting problems with Exchange Online and Outlook on their mobile devices. Microsoft attributed that problem to a situation involving Exchange ActiveSync and “a recent configuration update to components that route user requests was the cause of impact.”

On 1st October again an outage of cloud services has been noticed for s shorter period.

Active Directory ! Heart of business. Proper DR plan

Active directory as the name suggest, if business need to be active then active directory should be actively protected with proper care.

Business vitality depends on AD. each and every details from login info, Email info , relied strongly on AD. As so it’s vital we should maintain a proper hygiene way to secure it from external attacks, since we have a long history of foreign intrudes contaminating, encrpting and erasing info

As the gatekeeper to critical applications and data in 90% of organization’s worldwide, AD has become a prime target for widespread cyberattacks that have crippled businesses and wreaked havoc on governments and non-profit organization

If in case of a disaster happen there should be an escape route to restore it. Key considerations are elobarated

  • Minimize Active Directory’s attack surface: Lock down administrative access to the Active Directory service by implementing administrative tiering and secure administrative workstations, apply recommended policies and settings, and scan regularly for misconfigurations – accidental or malicious – that potentially expose your forest to abuse or attack.
  • Monitor Active Directory for signs of compromise and roll back unauthorized changes: Enable both basic and advanced auditing and periodically review key events via a centralized console. Monitor object and attribute changes at the directory level and changes shared across domain controllers.
  • Implement a scorched-earth recovery strategy in the event of a large-scale compromise: Widespread encryption of your network, including Active Directory, requires a solid, highly automated recovery strategy that includes offline backups for all your infrastructure components as well as the ability to restoring from backup s without reintroducing any malware that might be on them.