Lessons Learnt from Microsoft’s Azure Outage

In recent years, cloud computing has revolutionised the way businesses operate, offering unprecedented access to resources and services anywhere, at any time. Businesses have become increasingly dependent on online platforms for remote work and e-commerce. However, the recent Microsoft Azure outage highlights the potential dangers of over-reliance on cloud services. With its seemingly high uptime guarantee of 99.9%, few could have anticipated a service failure to happen not once, but twice within a short span of time.

On February 8, 2023, Microsoft Azure, one of the world’s leading cloud platforms, experienced a massive global outage, causing widespread disruption that lasted for several hours. Investigations revealed the root cause of the problem to be a utility power surge in the South-east Asia region that tripped a subset of cooling units in a data centre, resulting in the outage. Consequently, many businesses that rely heavily on Microsoft Azure for critical operations were cut off. In Singapore alone, the websites of Central Provident Fund (CPF) Board, EZ-Link, the Esplanade, and Nanyang Technological University (NTU) all experienced service interruptions.

Just two weeks prior on January 25, 2023, Microsoft had suffered a major cloud outage when an untested Wide-Area Network (WAN) routing change led to a global disruption for Microsoft 365 users. Numerous Azure cloud services became inaccessible, including Outlook, Microsoft Teams, SharePoint Online, OneDrive for Business, and more. While Microsoft did not disclose the number of users affected by this disruption, data from the outage tracking website Downdetector showed thousands of incidents across continents.

In today’s interconnected business world, breakdowns in cloud infrastructure and service disruptions are all the more damaging. Yet outages among Big Tech platforms are not uncommon - several companies including Google, Meta and Amazon have all seen service disruptions recently. While these cloud providers do invest heavily in resilient operations, outages can still occur from time to time, arising from configuration changes made by providers themselves, unforeseen circumstances caused by nature, and in some instances due to long-standing issues like power cuts.

These incidents serve as a powerful reminder that even large MNCs can face technical glitches and hence it is important for businesses to take proactive steps to minimise the damage at their own end. A key takeaway is the importance of having a robust incident management and business continuity plan in place. In relation to business continuity, the Monetary Authority of Singapore (MAS) released a set of guidelines on business continuity management (BCM) directed towards Financial Institutions (FIs). Businesses across all industries can still leverage this guideline and apply some of its best practices to minimise the impact of a possible downtime.

While Cloud services are designed to be largely available and reliable, disruptions may occur. It is critical for businesses to have a ‘circuit breaker’ plan in place to ensure network resiliency and the ability to recover from the problem as quickly as possible.

Looking for expert guidance in Technology Consulting? Send us your queries at [email protected]

Lessons Learnt from Microsoft’s Azure Outage

Page Menu