It’s Time to Rethink Outage Reports

This article has been indexed from CircleID: Cyberattack

Much has been said about the criticality of the small coterie of large-scale content distribution platforms and their critical role in today’s Internet. These days when one of the small set of core content platforms experiences a service outage, then it’s mainstream news, as we saw in June of this year with outages reported in both Fastly and Akamai. In the case of Akamai, the June outage impacted three of Australia’s largest banks, their national postal service, the country’s reserve bank, and one airline operator. Further afield from Australia, the outage impacted the Hong Kong Stock Exchange and some US airlines. The roll call of impacted services appeared to reach some 500 services from this Akamai incident. With Fastly’s outage earlier in the month, we saw a set of popular services disappear for an hour or more. The list of impacted services included Twitch, Pinterest, Reddit, Spotify, the New York Times, and the BBC, to name just a few. And now, at the end of July, Akamai has managed to do it again on a grander scale.

I’ve already talked about the increasing criticality of Cloud and Content service providers and the vulnerabilities associated with the strong levels of provider aggregation in this space. With so many enterprises all over the Internet forced to choose between just a handful of viable content distribution platforms for their content and services, nobody should be surprised when a single platform’s outage has a massive service impact. But that’s not what’s prompted me to write this note.

Akamai’s report of the incident was unusual. I’ll reproduce here in full:


[07:35 UTC on July 24, 2021] Update:

Root Cause:

This configuration directive was sent as part of preparation for independent load balancing control of a forthcoming product. Updates to the configuration directive for this load balancing component have routinely been made on approximately a weekly basis. (Further changes to this configuration channel have been blocked until additional safety measures have been implemented, as noted in Corrective and Preventive Actions.)

The load balancing configuration directive included a formatting error. As a safety measure, the load balancing component disregarded the improper configuration and fell back to a minimal configuration. In this minimal state, based on a VIP-only configuration, it did not

[…]
Content was cut in order to protect the source.Please visit the source for the rest of the article.

Read the original article: It’s Time to Rethink Outage Reports