Akamai’s Commitment to Reliability

For more than 20 years, Akamai has worked very hard to earn the trust and confidence of our customers and partners by developing services that can be relied upon to be available and secure at all times. We have scaled our business to handle many trillions of web requests quickly and securely each day by holding our services and operations to a very high standard of reliability. We know that errors can happen in any platform of this scale, and that is why we have invested in many systems and processes to catch mistakes quickly and prevent them from causing widespread harm. Reliability lies at the core of our mission, our culture, and everything we aspire to do. We know that thousands of major enterprises and billions of people are depending on us around the clock to enable their business and life online. And so it is especially painful when we experience a major service disruption. 

Within the last two months, Akamai experienced a serious disruption in two of our services: one that directs users’ browsers to websites, and one that protects websites against distributed denial-of-service attacks. (Detailed explanations of what happened are posted here and here.) As a result, many of our customers experienced interruptions in service, and for that I sincerely apologize. Any downtime is unacceptable, and all of us at Akamai deeply regret the impact that the disruption caused to the people and organizations that depend on us.

As you would expect, we have conducted a thorough review and root cause analysis of both incidents. We have determined that although the direct causes of the incidents were different, our platform maintenance processes played a contributing role in both cases. In particular, the safety mechanisms we had put in place to prevent problems associated with updates to these services did not perform in the manner necessary to prevent a disruption.

As a result, we are now performing a full audit of all tools, systems, and processes associated with updates for all of our services. And until the audit is successfully completed for any given service, there will be additional manual supervision of all updates to that service — to help ensure that an update will not result in a disruption. We are also taking numerous other actions to improve reliability on both a near-term and long-term basis, including increased investment in our automation, production release tooling, and monitoring, as well as increased engagement among — and oversight of — everyone involved in making and signing off on changes to our platform.

We will continue to keep customers informed of our progress, through engagement with their account teams as well as through a series of blog updates from Akamai’s Chief Operating Officer, Adam Karon. Of course, we always welcome your input on how we can further improve to better serve your needs.

Our 8,000+ employees around the world have worked very hard over the last 23 years to earn your trust. We acknowledge the unacceptable situation these incidents have caused and our need to restore your confidence. We are now, and will continue to be, working even harder to support you in the manner that you have come to expect from us. 


Tom Leighton

Chief Executive Officer and Co-Founder

*** This is a Security Bloggers Network syndicated blog from The Akamai Blog authored by Tom Leighton. Read the original post at: