Akamai’s DNS Contribution to Internet Resiliency

Background

Akamai Technologies recently contributed its “Serve Stale” DNS algorithm to Version 9 of the Internet Systems Consortium’s (ISC) Berkeley Internet Name Domain (BIND) open source Domain Name System (DNS) project.

As the Internet’s most widely used DNS implementation, BIND operates ubiquitously throughout the Internet. The ongoing availability of answers from BIND servers is a critical element for the ongoing availability of the Internet for many users.

Akamai operates the world’s largest and most widely used content delivery network (CDN), and is a leader in both the development and implementation of DNS technology. The Akamai Intelligent Platform leverages DNS protocols for content mapping, and DNS is a key underlying component in several of the services used by Akamai’s customers. As a contributor to DNS standards, and as an important citizen within the Internet community, Akamai believes that a fast, reliable and secure Internet is in everyone’s best interest, not only Akamai’s or its customers.

Serve Stale is a DNS feature that was implemented by Akamai to create resilience for its own services in the event of both widespread nameservice failures or even when a specific customer domain name becomes unresolvable. 

In the wake of DDoS attacks that caused public DNS outages, which in turn led to the unavailability of many of the Internet’s most popular websites, Akamai decided to contribute its Serve Stale algorithm to the BIND project. This article provides a high-level overview of how DNS and Serve Stale work, and explains how Akamai’s contribution may help make the Internet a little more resilient.

How DNS Works

At a high level, the Internet’s DNS consists of two types of services, recursive and authoritative. Authoritative DNS (aDNS) servers store many types of data about domain names, including the network addresses of servers in the domain. When someone or something wants to visit a website such as www.akamai.com, a DNS query is sent to a recursive DNS (rDNS) server, known as the resolver, whose responsibility is to perform a lookup that will ultimately provide an answer to the query, i.e. the IP address of the website it wants to visit. The rDNS server uses a multi-step process to find the necessary aDNS servers to query for the answer; and, once it has it, it provides it to the person or thing that first requested it. This is shown in simplified fashion by the diagram below.

Contribution to DNS Resiliency.png

When the DNS experiences an outage, Internet users become unable to locate the websites or other named Internet resources they normally use, because resolvers can’t get answers from authoritative servers. This explains why DNS servers have become a favorite target for cyber attackers.

How Serve Stale Works

All rDNS resolvers cache the resource records (RRs) used for answers as a means to improve DNS query response time as well as reduce traffic between the resolver and the authoritative servers.  The authoritative servers specify time-to-live (TTL) values for RRs, which is the maximum amount of time a record can be used by a resolver before it should be refreshed from the authoritative servers. Normally, a RR is flushed when its TTL expires, but with Serve Stale the RR is kept in the cache, but marked as stale. If a new lookup is unable to refresh the data because the authoritative servers could not be reached, then the previous answer can be provided to the client under the assumption that it is highly likely that it will still work.  A user can then still connect to a website when the recursive DNS lookup would have traditionally failed.

The implementation logs when stale answers are being returned by the rDNS server, so that DNS operators can track how often problems are occurring and for which domains. In its default configuration, the stale answers can be used for several days to allow enough time for manual intervention.  Eventually, the unrefreshed RRs will be removed from the cache, limiting how long the broken conditions can exist before hard errors are returned. 

The algorithm is described in more detail in “Serving Stale Data to Improve DNS Resiliency”, an IETF Internet-Draft which is being pursued in the DNSOP Working Group to update the basic DNS standards to bless this use of data beyond TTL expiry.

Summary

With Serve Stale active, a functioning rDNS resolver can still deliver a valid response to a DNS query, even when the requested domain’s aDNS services become unavailable, which means that the website essentially remains open for business, education, entertainment, or whatever purpose it serves.

Because the Internet has become an essential part of so many people’s lives, and because the DNS is an essential part of the Internet, malicious actors will undeniably continue to attack the DNS, as well as other critical Internet resources, whether for financial, political, egotistical, or other reasons.

Because BIND is so widely used throughout the Internet, the implementation of Serve Stale within BIND resolvers could play an important role in keeping the Internet up and running for millions of users, even when its infrastructure is subject to attack. And that benefits us all.

This is a Security Bloggers Network syndicated blog post authored by Tale Lawrence. Read the original post at: The Akamai Blog