Zero-Trust, the Service Mesh and Linkerd
By now, everyone working in the cloud-native world has probably heard something about “zero-trust”. It’s a ubiquitous buzzword: Even the White House is getting in on the action, and the buzz has resulted in a ton of marketing hype and vendor noise. But don’t write it off because of that hype: Some very important realities lie behind the buzzword, and the concept really does matter for network security.
While a zero-trust approach can be difficult to adopt, the good news for Kubernetes users, at least, is that much of zero-trust can be accomplished quickly by using the open source CNCF-graduated Linkerd service mesh. The past several releases of Linkerd have focused on adding powerful zero-trust primitives and on providing them to users in a way that fits into existing Kubernetes concepts and patterns.
So let’s take a look at what zero-trust is, how Linkerd can help and what this all means for Kubernetes users.
What is Zero-Trust?
Zero-trust is a model for reasoning about one of the most important questions in security: Do we trust a given entity to interact with some other entity?
This is not exactly a new question. It’s been central to security for a very long time, and there are a great many models dealing with it. The critical idea that sets zero-trust apart from the others is that trust must be explicit: The “zero” isn’t referring to having no trust, but to assuming no trust. Trust always starts from zero, and a given entity should be given the least amount of trust that lets it do its job.
This means that zero-trust rejects the entire concept of perimeter security and replaces it with the idea that, instead, you check every access, every time. As the DoD’s Zero Trust Reference Architecture summarizes:
“[N]o actor, system, network, or service operating outside or within the security perimeter is trusted. Instead, we must verify anything and everything attempting to establish access. It is a dramatic paradigm shift in philosophy of how we secure our infrastructure, networks, and data, from verify once at the perimeter to continual verification of each user, device, application, and transaction.”
This is a profound departure from previous practice; it’s worth a closer look.
What’s Wrong With Perimeter Security?
Perimeter security is a very, very old idea: If you want to protect things, put a wall around them! What’s inside the wall is assumed to be trustworthy; what’s outside the wall is assumed to be dangerous and scary. So you can let entities inside freely interact with each other, and you can freely let them send things from inside to outside. However, anything trying to come in from the dangerous, scary world outside needs careful vetting before being allowed in.
This worked beautifully back in the pre-cloud days, when servers were real, physical servers, networks were real networks and applications were tied to real hardware. Applications tended to be mostly-monolithic affairs, with long release cycles, running on physical hardware that we owned, located in buildings over which we had complete control. It was very easy to point to a certain set of hardware and say authoritatively that anything running on these boxes was inside the perimeter and anything else was outside. Then we could configure firewalls to enforce that.
The cloud isn’t like that.
In the cloud-native world, we rent rather than own and we don’t have anything like the kind of control we used to. Instead of buying servers, we pay for a certain amount of CPU cycles and a certain amount of RAM. Instead of buying networking equipment, we pay for a certain amount of I/O and load balancers. Not only do we not have any physical control over the hardware, we usually don’t even know where in the physical world it is.
Additionally, chances are very high that whatever hardware our application is running on, our competitors have code running on the same hardware. We can (mostly) trust containers and virtualization to isolate our workloads from those of other tenants, and we can (mostly) trust our cloud provider to deliver the resources we’ve paid for, but the very idea of a physical perimeter goes up in smoke. This doesn’t mean we should throw away firewalls entirely—defense-in-depth is still very much a thing—but they are no longer sufficient.
The only area in which we still retain control is in the running software we’ve written. This is the world in which the zero-trust model emerged.
Trust Without the Perimeter
To make trust work without this concept of a defensive perimeter, let’s start by thinking about the essentials that we really must have to make trust work:
- Identity. We need to know who, exactly, is trying to do something.
- Policy. We need to know what we think is OK for them to do.
- Enforcement. We need a way to stop things from happening if we don’t think they’re OK.
Service meshes are actually ideally positioned to tackle these elements in the cloud-native world.
Identity
In a very real sense, identity—knowing who, exactly, is making a request—is the foundation of trust. Every trust model has this concept, even though not every trust model gives it this name.
The perimeter security model divides the world according to whether you’re inside the perimeter or outside; effectively, this is using location in the network as a proxy for identity, with “inside” and “outside” being the only two identities in the system. This isn’t ideal, but it does function—as long as you have control over the network. The same constraint applies to other models that use network addresses as proxies for identity (e.g. IPSec or Wireguard): If you don’t control the network, you can’t control whether these proxies make sense.
The challenge in the cloud-native world, of course, is that we don’t control the network. The orchestration layer can (and does) swap addresses out from under us whenever it decides it needs to. There’s no way to be certain which workload sent or received a given packet, and no way to verify that the workload at a given IP address hasn’t changed (never mind ways of spoofing addresses). Network-based identity simply doesn’t work in the cloud.
Instead, we need a form of identity that’s more intrinsic to a workload, plus a way to verify it, so that we can check identity for every request. It’s often also important to distinguish between a workload making a request and an end user on whose behalf the request is being made. For example, an end user’s web browser might talk to an ingress controller, which then forwards the request to the workload that can provide the page requested. Knowing that the ingress controller is the workload making the request can be as important as knowing that the end user has successfully logged in.
Policy
After having an identity mechanism in place, we need to discuss policy—what’s OK for a given identity and what isn’t. Again, in the perimeter security model, this was usually handled in a fairly straightforward way: “Inside” had access to everything; “outside” would have access to everything except the services considered most sensitive.
This functions, mostly, but it tends to grant most actors much more access than they need. For example, a given “inside” actor rarely needs unlimited access: All it really needs is access to the other services it actually uses. And “outside” actors should really only have access to what we’d now call the ingress controller; anything else is unnecessary. This is the principle of least privilege, long a foundational concept in security.
To properly do zero-trust, in fact, requires adherence to least privilege—and, perhaps counterintuitively, least privilege can require complex policy descriptions. It’s not enough to say “any actor inside the perimeter can access the webserver”; instead, you need things like “the ingress controller is permitted to request widget lists from the widget service, but only if the ingress controller can also present credentials from a logged-in user”.
It would be possible to add code to every workload to perform these checks, but it would be expensive, fragile and probably fraught with errors—and it would be insane to do all this policy in code, requiring updating the workloads themselves to change the policies! A much better plan is to describe policies independently of the application itself and have mechanisms to enforce the policies.
Enforcement
Given these mechanisms around identity and policy, we have to enforce them. Recalling that zero-trust means checking every access, every time, “enforcement” here really means both authentication and authorization of every request.
Authentication is verifying identity; authorization is validating that policy allows the requested action. Once again, it’s possible to do these things by adding code to the application—in some cases, in fact, that’s the only way to do it. However, it’s far better to push the enforcement checks down into the infrastructure so that it happens uniformly across the entire application. It’s more likely to be done correctly, it lets application developers focus on the application and it makes it much easier to audit everything.
Zero-Trust, Kubernetes and Linkerd
Faced with the need to completely rethink security models for the cloud-native world, building a new identity mechanism, capturing policy in an application-independent way and adding enforcement mechanisms up and down the protocol stack, one could be forgiven a certain amount of anxiety—especially given that, for example, U.S. federal agencies are required to implement zero-trust by 2024!
Fortunately for Kubernetes users, the combination of Kubernetes and Linkerd can make adopting zero-trust considerably easier than it might seem at first glance. Kubernetes provides well-defined primitives and extension mechanisms that make it possible for Linkerd to build out workload identity, policy and enforcement at very low cost, without much configuration:
- Kubernetes already associates every workload with a ServiceAccount, which Linkerd uses to bootstrap cryptographic identity for every workload.
- Traffic between workloads can then use industry-standard mTLS, providing a simple mechanism to encrypt data in transit, protect against data being corrupted, and verify identity on both ends of the connection.
- Authorization policy is defined using Kubernetes Custom Resource Definitions (CRDs), providing an application-independent mechanism for defining and auditing policy.
- Finally, Linkerd performs enforcement for individual pods uniformly throughout the entire application. Each pod has a proxy mediating network traffic and performing authentication and authorization for every request, every time, without ever trusting anything about the network and without even requiring workloads to be in the same cluster.
Linkerd can do this, in part, because Kubernetes provides a mechanism for injecting functionality into a running application in the form of sidecars: Additional containers that run alongside the application’s pods. In Linkerd’s case, the sidecars are lightweight, high-performance Rust micro-proxies that mediate network traffic for the application, making it possible for them to provide much of what zero-trust requires entirely independently of application code.
The sidecar model provides a clear security boundary and unparalleled operational simplicity. Linkerd’s unique micro-proxy approach allows it to take advantage of this operational simplicity to provide security, reliability and observability for the application, without the resource cost of bulky, memory-hungry proxies like Envoy, or the complexity of “ambient” or eBPF-based approaches that muddle the security story and make operations more unpredictable.
This is not to say that Linkerd—or any other service mesh!—is a silver bullet that can solve all your security issues without any thought, of course. Linkerd and Kubernetes itself are great tools, but you’ll still need to understand your threat model and what the tools can and cannot do to use them effectively.
Conclusion
Security has always been a difficult and subtle problem—that’s been true long before computing security, and it’ll be true for a long time to come. The zero-trust model isn’t a magic wand to wave to change that, but if you set aside the marketing hype, you’ll see that it is an elegant, powerful, useful response to the realities of the cloud-native world.
Zero-trust does require some profound changes to the way we approach security: in particular, the changes it demands to core ideas like identity definitely require some learning and some effort. The good news there is that users of Kubernetes and of Linkerd should find it much easier to thrive in this world where there are no perimeters, and where instead we verify every access, every time.
To hear more about cloud-native topics, join the Cloud Native Computing Foundation and the cloud-native community at KubeCon+CloudNativeCon North America 2022 – October 24-28, 2022.