Note: This is the second in a series of posts, focused on the software architecture aspects of service mesh. To learn more about how we got here, check out The Path to Service Mesh.
If you are building your software and teams around microservices, you’re looking for ways to iterate faster and scale flexibly. A service mesh can help you do that while maintaining (or enhancing) visibility and control. In this blog, I’ll talk about what’s actually in a Service Mesh and what considerations you might want to make when choosing and deploying one.
So, what is a service mesh? How is it different from what’s already in your stack? A service mesh is a communication layer that rides on top of request/response unlocking some patterns essential for healthy microservices. A few of my favorites:
- Zero-trust security that doesn’t assume a trusted perimeter
- Tracing that shows you how and why every microservice talked to another microservice
- Fault injection and tolerance that lets you experimentally verify the resilience of your application
- Advanced routing that lets you do things like A/B testing, rapid versioning and deployment and request shadowing
Why a new term?
Looking at that list, you may think “I can do all of that without a Service Mesh”, and you’re correct. The same logic applies to sliding window protocols or request framing. But once there’s an emerging standard that does what you want, it’s more efficient to rely on that layer instead of implementing it yourself. Service Mesh is that emerging layer for microservices patterns.
Service mesh is still nascent enough that codified standards have yet to emerge, but there is enough experience that some best practices are beginning to become clear. As the the bleeding-edge leaders develop their own approaches, it is often useful to compare notes and distill best practices. We’ve seen Kubernetes emerge as the standard way to run containers for production web applications. My favorite standards are emergent rather than forced: It’s definitely a fine art to be neither too early nor too late to agree on common APIs, protocols and concepts.
Think about the history of computer networking. After the innovation of best-effort packet-switched networks, we found out that many of us were creating virtual circuits over them - using handshaking, retransmission and internetworking to turn a pile of packets into an ordered stream of bytes. For the sake of interoperability and simplicity, a “best practice” stream-over-packets emerged: TCP (the Introduction of RFC675 does a good job of explaining what it layers on top of). There are alternatives - I’ve used the Licklider Transmission Protocol in space networks where distributed congestion control is neither necessary nor efficient. Your browser might already be using QUIC. Standardizing on TCP, however, freed a generation of programmers from fiddling with implementations of sliding windows, retries, and congestion collapse (well, except for those packetheads that implemented it).
Next, we found a lot of request/response protocols running on top of TCP. Many of these eventually migrated to HTTP (or sequels like HTTP/2 or gRPC). If you can factor your communication into “method, metadata, body”, you should be looking at an HTTP-like protocol to manage framing, separate metadata from body, and address head-of-line blocking. This extends beyond just browser apps - databases like Mongo provide HTTP interfaces because the ubiquity of HTTP unlocks a huge amount of tooling and developer knowledge.
You can think about service mesh as being the lexicon, API and implementation around the next tier of communication patterns for microservices.
OK, so where does that layer live? You have a couple of choices:
- In a Library that your microservices applications import and use.
- In a Node Agent or daemon that services all of the containers on a particular node/machine.
- In a Sidecar container that runs alongside your application container.
The library approach is the original. It is simple and straightforward. In this case, each microservice application includes library code that implements service mesh features. Libraries like Hystrix and Ribbon would be examples of this approach.
This works well for apps that are exclusively written in one language by the teams that run them (so that it’s easy to insert the libraries). The library approach also doesn’t require much cooperation from the underlying infrastructure - the container runner (like Kubernetes) doesn’t need to be aware that you’re running a Hystrix-enhanced app.
There is some work on multilanguage libraries (reimplementations of the same concepts). The challenge here is the complexity and effort involved in replicating the same behavior over and over again.
We see very limited adoption of the library model in our user base because most of our users are running applications written in many different languages (polyglot), and are also running at least a few applications that aren’t written by them so injecting libraries isn’t feasible.
This model has an advantage in work accounting: the code performing work on behalf of the microservice is actually running in that microservice. The trust boundary is also small - you only have to trust calling a library in your own process, not necessarily a remote service somewhere out over the network. That code only has as many privileges as the one microservice it is performing work on behalf of. That work is also performed in the context of the microservice, so it’s easy to fairly allocate resources like CPU time or memory for that work - the OS probably does it for you.
The node agent model is the next alternative. In this architecture, there’s a separate agent (often a userspace process) running on every node, servicing a heterogenous mix of workloads. For purposes of our comparison, it’s the opposite of the library model: it doesn’t care about the language of your application but it serves many different microservice tenants.
Linkerd’s recommended deployment in Kubernetes works like this. As do F5’s Application Service Proxy (ASP) and the Kubernetes default kube-proxy.
Since you need one node agent on every node, this deployment requires some cooperation from the infrastructure - this model doesn’t work without a bit of coordination. By analogy, most applications can’t just choose their own TCP stack, guess an ephemeral port number, and send or receive TCP packets directly - they delegate that to the infrastructure (operating system).
Instead of good work accounting, this model emphasizes work resource sharing - if a node agent allocates some memory to buffer data for my microservice, it might turn around and use that buffer for data for your service in a few seconds. This can be very efficient, but there’s an avenue for abuse. If my microservice asks for all the buffer space, the node agent needs to make sure it gives your microservice a shot at buffer space first. You need a bit more code to manage this for each shared resource.
Another work resource that benefits from sharing is configuration information. It’s cheaper to distribute one copy of the configuration to each node, than to distribute one copy of the configuration to each pod on each node.
A lot of functionality that containerized microservices rely on are provided by a Node Agent or something topologically equivalent. Think about kubelet initializing your pod, your favorite CNI daemon like flanneld, or stretching your brain a bit, even the operating system kernel itself as following this node agent model.
Sidecar is the new kid on the block. This is the model used by Istio with Envoy. Conduit also uses a sidecar approach. In Sidecar deployments, you have one adjacent container deployed for every application container. For a service mesh, the sidecar handles all the network traffic in and out of the application container.
This approach is in between the library and node agent approaches for many of the tradeoffs I discussed so far. For instance, you can deploy a sidecar service mesh without having to run a new agent on every node (so you don’t need infrastructure-wide cooperation to deploy that shared agent), but you’ll be running multiple copies of an identical sidecar. Another take on this: I can install one service mesh for a group of microservices, and you could install a different one, and (with some implementation-specific caveats) we don’t have to coordinate. This is powerful in the early days of service mesh, where you and I might share the same Kubernetes cluster but have different goals, require different feature sets, or have different tolerances for bleeding-edge vs. tried-and-true.
Sidecar is advantageous for work accounting, especially in some security-related aspects. Here’s an example: suppose I’m using a service mesh to provide zero-trust style security. I want the service mesh to verify both ends (client and server) of a connection cryptographically. Let’s first consider using a node agent: When my pod wants to be the client of another server pod, the node agent is going to authenticate on behalf of my pod. The node agent is also serving other pods, so it must be careful that another pod cannot trick it into authenticating on my pod’s behalf. If we think about the sidecar case, my pod’s sidecar does not serve other pods. We can follow the principle of least privilege and give it the bare minimum it needs for the one pod it is serving in terms of authentication keys, memory and network capabilities.
So, from the outside the sidecar has the same privileges as the app it is attached to. On the other hand, the sidecar needs to intervene between the app and the outside. This creates some security tension: you want the sidecar to have as little privilege as possible, but you need to give it enough privilege to control traffic to/from the app. For example, in Istio, the init container responsible for setting up the sidecar has the NET_ADMIN permission currently (to set up the iptables rules necessary). That initialization uses good security practices - it does the minimum amount necessary and then goes away, but everything with NET_ADMIN represents attack surface. (Good news - smart people are working on enhancing this further).
Once the sidecar is attached to the app, it’s very proximate from a security perspective. Not as close as a function call in your process (like library) but usually closer than calling out to a multi-tenant node agent. When using Istio in Kubernetes your app container talks to the sidecar over a loopback interface inside of the network namespace shared with your pod - so other pods and node agents generally can’t see that communication.
Most Kubernetes clusters have more than one pod per node (and therefore more than one sidecar per node). If each sidecar needs to know “the entire config” (whatever that means for your context), then you’ll need more bandwidth to distribute that config (and more memory to store copies of it). So it can be powerful to limit the scope of configuration that you have to give to each sidecar - but again there’s an opposing tension: something (in Istio’s case, Pilot) has to spend more effort computing that reduced configuration for each sidecar.
Other things that happen to be replicated across sidecars accrue a similar bill. Good news - the container runtimes will reuse things like container disk images when they’re identical and you’re using the right drivers, so the disk penalty is not especially significant in many cases, and memory like code pages can also often be shared. But each sidecar’s process-specific memory will be unique to that sidecar so it’s important to keep this under control and avoid making your sidecar “heavy weight” by doing a bunch of replicated work in each sidecar.
Service Meshes relying on sidecar provide a good balance between a full set of features, and a lightweight footprint.
Will the node agent or sidecar model prevail?
I think you’re likely to see some of both. Now seems like a perfect time for sidecar service mesh: nascent technology, fast iteration and gradual adoption. As service mesh matures and the rate-of-change decreases, we’ll see more applications of the node agent model.
Advantages of the node agent model are particularly important as service mesh implementations mature and clusters get big:
Less overhead (especially memory) for things that could be shared across a node
Easier to scale distribution of configuration information
A well-built node agent can efficiently shift resources from serving one application to another
Sidecar is a novel way of providing services (like a high-level communication proxy a la Service Mesh) to applications. It is especially well-adapted for containers and Kubernetes. Some of its greatest advantages include:
Can be gradually added to an existing cluster without central coordination
Work performed for an app is accounted to that app
App-to-sidecar communication is easier to secure than app-to-agent
As Shawn talked about in his post, we’ve been thinking about how microservices change the requirements from network infrastructure for a few years now. The swell of support and uptake for Istio demonstrated to us that there’s a community ready to develop and coalesce on policy specs, with a well-architected implementation to go along with it.
Istio is advancing state-of-the-art microservices communication, and we’re excited to help make that technology easy to operate, reliable, and well-suited for your team’s workflow in private cloud, public cloud or hybrid.