#3 May 15, 2018

gVisor, with Nicolas Lacasse and Yoshi Tamura

Hosts: Craig Box, Adam Glick

On this weeks Kubernetes Podcast, Adam and Craig talk to Nicolas Lacasse and Yoshi Tamura from Google Cloud about gVisor, a user-space kernel, written in Go, that implements a substantial portion of the Linux system surface. It provides an isolation boundary between the application and the host kernel and integrates with Docker and Kubernetes, making it simple to run sandboxed containers.

Do you have something cool to share? Some questions? Let us know:

News of the week

gVisor:

CRAIG BOX: Hi, and welcome to the Kubernetes Podcast from Google. I'm Craig Box.

ADAM GLICK: And I'm Adam Glick.

[MUSIC PLAYING]

Hey, Craig. How are you doing? Recovering from KubeCon last week?

CRAIG BOX: Yeah. It's taken a few days to settle back in and unload everything that needs to be unloaded. But not a week after arriving, I'll be packing my suitcase again and heading off for a conference in Melbourne and some customer meetings back home in New Zealand. So it will be a busy few weeks, and I'll be excited to continue the conversation from the road. How about yourself?

ADAM GLICK: Oh, yes, I enjoyed KubeCon. It was so incredible to see all that happened there and meet with a lot of people who are really excited about what's going on in the community. I had a great time.

I've been doing I/O this week, which is a whole different set of folks. And I'm impressed at the number of people who are asking about Kubernetes and want to know more about it. I actually made the mistake, I forgot to bring stickers. Someone stopped by like, do you have a Kubernetes sticker? And I was like, oh, man. So I had to go find Paris Pittman, who, of course, is always well-prepared with stickers. So if you're ever looking for a Kubernetes sticker.

CRAIG BOX: They are very popular.

ADAM GLICK: Yep. Go find Paris. She always has a sticker.

CRAIG BOX: And if you're upset that the sticker doesn't tessellate nicely with all the rest of the stickers on the back of your laptop, have a chat with Tim Hockin who designed something with seven sides just to mess with people and the back of their laptops.

ADAM GLICK: Did you see that at KubeCon? There was some folks that were giving out fidget spinners that were the wheel, the captain's wheel for Kubernetes. And it had eight spokes on it. I said, eight spokes? And they were like, yeah. Someone pointed that out to us, too.

CRAIG BOX: Everyone can get an eight-spoke fidget spinner. I know the [INAUDIBLE] have had to go pretty much all the way to China and say, no, we want seven spokes. Can we get a custom designed one? And thank you for their attention to detail.

ADAM GLICK: True enough. That's all good stuff. Why don't we get into the news of the week?

CRAIG BOX: Last week saw Microsoft's annual Build Conference. As part of the conference, they made several Kubernetes-related announcements.

First up, their Azure Container Service, which they abbreviated AKS, has now been properly renamed to the Azure Kubernetes Service. Microsoft stated they expect the service to reach general availability in, quote, "the next few weeks."

Four new features they announced for AKS. It's now possible to deploy your nodes into custom VNets, Azure's version of VPC, using their CNI plugin. AKS now supports DNS endpoints for Kubernetes ingresses, promising to automatically configure DNS records and name servers for services that use the new Azure ingress.

Azure Monitor now has support for AKS, which shows control plane telemetry, log aggregation, and container health monitoring information. No word yet on Prometheus metrics.

And finally, Windows Containers on top of AKS are now available in private preview.

ADAM GLICK: Microsoft and Red Hat announced the upcoming managed OpenShift on Azure. The new offering will provide a managed deployment of OpenShift on Azure with joint support from Microsoft and Red Hat. No public availability date was provided, but a link to an online sign-up form was posted for people interested in receiving an email when the service is available as a public beta.

CRAIG BOX: Meanwhile, at the Red Hat Summit, Red Hat announced their plan for integrating CoreOS into Red Hat's products. First of all, the tectonic self-hosting upgrade machinery is going to be brought to the OpenShift Container Platform. With the acquisition, they now have two container operating systems, the Fedora-based Atomic Host and the Gentoo-based CoreOS Container Linux.

Container Linux will continue to be supported, but its eventual replacement will be something they're calling Red Hat CoreOS, which is a new iteration of Container Linux based on the Fedora parts instead of the Gentoo parts that will eventually succeed Atomic Host.

Clayton Coleman from Red Hat says that he expects it to be built of Ignition, OSTree, and the Omaha Update Server from Chrome OS.

ADAM GLICK: Mirantis has announced Virtlet, which enables customers to run VMs as pods in a Kubernetes cluster. Virtlet enables you to run VMs on Kubernetes clusters as if they were plain pods, so you can use standard kube control commands to manage them, bring them onto the cluster network as first-class citizens, and make it possible to build higher-level Kubernetes objects, such as deployments, StatefulSet, or DaemonSets composed of them. Virtlet achieves this by implementing the Container Runtime Interface, CRI.

Mirantis calls out that this can be useful in cases where you might need to do network function virtualization, run non-Linux operating systems, use unikernel applications, or provide greater isolation. GCP's support for nested virtualization is called out as a key enabling technology for Virtlet.

CRAIG BOX: Kong announced an ingress controller for their open source API gateway, traffic control, and microservice management layer. You can now publish a service and immediately connect it to Kong, which was previously a manual process.

ADAM GLICK: TechCrunch took a look at how Kubernetes is creating a broad ecosystem for startups. In the article, TechCrunch identifies what they feel are three indicators that the Kubernetes ecosystem is ripe for growth.

The first is major vendor support, which Kubernetes achieved last year when AWS, Microsoft, Oracle, VMware, and others joined Google as members of the CNCF, the foundation that Google donated Kubernetes to.

Second, they identified developer adoption, noting that over 400 projects have been built on Kubernetes, 771 developers had contributed to the project, and there have been 19,000 commits to the project since its v1 launch in 2015.

The article then goes on to point out that given these two important forces, funding for startups can be substantial. They estimated that over $4 billion has been invested in Kubernetes-related projects in just the past few years. All of this is to say what you probably know already, that Kubernetes is growing fast and there is a lot of opportunity in the ecosystem.

CRAIG BOX: And that's the news.

Our guests this week, Yoshi Tamura, Product Manager, and Nic Lacasse, an engineer on the gVisor Project. Welcome to the studio.

NICOLAS LACASSE: Thank you for having us.

YOSHI TAMURA: Yeah, nice to meet you.

ADAM GLICK: Great to have you guys here.

CRAIG BOX: Thank you very much for allowing me the privilege of announcing your new project on stage at KubeCon.

NICOLAS LACASSE: Thank you. Really appreciate it. I have to say, you gave the demo. And even though I know it was a recorded demo, and you're not going to play a demo that doesn't work, I was still on the edge of my seat like, I hope this works. I hope this works. I hope this works.

CRAIG BOX: All right. Of course, thank you very much to Ian for producing that lovely demo. Tell us a little bit about gVisor. Tell us what it is and what it does.

YOSHI TAMURA: Yeah, absolutely. So gVisor is a new type of sandbox, which is very exciting because it is lightweight, yet provides a very strong isolation.

I started thinking more like carbon fiber-ish, or titanium perhaps-- that it has a kind of nice part of that agility and also speed at the same time. And it's very, very solid.

We wanted to open source this because we wanted to advance the container isolation field. And it integrates well with Docker and Kubernetes already today.

ADAM GLICK: So I normally think of containers as being an isolation mechanism that helps me shield what I'm running there from the rest of the operating system. How should I think about what gVisor does, and how that helps me with isolation as opposed to the isolation I get just from using containers?

NICOLAS LACASSE: I think that that's interesting. I think a lot of people do assume that containers provide isolation. And they do to some extent. Although, when you run an application in a container, that application is still talking to the host kernel in much the same way it's still a regular-- if it's a Linux container, it's a regular Linux process talking to a Linux kernel. And with the container, the Linux sort of imposes some restrictions on what the application can do. But it's still talking to the host kernel.

So if there's a vulnerability in the host, the application can exploit it, even if it's in a container. I think a lot of people sort of assume that containers are like VMs and provide the isolation that a VM does, but they don't. The downside of VMs, however, is that they're typically kind of heavyweight. They have a lot of resource requirements. You have to specify the memory and the number of CPUs upfront beforehand. They can't be easily dynamically resized.

So with gVisor, we want to take sort of the isolation guarantees that VMs provide, but also make them lightweight, dynamic, more resource-friendly, and make them operate more like a container.

CRAIG BOX: Tell us about the features of the Linux kernel that make it possible to intercept the system calls and pipe them through the gVisor kernel.

NICOLAS LACASSE: Let me say a little bit about gVisor architecture, I guess. So we have our own kernel that we wrote from scratch in Go. It's not Linux. We wrote it ourselves, but it does implement most of Linux syscall API.

So when you run a container with gVisor, you actually run the gVisor kernel and your application together. And when your application makes a syscall, we have a couple of different mechanisms of detecting that the application made a syscall and rerouting that syscall into gVisor instead of to the host.

So one mechanism relies on ptrace, which is a feature that's been in Linux for a little while. It was originally meant for debugging purposes. But you can use ptrace to redirect those syscalls into gVisor.

We also have a way to use the KVM module, which is also in most Linux kernels to do the syscall redirection.

ADAM GLICK: And what are the pros and cons of both approaches?

NICOLAS LACASSE: So ptrace is great because it will run in most environments. It can run inside of a VM, but it has performance overhead. It's not that fast. It's also not quite as secure as the version that uses KVM.

KVM is, in most cases, faster than ptrace, but it requires virtualization. So if you want to run the KVM platform inside of a VM, you have to have nested virtualization. Which GCE VMs, you can enable, but not all VMs have nested virtualization enabled.

CRAIG BOX: But if I think about traditionally, I'd use KVM to boot an entire virtual machine. I don't have that overhead. I'm not running separate kernels if I'm using gVisor in this case.

NICOLAS LACASSE: Right. So that's important. So we do use some features of the KVM that are inside of the KVM module, but it's not a full virtual machine like you're probably used to when you run KVM and QMU to run a full virtual machine.

ADAM GLICK: So you mentioned something interesting earlier on when you were talking about that you wrote this in Go. And when I typically think about things written at the kernel level, I usually think about much more low-level languages, like writing something in [INAUDIBLE]. Why did you choose to write in Go versus some of the more fundamental languages?

NICOLAS LACASSE: One answer is that we like Go. But I think a better answer is that gVisor is a security project. And when you use a higher-level language like Go-- or one of the features of Go, it's mostly memory-safe, mostly type-safe. And so writing in Go lets us-- there's a whole class of bugs and vulnerabilities that you don't get with Go. Use after free, buffer overflows, those type of things. Not to say that they're impossible, but not like with C, where you have to code very, very defensively. Go lets you worry about other things than those types of bugs.

CRAIG BOX: In the demo that you showed at KubeCon recently, there was the Dirty COW, Copy-On-Write, vulnerability. Is that a rule that you had to teach gVisor that that vulnerability exists, or is that just something that had gVisor been public before this, people would not have been affected by it?

NICOLAS LACASSE: Well, so the Dirty COW is a vulnerability in Linux. There's a race condition in the copy-on-write semantics. And gVisor is not Linux. We have our own memory management. We do implement copy-on-write. And just because it's a new code base, it doesn't have that race condition. We're not vulnerable to Dirty COW just because we re-implemented that memory management ourselves.

CRAIG BOX: Do you think there will be a class of vulnerabilities that gVisor is vulnerable to as opposed to the Linux kernel?

NICOLAS LACASSE: Absolutely. Yeah, without a doubt. But it's all about the number of boundaries.

So even if you are able to compromise the gVisor kernel, if you've got a malicious application that's able to compromise the gVisor kernel, you end up still as isolated as a container. In fact, maybe more isolated, because we run the entire gVisor sandbox, the kernel and the application, in a very restricted [INAUDIBLE] sandbox in an empty user namespace.

So for example, nothing in the sandbox is able to open a file or open a socket. So even if you do compromise the gVisor kernel, you're still unable to open a file or a socket. You would need to find yet another vulnerability in the Linux kernel. So it's all about having multiple layers--

CRAIG BOX: Defense in depth.

NICOLAS LACASSE: Defense in depth. Multiple layers between an attacker and the host.

ADAM GLICK: So part of this technology feels like a layer below what many people who use Kubernetes are likely to be interfacing with. Who do you see are the people who are most likely to use gVisor? And who are the folks that you're writing this for who you think will get the most interesting uses out of it?

NICOLAS LACASSE: So I think the primary use case, in Kubernetes at least, is sandbox pods. That's the term we're using. So anytime you are running a workload that you don't trust. Maybe if you're a past provider and your customers are giving you applications to run, something that you don't trust that could be malicious. Or even if it's maybe semi-trusted. Maybe it's an application that you know is not explicitly malicious, but you don't know everything that it's going to do. Maybe you don't have the source code. Maybe it's a binary that you got from somewhere else. Those are good use cases for running that application in a sandbox.

ADAM GLICK: So that'd be like multitenant environments?

NICOLAS LACASSE: Yep, multitenant environments is the perfect example. Anything that handles user-facing data. So even if it's an application that you wrote yourself, if it accepts data from users, maybe a user could send malicious data to break your application. That's another use case where you might want to run the entire thing in a sandbox.

CRAIG BOX: We've seen a lot of great press over the last few weeks about gVisor. What are the things that people have not picked up upon? What are the things that you think that's a great feature, but people aren't yet talking about?

NICOLAS LACASSE: Yeah. I mean, there's a lot of things in gVisor that we're really excited about. One feature we haven't talked too much about is the save/restore capability that it has.

So with gVisor, it can be running your application. And you can tell it to stop and save the entire state of the application and the kernel itself. It'll serialize the entire process, all of the processes running, the entire state of the kernel to disk. And you can later, potentially on a different machine, restart the application and the kernel and it picks up exactly where it left off.

I think that's really exciting. And I'm excited to see what people want to use it for. I think one thing that comes to mind to me is like live migration of pods--

CRAIG BOX: Absolutely.

NICOLAS LACASSE: --is a feature that that enables. But I think that there's all kinds of interesting things that it can do.

ADAM GLICK: What about like debug error states? If you've got a container that's in a particular state, or perhaps something has been security compromised but you don't know how. Restarting the container may get rid of that particular piece, but will not help you with the forensics of figuring it out. Would this be something that might be able to--

NICOLAS LACASSE: That's totally interesting. I hadn't thought about that, but that's great. I mean, debugging in general. If you can get your container into a state that you want to experiment with, you can get into that state, save it, and then recreate that state almost instantaneously anywhere.

YOSHI TAMURA: You could also probably consider as like by continuously taking such a snapshot and associate with audit logging, that you may have a better visibility into not only for the logging, but all the state. So you have much more visibility into what actually happened when you're interested at the particular event.

CRAIG BOX: So there's obviously a lot of great features that you get for running your workload inside the gVisor sandbox. Do you want all workloads running inside the sandbox?

NICOLAS LACASSE: No. It doesn't make sense for all workloads. There is a performance penalty. I mean, security-- the sandbox isn't free. There are some workloads-- if it's something that you wrote yourself and you totally trust, there's no reason for you to have the sandbox penalty, the performance penalty.

CRAIG BOX: How big is that penalty?

NICOLAS LACASSE: So it depends on the workload right now. You pay the price whenever you make a syscall. It varies a lot based on the syscall. Whether you're using the ptrace platform or the KVM platform, it really depends on the workload.

CRAIG BOX: What about by comparison to traditional VM-based sandboxing?

NICOLAS LACASSE: Again, it depends on the workload. One of the things we're looking to see with open sourcing is to see what types of applications are our users running, so that we can do a better job of deciding where to focus our optimizations that we want to do.

Back to your question though about all workloads. There are also workloads that you would not want to run in a sandbox because you actually need to access host resources. So Kubernetes has this concept of privileged pods that need to configure the host network, say. That doesn't make sense to run in a sandbox. That needs to be running directly on the host.

CRAIG BOX: So we'll end up in a world where in your pod spec, you specify I want to use this particular runtime rather than the default?

NICOLAS LACASSE: Yeah. The Kubernetes community is sort of working on a spec right now for sandbox pods. And so exactly what that API looks like is still being defined. Whether it's something that the owner of the pod chooses, or whether it's something that the cluster administrator chooses, those details are still being worked out.

CRAIG BOX: Does gVisor have a version number?

NICOLAS LACASSE: Not yet.

CRAIG BOX: Would you consider this a 1.0 release of something we run internally at Google?

NICOLAS LACASSE: Certainly not. You know, we've been open source a very short amount of time. We definitely do want to have releases. That's something we're going to look at. Exactly how often and what they look like is still to be determined.

One nice thing though, about gVisor-- the fact that it's written in Go helps with this-- is it's just a single statically-linked binary. So it's really easy to deploy, really easy to start using. So I think one thing we could probably do relatively quickly is just start with our releases building a few of these binaries for a couple different architectures and putting them on our GitHub Releases page.

CRAIG BOX: Have you given thought to running gVisor on Windows?

NICOLAS LACASSE: No. So right now, gVisor only runs on Linux. But I think it would be really exciting, especially if someone in the open source community wanted to port it to Windows or OS X. I think that that would be awesome.

CRAIG BOX: Yoshi, tell me about the relationship that gVisor has with other security platforms, like Kata Containers, for example.

YOSHI TAMURA: That is a very good question. So as Nic mentioned, these are both interesting approach where because of the use case that we discussed, Kubernetes and containers took off significantly.

We've been seeing a lot of these new use cases, especially in the multi-tenancy field. So we're, first of all, very excited that there is a Kata Container, which is more of a VM-based approach. And also, we have this gVisor, which is also another new approach.

If you look at our announcement, the blog post that we announced, you see the comment, the great quotes that we got from the Kata Container community. So I think we are all excited to see what sort of collaboration that we can see in an open source community. And also, that will also be happening in the Kubernetes community as a sandbox API.

So users, Kubernetes user, Container users, are almost there to see this all coming together. So I think that will be the exciting moment to see. And then, keep going with more new use cases and new technology. This is just happening.

ADAM GLICK: We've chatted a bit about different use cases. What about for people, if you're not working necessarily with the container pieces directly, but maybe as a researcher working on operating systems, would this also be something that would be an interesting product to kind of play around with?

NICOLAS LACASSE: Yeah, that's something that we're really excited about, too. I mean obviously, we see the most immediate use in sandboxing containers, sandboxing pods. But gVisor, it's a kernel. And it's really easy to experiment with it. So I hope that if you're an OS researcher and you've got some new idea for some new operating system feature or some different semantics, rather than having to hack it into Linux, Go is a very friendly language. gVisor is pretty easy to develop with. You could experiment by putting your feature into gVisor and building an application and see how it runs. It makes that type of stuff really easy.

Also, gVisor right now looks like Linux because we're trying to run Linux apps. But with some work, you could make gVisor look like a different OS or look like your own custom OS. It's pretty flexible.

CRAIG BOX: Right now, you say there are a number of applications which the system calls are not supported. I imagine you could just pass those system calls through to the kernel and make them available? And would that be beneficial-- a quick way to make those applications run with some security, but not all?

NICOLAS LACASSE: Not really. Because the syscalls tie into other features in the kernel. So if we were to just pass them to the host, it might not make sense because the host doesn't have all of the context for the application. They really have to be integrated into the gVisor kernel.

From a security standpoint, we really want to make sure gVisor is a security product. Secure by default is a big thing for us. And so even if it did work, even if you could pass some of those syscalls through, it might not be a good idea. It's something I don't want to recommend.

CRAIG BOX: What sort of coverage do you have with syscalls today?

NICOLAS LACASSE: So Linux has over 300 syscalls. 350, 360, something like that. We implemented over 200. A lot of syscalls take lots of different options and arguments. We're very data-driven. So you know, we sort of run an application, see what it needs, and then we sort of implement those features as needed. So there's plenty of syscalls we don't implement yet. But for the most part, they're not very commonly used.

There are some applications that use them and they don't work yet, but we feel like we have pretty good coverage with most applications right now. And we're working to close the gap for the applications that don't.

CRAIG BOX: If I were a Go programmer and I had an app that didn't run on the gVisor, first of all, would I be able to get a list of the syscalls that's causing it to not succeed?

NICOLAS LACASSE: Yes. You can run gVisor with like an strace mode, and you can see every single syscall that the application is making. That's how we debug. So when we get a bug report that says, I tried application $FOO and it didn't work, we run it in strace. Usually, it's easy to tell, like, oh, here's a syscall that it made, and then it stopped working after that point. And that tells us what we need to do. Or it made a syscall that we do support, but it called it with some option that we don't support yet.

CRAIG BOX: Is there a general order of magnitude as to how much effort it takes to implement a new syscall in gVisor? Are all the easy ones done and now we're saying each new one will take 10 years to do?

NICOLAS LACASSE: The easy ones are done. A lot of hard ones are done, too.

CRAIG BOX: Right.

NICOLAS LACASSE: It's hard to say. I mean, you really have to dig in. We've spent a lot of time reading the Linux kernel code and trying to emulate as much as possible what Linux does. It all depends.

YOSHI TAMURA: Maybe that's exactly why-- that's exactly what we're expecting for the open source community-- just running your favorite applications on gVisor, just filing a bug on GitHub. That's already a great contribution. If you can actually fix it, implement, that would be awesome.

CRAIG BOX: Have you had vendors reach out to you to say, I want to help implement whatever I need to make my own application run?

NICOLAS LACASSE: Not quite vendors, but we have had a lot of people already who have interest in running-- we have in our issue tracker, a list of the applications that don't run. And people are already jumping in to say, I'll help implement that syscall, which is fantastic. I love to see that.

ADAM GLICK: So let's say someone's just a regular user, but they are a Go programmer since we know it's written in Go. They want to get involved in it. Where should they go to learn more, to get involved, to see if there are issues that maybe they can help out with?

NICOLAS LACASSE: Our GitHub. It's github.com/google/gvisor. We've got a fairly, I won't say complete, but it's a good introductory read me.

We've got plenty of issues filed, and more by the day. And also, you can see there the list of applications that are known to work and the ones that aren't working at this moment.

YOSHI TAMURA: Since after the announcement and the launch, within 48 hours, the GitHub star has been surpassing that of 3,000. So I think we have a lot of attention, and we appreciate that.

ADAM GLICK: Excellent.

NICOLAS LACASSE: There's also links in the read me to a mailing list. And you can join that, be part of the discussion.

ADAM GLICK: Yoshi. Nick. It was great to have you both.

NICOLAS LACASSE: Thank you very much.

YOSHI TAMURA: Thank you very much.

CRAIG BOX: Thank you. That's about all we have time for this week. If you want to learn more about gVisor, there's a great read me at the GitHub page, github.com/google/gvisor.

ADAM GLICK: Thank you for listening. As always, if you've enjoyed this show, please help us spread the word and tell a friend. If you have any feedback for us, you can find us on Twitter, @kubernetespod, or reach us by email at kubernetespodcast@google.com.

CRAIG BOX: You can also check out our website and find our show notes at kubernetespodcast.com. Until next time, have a great week.

ADAM GLICK: Take care.

[MUSIC PLAYING]