Welcome to Cilium’s documentation!¶
The documentation is divided into the following sections:
- Getting Started Guides: Provides a simple tutorial for running a small Cilium setup on your laptop. Intended as an easy way to get your hands dirty applying Cilium security policies between containers.
- Concepts: Describes the components of the Cilium architecture, and the different models for deploying Cilium. Provides the high-level understanding required to run a full Cilium deployment and understand its behavior.
- Installation Guides : Details instructions for installing, configuring, and troubleshooting Cilium in different deployment modes.
- Policy Enforcement Modes : Detailed walkthrough of the policy language structure and the supported formats.
- Monitoring & Metrics : Instructions for configuring metrics collection from Cilium.
- Troubleshooting : Describes how to troubleshoot Cilium in different deployment modes.
- BPF and XDP Reference Guide : Provides a technical deep dive of BPF and XDP technology, primarily focused at developers.
- API Reference : Details the Cilium agent API for interacting with a local Cilium instance.
- Developer / Contributor Guide : Gives background to those looking to develop and contribute modifications to the Cilium code or documentation.
Introduction to Cilium¶
What is Cilium?¶
Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes.
At the foundation of Cilium is a new Linux kernel technology called BPF, which enables the dynamic insertion of powerful security visibility and control logic within Linux itself. Because BPF runs inside the Linux kernel, Cilium security policies can be applied and updated without any changes to the application code or container configuration.
Why Cilium?¶
The development of modern datacenter applications has shifted to a service-oriented architecture often referred to as microservices, wherein a large application is split into small independent services that communicate with each other via APIs using lightweight protocols like HTTP. Microservices applications tend to be highly dynamic, with individual containers getting started or destroyed as the application scales out / in to adapt to load changes and during rolling updates that are deployed as part of continuous delivery.
This shift toward highly dynamic microservices presents both a challenge and an opportunity in terms of securing connectivity between microservices. Traditional Linux network security approaches (e.g., iptables) filter on IP address and TCP/UDP ports, but IP addresses frequently churn in dynamic microservices environments. The highly volatile life cycle of containers causes these approaches to struggle to scale side by side with the application as load balancing tables and access control lists carrying hundreds of thousands of rules that need to be updated with a continuously growing frequency. Protocol ports (e.g. TCP port 80 for HTTP traffic) can no longer be used to differentiate between application traffic for security purposes as the port is utilized for a wide range of messages across services.
An additional challenge is the ability to provide accurate visibility as traditional systems are using IP addresses as primary identification vehicle which may have a drastically reduced lifetime of just a few seconds in microservices architectures.
By leveraging Linux BPF, Cilium retains the ability to transparently insert security visibility + enforcement, but does so in a way that is based on service / pod / container identity (in contrast to IP address identification in traditional systems) and can filter on application-layer (e.g. HTTP). As a result, Cilium not only makes it simple to apply security policies in a highly dynamic environment by decoupling security from addressing, but can also provide stronger security isolation by operating at the HTTP-layer in addition to providing traditional Layer 3 and Layer 4 segmentation.
The use of BPF enables Cilium to achieve all of this in a way that is highly scalable even for large-scale environments.
Functionality Overview¶
Protect and secure APIs transparently¶
Ability to secure modern application protocols such as REST/HTTP, gRPC and Kafka. Traditional firewalls operates at Layer 3 and 4. A protocol running on a particular port is either completely trusted or blocked entirely. Cilium provides the ability to filter on individual application protocol requests such as:
- Allow all HTTP requests with method
GET
and path/public/.*
. Deny all other requests. - Allow
service1
to produce on Kafka topictopic1
andservice2
to consume ontopic1
. Reject all other Kafka messages. - Require the HTTP header
X-Token: [0-9]+
to be present in all REST calls.
See the section Layer 7 Policy in our documentation for the latest list of supported protocols and examples on how to use it.
Secure service to service communication based on identities¶
Modern distributed applications rely on technologies such as application containers to facilitate agility in deployment and scale out on demand. This results in a large number of application containers to be started in a short period of time. Typical container firewalls secure workloads by filtering on source IP addresses and destination ports. This concept requires the firewalls on all servers to be manipulated whenever a container is started anywhere in the cluster.
In order to avoid this situation which limits scale, Cilium assigns a security identity to groups of application containers which share identical security polices. The identity is then associated with all network packets emitted by the application containers, allowing to validate the identity at the receiving node. Security identity management is performed using a key-value store.
Secure access to and from external services¶
Label based security is the tool of choice for cluster internal access control. In order to secure access to and from external services, traditional CIDR based security policies for both ingress and egress are supported. This allows to limit access to and from application containers to particular IP ranges.
Simple Networking¶
A simple flat Layer 3 network with the ability to span multiple clusters connects all application containers. IP allocation is kept simple by using host scope allocators. This means that each host can allocate IPs without any coordination between hosts.
The following multi node networking models are supported:
Overlay: Encapsulation based virtual network spawning all hosts. Currently VXLAN and Geneve are baked in but all encapsulation formats supported by Linux can be enabled.
When to use this mode: This mode has minimal infrastructure and integration requirements. It works on almost any network infrastructure as the only requirement is IP connectivity between hosts which is typically already given.
Native Routing: Use of the regular routing table of the Linux host. The network is required to be capable to route the IP addresses of the application containers.
When to use this mode: This mode is for advanced users and requires some awareness of the underlying networking infrastructure. This mode works well with:
- Native IPv6 networks
- In conjunction with cloud network routers
- If you are already running routing daemons
Load balancing¶
Distributed load balancing for traffic between application containers and to external services. The loadbalancing is implemented using BPF using efficient hashtables allowing for almost unlimited scale and supports direct server return (DSR) if the loadbalancing operation is not performed on the source host. Note: load balancing requires connection tracking to be enabled. This is the default.
Monitoring and Troubleshooting¶
The ability to gain visibility and to troubleshoot issues is fundamental to the
operation of any distributed system. While we learned to love tools like
tcpdump
and ping
and while they will always find a special place in our
hearts, we strive to provide better tooling for troubleshooting. This includes
tooling to provide:
- Event monitoring with metadata: When a packet is dropped, the tool doesn’t just report the source and destination IP of the packet, the tool provides the full label information of both the sender and receiving among a lot of other information.
- Policy decision tracing: Why is a packet being dropped or a request rejected. The policy tracing framework allows to trace the policy decision process for both, running workloads and based on arbitrary label definitions.
- Metrics export via Prometheus: Key metrics are exported via Prometheus for integration with your existing dashboards.
Integrations¶
- Network plugin integrations: CNI, libnetwork
- Container runtime events: containerd
- Kubernetes: NetworkPolicy, Labels, Ingress, Service
- Logging: syslog, fluentd
Getting Started Guides¶
This document serves as the easiest introduction to using Cilium. If you are new to Cilium it is recommended to read the Introduction to Cilium section first to learn about the basic concepts and motivation.
The following guides which each takes an estimated time of 10-15 minutes to complete will help you to get started in your area of choice:
Getting Started Using Minikube¶
This guide uses minikube to demonstrate deployment and operation of Cilium in a single-node Kubernetes cluster. The minikube VM requires approximately 5GB of RAM and supports hypervisors like VirtualBox that run on Linux, macOS, and Windows.
If you instead want to understand the details of deploying Cilium on a full fledged Kubernetes cluster, then go straight to Installation Guide.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install kubectl & minikube¶
- Install
kubectl
version>= 1.7.0
as described in the Kubernetes Docs. - Install one of the hypervisors supported by minikube .
- Install
minikube
>= 0.22.3
as described on minikube’s github page .
$ minikube start --network-plugin=cni --extra-config=kubelet.network-plugin=cni --memory=5120
$ minikube start --network-plugin=cni --container-runtime=cri-o --extra-config=kubelet.network-plugin=cni --memory=5120
After minikube has finished setting up your new Kubernetes cluster, you can
check the status of the cluster by running kubectl get cs
:
$ kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
- Install etcd as a dependency of cilium in minikube by running:
$ kubectl create -n kube-system -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/etcd/standalone-etcd.yaml service "etcd-cilium" created statefulset.apps "etcd-cilium" created
To check that all pods are Running
and 100% ready, including kube-dns
and etcd-cilium-0
run:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-cilium-0 1/1 Running 0 1m
kube-system etcd-minikube 1/1 Running 0 3m
kube-system kube-addon-manager-minikube 1/1 Running 0 4m
kube-system kube-apiserver-minikube 1/1 Running 0 3m
kube-system kube-controller-manager-minikube 1/1 Running 0 3m
kube-system kube-dns-86f4d74b45-lhzfv 3/3 Running 0 4m
kube-system kube-proxy-tcd7h 1/1 Running 0 4m
kube-system kube-scheduler-minikube 1/1 Running 0 4m
kube-system storage-provisioner 1/1 Running 0 4m
If you see output similar to this, you are ready to proceed to the next step.
Note
The output might differ between minikube versions, you should expect to have all pods in READY / Running state before continuing.
Step 1: Install Cilium¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
Choose the installation instructions for the environment in which you are deploying Cilium.
Docker Based¶
CRI-O Based¶
Step 2: Deploy the Demo Application¶
Now that we have Cilium deployed and kube-dns
operating correctly we can deploy our demo application.
In our Star Wars-inspired example, there are three microservices applications: deathstar, tiefighter, and xwing. The deathstar runs an HTTP webservice on port 80, which is exposed as a Kubernetes Service to load-balance requests to deathstar across two pod replicas. The deathstar service provides landing services to the empire’s spaceships so that they can request a landing port. The tiefighter pod represents a landing-request client service on a typical empire ship and xwing represents a similar service on an alliance ship. They exist so that we can test different security policies for access control to deathstar landing services.
Application Topology for Cilium and Kubernetes

The file http-sw-app.yaml
contains a Kubernetes Deployment for each of the three services.
Each deployment is identified using the Kubernetes labels (org=empire, class=deathstar
), (org=empire, class=tiefighter
),
and (org=alliance, class=xwing
).
It also includes a deathstar-service, which load-balances traffic to all pods with label (org=empire, class=deathstar
).
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/minikube/http-sw-app.yaml service "deathstar" created deployment "deathstar" created deployment "tiefighter" created deployment "xwing" created
Kubernetes will deploy the pods and service in the background. Running
kubectl get pods,svc
will inform you about the progress of the operation.
Each pod will go through several states until it reaches Running
at which
point the pod is ready.
$ kubectl get pods,svc
NAME READY STATUS RESTARTS AGE
po/deathstar-76995f4687-2mxb2 1/1 Running 0 1m
po/deathstar-76995f4687-xbgnl 1/1 Running 0 1m
po/tiefighter 1/1 Running 0 1m
po/xwing 1/1 Running 0 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/deathstar ClusterIP 10.109.254.198 <none> 80/TCP 3h
svc/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3h
Each pod will be represented in Cilium as an Endpoint. We can invoke the
cilium
tool inside the Cilium pod to list them:
$ kubectl -n kube-system get pods -l k8s-app=cilium
NAME READY STATUS RESTARTS AGE
cilium-1c2cz 1/1 Running 0 26m
$ kubectl -n kube-system exec cilium-1c2cz -- cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
7624 Disabled Disabled 9919 k8s:class=deathstar f00d::a0f:0:0:1dc8 10.15.185.9 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=empire
10900 Disabled Disabled 32353 k8s:class=xwing f00d::a0f:0:0:2a94 10.15.92.254 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=alliance
11010 Disabled Disabled 9919 k8s:class=deathstar f00d::a0f:0:0:2b02 10.15.197.34 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=empire
50240 Disabled Disabled 12904 k8s:class=tiefighter f00d::a0f:0:0:c440 10.15.28.62 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=empire
52921 Disabled Disabled 4 reserved:health f00d::a0f:0:0:ceb9 10.15.126.89 ready
Both ingress and egress policy enforcement is still disabled on all of these pods because no network policy has been imported yet which select any of the pods.
Step 3: Check Current Access¶
From the perspective of the deathstar service, only the ships with label org=empire
are allowed to connect and request landing. Since we have no rules enforced, both xwing and tiefighter will be able to request landing. To test this, use the commands below.
$ kubectl exec xwing -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed
$ kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed
Step 4: Apply an L3/L4 Policy¶
When using Cilium, endpoint IP addresses are irrelevant when defining security policies. Instead, you can use the labels assigned to the pods to define security policies. The policies will be applied to the right pods based on the labels irrespective of where or when it is running within the cluster.
We’ll start with the basic policy restricting deathstar landing requests to only the ships that have label (org=empire
). This will not allow any ships that don’t have the org=empire
label to even connect with the deathstar service.
This is a simple policy that filters only on IP protocol (network layer 3) and TCP protocol (network layer 4), so it is often referred to as an L3/L4 network security policy.
Note: Cilium performs stateful connection tracking, meaning that if policy allows the frontend to reach backend, it will automatically allow all required reply packets that are part of backend replying to frontend within the context of the same TCP/UDP connection.
L4 Policy with Cilium and Kubernetes

We can achieve that with the following CiliumNetworkPolicy:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L3-L4 policy to restrict deathstar access to empire ships only"
metadata:
name: "rule1"
spec:
endpointSelector:
matchLabels:
org: empire
class: deathstar
ingress:
- fromEndpoints:
- matchLabels:
org: empire
toPorts:
- ports:
- port: "80"
protocol: TCP
CiliumNetworkPolicies match on pod labels using an “endpointSelector” to identify the sources and destinations to which the policy applies.
The above policy whitelists traffic sent from any pods with label (org=empire
) to deathstar pods with label (org=empire, class=deathstar
) on TCP port 80.
To apply this L3/L4 policy, run:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/minikube/sw_l3_l4_policy.yaml
Now if we run the landing requests again, only the tiefighter pods with the label org=empire
will succeed. The xwing pods will be blocked!
$ kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed
This works as expected. Now the same request run from an xwing pod will fail:
$ kubectl exec xwing -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
This request will hang, so press Control-C to kill the curl request, or wait for it to time out.
Step 5: Inspecting the Policy¶
If we run cilium endpoint list
again we will see that the pods with the label org=empire
and class=deathstar
now have ingress policy enforcement enabled as per the policy above.
$ kubectl -n kube-system exec cilium-1c2cz -- cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
7624 Enabled Disabled 9919 k8s:class=deathstar f00d::a0f:0:0:1dc8 10.15.185.9 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=empire
10900 Disabled Disabled 32353 k8s:class=xwing f00d::a0f:0:0:2a94 10.15.92.254 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=alliance
11010 Enabled Disabled 9919 k8s:class=deathstar f00d::a0f:0:0:2b02 10.15.197.34 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=empire
50240 Disabled Disabled 12904 k8s:class=tiefighter f00d::a0f:0:0:c440 10.15.28.62 ready
k8s:io.kubernetes.pod.namespace=default
k8s:org=empire
52921 Disabled Disabled 4 reserved:health f00d::a0f:0:0:ceb9 10.15.126.89 ready
You can also inspect the policy details via kubectl
$ kubectl get cnp
NAME AGE
rule1 42s
$ kubectl describe cnp rule1
Name: rule1
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"cilium.io/v2","description":"L3-L4 policy to restrict deathstar access to empire ships only","kind":"CiliumNetworkPolicy","metadata":{"a...
API Version: cilium.io/v2
Kind: CiliumNetworkPolicy
Metadata:
Cluster Name:
Creation Timestamp: 2018-04-14T07:38:51Z
Generation: 0
Resource Version: 24334
Self Link: /apis/cilium.io/v2/namespaces/default/ciliumnetworkpolicies/rule1
UID: e025de8b-3fb6-11e8-ab5f-08002737c671
Spec:
Endpoint Selector:
Match Labels:
Any : Class: deathstar
Any : Org: empire
Ingress:
From Endpoints:
Match Labels:
Any : Org: empire
To Ports:
Ports:
Port: 80
Protocol: TCP
Status:
Nodes:
Minikube:
Enforcing: true
Last Updated: 2018-04-14T07:38:55.174693943Z
Local Policy Revision: 87
Ok: true
Events: <none>
Step 6: Apply and Test HTTP-aware L7 Policy¶
In the simple scenario above, it was sufficient to either give tiefighter / xwing full access to deathstar’s API or no access at all. But to provide the strongest security (i.e., enforce least-privilege isolation) between microservices, each service that calls deathstar’s API should be limited to making only the set of HTTP requests it requires for legitimate operation.
For example, consider that the deathstar service exposes some maintenance APIs which should not be called by random empire ships. To see this run:
$ kubectl exec tiefighter -- curl -s -XPUT deathstar.default.svc.cluster.local/v1/exhaust-port
Panic: deathstar exploded
goroutine 1 [running]:
main.HandleGarbage(0x2080c3f50, 0x2, 0x4, 0x425c0, 0x5, 0xa)
/code/src/github.com/empire/deathstar/
temp/main.go:9 +0x64
main.main()
/code/src/github.com/empire/deathstar/
temp/main.go:5 +0x85
While this is an illustrative example, unauthorized access such as above can have adverse security repercussions.
L7 Policy with Cilium and Kubernetes

Cilium is capable of enforcing HTTP-layer (i.e., L7) policies to limit what URLs the tiefighter is allowed to reach. Here is an example policy file that extends our original policy by limiting tiefighter to making only a POST /v1/request-landing API call, but disallowing all other calls (including PUT /v1/exhaust-port).
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L7 policy to restrict access to specific HTTP call"
metadata:
name: "rule1"
spec:
endpointSelector:
matchLabels:
org: empire
class: deathstar
ingress:
- fromEndpoints:
- matchLabels:
org: empire
toPorts:
- ports:
- port: "80"
protocol: TCP
rules:
http:
- method: "POST"
path: "/v1/request-landing"
Update the existing rule to apply L7-aware policy to protect app1 using:
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/minikube/sw_l3_l4_l7_policy.yaml
We can now re-run the same test as above, but we will see a different outcome:
$ kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed
and
$ kubectl exec tiefighter -- curl -s -XPUT deathstar.default.svc.cluster.local/v1/exhaust-port
Access denied
As you can see, with Cilium L7 security policies, we are able to permit tiefighter to access only the required API resources on deathstar, thereby implementing a “least privilege” security approach for communication between microservices.
You can observe the L7 policy via kubectl
:
$ kubectl describe ciliumnetworkpolicies
Name: rule1
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"cilium.io/v2","description":"L3-L4 policy to restrict deathstar access to empire ships only","kind":"CiliumNetworkPolicy","metadata":{"a...
API Version: cilium.io/v2
Kind: CiliumNetworkPolicy
Metadata:
Cluster Name:
Creation Timestamp: 2018-04-14T07:38:51Z
Generation: 0
Resource Version: 26083
Self Link: /apis/cilium.io/v2/namespaces/default/ciliumnetworkpolicies/rule1
UID: e025de8b-3fb6-11e8-ab5f-08002737c671
Spec:
Endpoint Selector:
Match Labels:
Any : Class: deathstar
Any : Org: empire
Ingress:
From Endpoints:
Match Labels:
Any : Org: empire
To Ports:
Ports:
Port: 80
Protocol: TCP
Rules:
Http:
Method: POST
Path: /v1/request-landing
Status:
Nodes:
Minikube:
Enforcing: true
Last Updated: 2018-04-14T08:13:12.094961363Z
Local Policy Revision: 93
Ok: true
Events: <none>
and cilium
CLI:
$ kubectl -n kube-system exec cilium-qh5l2 cilium policy get
[
{
"endpointSelector": {
"matchLabels": {
"any:class": "deathstar",
"any:org": "empire",
"k8s:io.kubernetes.pod.namespace": "default"
}
},
"ingress": [
{
"fromEndpoints": [
{
"matchLabels": {
"any:org": "empire",
"k8s:io.kubernetes.pod.namespace": "default"
}
}
],
"toPorts": [
{
"ports": [
{
"port": "80",
"protocol": "TCP"
}
],
"rules": {
"http": [
{
"path": "/v1/request-landing",
"method": "POST"
}
]
}
}
]
}
],
"labels": [
{
"key": "io.cilium.k8s.policy.name",
"value": "rule1",
"source": "unspec"
},
{
"key": "io.cilium.k8s.policy.namespace",
"value": "default",
"source": "unspec"
}
]
}
]
Revision: 10
We hope you enjoyed the tutorial. Feel free to play more with the setup, read the rest of the documentation, and reach out to us on the Cilium Slack channel with any questions!
Step 7: Clean-Up¶
You have now installed Cilium, deployed a demo app, and tested both L3/L4 and L7 network security policies.
$ minikube delete
After this, you can re-run this guide from Step 1.
Extra: Metrics¶
To try out the metrics exported by cilium, simply install the example prometheus spec file:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/prometheus/prometheus.yaml $ kubectl replace --force -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium.yaml
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/prometheus/prometheus.yaml $ kubectl replace --force -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium.yaml
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/prometheus/prometheus.yaml $ kubectl replace --force -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium.yaml
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/prometheus/prometheus.yaml $ kubectl replace --force -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium.yaml
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/prometheus/prometheus.yaml $ kubectl replace --force -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium.yaml
This will create a barebones prometheus installation that you can use to inspect metrics from the agent, then restart cilium so it can consume the new prometheus configuration. Navigate to the web ui with:
$ minikube service prometheus -n monitoring
Getting Started Using Istio¶
This document serves as an introduction to using Cilium to enforce security policies in Kubernetes micro-services managed with Istio. It is a detailed walk-through of getting a single-node Cilium + Istio environment running on your machine.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install kubectl & minikube¶
- Install
kubectl
version>= 1.7.0
as described in the Kubernetes Docs. - Install one of the hypervisors supported by minikube .
- Install
minikube
>= 0.22.3
as described on minikube’s github page .
$ minikube start --network-plugin=cni --extra-config=kubelet.network-plugin=cni --memory=5120
$ minikube start --network-plugin=cni --container-runtime=cri-o --extra-config=kubelet.network-plugin=cni --memory=5120
After minikube has finished setting up your new Kubernetes cluster, you can
check the status of the cluster by running kubectl get cs
:
$ kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
- Install etcd as a dependency of cilium in minikube by running:
$ kubectl create -n kube-system -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/etcd/standalone-etcd.yaml service "etcd-cilium" created statefulset.apps "etcd-cilium" created
To check that all pods are Running
and 100% ready, including kube-dns
and etcd-cilium-0
run:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-cilium-0 1/1 Running 0 1m
kube-system etcd-minikube 1/1 Running 0 3m
kube-system kube-addon-manager-minikube 1/1 Running 0 4m
kube-system kube-apiserver-minikube 1/1 Running 0 3m
kube-system kube-controller-manager-minikube 1/1 Running 0 3m
kube-system kube-dns-86f4d74b45-lhzfv 3/3 Running 0 4m
kube-system kube-proxy-tcd7h 1/1 Running 0 4m
kube-system kube-scheduler-minikube 1/1 Running 0 4m
kube-system storage-provisioner 1/1 Running 0 4m
If you see output similar to this, you are ready to proceed to the next step.
Note
The output might differ between minikube versions, you should expect to have all pods in READY / Running state before continuing.
Step 1: Install Cilium¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
Choose the installation instructions for the environment in which you are deploying Cilium.
Docker Based¶
CRI-O Based¶
Step 2: Install Istio¶
Install the Helm client.
Download Istio version 1.0.0:
$ export ISTIO_VERSION=1.0.0
$ curl -L https://git.io/getLatestIstio | sh -
$ export ISTIO_HOME=`pwd`/istio-${ISTIO_VERSION}
$ export PATH="$PATH:${ISTIO_HOME}/bin"
Create a copy of Istio’s Helm charts in order to customize them:
$ cp -r ${ISTIO_HOME}/install/kubernetes/helm/istio istio-cilium-helm
Configure the Cilium-specific variant of Pilot to inject the Cilium network policy filters into each Istio sidecar proxy:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/cilium-pilot.awk > cilium-pilot.awk
$ awk -f cilium-pilot.awk \
< istio-cilium-helm/charts/pilot/templates/deployment.yaml \
> istio-cilium-helm/charts/pilot/templates/deployment.yaml.new && \
mv istio-cilium-helm/charts/pilot/templates/deployment.yaml.new istio-cilium-helm/charts/pilot/templates/deployment.yaml
Configure the Istio’s sidecar injection to setup the transparent proxy mode (TPROXY) as required by Cilium’s proxy filters:
$ sed -e 's,#interceptionMode: .*,interceptionMode: TPROXY,' \
< istio-cilium-helm/templates/configmap.yaml \
> istio-cilium-helm/templates/configmap.yaml.new && \
mv istio-cilium-helm/templates/configmap.yaml.new istio-cilium-helm/templates/configmap.yaml
Modify the Istio sidecar injection template to uses Cilium’s proxy Docker images and mount Cilium’s API Unix domain sockets into each sidecar to allow Cilium’s Envoy filters to query the Cilium agent for policy configuration:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/cilium-kube-inject.awk > cilium-kube-inject.awk
$ awk -f cilium-kube-inject.awk \
< istio-cilium-helm/templates/sidecar-injector-configmap.yaml \
> istio-cilium-helm/templates/sidecar-injector-configmap.yaml.new && \
mv istio-cilium-helm/templates/sidecar-injector-configmap.yaml.new istio-cilium-helm/templates/sidecar-injector-configmap.yaml
Create an Istio deployment spec, which configures the Cilium-specific variant of Pilot, and disables unused services:
$ helm template istio-cilium-helm --name istio --namespace istio-system \
--set pilot.image=docker.io/cilium/istio_pilot:1.0.0 \
--set sidecarInjectorWebhook.enabled=true \
--set global.controlPlaneSecurityEnabled=true \
--set global.mtls.enabled=true \
--set global.proxy.image=proxy_debug \
--set ingress.enabled=false \
--set egressgateway.enabled=false \
> istio-cilium.yaml
Deploy Istio onto Kubernetes:
$ kubectl create namespace istio-system
$ kubectl create -f istio-cilium.yaml
Check the progress of the deployment (every service should have an
AVAILABLE
count of 1
):
$ kubectl get deployments -n istio-system
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
istio-citadel 1 1 1 1 1m
istio-egressgateway 1 1 1 1 1m
istio-galley 1 1 1 1 1m
istio-ingressgateway 1 1 1 1 1m
istio-pilot 1 1 1 1 1m
istio-policy 1 1 1 1 1m
istio-sidecar-injector 1 1 1 1 1m
istio-statsd-prom-bridge 1 1 1 1 1m
istio-telemetry 1 1 1 1 1m
prometheus 1 1 1 1 1m
Once all Istio pods are ready, we are ready to install the demo application.
Step 3: Deploy the Bookinfo Application V1¶
Now that we have Cilium and Istio deployed, we can deploy version
v1
of the services of the Istio Bookinfo sample application.
The BookInfo application is broken into four separate microservices:
- productpage. The productpage microservice calls the details and reviews microservices to populate the page.
- details. The details microservice contains book information.
- reviews. The reviews microservice contains book reviews. It also calls the ratings microservice.
- ratings. The ratings microservice contains book ranking information that accompanies a book review.
In this demo, each specific version of each microservice is deployed into Kubernetes using separate YAML files which define:
- A Kubernetes Service.
- A Kubernetes Deployment specifying the microservice’s pods, specific to each service version.
- A Cilium Network Policy limiting the traffic to the microservice, specific to each service version.

First create a policy to explicitly allow the sidecar proxies to access the Istio services while the pods are initializing:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/istio-sidecar-init-policy.yaml ciliumnetworkpolicy "istio-sidecar" created
Create an Istio ingress gateway for the productpage service:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/bookinfo-productpage-ingress.yaml | \ istioctl create -f - Created config gateway/default/productpage at revision ... Created config virtual-service/default/productpage at revision ...
To package the Istio sidecar proxy and generate final YAML specifications, run:
$ for service in productpage-service productpage-v1 details-v1 reviews-v1; do \ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/bookinfo-${service}.yaml | \ istioctl kube-inject -f - | \ kubectl create --validate=false -f - ; done service "productpage" created ciliumnetworkpolicy "productpage-v1" created deployment "productpage-v1" created service "details" created ciliumnetworkpolicy "details-v1" created deployment "details-v1" created service "reviews" created ciliumnetworkpolicy "reviews-v1" created deployment "reviews-v1" created
Check the progress of the deployment (every service should have an
AVAILABLE
count of 1
):
$ kubectl get deployments -n default
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
details-v1 1 1 1 1 6m
productpage-v1 1 1 1 1 6m
reviews-v1 1 1 1 1 6m
To obtain the URL to the frontend productpage service, run:
$ export PRODUCTPAGE=`minikube service istio-ingressgateway -n istio-system --url | head -n 1`
$ echo "Open URL: ${PRODUCTPAGE}/productpage"
Open that URL in your web browser and check that the application has been successfully deployed. It may take several seconds before all services become accessible in the Istio service mesh, so you may have have to reload the page.
Step 4: Canary and Deploy the Reviews Service V2¶
We will now deploy version v2
of the reviews
service. In
addition to providing reviews from readers, reviews v2
queries a
new ratings
service for book ratings, and displays each rating as
1 to 5 black stars.
As a precaution, we will use Istio’s service routing feature to canary
the v2
deployment to prevent breaking the end-to-end application
completely if it is faulty.
Before deploying v2
, to prevent any traffic from being routed to
it for now, we will create this Istio route rules to route 100% of the
reviews
traffic to v1
:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
weight: 100
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2

Apply this route rule:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/route-rule-reviews-v1.yaml | \ istioctl create -f - Created config virtual-service/default/reviews at revision ... Created config destination-rule/default/reviews at revision ...
Deploy the ratings v1
and reviews v2
services:
$ for service in ratings-v1 reviews-v2; do \ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/bookinfo-${service}.yaml | \ istioctl kube-inject -f - | \ kubectl create --validate=false -f - ; done service "ratings" created ciliumnetworkpolicy "ratings-v1" created deployment "ratings-v1" created ciliumnetworkpolicy "reviews-v2" created deployment "reviews-v2" created
Check the progress of the deployment (every service should have an
AVAILABLE
count of 1
):
$ kubectl get deployments -n default
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
details-v1 1 1 1 1 6m
productpage-v1 1 1 1 1 6m
ratings-v1 1 1 1 1 57s
reviews-v1 1 1 1 1 6m
reviews-v2 1 1 1 1 57s
Check in your web browser that no stars are appearing in the Book
Reviews, even after refreshing the page several times. This indicates
that all reviews are retrieved from reviews v1
and none from
reviews v2
.

The ratings-v1
CiliumNetworkPolicy explicitly whitelists access
to the ratings
API only from productpage
and reviews v2
:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: ratings-v1
namespace: default
specs:
- endpointSelector:
matchLabels:
"k8s:app": ratings
"k8s:version": v1
ingress:
- fromEndpoints:
- matchLabels:
"k8s:app": productpage
"k8s:version": v1
toPorts:
- ports:
- port: "9080"
protocol: TCP
rules:
http:
- method: GET
path: "/ratings/[0-9]*"
- fromEndpoints:
- matchLabels:
"k8s:app": reviews
"k8s:version": v2
toPorts:
- ports:
- port: "9080"
protocol: TCP
rules:
http:
- method: GET
path: "/ratings/[0-9]*"
Check that reviews v1
may not be able to access the ratings
service, even if it were compromised or suffered from a bug, by
running curl
from within the pod:
$ export POD_REVIEWS_V1=`kubectl get pods -n default -l app=reviews,version=v1 -o jsonpath='{.items[0].metadata.name}'`
$ kubectl exec ${POD_REVIEWS_V1} -c istio-proxy -ti -- curl --connect-timeout 5 --fail http://ratings:9080/ratings/0
curl: (22) The requested URL returned error: 503 Service Unavailable
command terminated with exit code 22
Update the Istio route rule to send 50% of reviews
traffic to
v1
and 50% to v2
:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
weight: 50
- destination:
host: reviews
subset: v2
weight: 50

Apply this route rule:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/route-rule-reviews-v1-v2.yaml | \ istioctl replace -f - Updated config virtual-service/default/reviews to revision ...
Check in your web browser that stars are appearing in the Book Reviews
roughly 50% of the time. This may require refreshing the page for a
few seconds to observe. Queries to reviews v2
result in reviews
containing ratings displayed as black stars:

Finally, update the route rule to send 100% of reviews
traffic to
v2
:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v2
weight: 100

Apply this route rule:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/route-rule-reviews-v2.yaml | \ istioctl replace -f - Updated config virtual-service/default/reviews to revision ...
Refresh the product page in your web browser several times to verify
that stars are now appearing in the Book Reviews on every page
refresh. All the reviews are now retrieved from reviews v2
and
none from reviews v1
.
Step 5: Deploy the Product Page Service V2¶
We will now deploy version v2
of the productpage
service,
which brings two changes:
- It is deployed with a more restrictive CiliumNetworkPolicy, which restricts access to a subset of the HTTP URLs, at Layer-7.
- It implements a new authentication audit log into Kafka.

The policy for v1
currently allows read access to the full HTTP
REST API, under the /api/v1
HTTP URI path:
/api/v1/products
: Returns the list of books and their details./api/v1/products/<id>
: Returns details about a specific book./api/v1/products/<id>/reviews
: Returns reviews for a specific book./api/v1/products/<id>/ratings
: Returns ratings for a specific book.
Check that the full REST API is currently accessible in v1
and
returns valid JSON data:
$ export PRODUCTPAGE=`minikube service istio-ingressgateway -n istio-system --url | head -n 1`
$ for APIPATH in /api/v1/products /api/v1/products/0 /api/v1/products/0/reviews /api/v1/products/0/ratings; do echo ; curl -s -S "${PRODUCTPAGE}${APIPATH}" ; echo ; done
[{"descriptionHtml": "<a href=\"https://en.wikipedia.org/wiki/The_Comedy_of_Errors\">Wikipedia Summary</a>: The Comedy of Errors is one of <b>William Shakespeare's</b> early plays. It is his shortest and one of his most farcical comedies, with a major part of the humour coming from slapstick and mistaken identity, in addition to puns and word play.", "id": 0, "title": "The Comedy of Errors"}]
{"publisher": "PublisherA", "language": "English", "author": "William Shakespeare", "id": 0, "ISBN-10": "1234567890", "ISBN-13": "123-1234567890", "year": 1595, "type": "paperback", "pages": 200}
{"reviews": [{"reviewer": "Reviewer1", "rating": {"color": "black", "stars": 5}, "text": "An extremely entertaining play by Shakespeare. The slapstick humour is refreshing!"}, {"reviewer": "Reviewer2", "rating": {"color": "black", "stars": 4}, "text": "Absolutely fun and entertaining. The play lacks thematic depth when compared to other plays by Shakespeare."}], "id": "0"}
{"ratings": {"Reviewer2": 4, "Reviewer1": 5}, "id": 0}
We realized that the REST API to get the book reviews and ratings was
meant only for consumption by other internal services, and will be
blocked from external clients using the updated Layer-7
CiliumNetworkPolicy in productpage v2
, i.e. only the
/api/v1/products
and /api/v1/products/<id>
HTTP URLs will be
whitelisted:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: productpage-v2
namespace: default
specs:
- endpointSelector:
matchLabels:
"k8s:app": productpage
"k8s:version": v2
ingress:
- toPorts:
- ports:
- port: "9080"
protocol: TCP
rules:
http:
- method: GET
path: "/"
- method: GET
path: "/index.html"
- method: POST
path: "/login"
- method: GET
path: "/logout"
- method: GET
path: "/productpage"
- method: GET
path: "/api/v1/products"
- method: GET
path: "/api/v1/products/[0-9]*"
# - method: GET
# path: "/api/v1/products/[0-9]*/reviews"
# - method: GET
# path: "/api/v1/products/[0-9]*/ratings"
Because productpage v2
sends messages into Kafka, we must first
deploy a Kafka broker:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/kafka-v1-destrule.yaml | \ istioctl create -f - Created config destination-rule/default/kafka-disable-mtls at revision ...
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/kafka-v1.yaml service "kafka" created ciliumnetworkpolicy "kafka-authaudit" created statefulset "kafka-v1" created ciliumnetworkpolicy "kafka-from-init" created
Wait until the kafka-v1-0
pod is ready, i.e. until it has a
READY
count of 1/1
:
$ kubectl get pods -n default -l app=kafka
NAME READY STATUS RESTARTS AGE
kafka-v1-0 1/1 Running 0 21m
Create the authaudit
Kafka topic, which will be used by
productpage v2
:
$ kubectl exec kafka-v1-0 -c kafka -- bash -c '/opt/kafka_2.11-0.10.1.0/bin/kafka-topics.sh --zookeeper localhost:2181/kafka --create --topic authaudit --partitions 1 --replication-factor 1'
Created topic "authaudit".
We are now ready to deploy productpage v2
.
Create the productpage v2
service and its updated
CiliumNetworkPolicy and delete productpage v1
:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/bookinfo-productpage-v2.yaml | \ istioctl kube-inject -f - | \ kubectl create --validate=false -f - ciliumnetworkpolicy "productpage-v2" created deployment "productpage-v2" created $ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/bookinfo-productpage-v1.yaml ciliumnetworkpolicy "productpage-v1" deleted deployment "productpage-v1" deleted
productpage v2
implements an authorization audit logging. On
every user login or logout, it produces into Kafka topic authaudit
a JSON-formatted message which contains the following information:
- event:
login
orlogout
- username
- client IP address
- timestamp
To observe the Kafka messages sent by productpage
, we will run an
additional authaudit-logger
service. This service fetches and
prints out all messages from the authaudit
Kafka topic. Start
this service:
$ curl -s https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-istio/authaudit-logger-v1.yaml | \ istioctl kube-inject -f - | \ kubectl apply --validate=false -f - deployment "authaudit-logger-v1" created
Check the progress of the deployment (every service should have an
AVAILABLE
count of 1
):
$ kubectl get deployments -n default
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
authaudit-logger-v1 1 1 1 1 20s
details-v1 1 1 1 1 22m
productpage-v2 1 1 1 1 4m
ratings-v1 1 1 1 1 19m
reviews-v1 1 1 1 1 22m
reviews-v2 1 1 1 1 19m
Check that the product REST API is still accessible, and that Cilium now denies at Layer-7 any access to the reviews and ratings REST API:
$ export PRODUCTPAGE=`minikube service istio-ingressgateway -n istio-system --url | head -n 1`
$ for APIPATH in /api/v1/products /api/v1/products/0 /api/v1/products/0/reviews /api/v1/products/0/ratings; do echo ; curl -s -S "${PRODUCTPAGE}${APIPATH}" ; echo ; done
[{"descriptionHtml": "<a href=\"https://en.wikipedia.org/wiki/The_Comedy_of_Errors\">Wikipedia Summary</a>: The Comedy of Errors is one of <b>William Shakespeare's</b> early plays. It is his shortest and one of his most farcical comedies, with a major part of the humour coming from slapstick and mistaken identity, in addition to puns and word play.", "id": 0, "title": "The Comedy of Errors"}]
{"publisher": "PublisherA", "language": "English", "author": "William Shakespeare", "id": 0, "ISBN-10": "1234567890", "ISBN-13": "123-1234567890", "year": 1595, "type": "paperback", "pages": 200}
Access denied
Access denied
This demonstrated that requests to the
/api/v1/products/<id>/reviews
and
/api/v1/products/<id>/ratings
URIs now result in Cilium returning
HTTP 403 Forbidden
HTTP responses.
Every login and logout on the product page will result in a line in this service’s log:
$ export POD_LOGGER_V1=`kubectl get pods -n default -l app=authaudit-logger,version=v1 -o jsonpath='{.items[0].metadata.name}'`
$ kubectl logs ${POD_LOGGER_V1} -c authaudit-logger
...
{"timestamp": "2017-12-04T09:34:24.341668", "remote_addr": "10.15.28.238", "event": "login", "user": "richard"}
{"timestamp": "2017-12-04T09:34:40.943772", "remote_addr": "10.15.28.238", "event": "logout", "user": "richard"}
{"timestamp": "2017-12-04T09:35:03.096497", "remote_addr": "10.15.28.238", "event": "login", "user": "gilfoyle"}
{"timestamp": "2017-12-04T09:35:08.777389", "remote_addr": "10.15.28.238", "event": "logout", "user": "gilfoyle"}
As you can see, the user-identifiable information sent by
productpage
in every Kafka message is sensitive, so access to this
Kafka topic must be protected using Cilium. The CiliumNetworkPolicy
configured on the Kafka broker enforces that:
- only
productpage v2
is allowed to produce messages into theauthaudit
topic; - only
authaudit-logger
can fetch messages from this topic; - no service can access any other topic.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: kafka-authaudit
specs:
- endpointSelector:
matchLabels:
"k8s:app": kafka
ingress:
- fromEndpoints:
- matchLabels:
"k8s:app": productpage
"k8s:version": v2
toPorts:
- ports:
- port: "9092"
protocol: TCP
rules:
kafka:
- apiKey: "produce"
topic: "authaudit"
- apiKey: "apiversions"
- apiKey: "metadata"
- apiKey: "heartbeat"
- fromEndpoints:
- matchLabels:
app: kafka
- fromEndpoints:
- matchLabels:
"k8s:app": authaudit-logger
toPorts:
- ports:
- port: "9092"
protocol: TCP
rules:
kafka:
- apiKey: "fetch"
topic: "authaudit"
- apiKey: "apiversions"
- apiKey: "metadata"
- apiKey: "findcoordinator"
- apiKey: "joingroup"
- apiKey: "leavegroup"
- apiKey: "syncgroup"
- apiKey: "offsets"
- apiKey: "offsetcommit"
- apiKey: "offsetfetch"
- apiKey: "heartbeat"
Check that Cilium prevents the authaudit-logger
service from
writing into the authaudit
topic (enter a message followed by
ENTER, e.g. test message
):
$ export POD_LOGGER_V1=`kubectl get pods -n default -l app=authaudit-logger,version=v1 -o jsonpath='{.items[0].metadata.name}'`
$ kubectl exec ${POD_LOGGER_V1} -c authaudit-logger -ti -- /opt/kafka_2.11-0.10.1.0/bin/kafka-console-producer.sh --broker-list=kafka:9092 --topic=authaudit
test message
[2017-12-07 02:13:47,020] ERROR Error when sending message to topic authaudit with key: null, value: 12 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [authaudit]
This demonstrated that Cilium sent a response with an authorization
error for any Produce
request from this service.
Create another topic named credit-card-payments
, meant to transmit
highly-sensitive credit card payment requests:
$ kubectl exec kafka-v1-0 -c kafka -- bash -c '/opt/kafka_2.11-0.10.1.0/bin/kafka-topics.sh --zookeeper localhost:2181/kafka --create --topic credit-card-payments --partitions 1 --replication-factor 1'
Created topic "credit-card-payments".
Check that Cilium prevents the authaudit-logger
service from
fetching messages from this topic:
$ export POD_LOGGER_V1=`kubectl get pods -n default -l app=authaudit-logger,version=v1 -o jsonpath='{.items[0].metadata.name}'`
$ kubectl exec ${POD_LOGGER_V1} -c authaudit-logger -ti -- /opt/kafka_2.11-0.10.1.0/bin/kafka-console-consumer.sh --bootstrap-server=kafka:9092 --topic=credit-card-payments
[2017-12-07 03:08:54,513] WARN Not authorized to read from topic credit-card-payments. (org.apache.kafka.clients.consumer.internals.Fetcher)
[2017-12-07 03:08:54,517] ERROR Error processing message, terminating consumer process: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [credit-card-payments]
Processed a total of 0 messages
This demonstrated that Cilium sent a response with an authorization
error for any Fetch
request from this service for any topic other
than authaudit
.
Step 6: Clean Up¶
You have now installed Cilium and Istio, deployed a demo app, and tested both Cilium’s L3-L7 network security policies and Istio’s service route rules. To clean up, run:
$ minikube delete
After this, you can re-run the tutorial from Step 0.
Getting Started Securing Kafka¶
This document serves as an introduction to using Cilium to enforce Kafka-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install kubectl & minikube¶
- Install
kubectl
version>= 1.7.0
as described in the Kubernetes Docs. - Install one of the hypervisors supported by minikube .
- Install
minikube
>= 0.22.3
as described on minikube’s github page .
$ minikube start --network-plugin=cni --extra-config=kubelet.network-plugin=cni --memory=5120
$ minikube start --network-plugin=cni --container-runtime=cri-o --extra-config=kubelet.network-plugin=cni --memory=5120
After minikube has finished setting up your new Kubernetes cluster, you can
check the status of the cluster by running kubectl get cs
:
$ kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
- Install etcd as a dependency of cilium in minikube by running:
$ kubectl create -n kube-system -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/etcd/standalone-etcd.yaml service "etcd-cilium" created statefulset.apps "etcd-cilium" created
To check that all pods are Running
and 100% ready, including kube-dns
and etcd-cilium-0
run:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-cilium-0 1/1 Running 0 1m
kube-system etcd-minikube 1/1 Running 0 3m
kube-system kube-addon-manager-minikube 1/1 Running 0 4m
kube-system kube-apiserver-minikube 1/1 Running 0 3m
kube-system kube-controller-manager-minikube 1/1 Running 0 3m
kube-system kube-dns-86f4d74b45-lhzfv 3/3 Running 0 4m
kube-system kube-proxy-tcd7h 1/1 Running 0 4m
kube-system kube-scheduler-minikube 1/1 Running 0 4m
kube-system storage-provisioner 1/1 Running 0 4m
If you see output similar to this, you are ready to proceed to the next step.
Note
The output might differ between minikube versions, you should expect to have all pods in READY / Running state before continuing.
Step 1: Install Cilium¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
Choose the installation instructions for the environment in which you are deploying Cilium.
Docker Based¶
CRI-O Based¶
Step 2: Deploy the Demo Application¶
Now that we have Cilium deployed and kube-dns
operating correctly we can
deploy our demo Kafka application. Since our first demo of Cilium + HTTP-aware security
policies was Star Wars-themed we decided to do the same for Kafka. While the
HTTP-aware Cilium Star Wars demo
showed how the Galactic Empire used HTTP-aware security policies to protect the Death Star from the
Rebel Alliance, this Kafka demo shows how the lack of Kafka-aware security policies allowed the
Rebels to steal the Death Star plans in the first place.
Kafka is a powerful platform for passing datastreams between different components of an application. A cluster of “Kafka brokers” connect nodes that “produce” data into a data stream, or “consume” data from a datastream. Kafka refers to each datastream as a “topic”. Because scalable and highly-available Kafka clusters are non-trivial to run, the same cluster of Kafka brokers often handles many different topics at once (read this Introduction to Kafka for more background).
In our simple example, the Empire uses a Kafka cluster to handle two different topics:
- empire-announce : Used to broadcast announcements to sites spread across the galaxy
- deathstar-plans : Used by a small group of sites coordinating on building the ultimate battlestation.
To keep the setup small, we will just launch a small number of pods to represent this setup:
- kafka-broker : A single pod running Kafka and Zookeeper representing the Kafka cluster (label app=kafka).
- empire-hq : A pod representing the Empire’s Headquarters, which is the only pod that should produce messages to empire-announce or deathstar-plans (label app=empire-hq).
- empire-backup : A secure backup facility located in Scarif , which is allowed to “consume” from the secret deathstar-plans topic (label app=empire-backup).
- empire-outpost-8888 : A random outpost in the empire. It needs to “consume” messages from the empire-announce topic (label app=empire-outpost).
- empire-outpost-9999 : Another random outpost in the empire that “consumes” messages from the empire-announce topic (label app=empire-outpost).
All pods other than kafka-broker are Kafka clients, which need access to the kafka-broker container on TCP port 9092 in order to send Kafka protocol messages.

The file kafka-sw-app.yaml
contains a Kubernetes Deployment for each of the pods described
above, as well as a Kubernetes Service for both Kafka and Zookeeper.
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-kafka/kafka-sw-app.yaml deployment "kafka-broker" created deployment "zookeeper" created service "zook" created service "kafka-service" created deployment "empire-hq" created deployment "empire-outpost-8888" created deployment "empire-outpost-9999" created deployment "empire-backup" created
Kubernetes will deploy the pods and service in the background.
Running kubectl get svc,pods
will inform you about the progress of the operation.
Each pod will go through several states until it reaches Running
at which
point the setup is ready.
$ kubectl get svc,pods
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kafka-service ClusterIP None <none> 9092/TCP 2m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 10m
zook ClusterIP 10.97.250.131 <none> 2181/TCP 2m
NAME READY STATUS RESTARTS AGE
empire-backup-6f4567d5fd-gcrvg 1/1 Running 0 2m
empire-hq-59475b4b64-mrdww 1/1 Running 0 2m
empire-outpost-8888-78dffd49fb-tnnhf 1/1 Running 0 2m
empire-outpost-9999-7dd9fc5f5b-xp6jw 1/1 Running 0 2m
kafka-broker-b874c78fd-jdwqf 1/1 Running 0 2m
zookeeper-85f64b8cd4-nprck 1/1 Running 0 2m
Step 3: Setup Client Terminals¶
First we will open a set of windows to represent the different Kafka clients discussed above. For consistency, we recommend opening them in the pattern shown in the image below, but this is optional.

In each window, use copy-paste to have each terminal provide a shell inside each pod.
empire-hq terminal:
$ HQ_POD=$(kubectl get pods -l app=empire-hq -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $HQ_POD -- sh -c "PS1=\"empire-hq $\" /bin/bash"
empire-backup terminal:
$ BACKUP_POD=$(kubectl get pods -l app=empire-backup -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $BACKUP_POD -- sh -c "PS1=\"empire-backup $\" /bin/bash"
outpost-8888 terminal:
$ OUTPOST_8888_POD=$(kubectl get pods -l outpostid=8888 -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $OUTPOST_8888_POD -- sh -c "PS1=\"outpost-8888 $\" /bin/bash"
outpost-9999 terminal:
$ OUTPOST_9999_POD=$(kubectl get pods -l outpostid=9999 -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $OUTPOST_9999_POD -- sh -c "PS1=\"outpost-9999 $\" /bin/bash"
Step 4: Test Basic Kafka Produce & Consume¶
First, let’s start the consumer clients listening to their respective Kafka topics. All of the consumer commands below will hang intentionally, waiting to print data they consume from the Kafka topic:
In the empire-backup window, start listening on the top-secret deathstar-plans topic:
$ ./kafka-consume.sh --topic deathstar-plans
In the outpost-8888 window, start listening to empire-announcement:
$ ./kafka-consume.sh --topic empire-announce
Do the same in the outpost-9999 window:
$ ./kafka-consume.sh --topic empire-announce
Now from the empire-hq, first produce a message to the empire-announce topic:
$ echo "Happy 40th Birthday to General Tagge" | ./kafka-produce.sh --topic empire-announce
This message will be posted to the empire-announce topic, and shows up in both the outpost-8888 and outpost-9999 windows who consume that topic. It will not show up in empire-backup.
empire-hq can also post a version of the top-secret deathstar plans to the deathstar-plans topic:
$ echo "deathstar reactor design v3" | ./kafka-produce.sh --topic deathstar-plans
This message shows up in the empire-backup window, but not for the outposts.
Congratulations, Kafka is working as expected :)
Step 5: The Danger of a Compromised Kafka Client¶
But what if a rebel spy gains access to any of the remote outposts that act as Kafka clients? Since every client has access to the Kafka broker on port 9092, it can do some bad stuff. For starters, the outpost container can actually switch roles from a consumer to a producer, sending “malicious” data to all other consumers on the topic.
To prove this, kill the existing kafka-consume.sh
command in the outpost-9999 window
by typing control-C and instead run:
$ echo "Vader Booed at Empire Karaoke Party" | ./kafka-produce.sh --topic empire-announce
Uh oh! Outpost-8888 and all of the other outposts in the empire have now received this fake announcement.
But even more nasty from a security perspective is that the outpost container can access any topic on the kafka-broker.
In the outpost-9999 container, run:
$ ./kafka-consume.sh --topic deathstar-plans
"deathstar reactor design v3"
We see that any outpost can actually access the secret deathstar plans. Now we know how the rebels got access to them!
Step 6: Securing Access to Kafka with Cilium¶
Obviously, it would be much more secure to limit each pod’s access to the Kafka broker to be least privilege (i.e., only what is needed for the app to operate correctly and nothing more).
We can do that with the following Cilium security policy. As with Cilium HTTP policies, we can write policies that identify pods by labels, and then limit the traffic in/out of this pod. In this case, we’ll create a policy that identifies the exact traffic that should be allowed to reach the Kafka broker, and deny the rest.
As an example, a policy could limit containers with label app=empire-outpost to only be able to consume topic empire-announce, but would block any attempt by a compromised container (e.g., empire-outpost-9999) from producing to empire-announce or consuming from deathstar-plans.

Here is the CiliumNetworkPolicy rule that limits access of pods with label app=empire-outpost to only consume on topic empire-announce:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "enable outposts to consume empire-announce"
metadata:
name: "rule2"
spec:
endpointSelector:
matchLabels:
app: kafka
ingress:
- fromEndpoints:
- matchLabels:
app: empire-outpost
toPorts:
- ports:
- port: "9092"
protocol: TCP
rules:
kafka:
- role: "consume"
topic: "empire-announce"
A CiliumNetworkPolicy contains a list of rules that define allowed requests, meaning that requests that do not match any rules are denied as invalid.
The above rule applies to inbound (i.e., “ingress”) connections to kafka-broker pods (as indicated by “app: kafka” in the “endpointSelector” section). The rule will apply to connections from pods with label “app: empire-outpost” as indicated by the “fromEndpoints” section. The rule explicitly matches Kafka connections destined to TCP 9092, and allows consume/produce actions on various topics of interest. For example we are allowing consume from topic empire-announce in this case.
The full policy adds two additional rules that permit the legitimate “produce” (topic empire-announce and topic deathstar-plans) from empire-hq and the legitimate consume (topic = “deathstar-plans”) from empire-backup. The full policy can be reviewed by opening the URL in the command below in a browser.
Apply this Kafka-aware network security policy using kubectl
in the main window:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-kafka/kafka-sw-security-policy.yaml
If we then again try to produce a message from outpost-9999 to empire-annnounce, it is denied. Type control-c and then run:
$ echo "Vader Trips on His Own Cape" | ./kafka-produce.sh --topic empire-announce
>>[2018-04-10 23:50:34,638] ERROR Error when sending message to topic empire-announce with key: null, value: 27 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [empire-announce]
This is because the policy does not allow messages with role = “produce” for topic “empire-announce” from containers with label app = empire-outpost. Its worth noting that we don’t simply drop the message (which could easily be confused with a network error), but rather we respond with the Kafka access denied error (similar to how HTTP would return an error code of 403 unauthorized).
Likewise, if the outpost container ever tries to consume from topic deathstar-plans, it is denied, as role = consume is only allowed for topic empire-announce.
To test, from the outpost-9999 terminal, run:
$./kafka-consume.sh --topic deathstar-plans
[2018-04-10 23:51:12,956] WARN Error while fetching metadata with correlation id 2 : {deathstar-plans=TOPIC_AUTHORIZATION_FAILED} (org.apache.kafka.clients.NetworkClient)
This is blocked as well, thanks to the Cilium network policy. Imagine how different things would have been if the empire had been using Cilium from the beginning!
Step 6: Clean Up¶
You have now installed Cilium, deployed a demo app, and tested both L7 Kafka-aware network security policies. To clean up, run:
$ minikube delete
After this, you can re-run the tutorial from Step 1.
Getting Started Securing gRPC¶
This document serves as an introduction to using Cilium to enforce gRPC-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install kubectl & minikube¶
- Install
kubectl
version>= 1.7.0
as described in the Kubernetes Docs. - Install one of the hypervisors supported by minikube .
- Install
minikube
>= 0.22.3
as described on minikube’s github page .
$ minikube start --network-plugin=cni --extra-config=kubelet.network-plugin=cni --memory=5120
$ minikube start --network-plugin=cni --container-runtime=cri-o --extra-config=kubelet.network-plugin=cni --memory=5120
After minikube has finished setting up your new Kubernetes cluster, you can
check the status of the cluster by running kubectl get cs
:
$ kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
- Install etcd as a dependency of cilium in minikube by running:
$ kubectl create -n kube-system -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/etcd/standalone-etcd.yaml service "etcd-cilium" created statefulset.apps "etcd-cilium" created
To check that all pods are Running
and 100% ready, including kube-dns
and etcd-cilium-0
run:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-cilium-0 1/1 Running 0 1m
kube-system etcd-minikube 1/1 Running 0 3m
kube-system kube-addon-manager-minikube 1/1 Running 0 4m
kube-system kube-apiserver-minikube 1/1 Running 0 3m
kube-system kube-controller-manager-minikube 1/1 Running 0 3m
kube-system kube-dns-86f4d74b45-lhzfv 3/3 Running 0 4m
kube-system kube-proxy-tcd7h 1/1 Running 0 4m
kube-system kube-scheduler-minikube 1/1 Running 0 4m
kube-system storage-provisioner 1/1 Running 0 4m
If you see output similar to this, you are ready to proceed to the next step.
Note
The output might differ between minikube versions, you should expect to have all pods in READY / Running state before continuing.
Step 1: Install Cilium¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
Choose the installation instructions for the environment in which you are deploying Cilium.
Docker Based¶
CRI-O Based¶
It is important for this demo that kube-dns
is working correctly. To know the
status of kube-dns
you can run the following command:
$ kubectl get deployment kube-dns -n kube-system
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-dns 1 1 1 1 13h
Where at least one pod should be available.
Step 2: Deploy the Demo Application¶
Now that we have Cilium deployed and kube-dns
operating correctly we can
deploy our demo gRPC application. Since our first demo of Cilium + HTTP-aware security
policies was Star Wars-themed, we decided to do the same for gRPC. While the
HTTP-aware Cilium Star Wars demo
showed how the Galactic Empire used HTTP-aware security policies to protect the Death Star from the
Rebel Alliance, this gRPC demo shows how the lack of gRPC-aware security policies allowed Leia, Chewbacca, Lando, C-3PO, and R2-D2 to escape from Cloud City, which had been overtaken by
empire forces.
gRPC is a high-performance RPC framework built on top of the protobuf serialization/deserialization library popularized by Google. There are gRPC bindings for many programming languages, and the efficiency of the protobuf parsing as well as advantages from leveraging HTTP 2 as a transport make it a popular RPC framework for those building new microservices from scratch.
For those unfamiliar with the details of the movie, Leia and the other rebels are fleeing storm troopers and trying to reach the space port platform where the Millennium Falcon is parked, so they can fly out of Cloud City. However, the door to the platform is closed, and the access code has been changed. However, R2-D2 is able to access the Cloud City computer system via a public terminal, and disable this security, opening the door and letting the Rebels reach the Millennium Falcon just in time to escape.

In our example, Cloud City’s internal computer system is built as a set of gRPC-based microservices (who knew that gRPC was actually invented a long time ago, in a galaxy far, far away?).
With gRPC, each service is defined using a language independent protocol buffer definition. Here is the definition for the system used to manage doors within Cloud City:
package cloudcity;
// The door manager service definition.
service DoorManager {
// Get human readable name of door.
rpc GetName(DoorRequest) returns (DoorNameReply) {}
// Find the location of this door.
rpc GetLocation (DoorRequest) returns (DoorLocationReply) {}
// Find out whether door is open or closed
rpc GetStatus(DoorRequest) returns (DoorStatusReply) {}
// Request maintenance on the door
rpc RequestMaintenance(DoorMaintRequest) returns (DoorActionReply) {}
// Set Access Code to Open / Lock the door
rpc SetAccessCode(DoorAccessCodeRequest) returns (DoorActionReply) {}
}
To keep the setup small, we will just launch two pods to represent this setup:
- cc-door-mgr: A single pod running the gRPC door manager service with label
app=cc-door-mgr
. - terminal-87: One of the public network access terminals scattered across Cloud City. R2-D2 plugs into terminal-87 as the rebels are desperately trying to escape. This terminal uses the gRPC client code to communicate with the door management services with label
app=public-terminal
.

The file cc-door-app.yaml
contains a Kubernetes Deployment for the door manager
service, a Kubernetes Pod representing terminal-87
, and a Kubernetes Service for
the door manager services. To deploy this example app, run:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-grpc/cc-door-app.yaml deployment "cc-door-mgr" created service "cc-door-server" created pod "terminal-87" created
Kubernetes will deploy the pods and service in the background. Running
kubectl get svc,pods
will inform you about the progress of the operation.
Each pod will go through several states until it reaches Running
at which
point the setup is ready.
$ kubectl get pods,svc
NAME READY STATUS RESTARTS AGE
po/cc-door-mgr-3590146619-cv4jn 1/1 Running 0 1m
po/terminal-87 1/1 Running 0 1m
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/cc-door-server 10.0.0.72 <none> 50051/TCP 1m
svc/kubernetes 10.0.0.1 <none> 443/TCP 6m
Step 3: Test Access Between gRPC Client and Server¶
First, let’s confirm that the public terminal can properly act as a client to the door service. We can test this by running a Python gRPC client for the door service that exists in the terminal-87 container.
We’ll invoke the ‘cc_door_client’ with the name of the gRPC method to call, and any parameters (in this case, the door-id):
$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py GetName 1
Door name is: Spaceport Door #1
$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py GetLocation 1
Door location is lat = 10.222200393676758 long = 68.87879943847656
Exposing this information to public terminals seems quite useful, as it helps travelers new
to Cloud City identify and locate different doors. But recall that the door service also
exposes several other methods, including SetAccessCode
. If access to the door manager
service is protected only using traditional IP and port-based firewalling, the TCP port of
the service (50051 in this example) will be wide open to allow legitimate calls like
GetName
and GetLocation
, which also leave more sensitive calls like SetAccessCode
exposed as
well. It is this mismatch between the course granularity of traditional firewalls and
the fine-grained nature of gRPC calls that R2-D2 exploited to override the security
and help the rebels escape.
To see this, run:
$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py SetAccessCode 1 999
Successfully set AccessCode to 999
Step 4: Securing Access to a gRPC Service with Cilium¶
Once the legitimate owners of Cloud City recover the city from the empire, how can they
use Cilium to plug this key security hole and block requests to SetAccessCode
and GetStatus
while still allowing GetName
, GetLocation
, and RequestMaintenance
?

Since gRPC build on top of HTTP, this can be achieved easily by understanding how a gRPC call is mapped to an HTTP URL, and then applying a Cilium HTTP-aware filter to allow public terminals to only invoke a subset of all the total gRPC methods available on the door service.
Each gRPC method is mapped to an HTTP POST call to a URL of the form
/cloudcity.DoorManager/<method-name>
.
As a result, the following CiliumNetworkPolicy rule limits access of pods with label
app=public-terminal
to only invoke GetName
, GetLocation
, and RequestMaintenance
on the door service, identified by label app=cc-door-sgr
:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L7 policy to allow public terminals to call GetName, GetLocation, and RequestMaintenance, but not GetState, or SetAccessCode on the Door Manager Service"
metadata:
name: "rule1"
spec:
endpointSelector:
matchLabels:
app: cc-door-mgr
ingress:
- fromEndpoints:
- matchLabels:
app: public-terminal
toPorts:
- ports:
- port: "50051"
protocol: TCP
rules:
http:
- method: "POST"
path: "/cloudcity.DoorManager/GetName"
- method: "POST"
path: "/cloudcity.DoorManager/GetLocation"
- method: "POST"
path: "/cloudcity.DoorManager/RequestMaintenance"
A CiliumNetworkPolicy contains a list of rules that define allowed requests,
meaning that requests that do not match any rules (e.g., SetAccessCode
) are denied as invalid.
The above rule applies to inbound (i.e., “ingress”) connections to cc-door-mgr pods
(as
indicated by app: cc-door-mgr
in the “endpointSelector” section). The rule will apply to connections from pods with label
app: public-terminal
as indicated by the “fromEndpoints” section.
The rule explicitly matches
gRPC connections destined to TCP 50051, and white-lists specifically the permitted URLs.
Apply this gRPC-aware network security policy using kubectl
in the main window:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-grpc/cc-door-ingress-security.yaml
After this security policy is in place, access to the innocuous calls like GetLocation
still works as intended:
$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py GetLocation 1
Door location is lat = 10.222200393676758 long = 68.87879943847656
However, if we then again try to invoke SetAccessCode
, it is denied:
$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py SetAccessCode 1 999
Traceback (most recent call last):
File "/cloudcity/cc_door_client.py", line 71, in <module>
run()
File "/cloudcity/cc_door_client.py", line 53, in run
door_id=int(arg2), access_code=int(arg3)))
File "/usr/local/lib/python3.4/dist-packages/grpc/_channel.py", line 492, in __call__
return _end_unary_response_blocking(state, call, False, deadline)
File "/usr/local/lib/python3.4/dist-packages/grpc/_channel.py", line 440, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.CANCELLED, Received http2 header with status: 403)>
This is now blocked, thanks to the Cilium network policy. And notice that unlike a traditional firewall which would just drop packets in a way indistinguishable from a network failure, because Cilium operates at the API-layer, it can explicitly reply with an custom HTTP 403 Unauthorized error, indicating that the request was intentionally denied for security reasons.
Thank goodness that the empire IT staff hadn’t had time to deploy Cilium on Cloud City’s internal network prior to the escape attempt, or things might have turned out quite differently for Leia and the other Rebels!
Step 5: Clean-Up¶
You have now installed Cilium, deployed a demo app, and tested L7 gRPC-aware network security policies. To clean-up, run:
$ minikube delete
After this, you can re-run the tutorial from Step 1.
Getting Started Securing Elasticsearch¶
This document serves as an introduction for using Cilium to enforce Elasticsearch-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install kubectl & minikube¶
- Install
kubectl
version>= 1.7.0
as described in the Kubernetes Docs. - Install one of the hypervisors supported by minikube .
- Install
minikube
>= 0.22.3
as described on minikube’s github page .
$ minikube start --network-plugin=cni --extra-config=kubelet.network-plugin=cni --memory=5120
$ minikube start --network-plugin=cni --container-runtime=cri-o --extra-config=kubelet.network-plugin=cni --memory=5120
After minikube has finished setting up your new Kubernetes cluster, you can
check the status of the cluster by running kubectl get cs
:
$ kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
- Install etcd as a dependency of cilium in minikube by running:
$ kubectl create -n kube-system -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/addons/etcd/standalone-etcd.yaml service "etcd-cilium" created statefulset.apps "etcd-cilium" created
To check that all pods are Running
and 100% ready, including kube-dns
and etcd-cilium-0
run:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-cilium-0 1/1 Running 0 1m
kube-system etcd-minikube 1/1 Running 0 3m
kube-system kube-addon-manager-minikube 1/1 Running 0 4m
kube-system kube-apiserver-minikube 1/1 Running 0 3m
kube-system kube-controller-manager-minikube 1/1 Running 0 3m
kube-system kube-dns-86f4d74b45-lhzfv 3/3 Running 0 4m
kube-system kube-proxy-tcd7h 1/1 Running 0 4m
kube-system kube-scheduler-minikube 1/1 Running 0 4m
kube-system storage-provisioner 1/1 Running 0 4m
If you see output similar to this, you are ready to proceed to the next step.
Note
The output might differ between minikube versions, you should expect to have all pods in READY / Running state before continuing.
Step 1: Install Cilium¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
Choose the installation instructions for the environment in which you are deploying Cilium.
Docker Based¶
CRI-O Based¶
Step 2: Deploy the Demo Application¶
Following the Cilium tradition, we will use a Star Wars-inspired example. The Empire has a large scale Elasticsearch cluster which is used for storing a variety of data including:
index: troop_logs
: Stormtroopers performance logs collected from every outpost which are used to identify and eliminate weak performers!index: spaceship_diagnostics
: Spaceships diagnostics data collected from every spaceship which is used for R&D and improvement of the spaceships.
Every outpost has an Elasticsearch client service to upload the Stormtroopers logs. And every spaceship has a service to upload diagnostics. Similarly, the Empire headquarters has a service to search and analyze the troop logs and spaceship diagnostics data. Before we look into the security concerns, let’s first create this application scenario in minikube.
Deploy the app using command below, which will create
- An
elasticsearch
service with the selector labelcomponent:elasticsearch
and a pod running Elasticsearch. - Three Elasticsearch clients one each for
empire-hq
,outpost
andspaceship
.
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-es/es-sw-app.yaml serviceaccount "elasticsearch" created service "elasticsearch" created replicationcontroller "es" created role "elasticsearch" created rolebinding "elasticsearch" created pod "outpost" created pod "empire-hq" created pod "spaceship" created
$ kubectl get svc,pods
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/elasticsearch NodePort 10.111.238.254 <none> 9200:30130/TCP,9300:31721/TCP 2d
svc/etcd-cilium NodePort 10.98.67.60 <none> 32379:31079/TCP,32380:31080/TCP 9d
svc/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9d
NAME READY STATUS RESTARTS AGE
po/empire-hq 1/1 Running 0 2d
po/es-g9qk2 1/1 Running 0 2d
po/etcd-cilium-0 1/1 Running 0 9d
po/outpost 1/1 Running 0 2d
po/spaceship 1/1 Running 0 2d
Step 3: Security Risks for Elasticsearch Access¶
For Elasticsearch clusters the least privilege security challenge is to give clients access only to particular indices, and to limit the operations each client is allowed to perform on each index. In this example, the outpost
Elasticsearch clients only need access to upload troop logs; and the empire-hq
client only needs search access to both the indices. From the security perspective, the outposts are weak spots and susceptible to be captured by the rebels. Once compromised, the clients can be used to search and manipulate the critical data in Elasticsearch. We can simulate this attack, but first let’s run the commands for legitimate behavior for all the client services.
outpost
client uploading troop logs
$ kubectl exec outpost -- python upload_logs.py
Uploading Stormtroopers Performance Logs
created : {'_index': 'troop_logs', '_type': 'log', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': True}
spaceship
uploading diagnostics
$ kubectl exec spaceship -- python upload_diagnostics.py
Uploading Spaceship Diagnostics
created : {'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': True}
empire-hq
running search queries for logs and diagnostics
$ kubectl exec empire-hq -- python search.py
Searching for Spaceship Diagnostics
Got 1 Hits:
{'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_score': 1.0, \
'_source': {'spaceshipid': '3459B78XNZTF', 'type': 'tiefighter', 'title': 'Engine Diagnostics', \
'stats': '[CRITICAL] [ENGINE BURN @SPEED 5000 km/s] [CHANCE 80%]'}}
Searching for Stormtroopers Performance Logs
Got 1 Hits:
{'_index': 'troop_logs', '_type': 'log', '_id': '1', '_score': 1.0, \
'_source': {'outpost': 'Endor', 'datetime': '33 ABY 4AM DST', 'title': 'Endor Corps 1: Morning Drill', \
'notes': '5100 PRESENT; 15 ABSENT; 130 CODE-RED BELOW PAR PERFORMANCE'}}
Now imagine an outpost captured by the rebels. In the commands below, the rebels first search all the indices and then manipulate the diagnostics data from a compromised outpost.
$ kubectl exec outpost -- python search.py
Searching for Spaceship Diagnostics
Got 1 Hits:
{'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_score': 1.0, \
'_source': {'spaceshipid': '3459B78XNZTF', 'type': 'tiefighter', 'title': 'Engine Diagnostics', \
'stats': '[CRITICAL] [ENGINE BURN @SPEED 5000 km/s] [CHANCE 80%]'}}
Searching for Stormtroopers Performance Logs
Got 1 Hits:
{'_index': 'troop_logs', '_type': 'log', '_id': '1', '_score': 1.0, \
'_source': {'outpost': 'Endor', 'datetime': '33 ABY 4AM DST', 'title': 'Endor Corps 1: Morning Drill', \
'notes': '5100 PRESENT; 15 ABSENT; 130 CODE-RED BELOW PAR PERFORMANCE'}}
Rebels manipulate spaceship diagnostics data so that the spaceship defects are not known to the empire-hq! (Hint: Rebels have changed the stats
for the tiefighter spaceship, a change hard to detect but with adverse impact!)
$ kubectl exec outpost -- python update.py
Uploading Spaceship Diagnostics
{'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_score': 1.0, \
'_source': {'spaceshipid': '3459B78XNZTF', 'type': 'tiefighter', 'title': 'Engine Diagnostics', \
'stats': '[OK] [ENGINE OK @SPEED 5000 km/s]'}}
Step 4: Securing Elasticsearch Using Cilium¶

Following the least privilege security principle, we want to the allow the following legitimate actions and nothing more:
outpost
service only has upload access toindex: troop_logs
spaceship
service only has upload access toindex: spaceship_diagnostics
empire-hq
service only has search access for both the indices
Fortunately, the Empire DevOps team is using Cilium for their Kubernetes cluster. Cilium provides L7 visibility and security policies to control Elasticsearch API access. Cilium follows the white-list, least privilege model for security. That is to say, a CiliumNetworkPolicy contains a list of rules that define allowed requests and any request that does not match the rules is denied.
In this example, the policy rules are defined for inbound traffic (i.e., “ingress”) connections to the elasticsearch service. Note that endpoints selected as backend pods for the service are defined by the selector labels. Selector labels use the same concept as Kubernetes to define a service. In this example, label component: elasticsearch
defines the pods that are part of the elasticsearch service in Kubernetes.
In the policy file below, you will see the following rules for controlling the indices access and actions performed:
fromEndpoints
with labelsapp:spaceship
onlyHTTP
PUT
is allowed on paths matching regex^/spaceship_diagnostics/stats/.*$
fromEndpoints
with labelsapp:outpost
onlyHTTP
PUT
is allowed on paths matching regex^/troop_logs/log/.*$
fromEndpoints
with labelsapp:empire
onlyHTTP
GET
is allowed on paths matching regex^/spaceship_diagnostics/_search/??.*$
and^/troop_logs/search/??.*$
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: secure-empire-elasticsearch
namespace: default
specs:
- endpointSelector:
matchLabels:
component: elasticsearch
ingress:
- fromEndpoints:
- matchLabels:
app: spaceship
toPorts:
- ports:
- port: "9200"
protocol: TCP
rules:
http:
- method: ^PUT$
path: ^/spaceship_diagnostics/stats/.*$
- fromEndpoints:
- matchLabels:
app: empire-hq
toPorts:
- ports:
- port: "9200"
protocol: TCP
rules:
http:
- method: ^GET$
path: ^/spaceship_diagnostics/_search/??.*$
- method: ^GET$
path: ^/troop_logs/_search/??.*$
- fromEndpoints:
- matchLabels:
app: outpost
toPorts:
- ports:
- port: "9200"
protocol: TCP
rules:
http:
- method: ^PUT$
path: ^/troop_logs/log/.*$
- egress:
- toEndpoints:
- matchExpressions:
- key: k8s:io.kubernetes.pod.namespace
operator: Exists
- toEntities:
- cluster
- host
endpointSelector: {}
ingress:
- {}
Apply this Elasticsearch-aware network security policy using kubectl
:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes-es/es-sw-policy.yaml ciliumnetworkpolicy "secure-empire-elasticsearch" created
Let’s test the security policies. Firstly, the search access is blocked for both outpost and spaceship. So from a compromised outpost, rebels will not be able to search and obtain knowledge about troops and spaceship diagnostics. Secondly, the outpost clients don’t have access to create or update the index: spaceship_diagnostics
.
$ kubectl exec outpost -- python search.py
GET http://elasticsearch:9200/spaceship_diagnostics/_search [status:403 request:0.008s]
...
...
elasticsearch.exceptions.AuthorizationException: TransportError(403, 'Access denied\r\n')
command terminated with exit code 1
$ kubectl exec outpost -- python update.py
PUT http://elasticsearch:9200/spaceship_diagnostics/stats/1 [status:403 request:0.006s]
...
...
elasticsearch.exceptions.AuthorizationException: TransportError(403, 'Access denied\r\n')
command terminated with exit code 1
We can re-run any of the below commands to show that the security policy still allows all legitimate requests (i.e., no 403 errors are returned).
$ kubectl exec outpost -- python upload_logs.py
...
$ kubectl exec spaceship -- python upload_diagnostics.py
...
$ kubectl exec empire-hq -- python search.py
...
Step 6: Clean Up¶
You have now installed Cilium, deployed a demo app, and finally deployed & tested Elasticsearch-aware network security policies. To clean up, run:
$ minikube delete
Getting Started Using Mesos/Marathon¶
This tutorial leverages Vagrant and VirtualBox to deploy Apache Mesos, Marathon and Cilium. You will run Cilium to apply a simple policy between a simulated web-service and clients. This tutorial can be run on any operating system supported by Vagrant including Linux, macOS, and Windows.
For more information on Apache Mesos and Marathon orchestration, check out the Mesos and Marathon GitHub pages, respectively.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install Vagrant¶
You need to run at least Vagrant version 1.8.3 or you will run into issues
booting the Ubuntu 17.04 base image. You can verify by running vagrant
--version
.
If you don’t already have Vagrant installed, follow the Vagrant Install Instructions or see Download Vagrant for newer versions.
Step 1: Download the Cilium Source Code¶
Download the latest Cilium source code and unzip the files.
Alternatively, if you are a developer, feel free to clone the repository:
$ git clone https://github.com/cilium/cilium
Step 2: Starting a VM with Cilium¶
Open a terminal and navigate into the top of the cilium
source directory.
Then navigate into examples/mesos
and run vagrant up
:
$ cd examples/mesos
$ vagrant up
The script usually takes a few minutes depending on the speed of your internet connection. Vagrant will set up a VM, install Mesos & Marathon, run Cilium with the help of Docker compose, and start up the Mesos master and slave services. When the script completes successfully, it will print:
==> default: Creating cilium-kvstore
Creating cilium-kvstore ... done
==> default: Creating cilium ...
==> default: Creating cilium
Creating cilium ... done
==> default: Installing loopback driver...
==> default: Installing cilium-cni to /host/opt/cni/bin/ ...
==> default: Installing new /host/etc/cni/net.d/00-cilium.conf ...
==> default: Deploying Vagrant VM + Cilium + Mesos...done
$
If the script exits with an error message, do not attempt to proceed with the tutorial, as later steps will not work properly. Instead, contact us on the Cilium Slack channel.
Step 3: Accessing the VM¶
After the script has successfully completed, you can log into the VM using
vagrant ssh
:
$ vagrant ssh
All commands for the rest of the tutorial below should be run from inside this
Vagrant VM. If you end up disconnecting from this VM, you can always reconnect
by going to the examples/mesos
directory and then running the command vagrant ssh
.
Step 4: Confirm that Cilium is Running¶
The Cilium agent is now running and you can interact with it using the
cilium
CLI client. Check the status of the agent by running cilium
status
:
$ cilium status
KVStore: Ok Consul: 172.18.0.2:8300
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Disabled
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
IPv4 address pool: 3/65535 allocated
IPv6 address pool: 2/65535 allocated
Controller Status: 10/10 healthy
Proxy Status: OK, ip 10.15.0.1, port-range 10000-20000
Cluster health: 1/1 reachable (2018-06-19T15:10:28Z)
The status indicates that all necessary components are operational.
Step 5: Run Script to Start Marathon¶
Start Marathon inside the Vagrant VM:
$ ./start_marathon.sh
Starting marathon...
...
...
...
...
Done
Step 6: Simulate a Web-Server and Clients¶
Use curl
to submit a task to Marathon for scheduling, with data to run the
simulated web-server provided by the web-server.json
. The web-server simply
responds to requests on a particular port.
$ curl -i -H 'Content-Type: application/json' -d @web-server.json 127.0.0.1:8080/v2/apps
You should see output similar to the following:
HTTP/1.1 201 Created
...
Marathon-Deployment-Id: [UUID]
...
Confirm that Cilium sees the new workload. The output should return the
endpoint with label mesos:id=web-server
and the assigned IP:
$ cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
20928 Disabled Disabled 59281 mesos:id=web-server f00d::a0f:0:0:51c0 10.15.137.206 ready
23520 Disabled Disabled 4 reserved:health f00d::a0f:0:0:5be0 10.15.162.64 ready
Test the web-server provides OK output:
$ export WEB_IP=`cilium endpoint list | grep web-server | awk '{print $7}'`
$ curl $WEB_IP:8181/api
OK
Run a script to create two client tasks (“good client” and “bad client”) that
will attempt to access the web-server. The output of these tasks will be used
to validate the Cilium network policy enforcement later in the exercise. The
script will generate goodclient.json
and badclient.json
files for the
client tasks, respectively:
$ ./generate_client_file.sh goodclient
$ ./generate_client_file.sh badclient
Then submit the client tasks to Marathon, which will generate GET /public
and GET /private
requests:
$ curl -i -H 'Content-Type: application/json' -d @goodclient.json 127.0.0.1:8080/v2/apps
$ curl -i -H 'Content-Type: application/json' -d @badclient.json 127.0.0.1:8080/v2/apps
You can observe the newly created endpoints in Cilium, similar to the following output:
$ cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
20928 Disabled Disabled 59281 mesos:id=web-server f00d::a0f:0:0:51c0 10.15.137.206 ready
23520 Disabled Disabled 4 reserved:health f00d::a0f:0:0:5be0 10.15.162.64 ready
37835 Disabled Disabled 15197 mesos:id=goodclient f00d::a0f:0:0:93cb 10.15.152.208 ready
51053 Disabled Disabled 5113 mesos:id=badclient f00d::a0f:0:0:c76d 10.15.34.97 ready
Marathon runs the tasks as batch jobs with stdout
logged to task-specific
files located in /var/lib/mesos
. To simplify the retrieval of the
stdout
log, use the tail_client.sh
script to output each of the client
logs. In a new terminal, go to examples/mesos
, start a new ssh session to
the Vagrant VM with vagrant ssh
and tail the goodclient logs:
$ ./tail_client.sh goodclient
and in a separate terminal, do the same thing with vagrant ssh
and observe the badclient logs:
$ ./tail_client.sh badclient
Make sure both tail logs continuously prints the result of the clients accessing the /public and /private API of the web-server:
...
---------- Test #X ----------
Request: GET /public
Reply: OK
Request: GET /private
Reply: OK
-------------------------------
...
Note that both clients are able to access the web-server and retrieve both URLs because no Cilium policy has been applied yet.
Step 7: Apply L3/L4 Policy with Cilium¶
Apply an L3/L4 policy only allowing the goodclient to access the web-server. The L3/L4 json policy looks like:
[{
"labels": [{"key": "name", "value": "l3-l4-rule"}],
"endpointSelector": {"matchLabels":{"id":"web-server"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"id":"goodclient"}}
],
"toPorts": [{
"ports": [{"port": "8181", "protocol": "TCP"}]
}]
}]
}]
In your original terminal session, use cilium
CLI to apply the L3/L4 policy above, saved in the l3-l4-policy.json
file on the VM:
$ cilium policy import l3-l4-policy.json
Revision: 1
L3/L4 Policy with Cilium and Mesos

You can observe that the policy is applied via cilium
CLI as the POLICY ENFORCEMENT column changed from Disabled to Enabled:
$ cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
20928 Enabled Disabled 59281 mesos:id=web-server f00d::a0f:0:0:51c0 10.15.137.206 ready
23520 Disabled Disabled 4 reserved:health f00d::a0f:0:0:5be0 10.15.162.64 ready
37835 Disabled Disabled 15197 mesos:id=goodclient f00d::a0f:0:0:93cb 10.15.152.208 ready
51053 Disabled Disabled 5113 mesos:id=badclient f00d::a0f:0:0:c76d 10.15.34.97 ready
You should also observe that the goodclient logs continue to output the web-server responses, whereas the badclient request does not reach the web-server because of policy enforcement, and logging output similar to below.
...
---------- Test #X ----------
Request: GET /public
Reply: Timeout!
Request: GET /private
Reply: Timeout!
-------------------------------
...
Remove the L3/L4 policy in order to give badclient access to the web-server again.
$ cilium policy delete --all
Revision: 2
The badclient logs should resume outputting the web-server’s response and Cilium is configured to no longer enforce policy:
$ cilium endpoint list
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
29898 Disabled Disabled 37948 reserved:health f00d::a0f:0:0:74ca 10.15.242.54 ready
33115 Disabled Disabled 38072 mesos:id=web-server f00d::a0f:0:0:815b 10.15.220.6 ready
38061 Disabled Disabled 46430 mesos:id=badclient f00d::a0f:0:0:94ad 10.15.0.173 ready
64189 Disabled Disabled 31645 mesos:id=goodclient f00d::a0f:0:0:fabd 10.15.152.27 ready
Step 8: Apply L7 Policy with Cilium¶
Now, apply an L7 Policy that only allows access for the goodclient to the /public API, included in the l7-policy.json
file:
[{
"labels": [{"key": "name", "value": "l7-rule"}],
"endpointSelector": {"matchLabels":{"id":"web-server"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"id":"goodclient"}}
],
"toPorts": [{
"ports": [{"port": "8181", "protocol": "TCP"}],
"rules": {
"HTTP": [{
"method": "GET",
"path": "/public"
}]
}
}]
}]
}]
Apply using cilium
CLI:
$ cilium policy import l7-policy.json
Revision: 3
L7 Policy with Cilium and Mesos

In the terminal sessions tailing the goodclient and badclient logs, check the goodclient’s log to see that /private is no longer accessible, and the badclient’s requests are the same results as the enforced policy in the previous step.
...
---------- Test #X ----------
Request: GET /public
Reply: OK
Request: GET /private
Reply: Access Denied
-------------------------------
...
(optional) Remove the policy and notice that the access to /private is unrestricted again:
$ cilium policy delete --all
Revision: 4
Step 9: Clean-Up¶
Exit the vagrant VM by typing exit
in original terminal session. When you want to tear-down the Cilium + Mesos VM and destroy all local state (e.g., the VM disk image), ensure you are in the cilium/examples/mesos
directory and type:
$ vagrant destroy
You can always re-create the VM using the steps described above.
If instead you just want to shut down the VM but may use it later,
vagrant halt default
will work, and you can start it again later.
Troubleshooting¶
For assistance on any of the Getting Started Guides, please reach out and ask a question on the Cilium Slack channel.
Getting Started Using Docker Compose¶
This tutorial leverages Vagrant and VirtualBox, thus should run on any operating system supported by Vagrant, including Linux, macOS, and Windows.
If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Step 0: Install Vagrant¶
If you don’t already have Vagrant installed, refer to the Developer / Contributor Guide for links to installation instructions for Vagrant.
Step 1: Download the Cilium Source Code¶
Download the latest Cilium source code and unzip the files.
Alternatively, if you are a developer, feel free to clone the repository:
$ git clone https://github.com/cilium/cilium
Step 2: Starting the Docker + Cilium VM¶
Open a terminal and navigate into the top of the cilium
source directory.
Then navigate into examples/getting-started
and run vagrant up
:
$ cd examples/getting-started
$ vagrant up
The script usually takes a few minutes depending on the speed of your internet connection. Vagrant will set up a VM, install the Docker container runtime and run Cilium with the help of Docker Compose. When the script completes successfully, it will print:
==> cilium-1: Creating cilium-kvstore
==> cilium-1: Creating cilium
==> cilium-1: Creating cilium-docker-plugin
$
If the script exits with an error message, do not attempt to proceed with the tutorial, as later steps will not work properly. Instead, contact us on the Cilium Slack channel.
Step 3: Accessing the VM¶
After the script has successfully completed, you can log into the VM using
vagrant ssh
:
$ vagrant ssh
All commands for the rest of the tutorial below should be run from inside this
Vagrant VM. If you end up disconnecting from this VM, you can always reconnect
in a new terminal window just by running vagrant ssh
again from the Cilium
directory.
Step 4: Confirm that Cilium is Running¶
The Cilium agent is now running as a system service and you can interact with
it using the cilium
CLI client. Check the status of the agent by running
cilium status
:
$ cilium status
KVStore: Ok Consul: 172.18.0.2:8300
ContainerRuntime: Ok
Kubernetes: Disabled
Cilium: Ok OK
NodeMonitor: Listening for events on 1 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
Controller Status: 6/6 healthy
Proxy Status: OK, ip 10.15.28.238, port-range 10000-20000
Cluster health: 1/1 reachable (2018-04-05T16:08:22Z)
The status indicates that all components are operational with the Kubernetes integration currently being disabled.
Step 5: Create a Docker Network of Type Cilium¶
Cilium integrates with local container runtimes, which in the case of this demo means Docker. With Docker, native networking is handled via a component called libnetwork. In order to steer Docker to request networking of a container from Cilium, a container must be started with a network of driver type “cilium”.
With Cilium, all containers are connected to a single logical network, with isolation added not based on IP addresses but based on container labels (as we will do in the steps below). So with Docker, we simply create a single network named ‘cilium-net’ for all containers:
$ docker network create --ipv6 --subnet ::1/112 --driver cilium --ipam-driver cilium cilium-net
Step 6: Start an Example Service with Docker¶
In this tutorial, we’ll use a container running a simple HTTP server to represent a microservice application which we will refer to as app1. As a result, we will start this container with the label “id=app1”, so we can create Cilium security policies for that service.
Use the following command to start the app1 container connected to the Docker network managed by Cilium:
$ docker run -d --name app1 --net cilium-net -l "id=app1" cilium/demo-httpd
e5723edaa2a1307e7aa7e71b4087882de0250973331bc74a37f6f80667bc5856
This has launched a container running an HTTP server which Cilium is now managing as an Endpoint. A Cilium endpoint is one or more application containers which can be addressed by an individual IP address.
Step 7: Apply an L3/L4 Policy With Cilium¶
When using Cilium, endpoint IP addresses are irrelevant when defining security policies. Instead, you can use the labels assigned to the VM to define security policies, which are automatically applied to any container with that label, no matter where or when it is run within a container cluster.
We’ll start with an overly simple example where we create two additional apps, app2 and app3, and we want app2 containers to be able to reach app1 containers, but app3 containers should not be allowed to reach app1 containers. Additionally, we only want to allow app1 to be reachable on port 80, but no other ports. This is a simple policy that filters only on IP address (network layer 3) and TCP port (network layer 4), so it is often referred to as an L3/L4 network security policy.
Cilium performs stateful ‘’connection tracking’‘, meaning that if policy allows the app2 to contact app1, it will automatically allow return packets that are part of app1 replying to app2 within the context of the same TCP/UDP connection.
L4 Policy with Cilium and Docker

We can achieve that with the following Cilium policy:
[{
"labels": [{"key": "name", "value": "l3-rule"}],
"endpointSelector": {"matchLabels":{"id":"app1"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"id":"app2"}}
],
"toPorts": [{
"ports": [{"port": "80", "protocol": "TCP"}]
}]
}]
}]
Save this JSON to a file named l3_l4_policy.json in your VM, and apply the policy by running:
$ cilium policy import l3_l4_policy.json
Revision: 1
Step 8: Test L3/L4 Policy¶
You can now launch additional containers that represent other services attempting to access app1. Any new container with label “id=app2” will be allowed to access app1 on port 80, otherwise the network request will be dropped.
To test this out, we’ll make an HTTP request to app1 from a container with the label “id=app2” :
$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl -m 20 http://app1
<html><body><h1>It works!</h1></body></html>
We can see that this request was successful, as we get a valid HTTP response.
Now let’s run the same HTTP request to app1 from a container that has label “id=app3”:
$ docker run --rm -ti --net cilium-net -l "id=app3" cilium/demo-client curl -m 10 http://app1
You will see no reply as all packets are dropped by the Cilium security policy. The request will time-out after 10 seconds.
So with this we see Cilium’s ability to segment containers based purely on a container-level identity label. This means that the end user can apply security policies without knowing anything about the IP address of the container or requiring some complex mechanism to ensure that containers of a particular service are assigned an IP address in a particular range.
Step 9: Apply and Test an L7 Policy with Cilium¶
In the simple scenario above, it was sufficient to either give app2 / app3 full access to app1’s API or no access at all. But to provide the strongest security (i.e., enforce least-privilege isolation) between microservices, each service that calls app1’s API should be limited to making only the set of HTTP requests it requires for legitimate operation.
- For example, consider a scenario where app1 has two API calls:
- GET /public
- GET /private
Continuing with the example from above, if app2 requires access only to the GET /public API call, the L3/L4 policy along has no visibility into the HTTP requests, and therefore would allow any HTTP request from app2 (since all HTTP is over port 80).
To see this, run:
$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl 'http://app1/public'
{ 'val': 'this is public' }
and
$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl 'http://app1/private'
{ 'val': 'this is private' }
Cilium is capable of enforcing HTTP-layer (i.e., L7) policies to limit what URLs app2 is allowed to reach. Here is an example policy file that extends our original policy by limiting app2 to making only a GET /public API call, but disallowing all other calls (including GET /private).
L7 Policy with Cilium and Docker

The following Cilium policy file achieves this goal:
[{
"labels": [{"key": "name", "value": "l7-rule"}],
"endpointSelector": {"matchLabels":{"id":"app1"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"id":"app2"}}
],
"toPorts": [{
"ports": [{"port": "80", "protocol": "TCP"}],
"rules": {
"HTTP": [{
"method": "GET",
"path": "/public"
}]
}
}]
}]
}]
Create a file with this contents and name it l7_aware_policy.json. Then import this policy to Cilium by running:
$ cilium policy delete --all
Revision: 2
$ cilium policy import l7_aware_policy.json
Revision: 3
$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl -si 'http://app1/public'
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 28
Date: Tue, 31 Oct 2017 14:30:56 GMT
Etag: "1c-54bb868cec400"
Last-Modified: Mon, 27 Mar 2017 15:58:08 GMT
Server: Apache/2.4.25 (Unix)
Content-Type: text/plain; charset=utf-8
{ 'val': 'this is public' }
and
$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl -si 'http://app1/private'
HTTP/1.1 403 Forbidden
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 31 Oct 2017 14:31:09 GMT
Content-Length: 14
Access denied
As you can see, with Cilium L7 security policies, we are able to permit app2 to access only the required API resources on app1, thereby implementing a “least privilege” security approach for communication between microservices.
We hope you enjoyed the tutorial. Feel free to play more with the setup, read the rest of the documentation, and reach out to us on the Cilium Slack channel with any questions!
Step 10: Clean-Up¶
Exit the vagrant VM by typing exit
.
When you are done with the setup and want to tear-down the Cilium + Docker VM, and destroy all local state (e.g., the VM disk image), open a terminal in the cilium/examples/getting-started directory and type:
$ vagrant destroy cilium-1
You can always re-create the VM using the steps described above.
If instead you just want to shut down the VM but may use it later,
vagrant halt cilium-1
will work, and you can start it again later.
Step 1: Install Cilium¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
Choose the installation instructions for the environment in which you are deploying Cilium.
Docker Based¶
CRI-O Based¶
Cilium, Kubernetes and CRI-O¶
Step 1: Install Cilium with CRI-O¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system-relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
To deploy Cilium, run:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium-crio.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium-crio.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium-crio.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium-crio.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium-crio.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
Kubernetes is now deploying Cilium with its RBAC settings, ConfigMap and DaemonSet as a pod on minikube. This operation is performed in the background.
Run the following command to check the progress of the deployment:
$ kubectl get daemonsets -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
cilium 1 1 0 1 1 <none> 3m
kube-proxy 1 1 1 1 1 <none> 8m
Wait until the cilium Deployment shows a CURRENT
count of 1
like above (a READY
value of 0
is OK for this tutorial).
Since CRI-O does not automatically detect that a new CNI plugin has been installed, you will need to restart the CRI-O daemon for it to pick up the Cilium CNI configuration.
First make sure cilium is running:
$ kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE
cilium-mqtdz 1/1 Running 0 3m 10.0.2.15 minikube
After that you can restart CRI-O:
$ minikube ssh -- sudo systemctl restart crio
Finally, you need to restart the Cilium pod so it can re-mount
/var/run/crio.sock
which was recreated by CRI-O
$ kubectl delete -n kube-system pod cilium-mqtdz
Cilium, Kubernetes and Docker¶
Step 1: Install Cilium with Docker¶
The next step is to install Cilium into your Kubernetes cluster.
Cilium installation leverages the Kubernetes Daemon Set
abstraction, which will deploy one Cilium pod per cluster node. This
Cilium pod will run in the kube-system
namespace along with all
other system-relevant daemons and services. The Cilium pod will run
both the Cilium agent and the Cilium CNI plugin.
To deploy Cilium, run:
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium.yaml configmap "cilium-config" created secret "cilium-etcd-secrets" created daemonset.extensions "cilium" created clusterrolebinding.rbac.authorization.k8s.io "cilium" created clusterrole.rbac.authorization.k8s.io "cilium" created serviceaccount "cilium" created
Kubernetes is now deploying Cilium with its RBAC settings, ConfigMap and DaemonSet as a pod on minikube. This operation is performed in the background.
Run the following command to check the progress of the deployment:
$ kubectl get daemonsets -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
cilium 1 1 0 1 1 <none> 3m
kube-proxy 1 1 1 1 1 <none> 8m
Wait until the cilium Deployment shows a CURRENT
count of 1
like above (a READY
value of 0
is OK for this tutorial).
The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.
Concepts¶
The goal of this document is to describe the components of the Cilium architecture, and the different models for deploying Cilium within your datacenter or cloud environment. It focuses on the higher-level understanding required to run a full Cilium deployment. You can then use the more detailed Installation Guides to understand the details of setting up Cilium.
Component Overview¶

A deployment of Cilium consists of the following components running on each Linux container node in the container cluster:
- Cilium Agent (Daemon): Userspace daemon that interacts with the container runtime and orchestration systems such as Kubernetes via Plugins to setup networking and security for containers running on the local server. Provides an API for configuring network security policies, extracting network visibility data, etc.
- Cilium CLI Client: Simple CLI client for communicating with the local Cilium Agent, for example, to configure network security or visibility policies.
- Linux Kernel BPF: Integrated capability of the Linux kernel to accept compiled bytecode that is run at various hook / trace points within the kernel. Cilium compiles BPF programs and has the kernel run them at key points in the network stack to have visibility and control over all network traffic in / out of all containers.
- Container Platform Network Plugin: Each container platform (e.g., Docker, Kubernetes) has its own plugin model for how external networking platforms integrate. In the case of Docker, each Linux node runs a process (cilium-docker) that handles each Docker libnetwork call and passes data / requests on to the main Cilium Agent.
In addition to the components that run on each Linux container host, Cilium leverages a key-value store to share data between Cilium Agents running on different nodes. The currently supported key-value stores are:
- etcd
- consul
Cilium Agent¶
The Cilium agent (cilium-agent) runs on each Linux container host. At a high-level, the agent accepts configuration that describes service-level network security and visibility policies. It then listens to events in the container runtime to learn when containers are started or stopped, and it creates custom BPF programs which the Linux kernel uses to control all network access in / out of those containers. In more detail, the agent:
- Exposes APIs to allow operations / security teams to configure security policies (see below) that control all communication between containers in the cluster. These APIs also expose monitoring capabilities to gain additional visibility into network forwarding and filtering behavior.
- Gathers metadata about each new container that is created. In particular, it queries identity metadata like container / pod labels, which are used to identify Endpoint in Cilium security policies.
- Interacts with the container platforms network plugin to perform IP address management (IPAM), which controls what IPv4 and IPv6 addresses are assigned to each container. The IPAM is managed by the agent in a shared pool between all plugins which means that the Docker and CNI network plugin can run side by side allocating a single address pool.
- Combines its knowledge about container identity and addresses with the already configured security and visibility policies to generate highly efficient BPF programs that are tailored to the network forwarding and security behavior appropriate for each container.
- Compiles the BPF programs to bytecode using clang/LLVM and passes them to the Linux kernel to run for all packets in / out of the container’s virtual ethernet device(s).
Cilium CLI Client¶
The Cilium CLI Client (cilium) is a command-line tool that is installed along with the Cilium Agent. It gives a command-line interface to interact with all aspects of the Cilium Agent API. This includes inspecting Cilium’s state about each network endpoint (i.e., container), configuring and viewing security policies, and configuring network monitoring behavior.
Linux Kernel BPF¶
Berkeley Packet Filter (BPF) is a Linux kernel bytecode interpreter originally introduced to filter network packets, e.g. tcpdump and socket filters. It has since been extended with additional data structures such as hashtable and arrays as well as additional actions to support packet mangling, forwarding, encapsulation, etc. An in-kernel verifier ensures that BPF programs are safe to run and a JIT compiler converts the bytecode to CPU architecture specific instructions for native execution efficiency. BPF programs can be run at various hooking points in the kernel such as for incoming packets, outgoing packets, system calls, kprobes, etc.
BPF continues to evolve and gain additional capabilities with each new Linux release. Cilium leverages BPF to perform core datapath filtering, mangling, monitoring and redirection, and requires BPF capabilities that are in any Linux kernel version 4.8.0 or newer. On the basis that 4.8.x is already declared end of life and 4.9.x has been nominated as a stable release we recommend to run at least kernel 4.9.17 (the latest current stable Linux kernel as of this writing is 4.10.x).
Cilium is capable of probing the Linux kernel for available features and will automatically make use of more recent features as they are detected.
Linux distros that focus on being a container runtime (e.g., CoreOS, Fedora Atomic) typically already ship kernels that are newer than 4.8, but even recent versions of general purpose operating systems such as Ubuntu 16.10 ship fairly recent kernels. Some Linux distributions still ship older kernels but many of them allow installing recent kernels from separate kernel package repositories.
For more detail on kernel versions, see: Linux Kernel.
Key-Value Store¶
The Key-Value (KV) Store is used for the following state:
- Policy Identities: list of labels <=> policy identity identifier
- Global Services: global service id to VIP association (optional)
- Encapsulation VTEP mapping (optional)
To simplify things in a larger deployment, the key-value store can be the same one used by the container orchestrator (e.g., Kubernetes using etcd).
Assurances¶
If Cilium loses connectivity with the KV-Store, it guarantees that:
- Normal networking operations will continue;
- If policy enforcement is enabled, the existing Endpoint will still have their policy enforced but you will lose the ability to add additional containers that belong to security identities which are unknown on the node;
- If services are enabled, you will lose the ability to add additional services / loadbalancers;
- When the connectivity is restored to the KV-Store, Cilium can take up to 5 minutes to re-sync the out-of-sync state with the KV-Store.
Cilium will keep running even if it is out-of-sync with the KV-Store.
If Cilium crashes / or the DaemonSet is accidentally deleted, the following are guaranteed:
- When running Cilium as a DaemonSet / container, with the specification files provided in the documentation Installation Guide, the endpoints / containers which are already running will not lose any connectivity, and they will keep running with the policy loaded before Cilium stopped unexpectedly.
- When running Cilium on a different way, just make sure the bpf fs is mounted Mounting the BPF FS (Optional).
Terminology¶
Labels¶
Labels are a generic, flexible and highly scalable way of addressing a large set of resources as they allow for arbitrary grouping and creation of sets. Whenever something needs to be described, addressed or selected this is done based on labels:
- Endpoint are assigned labels as derived from container runtime, the orchestration system, or other sources.
- Network Policy select pairs of Endpoint which are allowed to communicate based on labels. The policies themselves are identified by labels as well.
What is a Label?¶
A label is a pair of strings consisting of a key
and value
. A label can
be formatted as a single string with the format key=value
. The key portion
is mandatory and must be unique. This is typically achieved by using the
reverse domain name notion, e.g. io.cilium.mykey=myvalue
. The value portion
is optional and can be omitted, e.g. io.cilium.mykey
.
Key names should typically consist of the character set [a-z0-9-.]
.
When using labels to select resources, both the key and the value must match,
e.g. when a policy should be applied to all endpoints with the label
my.corp.foo
then the label my.corp.foo=bar
will not match the
selector.
Label Source¶
A label can be derived from various sources. For example, an endpoint will derive the labels associated to the container by the local container runtime as well as the labels associated with the pod as provided by Kubernetes. As these two label namespaces are not aware of each other, this may result in conflicting label keys.
To resolve this potential conflict, Cilium prefixes all label keys with
source:
to indicate the source of the label when importing labels, e.g.
k8s:role=frontend
, container:user=joe
, k8s:role=backend
. This means
that when you run a Docker container using docker run [...] -l foo=bar
, the
label container:foo=bar
will appear on the Cilium endpoint representing the
container. Similarly, a Kubernetes pod started with the label foo: bar
will be represented with a Cilium endpoint associated with the label
k8s:foo=bar
. A unique name is allocated for each potential source. The
following label sources are currently supported:
container:
for labels derived from the local container runtimek8s:
for labels derived from Kubernetesmesos:
for labels derived from Mesosreserved:
for special reserved labels, see Special Identities.unspec:
for labels with unspecified source
When using labels to identify other resources, the source can be included to
limit matching of labels to a particular type. If no source is provided, the
label source defaults to any:
which will match all labels regardless of
their source. If a source is provided, the source of the selecting and matching
labels need to match.
Endpoint¶
Cilium makes application containers available on the network by assigning them
IP addresses. Multiple application containers can share the same IP address; a
typical example for this model is a Kubernetes Pod
. All application containers
which share a common address are grouped together in what Cilium refers to as
an endpoint.
Allocating individual IP addresses enables the use of the entire Layer 4 port
range by each endpoint. This essentially allows multiple application containers
running on the same cluster node to all bind to well known ports such 80
without causing any conflicts.
The default behavior of Cilium is to assign both an IPv6 and IPv4 address to
every endpoint. However, this behavior can be configured to only allocate an
IPv6 address with the --disable-ipv4
option. If both an IPv6 and IPv4
address are assigned, either address can be used to reach the endpoint. The
same behavior will apply with regard to policy rules, load-balancing, etc. See
Address Management for more details.
Identification¶
For identification purposes, Cilium assigns an internal endpoint id to all endpoints on a cluster node. The endpoint id is unique within the context of an individual cluster node.
Endpoint Metadata¶
An endpoint automatically derives metadata from the application containers associated with the endpoint. The metadata can then be used to identify the endpoint for security/policy, load-balancing and routing purposes.
The source of the metadata will depend on the orchestration system and container runtime in use. The following metadata retrieval mechanisms are currently supported:
System | Description |
---|---|
Kubernetes | Pod labels (via k8s API) |
Mesos | Labels (via CNI) |
containerd (Docker) | Container labels (via Docker API) |
Metadata is attached to endpoints in the form of Labels.
The following example launches a container with the label app=benchmark
which is then associated with the endpoint. The label is prefixed with
container:
to indicate that the label was derived from the container
runtime.
$ docker run --net cilium -d -l app=benchmark tgraf/netperf
aaff7190f47d071325e7af06577f672beff64ccc91d2b53c42262635c063cf1c
$ cilium endpoint list
ENDPOINT POLICY IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT
62006 Disabled 257 container:app=benchmark f00d::a00:20f:0:f236 10.15.116.202 ready
An endpoint can have metadata associated from multiple sources. A typical
example is a Kubernetes cluster which uses containerd as the container runtime.
Endpoints will derive Kubernetes pod labels (prefixed with the k8s:
source
prefix) and containerd labels (prefixed with container:
source prefix).
Identity¶
All Endpoint are assigned an identity. The identity is what is used to enforce basic connectivity between endpoints. In traditional networking terminology, this would be equivalent to Layer 3 enforcement.
An identity is identified by Labels and is given a cluster wide unique identifier. The endpoint is assigned the identity which matches the endpoint’s Security Relevant Labels, i.e. all endpoints which share the same set of Security Relevant Labels will share the same identity. This concept allows to scale policy enforcement to massive number of endpoints as many individual endpoints will typically share the same set of security Labels as applications are scaled.
What is an Identity?¶
The identity of an endpoint is derived based on the Labels associated with the pod or container which are derived to the endpoint. When a pod or container is started, Cilium will create an endpoint based on the event received by the container runtime to represent the pod or container on the network. As a next step, Cilium will resolve the identity of the endpoint created. Whenever the Labels of the pod or container change, the identity is reconfirmed and automatically modified as required.
Security Relevant Labels¶
Not all Labels associated with a container or pod are meaningful when
deriving the Identity. Labels may be used to store metadata such as the
timestamp when a container was launched. Cilium requires to know which labels
are meaningful and are subject to being considered when deriving the identity.
For this purpose, the user is required to specify a list of string prefixes of
meaningful labels. The standard behavior is to include all labels which start
with the prefix id.
, e.g. id.service1
, id.service2
,
id.groupA.service44
. The list of meaningful label prefixes can be specified
when starting the agent.
Special Identities¶
All endpoints which are managed by Cilium will be assigned an identity. In
order to allow communication to network endpoints which are not managed by
Cilium, special identities exist to represent those. Special reserved
identities are prefixed with the string reserved:
.
Identity | Description |
---|---|
reserved:host | The host network namespace on which the pod or container is running. |
reserved:cluster | Any network endpoint inside of the cluster that is not managed by Cilium. Does not include reserved:host. |
reserved:world | Any network endpoint outside of the cluster |
Identity Management in the Cluster¶
Identities are valid in the entire cluster which means that if several pods or containers are started on several cluster nodes, all of them will resolve and share a single identity if they share the identity relevant labels. This requires coordination between cluster nodes.

The operation to resolve an endpoint identity is performed with the help of the distributed key-value store which allows to perform atomic operations in the form generate a new unique identifier if the following value has not been seen before. This allows each cluster node to create the identity relevant subset of labels and then query the key-value store to derive the identity. Depending on whether the set of labels has been queried before, either a new identity will be created, or the identity of the initial query will be returned.
Node¶
Cilium refers to a node as an individual member of a cluster. Each node must be
running the cilium-agent
and will operate in a mostly autonomous manner.
Synchronization of state between Cilium agent’s running on different nodes is
kept to a minimum for simplicity and scale. It occurs exclusively via the
Key-Value store or with packet metadata.
Node Address¶
Cilium will automatically detect the node’s IPv4 and IPv6 address. The detected
node address is printed out when the cilium-agent
starts:
Local node-name: worker0
Node-IPv6: f00d::ac10:14:0:1
External-Node IPv4: 172.16.0.20
Internal-Node IPv4: 10.200.28.238
Address Management¶
The address management is designed with simplicity and resilience in mind. This is achieved by delegating the address allocation for endpoints to each individual node in the cluster. Each cluster node is assigned a node address allocation prefix out of an overarching cluster address prefix and will allocate IPs for endpoints independently.
This simplifies address handling and allows one to make a fundamental assumption:
- No state needs to be synchronized between cluster nodes to allocate IP addresses and to determine whether an IP address belongs to an endpoint of the cluster and whether that endpoint resides on the local cluster node.
Note
If you are using Kubernetes, the allocation of the node address prefix
can be simply delegated to Kubernetes by specifying
--allocate-node-cidrs
flag to kube-controller-manager
. Cilium
will automatically use the IPv4 node CIDR allocated by Kubernetes.
The following values are used by default if the cluster prefix is left unspecified. These are meant for testing and need to be adjusted according to the needs of your environment.
Type | Cluster | Node Prefix |
IPv4 | 10.0.0.0/8 |
10.X.0.0/16 where X is derived using the
last 8 bits of the first IPv4 address in the list
of global scope addresses on the cluster node. |
IPv6 | f00d::/48 |
Note: Only 16 bits out of the |
The size of the IPv4 cluster prefix can be changed with the
--ipv4-cluster-cidr-mask-size
option. The size of the IPv6 cluster prefix
is currently fixed sized at /48
. The node allocation prefixes can be
specified manually with the option --ipv4-range
respectively
--ipv6-range
.
Multi Host Networking¶
Cilium is in full control over both ends of the connection for connections inside the cluster. It can thus transmit state and security context information between two container hosts by embedding the information in encapsulation headers or even unused bits of the IPv6 packet header. This allows Cilium to transmit the security context of where the packet originates, which allows tracing back which container labels are assigned to the origin container.
Note
As the packet headers contain security sensitive information, it is highly recommended to either encrypt all traffic or run Cilium in a trusted network environment.
Cilium keeps the networking concept as simple as possible. There are two networking models to choose from.
Regardless of the option chosen, the container itself has no awareness of the underlying network it runs on; it only contains a default route which points to the IP address of the cluster node. Given the removal of the routing cache in the Linux kernel, this reduces the amount of state to keep in the per connection flow cache (TCP metrics), which allows to terminate millions of connections in each container.
Overlay Network Mode¶
When no configuration is provided, Cilium automatically runs in this mode.
In this mode, all cluster nodes form a mesh of tunnels using the UDP based
encapsulation protocols VXLAN
or Geneve
. All container-to-container network
traffic is routed through these tunnels. This mode has several major
advantages:
- Simplicity: The network which connects the cluster nodes does not need to be made aware of the cluster prefix. Cluster nodes can spawn multiple routing or link-layer domains. The topology of the underlying network is irrelevant as long as cluster nodes can reach each other using IP/UDP.
- Auto-configuration: When running together with an orchestration system
such as Kubernetes, the list of all nodes in the cluster including their
associated allocation prefix node is made available to each agent
automatically. This means that if Kubernetes is being run with the
--allocate-node-cidrs
option, Cilium can form an overlay network automatically without any configuration by the user. New nodes joining the cluster will automatically be incorporated into the mesh. - Identity transfer: Encapsulation protocols allow for the carrying of arbitrary metadata along with the network packet. Cilium makes use of this ability to transfer metadata such as the source security identity and load balancing state to perform direct-server-return.
Direct / Native Routing Mode¶
Note
This is an advanced networking mode which requires the underlying
network to be made aware of container IPs. You can enable this mode
by running Cilium with the option --tunnel disabled
.
In direct routing mode, Cilium will hand all packets which are not addressed for another local endpoint to the routing subsystem of the Linux kernel. This means that the packet will be routed as if a local process would have emitted the packet. As a result, the network connecting the cluster nodes must be aware that each of the node IP prefixes are reachable by using the node’s primary IP address as an L3 next hop address.
Cilium automatically enables IP forwarding in Linux when direct mode is configured, but it is up to the container cluster administrator to ensure that each routing element in the underlying network has a route that describes each node IP as the IP next hop for the corresponding node prefix.
This is typically achieved using two methods:
- Operation of a routing protocol such as OSPF or BGP via routing daemon such as Zebra, bird, bgpd. The routing protocols will announce the node allocation prefix via the node’s IP to all other nodes.
- Use of the cloud provider’s routing functionality. Refer to the documentation
of your cloud provider for additional details (e.g,. AWS VPC Route Tables
or GCE Routes). These APIs can be used to associate each node prefix with
the appropriate next hop IP each time a container node is added to the
cluster. If you are running Kubernetes with the
--cloud-provider
in combination with the--allocate-node-cidrs
option then this is configured automatically for IPv4 prefixes.
Note
Use of direct routing mode currently only offers identity based security policy enforcement for IPv6 where the security identity is stored in the flowlabel. IPv4 is currently not supported and thus security must be enforced using CIDR policy rules.
There are two possible approaches to performing network forwarding for container-to-container traffic:
Container Communication with External Hosts¶
Container communication with the outside world has two primary modes:
- Containers exposing API services for consumption by hosts outside of the container cluster.
- Containers making outgoing connections. Examples include connecting to 3rd-party API services like Twilio or Stripe as well as accessing private APIs that are hosted elsewhere in your enterprise datacenter or cloud deployment.
In the Direct / Native Routing Mode mode described before, if container IP addresses are routable outside of the container cluster, communication with external hosts requires little more than enabling L3 forwarding on each of the Linux nodes.
External Network Connectivity¶
If the destination of a packet lies outside of the cluster, Cilium will delegate routing to the routing subsystem of the cluster node to use the default route which is installed on the node of the cluster.
As the IP addresses used for the cluster prefix are typically allocated
from RFC1918 private address blocks and are not publicly routable. Cilium will
automatically masquerade the source IP address of all traffic that is leaving
the cluster. This behavior can be disabled by running cilium-agent
with
the option --masquerade=false
.
Public Endpoint Exposure¶
In direct routing mode, endpoint IPs can be publicly routable IPs and no additional action needs to be taken.
In overlay mode, endpoints that are accepting inbound connections from cluster external clients likely want to be exposed via some kind of load-balancing layer. Such a load-balancer will have a public external address that is not part of the Cilium network. This can be achieved by having a load-balancer container that both has a public IP on an externally reachable network and a private IP on a Cilium network. However, many container orchestration frameworks, like Kubernetes, have built in abstractions to handle this “ingress” load-balancing capability, which achieve the same effect that Cilium handles forwarding and security only for ‘’internal’’ traffic between different services.
Security¶
Cilium provides security on multiple levels. Each can be used individually or combined together.
- Identity based Connectivity Access Control: Connectivity policies between endpoints (Layer 3),
e.g. any endpoint with label
role=frontend
can connect to any endpoint with labelrole=backend
. - Restriction of accessible ports (Layer 4) for both incoming and outgoing
connections, e.g. endpoint with label
role=frontend
can only make outgoing connections on port 443 (https) and endpointrole=backend
can only accept connections on port 443 (https). - Fine grained access control on application protocol level to secure HTTP and
remote procedure call (RPC) protocols, e.g the endpoint with label
role=frontend
can only perform the REST API callGET /userdata/[0-9]+
, all other API interactions withrole=backend
are restricted.
Currently on the roadmap, to be added soon:
- Authentication: Any endpoint which wants to initiate a connection to an
endpoint with the label
role=backend
must have a particular security certificate to authenticate itself before being able to initiate any connections. See GH issue 502 for additional details. - Encryption: Communication between any endpoint with the label
role=frontend
to any endpoint with the labelrole=backend
is automatically encrypted with a key that is automatically rotated. See GH issue 504 to track progress on this feature.
Identity based Connectivity Access Control¶
Container management systems such as Kubernetes deploy a networking model which assigns an individual IP address to each pod (group of containers). This ensures simplicity in architecture, avoids unnecessary network address translation (NAT) and provides each individual container with a full range of port numbers to use. The logical consequence of this model is that depending on the size of the cluster and total number of pods, the networking layer has to manage a large number of IP addresses.
Traditionally security enforcement architectures have been based on IP address
filters. Let’s walk through a simple example: If all pods with the label
role=frontend
should be allowed to initiate connections to all pods with
the label role=backend
then each cluster node which runs at least one pod
with the label role=backend
must have a corresponding filter installed
which allows all IP addresses of all role=frontend
pods to initiate a
connection to the IP addresses of all local role=backend
pods. All other
connection requests should be denied. This could look like this: If the
destination address is 10.1.1.2 then allow the connection only if the source
address is one of the following [10.1.2.2,10.1.2.3,20.4.9.1].
Every time a new pod with the label role=frontend
or role=backend
is
either started or stopped, the rules on every cluster node which run any such
pods must be updated by either adding or removing the corresponding IP address
from the list of allowed IP addresses. In large distributed applications, this
could imply updating thousands of cluster nodes multiple times per second
depending on the churn rate of deployed pods. Worse, the starting of new
role=frontend
pods must be delayed until all servers running
role=backend
pods have been updated with the new security rules as
otherwise connection attempts from the new pod could be mistakenly dropped.
This makes it difficult to scale efficiently.
In order to avoid these complications which can limit scalability and
flexibility, Cilium entirely separates security from network addressing.
Instead, security is based on the identity of a pod, which is derived through
labels. This identity can be shared between pods. This means that when the
first role=frontend
pod is started, Cilium assigns an identity to that pod
which is then allowed to initiate connections to the identity of the
role=backend
pod. The subsequent start of additional role=frontend
pods
only requires to resolve this identity via a key-value store, no action has to
be performed on any of the cluster nodes hosting role=backend
pods. The
starting of a new pod must only be delayed until the identity of the pod has
been resolved which is a much simpler operation than updating the security
rules on all other cluster nodes.

Policy Enforcement¶
All security policies are described assuming stateful policy enforcement for
session based protocols. This means that the intent of the policy is to
describe allowed direction of connection establishment. If the policy allows
A => B
then reply packets from B
to A
are automatically allowed as
well. However, B
is not automatically allowed to initiate connections to
A
. If that outcome is desired, then both directions must be explicitly
allowed.
Security policies are may be enforced at ingress or egress. For ingress, this means that each cluster node verifies all incoming packets and determines whether the packet is allowed to be transmitted to the intended endpoint. Correspondingly, for egress each cluster node verifies outgoing packets and determines whether the packet is allowed to be transmitted to its intended destination.
In order to enforce identity based security in a multi host cluster, the identity of the transmitting endpoint is embedded into every network packet that is transmitted in between cluster nodes. The receiving cluster node can then extract the identity and verify whether a particular identity is allowed to communicate with any of the local endpoints.
Default Security Policy¶
If no policy is loaded, the default behavior is to allow all communication unless policy enforcement has been explicitly enabled. As soon as the first policy rule is loaded, policy enforcement is enabled automatically and any communication must then be white listed or the relevant packets will be dropped.
Similarly, if an endpoint is not subject to an L4 policy, communication from and to all ports is permitted. Associating at least one L4 policy to an endpoint will block all connectivity to ports unless explicitly allowed.
Orchestration System Specifics¶
Kubernetes¶
Cilium regards each deployed Pod
as an endpoint with regards to networking and
security policy enforcement. Labels associated with pods can be used to define
the identity of the endpoint.
When two pods communicate via a service construct, then the labels of the origin pod apply to determine the identity.
Getting Help¶
Cilium is a project with a growing community. There are numerous ways to get help with Cilium if needed:
Cilium Frequently Asked Questions (FAQ): Cilium uses GitHub tags to maintain a list of questions asked by users. We suggest checking to see if your question is already answered.
Chat: The best way to get immediate help if you get stuck is to ask in one of the Cilium Slack channels.
Bug Tracker: All the issues are addressed in the GitHub issue tracker. If you want to report a bug or a new feature please file the issue according to the GitHub template.
Contributing: If you want to contribute, reading the Developer / Contributor Guide should help you.
Security: If you want to report any security issues within Cilium, please
send an email to security@cilium.io
Kubernetes¶
Cilium provides seamless integration into Kubernetes. The following guidance may help you to navigate this documentation section:
- If you are already a Kubernetes, Service and NetworkPolicy expert: Quick Start.
- If you are looking for a simple and safe playground to experiment with Cilium and Kubernetes Getting Started Using Minikube.
- If you want to learn more about Cilium on Kubernetes first: Introduction.
- If you want to run Cilium on your kops cluster: Kubernetes Kops Installation Guide.
The following sections describe the Kubernetes integration in detail:
Quick Start¶
If you know what you are doing, then the following quick instructions get you started in the shortest time possible. If you require additional details or are looking to customize the installation then read the remaining sections of this chapter.
- Mount the BPF filesystem on all k8s worker nodes. There are many ways to achieve this, see section Mounting the BPF FS (Optional) for more details.
mount bpffs /sys/fs/bpf -t bpf
- Download the
DaemonSet
templatecilium.yaml
and specify the etcd address:
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium.yaml $ vim cilium.yaml [adjust the etcd address]
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium.yaml $ vim cilium.yaml [adjust the etcd address]
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium.yaml $ vim cilium.yaml [adjust the etcd address]
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium.yaml $ vim cilium.yaml [adjust the etcd address]
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium.yaml $ vim cilium.yaml [adjust the etcd address]
- Deploy
cilium
with your local changes
$ kubectl create -f ./cilium.yaml
clusterrole "cilium" created
serviceaccount "cilium" created
clusterrolebinding "cilium" created
configmap "cilium-config" created
secret "cilium-etcd-secrets" created
daemonset "cilium" created
$ kubectl get ds --namespace kube-system
NAME DESIRED CURRENT READY NODE-SELECTOR AGE
cilium 1 1 1 <none> 2m
You have cilium deployed in your cluster and ready to use.
Introduction¶
What does Cilium provide in your Kubernetes Cluster?¶
The following functionality is provided as your run Cilium in your Kubernetes cluster:
CNI
plugin support to provide pod_connectivity with Multi Host Networking.- Identity based implementation of the NetworkPolicy resource to isolate
pod
topod
connectivity on Layer 3 and 4. - An extension to NetworkPolicy in the form of a
CustomResourceDefinition
which extends policy control to add:- Layer 7 policy enforcement on ingress and egress for the following
application protocols:
- HTTP
- Kafka
- Egress support for CIDRs to secure access to external services
- Enforcement to external headless services to automatically restrict to the set of Kubernetes endpoints configured for a service.
- Layer 7 policy enforcement on ingress and egress for the following
application protocols:
- ClusterIP implementation to provide distributed load-balancing for pod to pod traffic.
- Fully compatible with existing kube-proxy model
Pod-to-Pod Connectivity¶
In Kubernetes, containers are deployed within units referred to as Pod
, which
include one or more containers reachable via a single IP address. With Cilium,
each Pod gets an IP address from the node prefix of the Linux node running the
Pod. See Address Management for additional details. In the absence of any
network security policies, all Pods can reach each other.
Pod IP addresses are typically local to the Kubernetes cluster. If pods need to reach services outside the cluster as a client, the network traffic is automatically masqueraded as it leaves the node. You can find additional information in the section External Network Connectivity.
Service Load-balancing¶
Kubernetes has developed the Services abstraction which provides the user the ability to load balance network traffic to different pods. This abstraction allows the pods reaching out to other pods by a single IP address, a virtual IP address, without knowing all the pods that are running that particular service.
Without Cilium, kube-proxy is installed on every node, watches for endpoints and services addition and removal on the kube-master which allows it to to apply the necessary enforcement on iptables. Thus, the received and sent traffic from and to the pods are properly routed to the node and port serving for that service. For more information you can check out the kubernetes user guide for Services.
When implementing ClusterIP, Cilium acts on the same principles as kube-proxy, it watches for services addition or removal, but instead of doing the enforcement on the iptables, it updates BPF map entries on each node. For more information, see the Pull Request.
Further Reading¶
The Kubernetes documentation contains more background on the Kubernetes Networking Model and Kubernetes Network Plugins .
Installation Guide¶
Note
This is the detailed installation guide aimed at production installations. If you are looking to get started quickly, the Getting Started Using Minikube or the Quick Start guide may be better options.
This section describes how to install and run Cilium on Kubernetes. The
deployment method we are using is called DaemonSet
which is the easiest way to
deploy Cilium in a Kubernetes environment. It will request Kubernetes to
automatically deploy and run a cilium/cilium
container image as a pod on
all Kubernetes worker nodes.
Should you encounter any issues during the installation, please refer to the
Troubleshooting section and / or seek help on Slack channel
. See
the Kubernetes Compatibility section for kubernetes API version compatibility.
Kubernetes Requirements¶
Enable automatic node CIDR allocation (Recommended)¶
Kubernetes has the capability to automatically allocate and assign per node IP
allocation CIDR. Cilium automatically uses this feature if enabled. This is the
easiest method to handle IP allocation in a Kubernetes cluster. To enable this
feature, simply add the following flag when starting
kube-controller-manager
:
--allocate-node-cidrs
This option is not required but highly recommended.
Running Kubernetes with CRD Validation (Recommended)¶
Custom Resource Validation was introduced in Kubernetes since version 1.8.0
.
This is still considered an alpha feature in Kubernetes 1.8.0
and beta in
Kubernetes 1.9.0
.
Since Cilium v1.0.0-rc3
, Cilium will create, or update in case it exists,
the Cilium Network Policy (CNP) Resource Definition with the embedded
validation schema. This allows the validation of CiliumNetworkPolicy to be done
on the kube-apiserver when the policy is imported with an ability to provide
direct feedback when importing the resource.
To enable this feature, the flag --feature-gates=CustomResourceValidation=true
must be set when starting kube-apiserver. Cilium itself will automatically make
use of this feature and no additional flag is required.
Note
In case there is an invalid CNP before updating to Cilium
v1.0.0-rc3
, which contains the validator, the kube-apiserver
validator will prevent Cilium from updating that invalid CNP with
Cilium node status. By checking Cilium logs for unable to update
CNP, retrying...
, it is possible to determine which Cilium Network
Policies are considered invalid after updating to Cilium
v1.0.0-rc3
.
To verify that the CNP resource definition contains the validation schema, run the following command:
kubectl get crd ciliumnetworkpolicies.cilium.io -o json
kubectl get crd ciliumnetworkpolicies.cilium.io -o json | grep -A 12 openAPIV3Schema
"openAPIV3Schema": {
"oneOf": [
{
"required": [
"spec"
]
},
{
"required": [
"specs"
]
}
],
In case the user writes a policy that does not conform to the schema, Kubernetes will return an error, e.g.:
cat <<EOF > ./bad-cnp.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Policy to test multiple rules in a single file"
metadata:
name: my-new-cilium-object
spec:
endpointSelector:
matchLabels:
app: details
track: stable
version: v1
ingress:
- fromEndpoints:
- matchLabels:
app: reviews
track: stable
version: v1
toPorts:
- ports:
- port: '65536'
protocol: TCP
rules:
http:
- method: GET
path: "/health"
EOF
kubectl create -f ./bad-cnp.yaml
...
spec.ingress.toPorts.ports.port in body should match '^(6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[0-9]{1,4})$'
In this case, the policy has a port out of the 0-65535 range.
Mounting the BPF FS (Optional)¶
This step is optional but recommended. It allows the cilium-agent
to pin
BPF resources to a persistent filesystem and make them persistent across
restarts of the agent. If the BPF filesystem is not mounted in the host
filesystem, Cilium will automatically mount the filesystem in the mount
namespace of the container when the agent starts. This will allow operation of
Cilium but will result in unmounting of the filesystem when the pod is
restarted. This in turn will cause resources such as the connection tracking
table of the BPF programs to be released which will cause all connections into
local containers to be dropped. Mounting the BPF filesystem in the host mount
namespace will ensure that the agent can be restarted without affecting
connectivity of any pods.
In order to mount the BPF filesystem, the following command must be run in the host mount namespace. The command must only be run once during the boot process of the machine.
mount bpffs /sys/fs/bpf -t bpf
A portable way to achieve this with persistence is to add the following line to
/etc/fstab
and then run mount /sys/fs/bpf
. This will cause the
filesystem to be automatically mounted when the node boots.
bpffs /sys/fs/bpf bpf defaults 0 0
If you are using systemd to manage the kubelet, another option is to add a mountd systemd service on all hosts:
Due to how systemd mounts filesystems, the mount point path must be reflected in the unit filename.
cat <<EOF | sudo tee /etc/systemd/system/sys-fs-bpf.mount
[Unit]
Description=Cilium BPF mounts
Documentation=http://docs.cilium.io/
DefaultDependencies=no
Before=local-fs.target umount.target
After=swap.target
[Mount]
What=bpffs
Where=/sys/fs/bpf
Type=bpf
[Install]
WantedBy=multi-user.target
EOF
CNI Configuration¶
CNI
- Container Network Interface is the plugin layer used by Kubernetes to
delegate networking configuration. You can find additional information on the
CNI
project website.
Note
Kubernetes `` >= 1.3.5`` requires the loopback
CNI
plugin to be
installed on all worker nodes. The binary is typically provided by
most Kubernetes distributions. See section Installing CNI and loopback for
instructions on how to install CNI
in case the loopback
binary
is not already installed on your worker nodes.
CNI configuration is automatically being taken care of when deploying Cilium
via the provided DaemonSet
. The script cni-install.sh
is automatically run
via the postStart
mechanism when the cilium
pod is started.
Note
In order for the the cni-install.sh
script to work properly, the
kubelet
task must either be running on the host filesystem of the
worker node, or the /etc/cni/net.d
and /opt/cni/bin
directories must be mounted into the container where kubelet
is
running. This can be achieved with Volumes
mounts.
The CNI auto installation is performed as follows:
- The
/etc/cni/net.d
and/opt/cni/bin
directories are mounted from the host filesystem into the pod where Cilium is running. - The file
/etc/cni/net.d/05-cilium.conf
is written in case it does not exist yet. - The binary
cilium-cni
is installed to/opt/cni/bin
. Any existing binary with the namecilium-cni
is overwritten.
Installing CNI and loopback¶
Since Kubernetes v1.3.5
the loopback
CNI
plugin must be installed.
There are many ways to install CNI
, the following is an example:
sudo mkdir -p /opt/cni
wget https://storage.googleapis.com/kubernetes-release/network-plugins/cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz
sudo tar -xvf cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz -C /opt/cni
rm cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz
Adjusting CNI configuration¶
The CNI installation can be configured with environment variables. These
environment variables can be specified in the DaemonSet
file like this:
env:
- name: "CNI_CONF_NAME"
value: "05-cilium.conf"
The following variables are supported:
Option | Description | Default |
HOST_PREFIX | Path prefix of all host mounts | /host |
CNI_DIR | Path to mounted CNI directory | ${HOST_PREFIX}/opt/cni |
CNI_CONF_NAME | Name of configuration file | 05-cilium.conf |
If you want to further adjust the CNI configuration you may do so by creating
the CNI configuration /etc/cni/net.d/05-cilium.conf
manually:
sudo mkdir -p /etc/cni/net.d
sudo sh -c 'echo "{
"name": "cilium",
"type": "cilium-cni"
}
" > /etc/cni/net.d/05-cilium.conf'
Cilium will use any existing /etc/cni/net.d/05-cilium.conf
file if it
already exists on a worker node and only creates it if it does not exist yet.
Deploying the DaemonSet¶
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium.yaml
Adjusting the ConfigMap¶
After downloading the cilium.yaml
file, open it with your text editor and
change the ConfigMap
based on the following instructions.
Adjusting etcd-config¶
First, make sure the etcd-config
endpoints have the correct addresses of
your etcd nodes.
If you are running more than one node simply specify the complete of endpoints. The list of endpoints can accept both domain names or IP addresses. Make sure you specify the correct port used in your etcd node.
If etcd is running with TLS, there are a couple of changes that you need to do.
Make sure you have
https
in all endpoints;Uncomment the line
#ca-file: '/var/lib/etcd-secrets/etcd-ca'
so that the certificate authority of the servers are known to Cilium;Create a kubernetes secret with certificate authority file in kubernetes;
Use certificate authority file, with the name
ca.crt
, used to create in etcd;Create the secret by executing:
$ kubectl create secret generic -n kube-system cilium-etcd-secrets \ --from-file=etcd-ca=ca.crt
If etcd is running with
client to server authentication,
you need make more changes to the ConfigMap
:
Uncomment both lines
#key-file: '/var/lib/etcd-secrets/etcd-client-key'
and#cert-file: '/var/lib/etcd-secrets/etcd-client-crt'
;Create a kubernetes secret with
client.key
andclient.crt
files in kubernetes.Use the file with the name
client.key
that contains the client key;Use the file with the name
client.crt
that contains the client certificate;Create the secret by executing:
$ kubectl create secret generic -n kube-system cilium-etcd-secrets \ --from-file=etcd-ca=ca.crt \ --from-file=etcd-client-key=client.key \ --from-file=etcd-client-crt=client.crt
Note
If you have set up the secret before you might see the error
Error from server (AlreadyExists): secrets "cilium-etcd-secrets" already exists
you can simply delete it with
kubectl delete secret -n kube-system cilium-etcd-secrets
and re-create it again.
Note
When creating the kubernetes secrets just make sure you create it with
all necessary files, ca.crt
, client.key
and client.crt
in a
single kubectl create
.
Regarding the etcd configuration that is all you need to change in the
ConfigMap
.
Adjusting Cilium Options¶
In the ConfigMap
there are a couple of options that can be changed
accordingly with your changes.
debug
- Sets to run Cilium in full debug mode, it can be changed at runtime;disable-ipv4
- Disables IPv4 in Cilium and endpoints managed by Cilium;clean-cilium-state
- Removes any Cilium state, e.g. BPF policy maps, before starting the Cilium agent;legacy-host-allows-world
- If true, the policy with the entityreserved:host
allows traffic fromworld
. If false, the policy needs to explicitly have the entityreserved:world
to allow traffic fromworld
. It is recommended to set it to false. This option provides compatibility with Cilium 1.0 which was not able to differentiate between NodePort traffic and traffic from the host.
Any changes that you perform in the Cilium ConfigMap
and in
cilium-etcd-secrets
Secret
will require you to restart any existing
Cilium pods in order for them to pick the latest configuration.
The following ConfigMap
is an example where the etcd cluster is running in 2
nodes, node-1
and node-2
with TLS, and client to server authentication
enabled.
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
endpoints:
- https://node-1:31079
- https://node-2:31079
#
# In case you want to use TLS in etcd, uncomment the 'ca-file' line
# and create a kubernetes secret by following the tutorial in
# https://cilium.link/etcd-config
ca-file: '/var/lib/etcd-secrets/etcd-ca'
#
# In case you want client to server authentication, uncomment the following
# lines and create a kubernetes secret by following the tutorial in
# https://cilium.link/etcd-config
key-file: '/var/lib/etcd-secrets/etcd-client-key'
cert-file: '/var/lib/etcd-secrets/etcd-client-crt'
# If you want to run cilium in debug mode change this value to true
debug: "false"
disable-ipv4: "false"
# If you want to clean cilium state; change this value to true
clean-cilium-state: "false"
legacy-host-allows-world: "false"
After configuring the ConfigMap
in cilium.yaml
it is time to deploy it
using kubectl
:
$ kubectl create -f cilium.yaml
Kubernetes will deploy the cilium
DaemonSet
as a pod in the kube-system
namespace on all worker nodes. This operation is performed in the background.
Run the following command to check the progress of the deployment:
$ kubectl --namespace kube-system get ds
NAME DESIRED CURRENT READY NODE-SELECTOR AGE
cilium 4 4 4 <none> 2m
As the pods are deployed, the number in the ready column will increase and eventually reach the desired count.
$ kubectl --namespace kube-system describe ds cilium
Name: cilium
Image(s): cilium/cilium:stable
Selector: io.cilium.admin.daemon-set=cilium,name=cilium
Node-Selector: <none>
Labels: io.cilium.admin.daemon-set=cilium
name=cilium
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Misscheduled: 0
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
35s 35s 1 {daemon-set } Normal SuccessfulCreate Created pod: cilium-2xzqm
We can now check the logfile of a particular cilium agent:
$ kubectl --namespace kube-system get pods
NAME READY STATUS RESTARTS AGE
cilium-2xzqm 1/1 Running 0 41m
$ kubectl --namespace kube-system logs cilium-2xzqm
INFO _ _ _
INFO ___|_| |_|_ _ _____
INFO | _| | | | | | |
INFO |___|_|_|_|___|_|_|_|
INFO Cilium 0.8.90 f022e2f Thu, 27 Apr 2017 23:17:56 -0700 go version go1.7.5 linux/amd64
INFO clang and kernel versions: OK!
INFO linking environment: OK!
[...]
Deploying to selected nodes¶
To deploy Cilium only to a selected list of worker nodes, you can add a
NodeSelector
to the cilium.yaml
file like this:
spec:
template:
spec:
nodeSelector:
with-network-plugin: cilium
And then label each node where Cilium should be deployed:
kubectl label node worker0 with-network-plugin=cilium
kubectl label node worker1 with-network-plugin=cilium
kubectl label node worker2 with-network-plugin=cilium
Networking For Existing Pods¶
In case pods were already running before the Cilium DaemonSet
was deployed,
these pods will still be connected using the previous networking plugin
according to the CNI configuration. A typical example for this is the
kube-dns
service which runs in the kube-system
namespace by default.
A simple way to change networking for such existing pods is to rely on the fact that Kubernetes automatically restarts pods in a Deployment if they are deleted, so we can simply delete the original kube-dns pod and the replacement pod started immediately after will have networking managed by Cilium. In a production deployment, this step could be performed as a rolling update of kube-dns pods to avoid downtime of the DNS service.
$ kubectl --namespace kube-system delete pods -l k8s-app=kube-dns
pod "kube-dns-268032401-t57r2" deleted
Running kubectl get pods
will show you that Kubernetes started a new set of
kube-dns
pods while at the same time terminating the old pods:
$ kubectl --namespace kube-system get pods
NAME READY STATUS RESTARTS AGE
cilium-5074s 1/1 Running 0 58m
kube-addon-manager-minikube 1/1 Running 0 59m
kube-dns-268032401-j0vml 3/3 Running 0 9s
kube-dns-268032401-t57r2 3/3 Terminating 0 57m
Network Policy¶
If you are running Cilium on Kubernetes, you can benefit from Kubernetes distributing policies for you. In this mode, Kubernetes is responsible for distributing the policies across all nodes and Cilium will automatically apply the policies. Two formats are available to configure network policies natively with Kubernetes:
- The standard NetworkPolicy resource which at the time of this writing, supports to specify L3/L4 ingress policies with limited egress support marked as beta.
- The extended CiliumNetworkPolicy format which is available as a
ThirdPartyResource
andCustomResourceDefinition
which supports to specify policies at Layers 3-7 for both ingress and egress.
It is recommended to only use one of the above policy types at a time to minimize unintended effects arising from the interaction between the policies.
NetworkPolicy¶
For more information, see the official NetworkPolicy documentation.
Known missing features for Kubernetes Network Policy:
Feature | Tracking Issue |
---|---|
Use of named ports | https://github.com/cilium/cilium/issues/2942 |
Ingress CIDR-based L4 policy | https://github.com/cilium/cilium/issues/1684 |
CiliumNetworkPolicy¶
The CiliumNetworkPolicy is very similar to the standard NetworkPolicy. The purpose is provide the functionality which is not yet supported in NetworkPolicy. Ideally all of the functionality will be merged into the standard resource format and this CRD will no longer be required.
The raw specification of the resource in Go looks like this:
type CiliumNetworkPolicy struct {
metav1.TypeMeta `json:",inline"`
// +optional
Metadata metav1.ObjectMeta `json:"metadata"`
// Spec is the desired Cilium specific rule specification.
Spec *api.Rule `json:"spec,omitempty"`
// Specs is a list of desired Cilium specific rule specification.
Specs api.Rules `json:"specs,omitempty"`
// Status is the status of the Cilium policy rule
// +optional
Status CiliumNetworkPolicyStatus `json:"status"`
}
- Metadata
Describes the policy. This includes:
- Name of the policy, unique within a namespace
- Namespace of where the policy has been injected into
- Set of labels to identify resource in Kubernetes
- Spec
- Field which contains a Rule Basics
- Specs
- Field which contains a list of Rule Basics. This field is useful if multiple rules must be removed or added automatically.
- Status
- Provides visibility into whether the policy has been successfully applied
Examples¶
See Layer 3 Examples for a detailed list of example policies.
Cilium Endpoint Custom Resource Definition¶
When managing pods in Kubernetes, Cilium will create a Custom Resource
Definition (CRD) of Kind CiliumEndpoint
. One CiliumEndpoint
is created
for each pod managed by Cilium, with the same name and in the same namespace.
The CiliumEndpoint
objects contain the same information as the json output
of cilium endpoint get
under the .status
field, but can be fetched for
all pods in the cluster. Adding the -o json
will export more information
about each endpoint. This includes the endpoint’s labels, security identity and
the policy in effect on it.
For example:
$ kubectl get ciliumendpoints --all-namespaces
NAMESPACE NAME AGE
default app1-55d7944bdd-l7c8j 1h
default app1-55d7944bdd-sn9xj 1h
default app2 1h
default app3 1h
kube-system cilium-health-minikube 1h
kube-system microscope 1h
Note
Each cilium-agent pod will create a CiliumEndpoint to represent its
own inter-agent health-check endpoint. These are not pods in
Kubernetes and are in the kube-system
namespace. They are named as
cilium-health-<node-name>
orphan: |
---|
Kubernetes Compatibility¶
Cilium is compatible with multiple Kubernetes API Groups. Some are deprecated or beta, and may only be available in specific versions of Kubernetes.
All Kubernetes versions listed are compatible with Cilium:
k8s Version | k8s NetworkPolicy API | CiliumNetworkPolicy |
1.7 | cilium.io/v2 has a
CustomResourceDefinition |
|
1.8, 1.9, 1.10, 1.11 |
Troubleshooting¶
Verifying the installation¶
Check the status of the DaemonSet
and verify that all desired instances are in
“ready” state:
$ kubectl --namespace kube-system get ds
NAME DESIRED CURRENT READY NODE-SELECTOR AGE
cilium 1 1 0 <none> 3s
In this example, we see a desired state of 1 with 0 being ready. This indicates
a problem. The next step is to list all cilium pods by matching on the label
k8s-app=cilium
and also sort the list by the restart count of each pod to
easily identify the failing pods:
$ kubectl --namespace kube-system get pods --selector k8s-app=cilium \
--sort-by='.status.containerStatuses[0].restartCount'
NAME READY STATUS RESTARTS AGE
cilium-813gf 0/1 CrashLoopBackOff 2 44s
Pod cilium-813gf
is failing and has already been restarted 2 times. Let’s
print the logfile of that pod to investigate the cause:
$ kubectl --namespace kube-system logs cilium-813gf
INFO _ _ _
INFO ___|_| |_|_ _ _____
INFO | _| | | | | | |
INFO |___|_|_|_|___|_|_|_|
INFO Cilium 0.8.90 f022e2f Thu, 27 Apr 2017 23:17:56 -0700 go version go1.7.5 linux/amd64
CRIT kernel version: NOT OK: minimal supported kernel version is >= 4.8
In this example, the cause for the failure is a Linux kernel running on the worker node which is not meeting System Requirements.
If the cause for the problem is not apparent based on these simple steps,
please come and seek help on our Slack channel
.
Migrating Cilium TPR to CRD¶
Prior to Kubernetes 1.7, Cilium Network Policy (CNP) objects were imported as a Kubernetes ThirdPartyResource (TPRs).
In Kubernetes >=1.7.0
, TPRs are now deprecated, and will be removed in Kubernetes 1.8. TPRs are replaced by Custom Resource Definitions (CRDs). Thus, as part of the upgrade process to Kubernetes 1.7, Kubernetes has provided documentation for migrating TPRs to CRDS.
The following instructions document how to migrate CiliumNetworkPolicies existing as TPRs from a Kubernetes cluster which was previously running versions < 1.7.0
to CRDs on a Kubernetes cluster running versions >= 1.7.0
. This is meant to correspond to steps 4-6 of the aforementioned guide.
Cilium adds the CNP CRD automatically; check to see that the CNP CRD has been added by Cilium:
$ kubectl get customresourcedefinition
NAME KIND
ciliumnetworkpolicies.cilium.io CustomResourceDefinition.v1beta1.apiextensions.k8s.io
Save your existing CNPs which were previously added as TPRs:
$ kubectl get ciliumnetworkpolicies --all-namespaces -o yaml > cnps.yaml
Change the version of the Cilium API from v1 to v2 in the YAML file to which you just saved your old CNPs. The Cilium API is versioned to account for the change from TPR to CRD:
$ cp cnps.yaml cnps.yaml.new
$ # Edit the version
$ vi cnps.yaml.new
$ # The diff of the old vs. new YAML file should be similar to the output below.
$ diff cnps.yaml cnps.yaml.new
3c3
< - apiVersion: cilium.io/v1
---
> - apiVersion: cilium.io/v2
10c10
< selfLink: /apis/cilium.io/v1/namespaces/default/ciliumnetworkpolicies/guestbook-web-deprecated
---
> selfLink: /apis/cilium.io/v2/namespaces/default/ciliumnetworkpolicies/guestbook-web-deprecated
Delete your old CNPs:
$ kubectl delete ciliumnetworkpolicies --all
$ kubectl delete thirdpartyresource cilium-network-policy.cilium.io
Add the changed CNPs back as CRDs:
$ kubectl create -f cnps.yaml.new
Check that your CNPs are added:
$ kubectl get ciliumnetworkpolicies
NAME KIND
guestbook-web-deprecated CiliumNetworkPolicy.v2.cilium.io
multi-rules-deprecated CiliumNetworkPolicy.v2.cilium.io Policy to test multiple rules in a single file 2 item(s)
Now if you try to create a CNP as a TPR, you will get an error:
$ Error from server (BadRequest): error when creating "cilium-tpr.yaml": the API version in the data (cilium.io/v1) does not match the expected API version (cilium.io/v2)
Istio¶
Cilium can be deployed along Istio to provide L3-L7 network filtering in complement to Istio’s microservice mesh features. The following quick guide guides you through the process step by step:
For more information on Istio, check out the Istio website.
Docker¶
Cilium can be integrated with Docker in two ways:
- via the
CNI
interface. This method is used by Kubernetes and Mesos. - via Docker’s libnetwork plugin interface, if networking is to be managed by the Docker runtime. This method is used, for example, by Docker Compose.
To run Cilium with Docker’s libnetwork, it needs a single logical Docker
network of type cilium
with an IPAM-driver of type cilium
. The
IPAM-driver delegates control over IPv4 and IPv6 address management and network
connectivity to Cilium for all containers attached to this network. Each Docker
container is allocated an IP address from the node prefix of the node running
that container.
When deployed with Docker, each Linux node must also run a cilium-docker
agent that receives libnetwork calls from Docker and then communicates with the
Cilium Agent to control container networking.
Security policies controlling connectivity between the Docker containers can be written in terms of the Docker container labels passed to Docker when creating the container. These policies can be created and updated via the Cilium agent API or by using the Cilium CLI client.
Follow this guide for a step by step introduction on how to use Cilium with Docker Compose:
System Requirements¶
Before installing Cilium, please ensure that your system meets the minimum requirements below. Most modern Linux distributions already do.
Summary¶
When running Cilium using the container image cilium/cilium
, the host
system must meet these requirements:
- Linux kernel >= 4.9.17
- Key-Value store etcd >= 3.1.0 or consul >= 0.6.4
When running Cilium as a native process on your host (i.e. not running the
cilium/cilium
container image) these additional requirements must be met:
- clang+LLVM >=3.7.1
- iproute2 >= 4.8.0
Requirement | Minimum Version | In cilium container |
---|---|---|
Linux kernel | >= 4.9.17 | no |
Key-Value store (etcd) | >= 3.1.0 | no |
Key-Value store (consul) | >= 0.6.4 | no |
clang+LLVM | >= 3.7.1 | yes |
iproute2 | >= 4.8.0 | yes |
Linux Distribution Compatibility Matrix¶
The following table lists Linux distributions that are known to work well with Cilium.
Distribution | Minimum Version |
---|---|
CoreOS | stable (>= 1298.5.0) |
Debian | >= 9 Stretch |
Fedora Atomic/Core | >= 25 |
LinuxKit | all |
Ubuntu | >= 16.04.2, >= 16.10 |
Opensuse | Tumbleweed, >=Leap 15.0 |
Note
The above list is based on feedback by users. If you find an unlisted Linux distribution that works well, please let us know by opening a GitHub issue or by creating a pull request that updates this guide.
Linux Kernel¶
Cilium leverages and builds on the kernel BPF functionality as well as various subsystems which integrate with BPF. Therefore, host systems are required to run Linux kernel version 4.8.0 or later to run a Cilium agent. More recent kernels may provide additional BPF functionality that Cilium will automatically detect and use on agent start.
In order for the BPF feature to be enabled properly, the following kernel configuration options must be enabled. This is typically the case with distribution kernels. When an option can be built as a module or statically linked, either choice is valid.
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=y
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=y
Key-Value store¶
Cilium uses a distributed Key-Value store to manage, synchronize and distribute security identities across all cluster nodes. The following Key-Value stores are currently supported:
- etcd >= 3.1.0
- consul >= 0.6.4
See Key-Value Store for details on how to configure the
cilium-agent
to use a Key-Value store.
clang+LLVM¶
Note
This requirement is only needed if you run cilium-agent
natively.
If you are using the Cilium container image cilium/cilium
,
clang+LLVM is included in the container image.
LLVM is the compiler suite that Cilium uses to generate BPF bytecode programs
to be loaded into the Linux kernel. The minimum supported version of LLVM
available to cilium-agent
should be >=3.7.1. The version of clang installed
must be compiled with the BPF backend enabled.
See https://releases.llvm.org/ for information on how to download and install LLVM.
Note
Beginning with clang 3.9.x, the minimum kernel version is >= 4.9.17.
iproute2¶
Note
iproute2 is only needed if you run cilium-agent
directly on the
host machine. iproute2 is included in the cilium/cilium
container
image.
iproute2 is a low level tool used to configure various networking related
subsystems of the Linux kernel. Cilium uses iproute2 to configure networking
and tc
, which is part of iproute2, to load BPF programs into the kernel.
The minimum version of iproute2 must be >= 4.8.0. Please see https://www.kernel.org/pub/linux/utils/net/iproute2/ for documentation on how to install iproute2.
Installation Guides¶
These guides describes the various ways to install and configure Cilium in different deployment modes. It focuses on a full deployment of Cilium within a datacenter or public cloud. If you are just looking for a simple way to experiment, we highly recommend trying out the Getting Started Guides instead.
Kubernetes Installation Guide (Generic)¶
Please refer to the detailed installation instructions in the Installation Guide.
Installation From Source¶
If for some reason you do not want to run Cilium as a container image. Installing it from source is possible as well. It does come with additional dependencies described in System Requirements.
- Requirements:
Install go-bindata:
$ go get -u github.com/cilium/go-bindata/...
Add $GOPATH/bin to your $PATH:
$ # To add $GOPATH/bin in your $PATH run
$ export PATH=$GOPATH/bin:$PATH
You can also add it in your ~/.bashrc
file:
if [ -d $GOPATH/bin ]; then
export PATH=$PATH:$GOPATH/bin
fi
- Download & extract the latest Cilium release from the ReleasesPage
$ go get -d github.com/cilium/cilium
$ cd $GOPATH/src/github.com/cilium/cilium
- Build & install the Cilium binaries to
bindir
$ git checkout v1.1.0
$ # We are pointing to $GOPATH/bin as well since it's where go-bindata is
$ # installed
$ make build
$ sudo make install
- Optional: Install systemd init files:
sudo cp contrib/systemd/*.service /lib/systemd/system
sudo cp contrib/systemd/sys-fs-bpf.mount /lib/systemd/system
sudo mkdir -p /etc/sysconfig/cilium && cp contrib/systemd/cilium /etc/sysconfig/cilium
service cilium start
Advanced Options¶
This guide covers advanced installation options in a generic way that can be applied on top of all other installation guides.
The following sections will describe runtime options that can be passed on to the agent. Depending on your chosen form of installation, the steps required to modify the agent options will be different:
- Modify the DaemonSet file if you are using Kubernetes.
- Modify the relevant unit or configuration file on all nodes or adjust your configuration management scripts if you are using systemd or another init system.
Running the agent on a node without a container runtime¶
If you want to run the Cilium agent on a node that will not host any
application containers, then that node may not have a container runtime
installed at all. You may still want to run the Cilium agent on the node to
ensure that local processes on that node can reach application containers on
other nodes. The default behavior of Cilium on startup when no container
runtime has been found is to abort startup. To avoid this abort, you can run
the cilium-agent
with the following option.
--container-runtime=none
Kubernetes Kops Installation Guide¶
As of kops
1.9 release, Cilium can be plugged into kops
-deployed clusters as the CNI plugin. This guide provides steps to create a Kubernetes cluster on AWS using kops
and Cilium as the CNI plugin. Note, the kops
deployment will automate several deployment features in AWS by default, including AutoScaling, Volumes, VPCs, etc.
Prerequisites¶
AmazonEC2FullAccess
AmazonRoute53FullAccess
AmazonS3FullAccess
IAMFullAccess
AmazonVPCFullAccess
Installing kops¶
curl -LO https://github.com/kubernetes/kops/releases/download/$(curl -s https://api.github.com/repos/kubernetes/kops/releases/latest | grep tag_name | cut -d '"' -f 4)/kops-linux-amd64
chmod +x kops-linux-amd64
sudo mv kops-linux-amd64 /usr/local/bin/kops
# If you are on MacOS, you can use:
brew update && brew install kops
Setting up IAM Group and User¶
Assuming you have all the prerequisites, run the following commands to create the kops
user and group:
# Create IAM group named kops and grant access
aws iam create-group --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonRoute53FullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/IAMFullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonVPCFullAccess --group-name kops
aws iam create-user --user-name kops
aws iam add-user-to-group --user-name kops --group-name kops
aws iam create-access-key --user-name kops
kops
requires the creation of a dedicated s3 bucket in order to store the state and representation of the cluster. You will need to change the bucket name and provide your unique bucket name (for example a reverse of FQDN added with short description of the cluster). Also make sure to use the region where you will be deploying the cluster.
aws s3api create-bucket --bucket prefix-example-com-state-store --region us-west-2 --create-bucket-configuration LocationConstraint=us-west-2
export KOPS_STATE_STORE=s3://prefix-example-com-state-store
The above steps are sufficient for getting a working cluster installed. Please consult kops aws documentation for more detailed setup instructions.
Cilium Prerequisites¶
- Ensure the System Requirements are met, particularly the Linux kernel and key-value store versions.
- (Recommended) Kubernetes with CRD validation (more on this after initial setup of the cluster)
In this guide, we will use etcd version 3.1.11 and the latest CoreOS stable image which satisfies the minimum kernel version requirement of Cilium. To get the latest CoreOS ami
image, you can change the region value to your choice in the command below.
aws ec2 describe-images --region=us-west-2 --owner=595879546273 --filters "Name=virtualization-type,Values=hvm" "Name=name,Values=CoreOS-stable*" --query 'sort_by(Images,&CreationDate)[-1].{id:ImageLocation}'
{
"id": "595879546273/CoreOS-stable-1745.5.0-hvm"
}
Creating a Cluster¶
- Note that you will need to specify the
--master-zones
and--zones
for creating the master and worker nodes. The number of master zones should be odd (1, 3, …) for HA. For simplicity, you can just use 1 region. - The cluster
NAME
variable should end withk8s.local
to use the gossip protocol. If creating multiple clusters using the same kops user, then make cluster name unique by adding a prefix such ascom-company-emailid-
.
export NAME=com-company-emailid-cilium.k8s.local
export KOPS_FEATURE_FLAGS=SpecOverrideFlag
kops create cluster --state=${KOPS_STATE_STORE} --node-count 3 --node-size t2.medium --master-size t2.medium --topology private --master-zones us-west-2a,us-west-2b,us-west-2c --zones us-west-2a,us-west-2b,us-west-2c --image 595879546273/CoreOS-stable-1745.5.0-hvm --networking cilium --override "cluster.spec.etcdClusters[*].version=3.1.11" --kubernetes-version 1.10.3 --cloud-labels "Team=Dev,Owner=Admin" ${NAME}
You may be prompted to create a ssh public-private key pair.
ssh-keygen
(Please see Deleting a Cluster)
Kubernetes with CRD validation¶
Cilium recommends using CRD validation in Kubernetes. In order to enable the flag --feature-gates=CustomResourceValidation=true
, edit the cluster yaml:
kops edit cluster --name= ${NAME}
and append the following snippet for kupeAPIServer:
to the spec:
section:
spec:
...
...
...
kubeAPIServer:
featureGates:
CustomResourceValidation: "true"
After successful editing, apply changes using kops update cluster
.
kops update cluster ${NAME} --yes
kops validate cluster
Upgrading Cilium¶
The default Cilium version deployed by kops
is old. Upgrade the Cilium DaemonSet to a newer version with the following commands. The following illustrates the upgrade process for Kubernetes v1.10 since that is the version we created. And we are upgrading Cilium to v1.0.3
but you can replace to any stable version vX.Y.Z
. (Please consult Cilium Upgrade Guide for more details.)
Note: In subsequent releases of kops
, there will be an option to provide Cilium version. This PR is tracking additional options for configuring Cilium CNI in a kops
cluster.
kubectl delete crd ciliumendpoints.cilium.io # this ensures older CEP objects do not persist
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/HEAD/examples/kubernetes/1.10/cilium-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/HEAD/examples/kubernetes/1.10/cilium-ds.yaml
kubectl set image daemonset/cilium -n kube-system cilium-agent=docker.io/cilium/cilium:v1.0.3
kubectl rollout status daemonset/cilium -n kube-system
Testing Cilium¶
Follow the Cilium getting started guide example to test that the cluster is setup properly and that Cilium CNI and security policies are functional.
Deleting a Cluster¶
To undo the dependencies and other deployment features in AWS from the kops
cluster creation, use kops
to destroy a cluster immediately with the parameter --yes
:
kops delete cluster ${NAME} --yes
Appendix: Details of kops flags used in cluster creation¶
The following section explains all the flags used in create cluster command.
KOPS_FEATURE_FLAGS=SpecOverrideFlag
: This flag is used to override the etcd version to be used from 2.X[kops default ] to 3.1.x [requirement of cilium]--state=${KOPS_STATE_STORE}
: KOPS uses an s3 bucket to store the state of your cluster and representation of your cluster--node-count 3
: No. of worker nodes in the kubernetes cluster.--node-size t2.medium
: The size of the AWS EC2 instance for worker nodes--master-size t2.medium
: The size of the AWS EC2 instance of master nodes--topology private
: Cluster will be created with private topology, what that means is all masters/nodes will be launched in a private subnet in the VPC--master-zones eu-west-1a,eu-west-1b,eu-west-1c
: The 3 zones ensure the HA of master nodes, each belonging in a different Availability zones.--zones eu-west-1a,eu-west-1b,eu-west-1c
: Zones where the worker nodes will be deployed--image 595879546273/CoreOS-stable-1745.3.1-hvm
: Image name to be deployed (Cilium requires kernel version 4.8 and above so ensure to use the right OS for workers.)--networking cilium
: Networking CNI plugin to be used - cilium--override "cluster.spec.etcdClusters[*].version=3.1.11"
: Overrides the etcd version to be used.--kubernetes-version 1.10.3
: Kubernetes version that is to be installed. Please note [Kops 1.9 officially supports k8s version 1.9]--cloud-labels "Team=Dev,Owner=Admin"
: Labels for your cluster${NAME}
: Name of the cluster. Make sure the name ends with k8s.local for a gossip based cluster
Setting up the Cluster Mesh¶
This is a step-by-step guide on how to build a mesh of Kubernetes clusters by connecting them together and enabling pod-to-pod connectivity across all clusters while providing label-based security policies.
Note
This is a beta feature introduced in Cilium 1.2.
Prerequisites¶
- All nodes in all clusters must have IP connectivity. This requirement is typically met by establishing peering between the networks of the machines forming each cluster.
- No encryption is performed by Cilium for the connectivity between nodes. The feature is on the roadmap (details) but not implemented yet. It is however possible to set up IPSec-based encryption between all nodes using a standard guide. It is also possible and common to establish VPNs between networks if clusters are connected across untrusted networks.
- All worker nodes must have connectivity to the etcd clusters of all remote clusters. Security is implemented by connecting to etcd using TLS/SSL certificates.
- Cilium must be configured to use etcd as the kvstore. Consul is not supported by cluster mesh.
Getting Started¶
Step 1: Prepare the individual clusters¶
Setting the cluster name and ID¶
Each cluster must be assigned a unique human-readable name. The name will be
used to group nodes of a cluster. The cluster name is specified with the
--cluster-name=NAME
argument or cluster-name
ConfigMap option.
To ensure scalability of identity allocation and policy enforcement, each
cluster continues to manage its own identity allocation. In order to guarantee
compatibility with identities across clusters, each cluster is configured with
a unique cluster ID configured with the --cluster-id=ID
argument or
cluster-id
ConfigMap option.
$ kubectl -n kube-system edit cm cilium-config
[ add/edit ]
cluster-name: default
cluster-id: 1
Provide unique values for the cluster name and ID for each cluster.
Step 2: Create Secret to provide access to remote etcd¶
Clusters are connected together by providing connectivity information to the etcd key-value store to each individual cluster. This allows Cilium to synchronize state between clusters and provide cross-cluster connectivity and policy enforcement.
The connectivity details of a remote etcd typically includes certificates to enable use of TLS which is why the entire ClusterMesh configuration is stored in a Kubernetes Secret.
Create an etcd configuration file for each remote cluster you want to connect to. The syntax is that the official etcd configuration file and identical to the syntax used in the
cilium-config
ConfigMap.Create a secret
cilium-clustermesh
from all configuration files you have created:$ ks create secret generic cilium-clustermesh --from-file=./cluster5 --from-file=./cluster7
Cilium will automatically ignore any configuration referring to its own cluster so you can create a single secret and import it into all your clusters to establish connectivity between all clusters.
cluster 1: $ kubectl -n kube-system get secret cilium-clustermesh -o yaml > clustermesh.yaml cluster 2: $ kubectl apply -f clustermesh.yaml
Step 3: Restart the cilium agent¶
Restart Cilium in each cluster to pick up the new cluster name, cluster id and clustermesh secret configuration. Cilium will automatically establish connectivity between the clusters.
$ kubectl -n kube-system delete -l k8s-app=cilium
Step 4: Test the connectivity between clusters¶
Run cilium node list
to see the full list of nodes discovered. You can run
this command inside any Cilium pod in any cluster:
$ kubectl -n kube-system exec -ti cilium-g6btl cilium node list
Name IPv4 Address Endpoint CIDR IPv6 Address Endpoint CIDR
cluster5/ip-172-0-117-60.us-west-2.compute.internal 172.0.117.60 10.2.2.0/24 <nil> f00d::a02:200:0:0/112
cluster5/ip-172-0-186-231.us-west-2.compute.internal 172.0.186.231 10.2.3.0/24 <nil> f00d::a02:300:0:0/112
cluster5/ip-172-0-50-227.us-west-2.compute.internal 172.0.50.227 10.2.0.0/24 <nil> f00d::a02:0:0:0/112
cluster5/ip-172-0-51-175.us-west-2.compute.internal 172.0.51.175 10.2.1.0/24 <nil> f00d::a02:100:0:0/112
cluster7/ip-172-0-121-242.us-west-2.compute.internal 172.0.121.242 10.4.2.0/24 <nil> f00d::a04:200:0:0/112
cluster7/ip-172-0-58-194.us-west-2.compute.internal 172.0.58.194 10.4.1.0/24 <nil> f00d::a04:100:0:0/112
cluster7/ip-172-0-60-118.us-west-2.compute.internal 172.0.60.118 10.4.0.0/24 <nil> f00d::a04:0:0:0/112
$ kubectl exec -ti pod-cluster5-xxx curl <pod-ip-cluster7>
[...]
Using kube-router to run BGP¶
This guide explains how to configure Cilium and kube-router to co-operate to use kube-router for BGP peering and route propagation and Cilium for policy enforcement and load-balancing.
Note
This is a beta feature. Please provide feedback and file a GitHub issue if you experience any problems.
Deploy kube-router¶
Download the kube-router DaemonSet template:
curl -LO https://raw.githubusercontent.com/cloudnativelabs/kube-router/v0.2.0-beta.7/daemonset/generic-kuberouter-only-advertise-routes.yaml
Open the file generic-kuberouter-only-advertise-routes.yaml
and edit the
args:
section. The following arguments are requried to be set to
exactly these values:
- --run-router=true
- --run-firewall=false
- --run-service-proxy=false
- --enable-cni=false
- --enable-pod-egress=false
The following arguments are optional and may be set according to your needs. For the purpose of keeping this guide simple, the following values are being used which require the least preparations in your cluster. Please see the kube-router user guide for more information.
- --enable-ibgp=true
- --enable-overlay=true
- --advertise-cluster-ip=true
- --advertise-external-ip=true
- --advertise-loadbalancer-ip=true
The following arguments are optional and should be set if you want BGP peering with an external router. This is useful if you want externally routable Kubernetes Pod and Service IPs. Note the values used here should be changed to whatever IPs and ASNs are configured on your external router.
- --cluster-asn=65001
- --peer-router-ips=10.0.0.1,10.0.2
- --peer-router-asns=65000,65000
Apply the DaemonSet file to deploy kube-router and verify it has come up correctly:
$ kubectl apply -f generic-kuberouter-only-advertise-routes.yaml
$ kubectl -n kube-system get pods -l k8s-app=kube-router
NAME READY STATUS RESTARTS AGE
kube-router-n6fv8 1/1 Running 0 10m
kube-router-nj4vs 1/1 Running 0 10m
kube-router-xqqwc 1/1 Running 0 10m
kube-router-xsmd4 1/1 Running 0 10m
Deploy Cilium¶
In order for routing to be delegated to kube-router, tunneling/encapsulation
must be disabled. This is done by setting the tunnel=disabled
in the
ConfigMap cilium-config
or by adjusting the DaemonSet to run the
cilium-agent
with the argument --tunnel=disabled
:
# Encapsulation mode for communication between nodes
# Possible values:
# - disabled
# - vxlan (default)
# - geneve
tunnel: "disabled"
You can then install Cilium according to the instructions in section Deploying the DaemonSet.
Ensure that Cilium is up and running:
$ kubectl -n kube-system get pods -l k8s-app=cilium
NAME READY STATUS RESTARTS AGE
cilium-fhpk2 1/1 Running 0 45m
cilium-jh6kc 1/1 Running 0 44m
cilium-rlx6n 1/1 Running 0 44m
cilium-x5x9z 1/1 Running 0 45m
Verify Installation¶
Verify that kube-router has installed routes:
$ kubectl -n kube-system exec -ti cilium-fhpk2 -- ip route list scope global
default via 172.0.32.1 dev eth0 proto dhcp src 172.0.50.227 metric 1024
10.2.0.0/24 via 10.2.0.172 dev cilium_host src 10.2.0.172
10.2.1.0/24 via 172.0.51.175 dev eth0 proto 17
10.2.2.0/24 dev tun-172011760 proto 17 src 172.0.50.227
10.2.3.0/24 dev tun-1720186231 proto 17 src 172.0.50.227
In the above example, we see three categories of routes that have been installed:
- Local PodCIDR: This route points to all pods running on the host and makes
these pods available to
*
10.2.0.0/24 via 10.2.0.172 dev cilium_host src 10.2.0.172
- BGP route: This type of route is installed if kube-router determines that
the remote PodCIDR can be reached via a router known to the local host. It
will instruct pod to pod traffic to be forwarded directly to that router
without requiring any encapsulation.
*
10.2.1.0/24 via 172.0.51.175 dev eth0 proto 17
- IPIP tunnel route: If no direct routing path exists, kube-router will fall
back to using an overlay and establish an IPIP tunnel between the nodes.
*
10.2.2.0/24 dev tun-172011760 proto 17 src 172.0.50.227
*10.2.3.0/24 dev tun-1720186231 proto 17 src 172.0.50.227
You can test connectivity by deploying the following connectivity checker pods:
$ kubectl create -f \ |SCM_WEB|\/examples/kubernetes/connectivity-check/connectivity-check.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
echo-7d9f9564df-2vbpw 1/1 Running 0 26m
echo-7d9f9564df-ff8xh 1/1 Running 0 26m
echo-7d9f9564df-pnbgc 1/1 Running 0 26m
echo-7d9f9564df-sbrxh 1/1 Running 0 26m
echo-7d9f9564df-wzfrc 1/1 Running 0 26m
probe-8689f6579-7l7w7 1/1 Running 0 27m
probe-8689f6579-fvqp8 1/1 Running 0 27m
probe-8689f6579-lvjlh 1/1 Running 0 27m
probe-8689f6579-m26g8 1/1 Running 0 27m
probe-8689f6579-tzbjk 1/1 Running 0 27m
Upgrade Guide¶
Kubernetes Cilium Upgrade¶
This Cilium Kubernetes upgrade guide is divided into two sections:
- General Upgrade Workflow: Provides a conceptual understanding of how Cilium upgrade works in Kubernetes.
- Specific Upgrade Instructions: Details considerations and instructions for specific upgrades between recent Cilium versions.
General Upgrade Workflow¶
The configuration of a standard Cilium Kubernetes deployment consists of several Kubernetes resources:
- A
DaemonSet
resource: describes the Cilium pod that is deployed to each Kubernetes node. This pod runs the cilium-agent and associated daemons. The configuration of this DaemonSet includes the image tag indicating the exact version of the Cilium docker container (e.g., v1.0.0) and command-line options passed to the cilium-agent. - A
ConfigMap
resource: describes common configuration values that are passed to the cilium-agent, such as the kvstore endpoint and credentials, enabling/disabling debug mode, etc. ServiceAccount
,ClusterRole
, andClusterRoleBindings
resources: the identity and permissions used by cilium-agent to access the Kubernetes API server when Kubernetes RBAC is enabled.- A
Secret
resource: describes the credentials use access the etcd kvstore, if required.
If you have followed the installation guide from Deploying the DaemonSet, all of the
above resources were installed via a single cilium.yaml
file.
All upgrades require at least updating the DaemonSet
to point to the newer
Cilium image tag. However, safely upgrading Cilium may also required changes
additional changes to the DaemonSet
, ConfigMap
or RBAC
related
resources. This depends on your current and target Cilium versions, so it is
critical to read the Specific Upgrade Instructions below referring to your target
Cilium version.
In general, the easiest way to ensure you are making all required updates to Cilium related resources is to download new versions of those resources, and apply them to your Kubernetes environment. The recommended high-level workflow is:
- Download a new
cilium-cm.yaml
file associated with your target version of Cilium, manually edit the file with any configuration options specific to your environment, and then apply the file to your cluster using kubectl. - Update the
ServiceAccount
,ClusterRole
, andClusterRoleBindings
resources by using kubectl to apply a newcilium-rbac.yaml
associated with your target version of Cilium. - Update the
DaemonSet
resource by applying thecilium-ds.yaml
associated with your target version of Cilium.
If there are no changes required to the resources between two Cilium versions (e.g., between two patch releases in the same minor version), then it is possible to upgrade Cilium simply by editing the cilium image tag in the DaemonSet. However, this short-cut should only be done if the Specific Upgrade Instructions instructions below confirm that it is safe.
When upgrading from one minor release to another minor release, for example 1.x to 1.y, it is generally safer to upgrade first to the latest release in the 1.x.z series, then subsequently upgrade to 1.y. This way, if there is any unexpected issue during the upgrade, it can be safely rolled back to the latest good version.
Below we will show examples of how Cilium should be upgraded using Kubernetes
rolling upgrade functionality in order to preserve any existing Cilium
configuration changes (e.g., etc configuration) and minimize network
disruptions for running workloads. These instructions upgrade Cilium to version
“queue-rtd-theme-fixups” by updating the ConfigMap
, RBAC
rules and
DaemonSet
files separately. Rather than installing all configuration in one
cilium.yaml
file, which could override any custom ConfigMap
configuration, installing these files separately allows upgrade to be staged
and for user configuration to not be affected by the upgrade.
Upgrade ConfigMap (Recommended)¶
Upgrading the ConfigMap
should be done manually before upgrading the RBAC
and the DaemonSet
. Upgrading the ConfigMap
first will not affect current
Cilium pods as the new ConfigMap
configurations are only used when a pod is
restarted.
- To update your current
ConfigMap
store it locally so you can modify it:
$ kubectl get configmap -n kube-system cilium-config -o yaml --export > cilium-cm-old.yaml
$ cat ./cilium-cm-old.yaml
apiVersion: v1
data:
clean-cilium-state: "false"
debug: "true"
disable-ipv4: "false"
etcd-config: |-
---
endpoints:
- https://192.168.33.11:2379
#
# In case you want to use TLS in etcd, uncomment the 'ca-file' line
# and create a kubernetes secret by following the tutorial in
# https://cilium.link/etcd-config
ca-file: '/var/lib/etcd-secrets/etcd-ca'
#
# In case you want client to server authentication, uncomment the following
# lines and add the certificate and key in cilium-etcd-secrets below
key-file: '/var/lib/etcd-secrets/etcd-client-key'
cert-file: '/var/lib/etcd-secrets/etcd-client-crt'
kind: ConfigMap
metadata:
creationTimestamp: null
name: cilium-config
selfLink: /api/v1/namespaces/kube-system/configmaps/cilium-config
In the ConfigMap
above, we can verify that Cilium is using debug
with
true
, it has a etcd endpoint running with TLS,
and the etcd is set up to have client to server authentication.
- Download the
ConfigMap
with the changes for “queue-rtd-theme-fixups”:
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium-cm.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium-cm.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium-cm.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium-cm.yaml
$ wget https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium-cm.yaml
Verify its contents:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
# This etcd-config contains the etcd endpoints of your cluster. If you use
# TLS please make sure you follow the tutorial in https://cilium.link/etcd-config
etcd-config: |-
---
endpoints:
- http://127.0.0.1:31079
#
# In case you want to use TLS in etcd, uncomment the 'ca-file' line
# and create a kubernetes secret by following the tutorial in
# https://cilium.link/etcd-config
#ca-file: '/var/lib/etcd-secrets/etcd-ca'
#
# In case you want client to server authentication, uncomment the following
# lines and create a kubernetes secret by following the tutorial in
# https://cilium.link/etcd-config
#key-file: '/var/lib/etcd-secrets/etcd-client-key'
#cert-file: '/var/lib/etcd-secrets/etcd-client-crt'
# If you want to run cilium in debug mode change this value to true
debug: "false"
disable-ipv4: "false"
# If you want to clean cilium state; change this value to true
clean-cilium-state: "false"
legacy-host-allows-world: "false"
# If you want cilium monitor to aggregate tracing for packets, set this level
# to "low", "medium", or "maximum". The higher the level, the less packets
# that will be seen in monitor output.
monitor-aggregation-level: "none"
# Regular expression matching compatible Istio sidecar istio-proxy
# container image names
sidecar-istio-proxy-image: "cilium/istio_proxy"
# Encapsulation mode for communication between nodes
# Possible values:
# - disabled
# - vxlan (default)
# - geneve
tunnel: "vxlan"
# Name of the cluster. Only relevant when building a mesh of clusters.
cluster-name: default
# Unique ID of the cluster. Must be unique across all conneted clusters and
# in the range of 1 and 255. Only relevant when building a mesh of clusters.
#cluster-id: 1
3. Add the new options manually to your old ConfigMap
, and make the necessary
changes.
In this example, the debug
option is meant to be kept with true
, the
etcd-config
is kept unchanged, and legacy-host-allows-world
is a new
option, but after reading the Upgrade notes the value was kept unchanged
from the default value.
After making the necessary changes, the old ConfigMap
was migrated with the
new options while keeping the configuration that we wanted:
$ cat ./cilium-cm-old.yaml
apiVersion: v1
data:
debug: "true"
disable-ipv4: "false"
# If you want to clean cilium state; change this value to true
clean-cilium-state: "false"
legacy-host-allows-world: "false"
etcd-config: |-
---
endpoints:
- https://192.168.33.11:2379
#
# In case you want to use TLS in etcd, uncomment the 'ca-file' line
# and create a kubernetes secret by following the tutorial in
# https://cilium.link/etcd-config
ca-file: '/var/lib/etcd-secrets/etcd-ca'
#
# In case you want client to server authentication, uncomment the following
# lines and add the certificate and key in cilium-etcd-secrets below
key-file: '/var/lib/etcd-secrets/etcd-client-key'
cert-file: '/var/lib/etcd-secrets/etcd-client-crt'
kind: ConfigMap
metadata:
creationTimestamp: null
name: cilium-config
selfLink: /api/v1/namespaces/kube-system/configmaps/cilium-config
After adding the options, manually save the file with your changes and install
the ConfigMap
in the kube-system
namespace of your cluster.
$ kubectl apply -n kube-system -f ./cilium-cm-old.yaml
As the ConfigMap
is successfully upgraded we can start upgrading Cilium
DaemonSet
and RBAC
which will pick up the latest configuration from the
ConfigMap
.
Upgrade Cilium DaemonSet and RBAC¶
There are two methods to upgrade the Cilium DaemonSet
:
- Full upgrade of the
RBAC
andDaemonSet
resources: This is the safest option, which pulls in the latest configuration options for the Cilium daemon. If in doubt, use this approach. - Set the version in the existing
DaemonSet
: A simpler upgrade procedure which does not update the Daemon options.
Refer to the section Specific Upgrade Instructions for more details on which approach is relevant for the target Cilium version.
The following sections describe how to upgrade using either of the above approaches, then how to monitor (and if necessary, roll back) the upgrade process.
Full Upgrade of RBAC and DaemonSet¶
Simply pick your Kubernetes version and run kubectl apply
for the RBAC
and the DaemonSet
.
Both files are dedicated to “queue-rtd-theme-fixups” for each Kubernetes version.
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium-rbac.yaml $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.7/cilium-ds.yaml
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium-rbac.yaml $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.8/cilium-ds.yaml
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium-rbac.yaml $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.9/cilium-ds.yaml
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium-rbac.yaml $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.10/cilium-ds.yaml
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium-rbac.yaml $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/queue-rtd-theme-fixups/examples/kubernetes/1.11/cilium-ds.yaml
Direct version upgrade¶
Note
Direct version upgrade is not recommended for major or minor version upgrades. Upgrade using the Full Upgrade of RBAC and DaemonSet instructions instead.
You can alternatively substitute the version vX.Y.Z
for the desired Cilium
version number in the command below, but be aware that copy of the spec file
stored in Kubernetes might run out-of-sync with the CLI flags, or options,
specified by each Cilium version.
$ kubectl set image daemonset/cilium -n kube-system cilium-agent=docker.io/cilium/cilium:vX.Y.Z
Monitor the upgrade procedure¶
To monitor the rollout and confirm it is complete, run:
$ kubectl rollout status daemonset/cilium -n kube-system
During the upgrade roll-out, Cilium will typically continue to forward traffic at L3/L4, and all endpoints and their configuration will be preserved across the upgrade. However, because the L7 proxies implementing HTTP, gRPC, and Kafka-aware filtering currently reside in the same Pod as Cilium, they are removed and re-installed as part of the rollout. As a result, any proxied connections will be lost and clients must reconnect.
Rolling back the upgrade (Typically unnecessary)¶
Occasionally, it may be necessary to undo the rollout because a step was missed or something went wrong during upgrade. To undo the rollout via rollback, run:
$ kubectl rollout undo daemonset/cilium -n kube-system
This will revert the latest changes to the Cilium DaemonSet
and return
Cilium to the state it was in prior to the upgrade.
Specific Upgrade Instructions¶
This section documents the specific steps required for upgrading from one version of Cilium to another version of Cilium. There are particular version transitions which are suggested by the Cilium developers to avoid known issues during upgrade, then subsequently there are sections for specific upgrade transitions, ordered by version.
The table below lists suggested upgrade transitions, from a specified current
version running in a cluster to a specified target version. If a specific
combination is not listed in the table below, then it may not be safe. In that
case, consider staging the upgrade, for example upgrading from 1.1.x
to the
latest 1.1.y
release before subsequently upgrading to 1.2.z
.
Current version | Target version | DaemonSet upgrade |
L3 impact | L7 impact |
---|---|---|---|---|
1.0.x |
1.0.y |
Not required | N/A | Clients must reconnect[1] |
1.0.x |
1.1.y |
Required | N/A | Clients must reconnect[1] |
1.1.x |
1.1.y |
Not required | N/A | Clients must reconnect[1] |
1.1.x , x >= 3 |
1.2.y |
Required | Temporary disruption[2] | Clients must reconnect[1] |
Annotations:
- Clients must reconnect: Any traffic flowing via a proxy (for example, because an L7 policy is in place) will be disrupted during upgrade. Endpoints communicating via the proxy must reconnect to re-establish connections.
- Temporary disruption: All traffic may be temporarily disrupted during upgrade. Connections should successfully re-establish without requiring clients to reconnect.
Upgrading to the Cilium 1.2 series¶
The latest version in the Cilium 1.2 series can be found here
Upgrading to Cilium 1.2.x from Cilium 1.2.y¶
Set the version to the desired release per Direct version upgrade.
Upgrading to Cilium 1.2.x from Cilium 1.1.y or 1.0.z¶
Note
Users running Linux 4.10 or earlier with Cilium CIDR policies may face Restrictions on unique prefix lengths for CIDR policy rules.
Upgrade to Cilium
1.1.3
or later using the instructions below.Upgrade ConfigMap (Recommended).
New options in Cilium 1.2:
cluster-name
cluster-id
monitor-aggregation-level
See the example ConfigMap for more details.
Upgrading to the Cilium 1.1 series¶
The latest version in the Cilium 1.1 series can be found here
Upgrading to Cilium 1.1.x from Cilium 1.1.y¶
Set the version to the desired release per Direct version upgrade.
Upgrading to Cilium 1.1.x from Cilium 1.0.y¶
Note
Users running Linux 4.10 or earlier with Cilium CIDR policies may face Restrictions on unique prefix lengths for CIDR policy rules.
Follow the guide in MTU handling behavior change in Cilium 1.1 to update the MTU of existing endpoints.
Upgrade ConfigMap (Recommended).
New options in Cilium 1.1:
legacy-host-allows-world
: This is recommended to be set to false. For more information, see Traffic from world to endpoints is classified as from host.sidecar-istio-proxy-image
Deprecated options in Cilium 1.1:
sidecar-http-proxy
See the example Cilium 1.1 ConfigMap for more details.
Follow the instructions for Full Upgrade of RBAC and DaemonSet, using the desired version rather than “queue-rtd-theme-fixups”.
Downgrading to Cilium 1.1.x from Cilium 1.2.y¶
When downgrading from Cilium 1.2, the target version must be Cilium 1.1.3 or later.
Check whether you have any DNS policy rules installed:
$ kubectl get cnp --all-namespaces -o yaml | grep "fqdn"
If any DNS rules exist, these must be removed prior to downgrade as these rules are not supported by Cilium 1.1.
Follow the instructions for Full Upgrade of RBAC and DaemonSet, using the desired version rather than “queue-rtd-theme-fixups”.
Downgrading to Cilium 1.1.x from Cilium 1.1.y¶
Set the version to the desired release per Direct version upgrade.
Upgrading to the Cilium 1.0 series¶
The latest version in the Cilium 1.0 series can be found here
Upgrading to Cilium 1.0.x from Cilium 1.0.y¶
Set the version to the desired release per Direct version upgrade.
Upgrading to Cilium 1.0.x from older versions¶
Versions of Cilium older than 1.0.0 are unsupported for upgrade. The General Upgrade Workflow may work, however it may be more reliable to start again from the Installation Guides.
Downgrading to Cilium 1.0.x from Cilium 1.1.y¶
Check whether you have any CIDR policy rules installed:
$ kubectl get cnp --all-namespaces -o yaml
- If any CIDR rules match on the CIDR prefix
/0
, these must be removed prior to downgrade as these rules are not supported by Cilium 1.0. - If any CIDR rules also specify a
toPorts
section, these must be removed prior to downgrade as these rules are not supported by Cilium 1.0.
- If any CIDR rules match on the CIDR prefix
Follow the instructions for Full Upgrade of RBAC and DaemonSet, using the desired version rather than “queue-rtd-theme-fixups”.
Downgrading to Cilium 1.0.x from Cilium 1.0.y¶
Set the version to the desired release per Direct version upgrade.
Downgrade¶
Occasionally, when encountering issues with a particular version of Cilium, it may be useful to alternatively downgrade an instance or deployment. The above instructions may be used, replacing the “queue-rtd-theme-fixups” version with the desired version.
Particular versions of Cilium may introduce new features, however, so if Cilium is configured with the newer feature, and a downgrade is performed, then the downgrade may leave Cilium in a bad state. Below is a table of features which have been introduced in later versions of Cilium. If you are using a feature in the below table, then a downgrade cannot be safely implemented unless you also disable the usage of the feature.
Feature | Minimum version | Mitigation | Feature Link |
---|---|---|---|
CIDR policies matching on IPv6 prefix ranges | v1.0.2 |
Remove policies that contain IPv6 CIDR rules | PR 4004 |
CIDR policies matching on default prefix | v1.1.0 |
Remove policies that match a /0 prefix |
PR 4458 |
CIDR-dependent L4 policies | v1.1.0 |
Remove rules with both toPorts and
toCIDR from policy |
PR 3835 |
Monitor Aggregation | v1.2.0 |
Re-install DaemonSet from target version |
PR 5118 |
Upgrade notes¶
This section describes known issues and limitations with released versions of Cilium which may require user interaction to mitigate or remediate.
Restrictions on unique prefix lengths for CIDR policy rules¶
The Linux kernel applies limitations on the complexity of BPF code that is loaded into the kernel so that the code may be verified as safe to execute on packets. Over time, Linux releases become more intelligent about the verification of programs which allows more complex programs to be loaded. However, the complexity limitations affect some features in Cilium depending on the kernel version that is used with Cilium.
One such limitation affects Cilium’s configuration of CIDR policies. On Linux kernels 4.10 and earlier, this manifests as a restriction on the number of unique prefix lengths supported in CIDR policy rules.
Unique prefix lengths are counted by looking at the prefix portion of CIDR
rules and considering which prefix lengths are unique. For example, in the
following policy example, the toCIDR
section specifies a /32
, and the
toCIDRSet
section specifies a /8
with a /12
removed from it. In
addition, three prefix lengths are always counted: the host prefix length for
the protocol (IPv4: /32
, IPv6: /128
), the default prefix length
(/0
), and the cluster prefix length (default IPv4: /8
, IPv6: /64
).
All in all, the following example counts as seven unique prefix lengths in IPv4:
/32
(fromtoCIDR
, also from host prefix)/12
(fromtoCIDRSet
)/11
(fromtoCIDRSet
)/10
(fromtoCIDRSet
)/9
(fromtoCIDRSet
)/8
(from cluster prefix)/0
(from default prefix)
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "cidr-rule"
spec:
endpointSelector:
matchLabels:
app: myService
egress:
- toCIDR:
- 20.1.1.1/32
- toCIDRSet:
- cidr: 10.0.0.0/8
except:
- 10.96.0.0/12
[{
"labels": [{"key": "name", "value": "cidr-rule"}],
"endpointSelector": {"matchLabels":{"app":"myService"}},
"egress": [{
"toCIDR": [
"20.1.1.1/32"
]
}, {
"toCIDRSet": [{
"cidr": "10.0.0.0/8",
"except": [
"10.96.0.0/12"
]}
]
}]
}]
Affected versions¶
- Any version of Cilium running on Linux 4.10 or earlier
When a CIDR policy with too many unique prefix lengths is imported, Cilium will reject the policy with a message like the following:
$ cilium policy import too_many_cidrs.json
Error: Cannot import policy: [PUT /policy][500] putPolicyFailure too many
egress CIDR prefix lengths (128/40)
$ cilium policy import too_many_cidrs.json
Error: Cannot import policy: [PUT /policy][500] putPolicyFailure Adding
specified prefixes would result in too many prefix lengths (current: 3,
result: 32, max: 18)
The supported count of unique prefix lengths may differ between Cilium minor releases, for example Cilium 1.1 supports 20 unique prefix lengths on Linux 4.10 or older, while Cilium 1.2 only supports 18 (for IPv4) or 4 (for IPv6).
Mitigation¶
Users may construct CIDR policies that use fewer unique prefix lengths. This can be achieved by composing or decomposing adjacent prefixes.
Solution¶
Upgrade the host Linux version to 4.11 or later. This step is beyond the scope of the Cilium guide.
Traffic from world to endpoints is classified as from host¶
In Cilium 1.0, all traffic from the host, including from local processes and
traffic that is masqueraded from the outside world to the host IP, would be
classified as from the host
entity (reserved:host
label).
Furthermore, to allow Kubernetes agents to perform health checks over IP into
the endpoints, the host is allowed by default. This means that all traffic from
the outside world is also allowed by default, regardless of security policy.
Affected versions¶
- Cilium 1.0 or earlier deployed using the DaemonSet and ConfigMap YAMLs provided with that release, or
- Later versions of Cilium deployed using the YAMLs provided with Cilium 1.0 or earlier.
Affected environments will see no output for one or more of the below commands:
$ kubectl get ds cilium -n kube-system -o yaml | grep -B 3 -A 2 -i legacy-host-allows-world
$ kubectl get cm cilium-config -n kube-system -o yaml | grep -i legacy-host-allows-world
Unaffected environments will see the following output (note the configMapKeyRef key in the Cilium DaemonSet and the legacy-host-allows-world: "false"
setting of the ConfigMap):
$ kubectl get ds cilium -n kube-system -o yaml | grep -B 3 -A 2 -i legacy-host-allows-world
- name: CILIUM_LEGACY_HOST_ALLOWS_WORLD
valueFrom:
configMapKeyRef:
name: cilium-config
optional: true
key: legacy-host-allows-world
$ kubectl get cm cilium-config -n kube-system -o yaml | grep -i legacy-host-allows-world
legacy-host-allows-world: "false"
Mitigation¶
Users who are not reliant upon IP-based health checks for their kubernetes pods
may mitigate this issue on earlier versions of Cilium by adding the argument
--allow-localhost=policy
to the Cilium DaemonSet for the Cilium container.
This prevents the automatic insertion of L3 allow policy in kubernetes
environments. Note however that with this option, if the Cilium Network Policy
allows traffic from the host, then it will still allow access from the outside
world.
$ kubectl edit ds cilium -n kube-system
(Edit the "args" section to add the option "--allow-localhost=policy")
$ kubectl rollout status daemonset/cilium -n kube-system
(Wait for kubernetes to redeploy Cilium with the new options)
Solution¶
Cilium 1.1 and later only classify traffic from a process on the local host as
from the host
entity; other traffic that is masqueraded to the host IP is
now classified as from the world
entity (reserved:world
label).
Fresh deployments using the Cilium 1.1 YAMLs are not affected.
Affected users are recommended to upgrade using the steps below.
Upgrade steps¶
Redeploy the Cilium DaemonSet with the YAMLs provided with the Cilium 1.1 or later release. The instructions for this are found at the top of the Upgrade Guide.
Add the config option
legacy-host-allows-world: "false"
to the Cilium ConfigMap under the “data” paragraph.$ kubectl edit configmap cilium-config -n kube-system (Add a new line with the config option above in the "data" paragraph)
(Optional) Update the Cilium Network Policies to allow specific traffic from the outside world. For more information, see Network Policy.
MTU handling behavior change in Cilium 1.1¶
Cilium 1.0 by default configured the MTU of all Cilium-related devices and
endpoint devices to 1450 bytes, to guarantee that packets sent from an endpoint
would remain below the MTU of a tunnel. This had the side-effect that when a
Cilium-managed pod made a request to an outside (world) IP, if the response
came back in 1500B chunks, then it would be fragmented when transmitted to the
cilium_host
device. These fragments then pass through the Cilium policy
logic. Latter IP fragments would not contain L4 ports, so if any L4 or L4+L7
policy was applied to the destination endpoint, then the fragments would be
dropped. This could cause disruption to network traffic.
Mitigation¶
There is no known mitigation for users running Cilium 1.0 at this time.
Solution¶
Cilium 1.1 fixes the above issue by increasing the MTU of the Cilium-related devices and endpoint devices to 1500B (or larger based on container runtime settings), then configuring a route within the endpoint at a lower MTU to ensure that transmitted packets will fit within tunnel encapsulation. This addresses the above issue for all new pods.
The MTU for endpoints deployed on Cilium 1.0 must be updated to remediate this issue. Users are recommended to follow the below upgrade instructions prior to upgrading to Cilium 1.1 to prepare the endpoints for the new MTU behavior.
Upgrade Steps¶
The mtu-update tool is provided as a Kubernetes DaemonSet
to assist the
live migration of applications from the Cilium 1.0 MTU handling behavior to the
Cilium 1.1 or later MTU handling behavior. To prevent any packet loss during
upgrade, these steps should be followed before upgrading to Cilium 1.1;
however, they are also safe to run after upgrade.
To deploy the mtu-update DaemonSet:
$ kubectl create -f https://raw.githubusercontent.com/cilium/mtu-update/v1.1/mtu-update.yaml
This will deploy the mtu-update daemon on each node in your cluster, where it will proceed to search for Cilium-managed pods and update the MTU inside these pods to match the Cilium 1.1 behavior.
To determine whether this was successful:
$ kubectl get ds mtu-update -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
mtu-update 1 1 1 1 1 <none> 18s
When the DESIRED
count matches the READY
count, the MTU has been
successfully updated for running pods. It is now safe to remove the
mtu-update daemonset:
$ kubectl delete -f https://raw.githubusercontent.com/cilium/mtu-update/v1.1/mtu-update.yaml
For more information, visit the mtu-update website.
Network Policy¶
This chapter documents the policy language used to configure network policies in Cilium. Security policies can be specified and imported via the following mechanisms:
- Using Kubernetes NetworkPolicy and CiliumNetworkPolicy resources. See the section Network Policy for more details. In this mode, Kubernetes will automatically distribute the policies to all agents.
- Directly imported into the agent via CLI or API Reference of the agent. This method does not automatically distribute policies to all agents. It is in the responsibility of the user to import the policy in all required agents.
New in version future: Use of the KVstore to distribute security policies is on the roadmap but has not been implemented yet.
Policy Enforcement Modes¶
The configuration of the Cilium agent and the Cilium Network Policy determines whether an endpoint accepts traffic from a source or not. The agent can be put into the following three policy enforcement modes:
- default
This is the default behavior for policy enforcement when Cilium is launched without any specified value for the policy enforcement configuration. The following rules apply:
- If any rule selects an Endpoint and the rule has an ingress section, the endpoint goes into default deny at ingress.
- If any rule selects an Endpoint and the rule has an egress section, the endpoint goes into default deny at egress.
This means that endpoints will start without any restrictions and as soon as a rule restricts their ability to receive traffic on ingress or to transmit traffic on egress, then the endpoint goes into whitelisting mode and all traffic must be explicitly allowed.
- always
- With always mode, policy enforcement is enabled on all endpoints even if no rules select specific endpoints.
- never
- With never mode, policy enforcement is disabled on all endpoints, even if rules do select specific endpoints. In other words, all traffic is allowed from any source (on ingress) or destination (on egress).
To configure the policy enforcement mode at runtime for all endpoints managed by a Cilium agent, use:
$ cilium config PolicyEnforcement={default,always,never}
If you want to configure the policy enforcement mode at start-time for a particular agent, provide the following flag when launching the Cilium daemon:
$ cilium-agent --enable-policy={default,always,never} [...]
Similarly, you can enable the policy enforcement mode across a Kubernetes cluster by including the parameter above in the Cilium DaemonSet.
Rule Basics¶
All policy rules are based upon a whitelist model, that is, each rule in the policy allows traffic that matches the rule. If two rules exist, and one would match a broader set of traffic, then all traffic matching the broader rule will be allowed. If there is an intersection between two or more rules, then traffic matching the union of those rules will be allowed. Finally, if traffic does not match any of the rules, it will be dropped pursuant to the Policy Enforcement Modes.
Policy rules share a common base type which specifies which endpoints the rule applies to and common metadata to identify the rule. Each rule is split into an ingress section and an egress section. The ingress section contains the rules which must be applied to traffic entering the endpoint, and the egress section contains rules applied to traffic coming from the endpoint matching the endpoint selector. Either ingress, egress, or both can be provided. If both ingress and egress are omitted, the rule has no effect.
type Rule struct {
// EndpointSelector selects all endpoints which should be subject to
// this rule. Cannot be empty.
EndpointSelector EndpointSelector `json:"endpointSelector"`
// Ingress is a list of IngressRule which are enforced at ingress.
// If omitted or empty, this rule does not apply at ingress.
//
// +optional
Ingress []IngressRule `json:"ingress,omitempty"`
// Egress is a list of EgressRule which are enforced at egress.
// If omitted or empty, this rule does not apply at egress.
//
// +optional
Egress []EgressRule `json:"egress,omitempty"`
// Labels is a list of optional strings which can be used to
// re-identify the rule or to store metadata. It is possible to lookup
// or delete strings based on labels. Labels are not required to be
// unique, multiple rules can have overlapping or identical labels.
//
// +optional
Labels labels.LabelArray `json:"labels,omitempty"`
// Description is a free form string, it can be used by the creator of
// the rule to store human readable explanation of the purpose of this
// rule. Rules cannot be identified by comment.
//
// +optional
Description string `json:"description,omitempty"`
}
- endpointSelector
- Selects the endpoints which the policy rules apply to. The policy rules will be applied to all endpoints which match the labels specified in the Endpoint Selector. See the Endpoint Selector section for additional details.
- ingress
- List of rules which must apply at ingress of the endpoint, i.e. to all network packets which are entering the endpoint.
- egress
- List of rules which must apply at egress of the endpoint, i.e. to all network packets which are leaving the endpoint.
- labels
- Labels are used to identify the rule. Rules can be listed and deleted by
labels. Policy rules which are imported via kubernetes
automatically get the label
io.cilium.k8s.policy.name=NAME
assigned whereNAME
corresponds to the name specified in the NetworkPolicy or CiliumNetworkPolicy resource. - description
- Description is a string which is not interpreted by Cilium. It can be used to describe the intent and scope of the rule in a human readable form.
Endpoint Selector¶
The Endpoint Selector is based on the Kubernetes LabelSelector. It is called Endpoint Selector because it only applies to labels associated with Endpoint.
Layer 3 Examples¶
The layer 3 policy establishes the base connectivity rules regarding which endpoints can talk to each other. Layer 3 policies can be specified using the following methods:
- Labels Based: This is used to describe the relationship if both endpoints are managed by Cilium and are thus assigned labels. The big advantage of this method is that IP addresses are not encoded into the policies and the policy is completely decoupled from the addressing.
- Services based: This is an intermediate form between Labels and CIDR and makes use of the services concept in the orchestration system. A good example of this is the Kubernetes concept of Service endpoints which are automatically maintained to contain all backend IP addresses of a service. This allows to avoid hardcoding IP addresses into the policy even if the destination endpoint is not controlled by Cilium.
- Entities Based: Entities are used to describe remote peers which can be categorized without knowing their IP addresses. This includes connectivity to the local host serving the endpoints or all connectivity to outside of the cluster.
- IP/CIDR based: This is used to describe the relationship to or from external services if the remote peer is not an endpoint. This requires to hardcode either IP addresses or subnets into the policies. This construct should be used as a last resort as it requires stable IP or subnet assignments.
- DNS based: Selects remote, non-cluster, peers using DNS names converted to IPs via DNS lookups. It shares all limitations of the IP/CIDR based rules above. The current implementation simply polls the listed DNS targets without regard for TTLs, and allows traffics from IPs listed in the DNS responses.
Labels Based¶
Label-based L3 policy is used to establish policy between endpoints inside the cluster managed by Cilium. Label-based L3 policies are defined by using an Endpoint Selector inside a rule to choose what kind of traffic that can be received (on ingress), or sent (on egress). An empty Endpoint Selector allows all traffic. The examples below demonstrate this in further detail.
Note
Kubernetes: See section Namespaces for details on how the Endpoint Selector applies in a Kubernetes environment with regard to namespaces.
Ingress¶
An endpoint is allowed to receive traffic from another endpoint if at least one
ingress rule exists which selects the destination endpoint with the
Endpoint Selector in the endpointSelector
field. To restrict traffic upon
ingress to the selected endpoint, the rule selects the source endpoint with the
Endpoint Selector in the fromEndpoints
field.
Simple Ingress Allow¶
The following example illustrates how to use a simple ingress rule to allow
communication from endpoints with the label role=frontend
to endpoints with
the label role=backend
.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l3-rule"
spec:
endpointSelector:
matchLabels:
role: backend
ingress:
- fromEndpoints:
- matchLabels:
role: frontend
[{
"labels": [{"key": "name", "value": "l3-rule"}],
"endpointSelector": {"matchLabels": {"role":"backend"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"role":"frontend"}}
]
}]
}]
Ingress Allow All¶
An empty Endpoint Selector will select all endpoints, thus writing a rule that will allow all ingress traffic to an endpoint may be done as follows:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "allow-all-to-victim"
spec:
endpointSelector:
matchLabels:
role: victim
ingress:
- fromEndpoints:
- {}
[{
"labels": [{"key": "name", "value": "allow-all-to-victim"}],
"endpointSelector": {"matchLabels": {"role":"victim"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{}}
]
}]
}]
Note that while the above examples allow all ingress traffic to an endpoint, this does not mean that all endpoints are allowed to send traffic to this endpoint per their policies. In other words, policy must be configured on both sides (sender and receiver).
Egress¶
An endpoint is allowed to send traffic to another endpoint if at least one
egress rule exists which selects the destination endpoint with the
Endpoint Selector in the endpointSelector
field. To restrict traffic upon
egress to the selected endpoint, the rule selects the destination endpoint with
the Endpoint Selector in the toEndpoints
field.
Simple Egress Allow¶
The following example illustrates how to use a simple egress rule to allow
communication to endpoints with the label role=backend
from endpoints with
the label role=frontend
.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l3-egress-rule"
spec:
endpointSelector:
matchLabels:
role: frontend
egress:
- toEndpoints:
- matchLabels:
role: backend
[{
"labels": [{"key": "name", "value": "l3-egress-rule"}],
"endpointSelector": {"matchLabels": {"role":"frontend"}},
"egress": [{
"toEndpoints": [
{"matchLabels":{"role":"backend"}}
]
}]
}]
Egress Allow All¶
An empty Endpoint Selector will select all endpoints, thus writing a rule that will allow all egress traffic from an endpoint may be done as follows:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "allow-all-from-frontend"
spec:
endpointSelector:
matchLabels:
role: frontend
egress:
- toEndpoints:
- {}
[{
"labels": [{"key": "name", "value": "allow-all-from-frontend"}],
"endpointSelector": {"matchLabels": {"role":"frontend"}},
"egress": [{
"toEndpoints": [
{"matchLabels":{}}
]
}]
}]
Note that while the above examples allow all egress traffic from an endpoint, the receivers of the egress traffic may have ingress rules that deny the traffic. In other words, policy must be configured on both sides (sender and receiver).
Ingress/Egress Default Deny¶
An endpoint can be put into the default deny mode at ingress or egress if a rule selects the endpoint and contains the respective rule section ingress or egress.
Note
Any rule selecting the endpoint will have this effect, this example illustrates how to put an endpoint into default deny mode without whitelisting other peers at the same time.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "deny-all-egress"
spec:
endpointSelector:
matchLabels:
role: restricted
egress:
- {}
[{
"labels": [{"key": "name", "value": "deny-all-egress"}],
"endpointSelector": {"matchLabels": {"role":"restricted"}},
"egress": [{}]
}]
Additional Label Requirements¶
It is often required to apply the principle of separation of concern when defining policies. For this reason, an additional construct exists which allows to establish base requirements for any connectivity to happen.
For this purpose, the fromRequires
field can be used to establish label
requirements which serve as a foundation for any fromEndpoints
relationship. fromRequires
is a list of additional constraints which must
be met in order for the selected endpoints to be reachable. These additional
constraints do not grant access privileges by themselves, so to allow traffic
there must also be rules which match fromEndpoints
. The same applies for
egress policies, with toRequires
and toEndpoints
.
The purpose of this rule is to allow establishing base requirements such as, any
endpoint in env=prod
can only be accessed if the source endpoint also carries
the label env=prod
.
This example shows how to require every endpoint with the label env=prod
to
be only accessible if the source endpoint also has the label env=prod
.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "For endpoints with env=prod, only allow if source also has label env=prod"
metadata:
name: "requires-rule"
specs:
- endpointSelector:
matchLabels:
env: prod
ingress:
- fromRequires:
- matchLabels:
env: prod
[{
"labels": [{"key": "name", "value": "requires-rule"}],
"endpointSelector": {"matchLabels": {"env":"prod"}},
"ingress": [{
"fromRequires": [
{"matchLabels":{"env":"prod"}}
]
}]
}]
Services based¶
Services running in your cluster can be whitelisted in Egress rules. Currently Kubernetes Services without a Selector are supported when defined by their name and namespace or label selector. Future versions of Cilium will support specifying non-Kubernetes services and Kubernetes services which are backed by pods.
This example shows how to allow all endpoints with the label id=app2
to talk to all endpoints of kubernetes service myservice
in kubernetes
namespace default
.
Note
These rules will only take effect on Kubernetes services without a selector.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "service-rule"
spec:
endpointSelector:
matchLabels:
id: app2
egress:
- toServices:
- k8sService:
serviceName: myservice
namespace: default
[{
"labels": [{"key": "name", "value": "service-rule"}],
"endpointSelector": {
"matchLabels": {
"id": "app2"
}
},
"egress": [
{
"toServices": [
{
"k8sService": {
"serviceName": "myservice",
"namespace": "default"
}
}
]
}
]
}]
This example shows how to allow all endpoints with the label id=app2
to talk to all endpoints of all kubernetes headless services which
have head:none
set as the label.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "service-labels-rule"
spec:
endpointSelector:
matchLabels:
id: app2
egress:
- toServices:
- k8sServiceSelector:
selector:
matchLabels:
head: none
[{
"labels": [{"key": "name", "value": "service-labels-rule"}],
"endpointSelector": {
"matchLabels": {
"id": "app2"
}
},
"egress": [
{
"toServices": [
{
"k8sServiceSelector": {
"selector": {
"matchLabels": {
"head": "none"
}
}
}
}
]
}
]
}
]
Entities Based¶
fromEntities
is used to describe the entities that can access the selected
endpoints. toEntities
is used to describe the entities that can be accessed
by the selected endpoints.
The following entities are defined:
- host
- The local host serving the endpoint. On ingress, this also includes the host of other Cilium cluster nodes.
- world
- All traffic outside of the cluster.
- all
- All traffic both within the cluster and outside of the cluster.
New in version future: Allowing users to define custom identities is on the roadmap but has not been implemented yet.
Access to/from local host¶
Allow all endpoints with the label env=dev
to access the host that is
serving the particular endpoint.
Note
Kubernetes will automatically allow all communication from and to the
local host of all local endpoints. You can run the agent with the
option --allow-localhost=policy
to disable this behavior which
will give you control over this via policy.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "dev-to-host"
spec:
endpointSelector:
matchLabels:
env: dev
egress:
- toEntities:
- host
[{
"labels": [{"key": "name", "value": "dev-to-host"}],
"endpointSelector": {"matchLabels": {"env":"dev"}},
"egress": [{
"toEntities": ["host"]
}]
}]
Access to/from outside cluster¶
This example shows how to enable access from outside of the cluster to all
endpoints that have the label role=public
.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "from-world-to-role-public"
spec:
endpointSelector:
matchLabels:
role: public
ingress:
- fromEntities:
- world
[{
"labels": [{"key": "name", "value":"from-world-to-role-public"}],
"endpointSelector": {"matchLabels": {"role":"public"}},
"ingress": [{
"fromEntities": ["world"]
}]
}]
IP/CIDR based¶
CIDR policies are used to define policies to and from endpoints which are not managed by Cilium and thus do not have labels associated with them. These are typically external services, VMs or metal machines running in particular subnets. CIDR policy can also be used to limit access to external services, for example to limit external access to a particular IP range. CIDR policies can be applied at ingress or egress.
CIDR rules apply if Cilium cannot map the source or destination to an identity derived from endpoint labels, ie the Special Identities. For example, CIDR rules will apply to traffic where one side of the connection is:
- A network endpoint outside the cluster
- The host network namespace where the pod is running.
- Within the cluster prefix but the IP’s networking is not provided by Cilium.
Note
When running Cilium on Linux 4.10 or earlier, there are Restrictions on unique prefix lengths for CIDR policy rules.
Ingress¶
- fromCIDR
- List of source prefixes/CIDRs that are allowed to talk to all endpoints
selected by the
endpointSelector
. - fromCIDRSet
- List of source prefixes/CIDRs that are allowed to talk to all endpoints
selected by the
endpointSelector
, along with an optional list of prefixes/CIDRs per source prefix/CIDR that are subnets of the source prefix/CIDR from which communication is not allowed.
Egress¶
- toCIDR
- List of destination prefixes/CIDRs that endpoints selected by
endpointSelector
are allowed to talk to. Note that endpoints which are selected by afromEndpoints
are automatically allowed to talk to their respective destination endpoints. - toCIDRSet
- List of destination prefixes/CIDRs that are allowed to talk to all endpoints
selected by the
endpointSelector
, along with an optional list of prefixes/CIDRs per source prefix/CIDR that are subnets of the destination prefix/CIDR to which communication is not allowed.
Allow to external CIDR block¶
This example shows how to allow all endpoints with the label app=myService
to talk to the external IP 20.1.1.1
, as well as the CIDR prefix 10.0.0.0/8
,
but not CIDR prefix 10.96.0.0/12
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "cidr-rule"
spec:
endpointSelector:
matchLabels:
app: myService
egress:
- toCIDR:
- 20.1.1.1/32
- toCIDRSet:
- cidr: 10.0.0.0/8
except:
- 10.96.0.0/12
[{
"labels": [{"key": "name", "value": "cidr-rule"}],
"endpointSelector": {"matchLabels":{"app":"myService"}},
"egress": [{
"toCIDR": [
"20.1.1.1/32"
]
}, {
"toCIDRSet": [{
"cidr": "10.0.0.0/8",
"except": [
"10.96.0.0/12"
]}
]
}]
}]
DNS based¶
toFQDNs
simplifies specifying egress policy to IPs of remote, external,
peers. The DNS lookup for each matchName
is done periodically by
cilium-agent
and the result is used to regenerate endpoint policy. This
allows tracking changing IPs or sets of IPs that may not be known a priori.
Despite the naming, the matchName
field does not have to be a
fully-qualified domain name. In cases where search domains are configured, the
DNS lookups from cilium
will not be qualified and will utilize the search
list.
The DNS lookups are repeated with an interval of 5 seconds, and are made for
A(IPv4) and AAAA(IPv6) addresses. Should a lookup fail, the most recent IP data
is used instead. An IP change will trigger a regeneration of the cilium
policy for each endpoint, and the updated IPs can be seen in the response from
cilium policy get
. Each update will also increment the per cilium-agent
policy repository revision.
toFQDNs
rules cannot contain any other L3 rules, such as toEndpoints
(under Labels Based) and toCIDRs
(under CIDR Based). They can contain
L4/L7 rules, such as toPorts
(see Layer 4 Examples) and, optionally,
with HTTP
and Kafka
sections (see Layer 7 Examples).
Note
toFQDNs
rules are marked on import with a
cilium-generated:ToFQDN-UUID
label. This is for internal
bookkeeping and can be safely ignored.
Note
The DNS resolver must be explicitly whitelisted to allow cilium-agent to send the DNS polls. This is illustrated in the example below.
Example¶
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "to-fqdn"
spec:
endpointSelector:
matchLabels:
app: test-app
egress:
- toEndpoints:
- matchLabels:
"k8s:io.cilium.k8s.policy.serviceaccount": kube-dns
"k8s:io.kubernetes.pod.namespace": kube-system
"k8s:k8s-app": kube-dns
- toFQDNs:
- matchName: "my-remote-service.com"
[
{
"endpointSelector": {
"matchLabels": {
"app": "test-app"
}
},
"egress": [
{
"toEndpoints": [
{
"matchLabels": {
"app-type": "dns"
}
}
]
},
{
"toFQDNs": [
{
"matchName": "my-remote-service.com"
}
]
}
]
}
]
Limitations¶
The current toFQDNs
implementation is very limited. It may not behave as expected.
- The DNS polling is done from the
cilium-agent
process. This may result in different IPs being returned in the DNS response than those seen by an endpoint or pod. - The IP response is used as-is. For DNS responses that return a new IP on every query this may result in a different IP being whitelisted than the one used for current connections.
- The lookups from
cilium
follow the configuration of the environment it is in via/etc/resolv.conf
. When running as a pod, the contents ofresolv.conf
are controlled via thednsPolicy
field of a spec. When running directly on a host, it will use the host’s file. Irrespective of how the DNS lookups are configured, TTLs and caches on the resolver will impact the IPs seen by thecilium-agent
lookups.
Layer 4 Examples¶
Limit ingress/egress ports¶
Layer 4 policy can be specified in addition to layer 3 policies or independently. It restricts the ability of an endpoint to emit and/or receive packets on a particular port using a particular protocol. If no layer 4 policy is specified for an endpoint, the endpoint is allowed to send and receive on all layer 4 ports and protocols including ICMP. If any layer 4 policy is specified, then ICMP will be blocked unless it’s related to a connection that is otherwise allowed by the policy. Layer 4 policies apply to ports after service port mapping has been applied.
Layer 4 policy can be specified at both ingress and egress using the
toPorts
field. The toPorts
field takes a PortProtocol
structure
which is defined as follows:
// PortProtocol specifies an L4 port with an optional transport protocol
type PortProtocol struct {
// Port is an L4 port number. For now the string will be strictly
// parsed as a single uint16. In the future, this field may support
// ranges in the form "1024-2048
Port string `json:"port"`
// Protocol is the L4 protocol. If omitted or empty, any protocol
// matches. Accepted values: "TCP", "UDP", ""/"ANY"
//
// Matching on ICMP is not supported.
//
// +optional
Protocol string `json:"protocol,omitempty"`
}
Example (L4)¶
The following rule limits all endpoints with the label app=myService
to
only be able to emit packets using TCP on port 80, to any layer 3 destination:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l4-rule"
spec:
endpointSelector:
matchLabels:
app: myService
egress:
- toPorts:
- ports:
- port: "80"
protocol: TCP
[{
"labels": [{"key": "name", "value": "l4-rule"}],
"endpointSelector": {"matchLabels":{"app":"myService"}},
"egress": [{
"toPorts": [
{"ports":[ {"port": "80", "protocol": "TCP"}]}
]
}]
}]
Labels-dependent Layer 4 rule¶
This example enables all endpoints with the label role=frontend
to
communicate with all endpoints with the label role=backend
, but they must
communicate using TCP on port 80. Endpoints with other labels will not be
able to communicate with the endpoints with the label role=backend
, and
endpoints with the label role=frontend
will not be able to communicate with
role=backend
on ports other than 80.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l4-rule"
spec:
endpointSelector:
matchLabels:
role: backend
ingress:
- fromEndpoints:
- matchLabels:
role: frontend
toPorts:
- ports:
- port: "80"
protocol: TCP
[{
"labels": [{"key": "name", "value": "l4-rule"}],
"endpointSelector": {"matchLabels":{"role":"backend"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"role":"frontend"}}
],
"toPorts": [
{"ports":[ {"port": "80", "protocol": "TCP"}]}
]
}]
}]
CIDR-dependent Layer 4 Rule¶
This example enables all endpoints with the label role=crawler
to
communicate with all remote destinations inside the CIDR 192.0.2.0/24
, but
they must communicate using TCP on port 80. The policy does not allow Endpoints
without the label role=crawler
to communicate with destinations in the CIDR
192.0.2.0/24
. Furthermore, endpoints with the label role=crawler
will
not be able to communicate with destinations in the CIDR 192.0.2.0/24
on
ports other than port 80.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "cidr-l4-rule"
spec:
endpointSelector:
matchLabels:
role: crawler
egress:
- toCIDR:
- 192.0.2.0/24
toPorts:
- ports:
- port: "80"
protocol: TCP
[{
"labels": [{"key": "name", "value": "cidr-l4-rule"}],
"endpointSelector": {"matchLabels":{"role":"crawler"}},
"egress": [{
"toCIDR": [
"192.0.2.0/24"
],
"toPorts": [
{"ports":[ {"port": "80", "protocol": "TCP"}]}
]
}]
}]
Layer 7 Examples¶
Layer 7 policy rules are embedded into Layer 4 Examples rules and can be specified
for ingress and egress. L7Rules
structure is a base type containing an
enumeration of protocol specific fields.
// L7Rules is a union of port level rule types. Mixing of different port
// level rule types is disallowed, so exactly one of the following must be set.
// If none are specified, then no additional port level rules are applied.
type L7Rules struct {
// HTTP specific rules.
//
// +optional
HTTP []PortRuleHTTP `json:"http,omitempty"`
// Kafka-specific rules.
//
// +optional
Kafka []PortRuleKafka `json:"kafka,omitempty"`
}
The structure is implemented as a union, i.e. only one member field can be used
per port. If multiple toPorts
rules with identical PortProtocol
select
an overlapping list of endpoints, then the layer 7 rules are combined together
if they are of the same type. If the type differs, the policy is rejected.
Each member consists of a list of application protocol rules. A layer 7 request is permitted if at least one of the rules matches. If no rules are specified, then all traffic is permitted.
If a layer 4 rule is specified in the policy, and a similar layer 4 rule with layer 7 rules is also specified, then the layer 7 portions of the latter rule will have no effect.
Note
Unlike layer 3 and layer 4 policies, violation of layer 7 rules does not result in packet drops. Instead, if possible, an application protocol specific access denied message is crafted and returned, e.g. an HTTP 403 access denied is sent back for HTTP requests which violate the policy.
Note
There is currently a max limit of 40 ports with layer 7 policies per endpoint. This might change in the future when support for ranges is added.
HTTP¶
The following fields can be matched on:
- Path
- Path is an extended POSIX regex matched against the path of a request.
Currently it can contain characters disallowed from the conventional “path”
part of a URL as defined by RFC 3986. Paths must begin with a
/
. If omitted or empty, all paths are all allowed. - Method
- Method is an extended POSIX regex matched against the method of a request,
e.g.
GET
,POST
,PUT
,PATCH
,DELETE
, … If omitted or empty, all methods are allowed. - Host
- Host is an extended POSIX regex matched against the host header of a request,
e.g.
foo.com
. If omitted or empty, the value of the host header is ignored. - Headers
- Headers is a list of HTTP headers which must be present in the request. If omitted or empty, requests are allowed regardless of headers present.
Allow GET /public¶
The following example allows GET
requests to the URL /public
to be
allowed to endpoints with the labels env:prod
, but requests to any other
URL, or using another method, will be rejected. Requests on ports other than
port 80 will be dropped.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Allow HTTP GET /public from env=prod to app=service"
metadata:
name: "rule1"
spec:
endpointSelector:
matchLabels:
app: service
ingress:
- fromEndpoints:
- matchLabels:
env: prod
toPorts:
- ports:
- port: "80"
protocol: TCP
rules:
http:
- method: "GET"
path: "/public"
[{
"labels": [{"key": "name", "value": "rule1"}],
"endpointSelector": {"matchLabels": {"app": "service"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels": {"env": "prod"}}
]},{
"toPorts": [{
"ports": [
{"port": "80", "protocol": "TCP"}
],
"rules": {
"HTTP": [
{
"method": "GET",
"path": "/public"
}
]
}
}]
}]
}]
All GET /path1 and PUT /path2 when header set¶
The following example limits all endpoints which carry the labels
app=myService
to only be able to receive packets on port 80 using TCP.
While communicating on this port, the only API endpoints allowed will be GET
/path1
and PUT /path2
with the HTTP header X-My_header
set to
true
:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "l7-rule"
spec:
endpointSelector:
matchLabels:
app: myService
ingress:
- toPorts:
- ports:
- port: '80'
protocol: TCP
rules:
HTTP:
- method: GET
path: "/path1$"
- method: PUT
path: "/path2$"
headers:
- 'X-My-Header: true'
[{
"labels": [{"key": "name", "value": "l7-rule"}],
"endpointSelector": {"matchLabels":{"app":"myService"}},
"ingress": [{
"toPorts": [{
"ports": [
{"port": "80", "protocol": "TCP"}
],
"rules": {
"HTTP": [
{
"method": "GET",
"path": "/path1$"
},{
"method": "PUT",
"path": "/path2$",
"headers": ["X-My-Header: true"]
}
]
}
}]
}]
}]
Kafka (Tech Preview)¶
Note
Kafka support is currently in tech preview phase. Tech preview is functionality that has recently been added and had limited user exposure so far.
PortRuleKafka is a list of Kafka protocol constraints. All fields are optional, if all fields are empty or missing, the rule will match all Kafka messages. There are two ways to specify the Kafka rules. We can choose to specify a high-level “produce” or “consume” role to a topic or choose to specify more low-level Kafka protocol specific apiKeys. Writing rules based on Kafka roles is easier and covers most common use cases, however if more granularity is needed then users can alternatively write rules using specific apiKeys.
The following fields can be matched on:
- Role
Role is a case-insensitive string which describes a group of API keys necessary to perform certain higher-level Kafka operations such as “produce” or “consume”. A Role automatically expands into all APIKeys required to perform the specified higher-level operation. The following roles are supported:
- “produce”: Allow producing to the topics specified in the rule.
- “consume”: Allow consuming from the topics specified in the rule.
This field is incompatible with the APIKey field, i.e APIKey and Role cannot both be specified in the same rule. If omitted or empty, and if APIKey is not specified, then all keys are allowed.
- APIKey
- APIKey is a case-insensitive string matched against the key of a request, for example “produce”, “fetch”, “createtopic”, “deletetopic”. For a more extensive list, see the Kafka protocol reference. This field is incompatible with the Role field.
- APIVersion
- APIVersion is the version matched against the api version of the Kafka message. If set, it must be a string representing a positive integer. If omitted or empty, all versions are allowed.
- ClientID
ClientID is the client identifier as provided in the request.
From Kafka protocol documentation: This is a user supplied identifier for the client application. The user can use any identifier they like and it will be used when logging errors, monitoring aggregates, etc. For example, one might want to monitor not just the requests per second overall, but the number coming from each client application (each of which could reside on multiple servers). This id acts as a logical grouping across all requests from a particular client.
If omitted or empty, all client identifiers are allowed.
- Topic
Topic is the topic name contained in the message. If a Kafka request contains multiple topics, then all topics in the message must be allowed by the policy or the message will be rejected.
This constraint is ignored if the matched request message type does not contain any topic. The maximum length of the Topic is 249 characters, which must be either
a-z
,A-Z
,0-9
,-
,.
or_
.If omitted or empty, all topics are allowed.
Allow producing to topic empire-announce using Role¶
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "enable empire-hq to produce to empire-announce and deathstar-plans"
metadata:
name: "rule1"
spec:
endpointSelector:
matchLabels:
app: kafka
ingress:
- fromEndpoints:
- matchLabels:
app: empire-hq
toPorts:
- ports:
- port: "9092"
protocol: TCP
rules:
kafka:
- role: "produce"
topic: "deathstar-plans"
- role: "produce"
topic: "empire-announce"
[{
"labels": [{"key": "name", "value": "rule1"}],
"endpointSelector": {"matchLabels": {"app": "kafka"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels": {"app": "empire-hq"}}
],
"toPorts": [{
"ports": [
{"port": "9092", "protocol": "TCP"}
],
"rules": {
"kafka": [
{"role": "produce","topic": "deathstar-plans"},
{"role": "produce", "topic": "empire-announce"}
]
}
}]
}]
}]
Allow producing to topic empire-announce using apiKeys¶
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "enable empire-hq to produce to empire-announce and deathstar-plans"
metadata:
name: "rule1"
spec:
endpointSelector:
matchLabels:
app: kafka
ingress:
- fromEndpoints:
- matchLabels:
app: empire-hq
toPorts:
- ports:
- port: "9092"
protocol: TCP
rules:
kafka:
- apiKey: "apiversions"
- apiKey: "metadata"
- apiKey: "produce"
topic: "deathstar-plans"
- apiKey: "produce"
topic: "empire-announce"
[{
"labels": [{"key": "name", "value": "rule1"}],
"endpointSelector": {"matchLabels": {"app": "kafka"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels": {"app": "empire-hq"}}
],
"toPorts": [{
"ports": [
{"port": "9092", "protocol": "TCP"}
],
"rules": {
"kafka": [
{"apiKey": "apiversions"},
{"apiKey": "metadata"},
{"apiKey": "produce", "topic": "deathstar-plans"},
{"apiKey": "produce", "topic": "empire-announce"}
]
}
}]
}]
}]
Kubernetes¶
This section covers Kubernetes specific network policy aspects.
Namespaces¶
Namespaces are used to create virtual clusters within a Kubernetes cluster. All Kubernetes objects including NetworkPolicy and CiliumNetworkPolicy belong to a particular namespace. Depending on how a policy is being defined and created, Kubernetes namespaces are automatically being taken into account:
- Network policies created and imported as CiliumNetworkPolicy CRD and NetworkPolicy apply within the namespace, i.e. the policy only applies to pods within that namespace. It is however possible to grant access to and from pods in other namespaces as described below.
- Network policies imported directly via the API Reference apply to all namespaces unless a namespace selector is specified as described below.
Note
While specification of the namespace via the label
k8s:io.kubernetes.pod.namespace
in the fromEndpoints
and
toEndpoints
fields is deliberately supported. Specification of the
namespace in the endpointSelector
is prohibited as it would
violate the namespace isolation principle of Kubernetes. The
endpointSelector
always applies to pods of the namespace which is
associated with the CiliumNetworkPolicy resource itself.
Example: Enforce namespace boundaries¶
This example demonstrates how to enforce Kubernetes namespace-based boundaries
for the namespaces ns1
and ns2
by enabling default-deny on all pods of
either namespace and then allowing communication from all pods within the same
namespace.
Note
The example locks down ingress of the pods in ns1
and ns2
.
This means that the pods can still communicate egress to anywhere
unless the destination is in either ns1
or ns2
in which case
both source and destination have to be in the same namespace. In
order to enforce namespace boundaries at egress, the same example can
be used by specifying the rules at egress in addition to ingress.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "isolate-ns1"
namespace: ns1
spec:
endpointSelector:
matchLabels:
{}
ingress:
- fromEndpoints:
- matchLabels:
{}
---
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "isolate-ns1"
namespace: ns2
spec:
endpointSelector:
matchLabels:
{}
ingress:
- fromEndpoints:
- matchLabels:
{}
[
{
"ingress" : [
{
"fromEndpoints" : [
{
"matchLabels" : {
"k8s:io.kubernetes.pod.namespace" : "ns1"
}
}
]
}
],
"endpointSelector" : {
"matchLabels" : {
"k8s:io.kubernetes.pod.namespace" : "ns1"
}
}
},
{
"endpointSelector" : {
"matchLabels" : {
"k8s:io.kubernetes.pod.namespace" : "ns2"
}
},
"ingress" : [
{
"fromEndpoints" : [
{
"matchLabels" : {
"k8s:io.kubernetes.pod.namespace" : "ns2"
}
}
]
}
]
}
]
Example: Expose pods across namespaces¶
The following example exposes all pods with the label name=leia
in the
namespace ns1
to all pods with the label name=luke
in the namespace
ns2
.
Refer to the example YAML files for a fully functional example including pods deployed to different namespaces.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "k8s-expose-across-namespace"
namespace: ns1
spec:
endpointSelector:
matchLabels:
name: leia
ingress:
- fromEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: ns2
name: luke
[{
"labels": [{"key": "name", "value": "k8s-svc-account"}],
"endpointSelector": {
"matchLabels": {"name":"leia", "k8s:io.kubernetes.pod.namespace":"ns1"}
},
"ingress": [{
"fromEndpoints": [{
"matchLabels":{"name": "luke", "k8s:io.kubernetes.pod.namespace":"ns2"}
}]
}]
}]
Example: Allow egress to kube-dns in kube-system namespace¶
The following example allows all pods in the namespace in which the policy is
created to communicate with kube-dns on port 53/UDP in the kube-system
namespace.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "allow-to-kubedns"
spec:
endpointSelector:
{}
egress:
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: kube-system
k8s-app: kube-dns
toPorts:
- ports:
- port: '53'
protocol: UDP
[
{
"endpointSelector" : {
"matchLabels" : {}
},
"egress" : [
{
"toEndpoints" : [
{
"matchLabels" : {
"k8s:io.kubernetes.pod.namespace" : "kube-system",
"k8s-app" : "kube-dns"
}
}
],
"toPorts" : [
{
"ports" : [
{
"port" : "53",
"protocol" : "UDP"
}
]
}
]
}
]
}
]
ServiceAccounts¶
Kubernetes Service Accounts are used to associate an identity to a pod or process managed by Kubernetes and grant identities access to Kubernetes resources and secrets. Cilium supports the specification of network security policies based on the service account identity of a pod.
The service account of a pod is either defined via the service account admission controller or can be directly specified in the Pod, Deployment, ReplicationController resource like this:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
serviceAccountName: leia
...
Example¶
The following example grants any pod running under the service account of
“luke” to issue a HTTP GET /public
request on TCP port 80 to all pods
running associated to the service account of “leia”.
Refer to the example YAML files for a fully functional example including deployment and service account resources.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "k8s-svc-account"
spec:
endpointSelector:
matchLabels:
io.cilium.k8s.policy.serviceaccount: leia
ingress:
- fromEndpoints:
- matchLabels:
io.cilium.k8s.policy.serviceaccount: luke
toPorts:
- ports:
- port: '80'
protocol: TCP
rules:
HTTP:
- method: GET
path: "/public$"
[{
"labels": [{"key": "name", "value": "k8s-svc-account"}],
"endpointSelector": {"matchLabels": {"io.cilium.k8s.policy.serviceaccount":"leia"}},
"ingress": [{
"fromEndpoints": [
{"matchLabels":{"io.cilium.k8s.policy.serviceaccount":"luke"}}
],
"toPorts": [{
"ports": [
{"port": "80", "protocol": "TCP"}
],
"rules": {
"HTTP": [
{
"method": "GET",
"path": "/public$"
}
]
}
}]
}]
}]
Endpoint Lifecycle¶
This section specifies the lifecycle of Cilium endpoints.
Every endpoint in Cilium is in one of the following states:
restoring
: The endpoint was started before Cilium started, and Cilium is restoring its networking configuration.waiting-for-identity
: Cilium is allocating a unique identity for the endpoint.waiting-to-regenerate
: The endpoint received an identity and is waiting for its networking configuration to be (re)generated.regenerating
: The endpoint’s networking configuration is being (re)generated. This includes programming BPF for that endpoint.ready
: The endpoint’s networking configuration has been successfully (re)generated.disconnecting
: The endpoint is being deleted.disconnected
: The endpoint has been deleted.

The state of an endpoint can be queried using the cilium endpoint
list
and cilium endpoint get
CLI commands.
While an endpoint is running, it transitions between the
waiting-for-identity
, waiting-to-regenerate
, regenerating
,
and ready
states. A transition into the waiting-for-identity
state indicates that the endpoint changed its identity. A transition
into the waiting-to-regenerate
or regenerating
state indicates
that the policy to be enforced on the endpoint has changed because of
a change in identity, policy, or configuration.
An endpoint transitions into the disconnecting
state when it is
being deleted, regardless of its current state.
In some environments, notably Docker and Kubernetes, Cilium can’t
determine the labels of an endpoint immediately when the endpoint is
created, and therefore can’t allocate an identity for the endpoint at
that point. Until the endpoint’s labels are known, Cilium temporarily
associates a special single label reserved:init=
to the endpoint.
When the endpoint’s labels become known, Cilium then replaces that
special label with the endpoint’s labels and allocates a proper
identity to the endpoint.
To allow traffic to/from endpoints while they are initializing, you
can create policy rules that select the reserved:init
label,
and/or rules that allow traffic to/from the special init
entity.
For instance, writing a rule that allows all initializing endpoints to receive connections from the host and to perform DNS queries may be done as follows:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: init
specs:
- endpointSelector:
matchLabels:
"reserved:init": ""
ingress:
- fromEntities:
- host
egress:
- toEntities:
- all
toPorts:
- ports:
- port: "53"
protocol: UDP
[{
"labels": [{"key": "name", "value": "init"}],
"endpointSelector": {"matchLabels":{"reserved:init":""}},
"ingress": [{
"fromEntities": ["host"]
}],
"egress": [{
"toEntities": ["all"],
"toPorts": [
{"ports":[ {"port": "53", "protocol": "UDP"}]}
]
}]
}]
Likewise, writing a rule that allows an endpoint to receive DNS queries from initializing endpoints may be done as follows:
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
name: "from-init"
spec:
endpointSelector:
matchLabels:
app: myService
ingress:
- fromEntities:
- init
- toPorts:
- ports:
- port: "53"
protocol: UDP
[{
"labels": [{"key": "name", "value": "from-init"}],
"endpointSelector": {"matchLabels":{"app":"myService"}},
"ingress": [{
"fromEntities": ["init"],
"toPorts": [
{"ports":[ {"port": "53", "protocol": "UDP"}]}
]
}]
}]
If any ingress (resp. egress) policy rules selects the
reserved:init
label, all ingress (resp. egress) traffic to
(resp. from) initializing endpoints that is not explicitly allowed by
those rules will be dropped. Otherwise, if the policy enforcement
mode is never
or default
, all ingress (resp. egress) traffic
is allowed to (resp. from) initializing endpoints. Otherwise, all
ingress (resp. egress) traffic is dropped.
Troubleshooting¶
Policy Tracing¶
If Cilium is allowing / denying connections in a way that is not aligned with the
intent of your Cilium Network policy, there is an easy way to
verify if and what policy rules apply between two
endpoints. We can use the cilium policy trace
to simulate a policy decision
between the source and destination endpoints.
We will use the example from the Minikube Getting Started Guide to trace the policy. In this example, there is:
deathstar
service identified by labels:org=empire, class=deathstar
. The service is backed by two pods.tiefighter
spaceship client pod with labels:org=empire, class=tiefighter
xwing
spaceship client pod with labels:org=alliance, class=xwing
An L3/L4 policy is enforced on the deathstar
service to allow access to all spaceships with labels org=empire
. With this policy, the tiefighter
access is allowed but xwing
access will be denied. Let’s use the cilium policy trace
to simulate the policy decision. The command provides flexibility to run using pod names, labels or Cilium security identities.
Note
If the --dport
option is not specified, then L4 policy will not be
consulted in this policy trace command.
Currently, there is no support for tracing L7 policies via this tool.
# Policy trace using pod name and service labels
$ kubectl exec -ti cilium-88k78 -n kube-system -- cilium policy trace --src-k8s-pod default:xwing -d any:class=deathstar,k8s:org=empire,k8s:io.kubernetes.pod.namespace=default --dport 80
level=info msg="Waiting for k8s api-server to be ready..." subsys=k8s
level=info msg="Connected to k8s api-server" ipAddr="https://10.96.0.1:443" subsys=k8s
----------------------------------------------------------------
Tracing From: [k8s:class=xwing, k8s:io.cilium.k8s.policy.serviceaccount=default, k8s:io.kubernetes.pod.namespace=default, k8s:org=alliance] => To: [any:class=deathstar, k8s:org=empire, k8s:io.kubernetes.pod.namespace=default] Ports: [80/ANY]
* Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
Allows from labels {"matchLabels":{"any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}
Labels [k8s:class=xwing k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance] not found
1/1 rules selected
Found no allow rule
Label verdict: undecided
Resolving ingress port policy for [any:class=deathstar k8s:org=empire k8s:io.kubernetes.pod.namespace=default]
* Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
Labels [k8s:class=xwing k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance] not found
1/1 rules selected
Found no allow rule
L4 ingress verdict: undecided
Final verdict: DENIED
# Get the Cilium security id
$ kubectl exec -ti cilium-88k78 -n kube-system -- cilium endpoint list | egrep 'deathstar|xwing|tiefighter'
ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT ENFORCEMENT
568 Enabled Disabled 22133 k8s:class=deathstar f00d::a0f:0:0:238 10.15.65.193 ready
900 Enabled Disabled 22133 k8s:class=deathstar f00d::a0f:0:0:384 10.15.114.17 ready
33633 Disabled Disabled 53208 k8s:class=xwing f00d::a0f:0:0:8361 10.15.151.230 ready
38654 Disabled Disabled 22962 k8s:class=tiefighter f00d::a0f:0:0:96fe 10.15.88.156 ready
# Policy trace using Cilium security ids
$ kubectl exec -ti cilium-88k78 -n kube-system -- cilium policy trace --src-identity 53208 --dst-identity 22133 --dport 80
----------------------------------------------------------------
Tracing From: [k8s:class=xwing, k8s:io.cilium.k8s.policy.serviceaccount=default, k8s:io.kubernetes.pod.namespace=default, k8s:org=alliance] => To: [any:class=deathstar, k8s:org=empire, k8s:io.kubernetes.pod.namespace=default] Ports: [80/ANY]
* Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
Allows from labels {"matchLabels":{"any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}
Labels [k8s:class=xwing k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance] not found
1/1 rules selected
Found no allow rule
Label verdict: undecided
Resolving ingress port policy for [any:class=deathstar k8s:org=empire k8s:io.kubernetes.pod.namespace=default]
* Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
Labels [k8s:class=xwing k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance] not found
1/1 rules selected
Found no allow rule
L4 ingress verdict: undecided
Final verdict: DENIED
Policy Rule to Endpoint Mapping¶
To determine which policy rules are currently in effect for an endpoint the
data from cilium endpoint list
and cilium endpoint get
can be paired
with the data from cilium policy get
. cilium endpoint get
will list the
labels of each rule that applies to an endpoint. The list of labels can be
passed to cilium policy get
to show that exact source policy. Note that
rules that have no labels cannot be fetched alone (a no label cililum policy
get
returns the complete policy on the node). Rules with the same labels will
be returned together.
In the above example, for one of the deathstar
pods the endpoint id is 568. We can print all policies applied to it with:
# Get a shell on the Cilium pod
$ kubectl exec -ti cilium-88k78 -n kube-system /bin/bash
# print out the Layer 4 ingress labels
# clean up the data
# fetch each policy via each set of labels
$ cilium endpoint get 568 -o jsonpath='{range ..status.policy.realized.l4.ingress[*].derived-from-rules}{@}{"\n"}{end}'|tr -d '][' | xargs -I{} bash -c 'echo "Labels: {}"; cilium policy get {}'
Labels: k8s:io.cilium.k8s.policy.name=rule1 k8s:io.cilium.k8s.policy.namespace=default
[
{
"endpointSelector": {
"matchLabels": {
"any:class": "deathstar",
"any:org": "empire",
"k8s:io.kubernetes.pod.namespace": "default"
}
},
"ingress": [
{
"fromEndpoints": [
{
"matchLabels": {
"any:org": "empire",
"k8s:io.kubernetes.pod.namespace": "default"
}
}
],
"toPorts": [
{
"ports": [
{
"port": "80",
"protocol": "TCP"
}
],
"rules": {
"http": [
{
"path": "/v1/request-landing",
"method": "POST"
}
]
}
}
]
}
],
"labels": [
{
"key": "io.cilium.k8s.policy.name",
"value": "rule1",
"source": "k8s"
},
{
"key": "io.cilium.k8s.policy.namespace",
"value": "default",
"source": "k8s"
}
]
}
]
Revision: 217
# repeat for L4 egress and L3
$ cilium endpoint get 568 -o jsonpath='{range ..status.policy.realized.l4.egress[*].derived-from-rules}{@}{"\n"}{end}' | tr -d '][' | xargs -I{} bash -c 'echo "Labels: {}"; cilium policy get {}'
$ cilium endpoint get 568 -o jsonpath='{range ..status.policy.realized.cidr-policy.ingress[*].derived-from-rules}{@}{"\n"}{end}' | tr -d '][' | xargs -I{} bash -c 'echo "Labels: {}"; cilium policy get {}'
$ cilium endpoint get 568 -o jsonpath='{range ..status.policy.realized.cidr-policy.egress[*].derived-from-rules}{@}{"\n"}{end}' | tr -d '][' | xargs -I{} bash -c 'echo "Labels: {}"; cilium policy get {}'
Monitoring & Metrics¶
cilium-agent
can be configured to serve Prometheus
metrics. Prometheus is a pluggable metrics collection and storage system and
can act as a data source for Grafana, a metrics
visualization frontend. Unlike some metrics collectors like statsd, Prometheus requires the
collectors to pull metrics from each source.
To expose any metrics, invoke cilium-agent
with the
--prometheus-serve-addr
option. This option takes a IP:Port
pair but
passing an empty IP (e.g. :9090
) will bind the server to all available
interfaces (there is usually only one in a container).
Exported Metrics¶
All metrics are exported under the cilium
Prometheus namespace. When
running and collecting in Kubernetes they will be tagged with a pod name and
namespace.
Endpoint¶
endpoint_count
: Number of endpoints managed by this agentendpoint_regenerating
: Number of endpoints currently regenerating. Deprecated. Use endpoint_state with proper labels insteadendpoint_regenerations
: Count of all endpoint regenerations that have completed, tagged by outcomeendpoint_regeneration_seconds_total
: Total sum of successful endpoint regeneration timesendpoint_regeneration_square_seconds_total
: Total sum of squares of successful endpoint regeneration timesendpoint_state
: Count of all endpoints, tagged by different endpoint states
Datapath¶
datapath_errors_total
: Total number of errors occurred in datapath management, labeled by area, name and address family.
Drops/Forwards (L3/L4)¶
drop_count_total
: Total dropped packets, tagged by drop reason and ingress/egress directionforward_count_total
: Total forwarded packets, tagged by ingress/egress direction
Policy Imports¶
policy_count
: Number of policies currently loadedpolicy_regeneration_total
: Total number of policies regenerated successfullypolicy_regeneration_seconds_total
: Total sum of successful policy regeneration timespolicy_regeneration_square_seconds_total
: Total sum of squares of successful policy regeneration timespolicy_max_revision
: Highest policy revision number in the agentpolicy_import_errors
: Number of times a policy import has failed
Policy L7 (HTTP/Kafka)¶
policy_l7_parse_errors_total
: Number of total L7 parse errorspolicy_l7_forwarded_total
: Number of total L7 forwarded requests/responsespolicy_l7_denied_total
: Number of total L7 denied requests/responses due to policypolicy_l7_received_total
: Number of total L7 received requests/responses
Events external to Cilium¶
event_ts
: Last timestamp when we received an event. Further labeled by source:api
,containerd
,k8s
.
Cilium as a Kubernetes pod¶
The Cilium Prometheus reference configuration configures jobs that automatically collect pod metrics marked with the appropriate two labels.
Your Cilium spec will need these annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
The reference Cilium Kubernetes DaemonSet Kubernetes spec
is an example of how to configure cilium-agent
and set the appropriate labels.
Note: the port can be configured per-pod to any value and the label set accordingly. Prometheus uses this label to discover the port.
To configure automatic metric discovery and collection, Prometheus itself requires a kubernetes_sd_config configuration. The configured rules are used to filter pods and nodes by label and annotation, and tag the resulting metrics series. In the Kubernetes case Prometheus will contact the Kubernetes API server for these lists and must have permissions to do so.
An example promethues configuration can be found alongside the reference Cilium Kubernetes DaemonSet spec.
The critical discovery section is:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_k8s_app]
action: keep
regex: cilium
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (.+):(?:\d+);(\d+)
replacement: ${1}:${2}
target_label: __address__
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
This job configures prometheus to do a number of things for all pods returned by the Kubernetes API server:
- find and keep all pods that have labels
k8s-app=cilium
andprometheus.io/scrape=true
- extract the IP and port of the pod from
address
andprometheus.io/port
- discover the metrics URL path from the label
prometheus.io/path
or use the default of/metrics
when it isn’t present - populate metrics tags for the Kubernetes namespace and pod name derived from the pod labels
Cilium as a host-agent on a node¶
Prometheus can use a number of more common service discovery schemes, such as consul and DNS, or a cloud provider API, such as AWS, GCE or Azure. Prometheus documentation contains more information.
It is also possible to hard-code static-config
sections that simply contain
a hardcoded IP address and port:
- job_name: 'cilium-agent-nodes'
metrics_path: /metrics
static_configs:
- targets: ['192.168.33.11:9090']
labels:
node-id: i-0598c7d7d356eba47
node-az: a
Troubleshooting¶
This document describes how to troubleshoot Cilium in different deployment modes. It focuses on a full deployment of Cilium within a datacenter or public cloud. If you are just looking for a simple way to experiment, we highly recommend trying out the Getting Started Guides instead.
This guide assumes that you have read the Concepts which explains all the components and concepts.
We use GitHub issues to maintain a list of Cilium Frequently Asked Questions (FAQ). You can also check there to see if your question(s) is already addressed.
Component & Cluster Health¶
Kubernetes¶
An initial overview of Cilium can be retrieved by listing all pods to verify
whether all pods have the status Running
:
$ kubectl -n kube-system get pods -l k8s-app=cilium
NAME READY STATUS RESTARTS AGE
cilium-2hq5z 1/1 Running 0 4d
cilium-6kbtz 1/1 Running 0 4d
cilium-klj4b 1/1 Running 0 4d
cilium-zmjj9 1/1 Running 0 4d
If Cilium encounters a problem that it cannot recover from, it will
automatically report the failure state via cilium status
which is regularly
queried by the Kubernetes liveness probe to automatically restart Cilium pods.
If a Cilium pod is in state CrashLoopBackoff
then this indicates a
permanent failure scenario.
Detailed Status¶
If a particular Cilium pod is not in running state, the status and health of
the agent on that node can be retrieved by running cilium status
in the
context of that pod:
$ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium status
KVStore: Ok etcd: 1/1 connected: http://demo-etcd-lab--a.etcd.tgraf.test1.lab.corp.covalent.link:2379 - 3.2.5 (Leader)
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Ok OK
Kubernetes APIs: ["cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node", "CustomResourceDefinition"]
Cilium: Ok OK
NodeMonitor: Disabled
Cilium health daemon: Ok
Controller Status: 14/14 healthy
Proxy Status: OK, ip 10.2.0.172, port-range 10000-20000
Cluster health: 4/4 reachable (2018-06-16T09:49:58Z)
Alternatively, the k8s-cilium-exec.sh
script can be used to run cilium
status
on all nodes. This will provide detailed status and health information
of all nodes in the cluster:
$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh
$ chmod +x ./k8s-cilium-exec.sh
… and run cilium status
on all nodes:
$ ./k8s-cilium-exec.sh cilium status
KVStore: Ok Etcd: http://127.0.0.1:2379 - (Leader) 3.1.10
ContainerRuntime: Ok
Kubernetes: Ok OK
Kubernetes APIs: ["extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint"]
Cilium: Ok OK
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
Controller Status: 7/7 healthy
Proxy Status: OK, ip 10.15.28.238, 0 redirects, port-range 10000-20000
Cluster health: 1/1 reachable (2018-02-27T00:24:34Z)
Logs¶
To retrieve log files of a cilium pod, run (replace cilium-1234
with a pod
name returned by kubectl -n kube-system get pods -l k8s-app=cilium
)
$ kubectl -n kube-system logs --timestamps cilium-1234
If the cilium pod was already restarted due to the liveness problem after encountering an issue, it can be useful to retrieve the logs of the pod before the last restart:
$ kubectl -n kube-system logs --timestamps -p cilium-1234
Generic¶
When logged in a host running Cilium, the cilium CLI can be invoked directly, e.g.:
$ cilium status
KVStore: Ok etcd: 1/1 connected: https://192.168.33.11:2379 - 3.2.7 (Leader)
ContainerRuntime: Ok
Kubernetes: Ok OK
Kubernetes APIs: ["core/v1::Endpoint", "extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service"]
Cilium: Ok OK
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPv4 address pool: 261/65535 allocated
IPv6 address pool: 4/4294967295 allocated
Controller Status: 20/20 healthy
Proxy Status: OK, ip 10.0.28.238, port-range 10000-20000
Cluster health: 2/2 reachable (2018-04-11T15:41:01Z)
Connectivity Problems¶
Checking cluster connectivity health¶
Cilium allows to rule out network fabric related issues when troubleshooting connectivity issues by providing reliable health and latency probes between all cluster nodes and between a simulated workload running on each node.
By default when Cilium is run, it launches instances of cilium-health
in
the background to determine overall connectivity status of the cluster. This
tool periodically runs bidirectional traffic across multiple paths through the
cluster and through each node using different protocols to determine the health
status of each path and protocol. At any point in time, cilium-health may be
queried for the connectivity status of the last probe.
$ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium-health status
Probe time: 2018-06-16T09:51:58Z
Nodes:
ip-172-0-52-116.us-west-2.compute.internal (localhost):
Host connectivity to 172.0.52.116:
ICMP: OK, RTT=315.254µs
HTTP via L3: OK, RTT=368.579µs
Endpoint connectivity to 10.2.0.183:
ICMP: OK, RTT=190.658µs
HTTP via L3: OK, RTT=536.665µs
ip-172-0-117-198.us-west-2.compute.internal:
Host connectivity to 172.0.117.198:
ICMP: OK, RTT=1.009679ms
HTTP via L3: OK, RTT=1.808628ms
Endpoint connectivity to 10.2.1.234:
ICMP: OK, RTT=1.016365ms
HTTP via L3: OK, RTT=2.29877ms
For each node, the connectivity will be displayed for each protocol and path, both to the node itself and to an endpoint on that node. The latency specified is a snapshot at the last time a probe was run, which is typically once per minute.
Monitoring Packet Drops¶
Sometimes you may experience broken connectivity, which may be due to a
number of different causes. A main cause can be unwanted packet drops on
the networking level. The tool
cilium monitor
allows you to quickly inspect and see if and where packet
drops happen. Following is an example output (use kubectl exec
as in previous
examples if running with Kubernetes):
$ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium monitor --type drop
Listening for events on 2 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest
xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest
xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest
xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest
xx drop (Invalid destination mac) to endpoint 0, identity 0->0: fe80::5c25:ddff:fe8e:78d8 -> ff02::2 RouterSolicitation
The above indicates that a packet to endpoint ID 25729
has been dropped due
to violation of the Layer 3 policy.
Policy Troubleshooting¶
Ensure pod is managed by Cilium¶
A potential cause for policy enforcement not functioning as expected is that the networking of the pod selected by the policy is not being managed by Cilium. The following situations result in unmanaged pods:
- The pod is running in host networking and will use the host’s IP address directly. Such pods have full network connectivity but Cilium will not provide security policy enforcement for such pods.
- The pod was started before Cilium was deployed. Cilium only manages pods that have been deployed after Cilium itself was started. Cilium will not provide security policy enforcement for such pods.
If pod networking is not managed by Cilium. Ingress and egress policy rules selecting the respective pods will not be applied. See the section Network Policy for more details.
You can run the following script to list the pods which are not managed by Cilium:
$ ./contrib/k8s/k8s-unmanaged.sh
kube-system/cilium-hqpk7
kube-system/kube-addon-manager-minikube
kube-system/kube-dns-54cccfbdf8-zmv2c
kube-system/kubernetes-dashboard-77d8b98585-g52k5
kube-system/storage-provisioner
See section Troubleshooting for details and examples on how to use the policy tracing feature.
Automatic Diagnosis¶
The cluster-diagnosis
tool can help identify the most commonly encountered
issues in Cilium deployments. The tool currently supports Kubernetes
and Minikube clusters only.
The tool performs various checks and provides hints to fix specific issues that it has identified.
The following is a list of prerequisites:
- Requires Python >= 2.7.*
- Requires
kubectl
. kubectl
should be pointing to your cluster before running the tool.
You can download the latest version of the cluster-diagnosis.zip file using the following command:
curl -sLO releases.cilium.io/tools/cluster-diagnosis.zip
Command to run the cluster-diagnosis tool:
python cluster-diagnosis.zip
Command to collect the system dump using the cluster-diagnosis tool:
python cluster-diagnosis.zip sysdump
Symptom Library¶
Node to node traffic is being dropped¶
Symptom¶
Endpoint to endpoint communication on a single node succeeds but communication fails between endpoints across multiple nodes.
Troubleshooting steps:¶
- Run
cilium-health status
on the node of the source and destination endpoint. It should describe the connectivity from that node to other nodes in the cluster, and to a simulated endpoint on each other node. Identify points in the cluster that cannot talk to each other. If the command does not describe the status of the other node, there may be an issue with the KV-Store. - Run
cilium monitor
on the node of the source and destination endpoint. Look for packet drops.
When running in Overlay Network Mode mode:
Run
cilium bpf tunnel list
and verify that each Cilium node is aware of the other nodes in the cluster. If not, check the logfile for errors.If nodes are being populated correctly, run
tcpdump -n -i cilium_vxlan
on each node to verify whether cross node traffic is being forwarded correctly between nodes.If packets are being dropped,
- verify that the node IP listed in
cilium bpf tunnel list
can reach each other. - verify that the firewall on each node allows UDP port 4789.
- verify that the node IP listed in
When running in Direct / Native Routing Mode mode:
- Run
ip route
or check your cloud provider router and verify that you have routes installed to route the endpoint prefix between all nodes. - Verify that the firewall on each node permits to route the endpoint IPs.
Useful Scripts¶
Retrieve Cilium pod managing a particular pod¶
Identifies the Cilium pod that is managing a particular pod in a namespace:
k8s-get-cilium-pod.sh <pod> <namespace>
Example:
$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-get-cilium-pod.sh
$ ./k8s-get-cilium-pod.sh luke-pod default
cilium-zmjj9
Execute a command in all Kubernetes Cilium pods¶
Run a command within all Cilium pods of a cluster
k8s-cilium-exec.sh <command>
Example:
$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh
$ ./k8s-cilium-exec.sh uptime
10:15:16 up 6 days, 7:37, 0 users, load average: 0.00, 0.02, 0.00
10:15:16 up 6 days, 7:32, 0 users, load average: 0.00, 0.03, 0.04
10:15:16 up 6 days, 7:30, 0 users, load average: 0.75, 0.27, 0.15
10:15:16 up 6 days, 7:28, 0 users, load average: 0.14, 0.04, 0.01
List unmanaged Kubernetes pods¶
Lists all Kubernetes pods in the cluster for which Cilium does not provide networking. This includes pods running in host-networking mode and pods that were started before Cilium was deployed.
k8s-unmanaged.sh
Example:
$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-unmanaged.sh
$ ./k8s-unmanaged.sh
kube-system/cilium-hqpk7
kube-system/kube-addon-manager-minikube
kube-system/kube-dns-54cccfbdf8-zmv2c
kube-system/kubernetes-dashboard-77d8b98585-g52k5
kube-system/storage-provisioner
Reporting a problem¶
Automatic log & state collection¶
Before you report a problem, make sure to retrieve the necessary information from your cluster before the failure state is lost. Cilium provides a script to automatically grab logs and retrieve debug information from all Cilium pods in the cluster.
The script has the following list of prerequisites:
- Requires Python >= 2.7.*
- Requires
kubectl
. kubectl
should be pointing to your cluster before running the tool.
You can download the latest version of the cluster-diagnosis.zip file using the following command:
$ curl -sLO releases.cilium.io/tools/cluster-diagnosis.zip
$ python cluster-diagnosis.zip sysdump
Single Node Bugtool¶
If you are not running Kubernetes, it is also possible to run the bug collection tool manually with the scope of a single node:
The cilium-bugtool
captures potentially useful information about your
environment for debugging. The tool is meant to be used for debugging a single
Cilium agent node. In the Kubernetes case, if you have multiple Cilium pods,
the tool can retrieve debugging information from all of them. The tool works by
archiving a collection of command output and files from several places. By
default, it writes to the tmp
directory.
Note that the command needs to be run from inside the Cilium pod/container.
$ cilium-bugtool
When running it with no option as shown above, it will try to copy various
files and execute some commands. If kubectl
is detected, it will search for
Cilium pods. The default label being k8s-app=cilium
, but this and the
namespace can be changed via k8s-namespace
and k8s-label
respectively.
If you’d prefer to browse the dump, there is a HTTP flag.
$ cilium-bugtool --serve
If you want to capture the archive from a Kubernetes pod, then the process is a bit different
# First we need to get the Cilium pod
$ kubectl get pods --namespace kube-system
NAME READY STATUS RESTARTS AGE
cilium-kg8lv 1/1 Running 0 13m
kube-addon-manager-minikube 1/1 Running 0 1h
kube-dns-6fc954457d-sf2nk 3/3 Running 0 1h
kubernetes-dashboard-6xvc7 1/1 Running 0 1h
# Run the bugtool from this pod
$ kubectl -n kube-system exec cilium-kg8lv cilium-bugtool
[...]
# Copy the archive from the pod
$ kubectl cp kube-system/cilium-kg8lv:/tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar /tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar
[...]
Note
Please check the archive for sensitive information and strip it away before sharing it with us.
Below is an approximate list of the kind of information in the archive.
- Cilium status
- Cilium version
- Kernel configuration
- Resolve configuration
- Cilium endpoint state
- Cilium logs
- Docker logs
dmesg
ethtool
ip a
ip link
ip r
iptables-save
kubectl -n kube-system get pods
kubectl get pods,svc for all namespaces
uname
uptime
cilium bpf * list
cilium endpoint get for each endpoint
cilium endpoint list
hostname
cilium policy get
cilium service list
- …
Debugging information¶
If you are not running Kubernetes, you can use the cilium debuginfo
command
to retrieve useful debugging information. If you are running Kubernetes, this
command is automatically run as part of the system dump.
cilium debuginfo
can print useful output from the Cilium API. The output
format is in Markdown format so this can be used when reporting a bug on the
issue tracker. Running without arguments will print to standard output, but
you can also redirect to a file like
$ cilium debuginfo -f debuginfo.md
Note
Please check the debuginfo file for sensitive information and strip it away before sharing it with us.
Slack Assistance¶
The Cilium slack community is helpful first point of assistance to get help troubleshooting a problem or to discuss options on how to address a problem.
The slack community is open to everyone. You can request an invite email by visiting Slack.
Report an issue via GitHub¶
If you believe to have found an issue in Cilium, please report a GitHub issue and make sure to attach a system dump as described above to ensure that developers have the best chance to reproduce the issue.
Developer / Contributor Guide¶
We’re happy you’re interested in contributing to the Cilium project.
This guide will help you make sure you have an environment capable of testing changes to the Cilium source code, and that you understand the workflow of getting these changes reviewed and merged upstream.
Setting up the development environment¶
Requirements¶
You need to have the following tools available in order to effectively contribute to Cilium:
Dependency | Version / Commit ID | Download Command |
---|---|---|
git | latest | N/A (OS-specific) |
go | 1.10 | N/A (OS-specific) | |
dep | >= v0.4.1 | curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh |
go-bindata | a0ff2567cfb |
go get -u github.com/cilium/go-bindata/... |
ginkgo | >= 1.4.0 | go get -u github.com/onsi/ginkgo |
gomega | >= 1.2.0 | go get -u github.com/onsi/gomega |
protoc-gen-go | latest | go get -u github.com/golang/protobuf/protoc-gen-go |
protoc-gen-validate | latest | go get -u github.com/lyft/protoc-gen-validate |
Docker | OS-Dependent | N/A (OS-specific) |
Docker-Compose | OS-Dependent | N/A (OS-specific) |
Cmake | OS-Dependent | N/A (OS-specific) |
Bazel | 0.14.1 | N/A (OS-specific) |
Libtool | >= 1.4.2 | N/A (OS-specific) |
Automake | OS-Dependent | N/A (OS-specific) |
Kubecfg | >=0.8.0 | go get github.com/ksonnet/kubecfg |
To run Cilium locally on VMs, you need:
Dependency | Version / Commit ID | Download Command |
---|---|---|
Vagrant | >= 2.0 | Vagrant Install Instructions |
VirtualBox (if not using libvirt) | >= 5.2 | N/A (OS-specific) |
Finally, in order to build the documentation, you should have Sphinx installed:
$ sudo pip install sphinx
You should start with the Getting Started Guides, which walks you through the set-up, such as installing Vagrant, getting the Cilium sources, and going through some Cilium basics.
Vagrant Setup¶
While the Getting Started Guides uses a Vagrantfile tuned for the basic walk through, the
setup for the Vagrantfile in the root of the Cilium tree depends on a number of
environment variables and network setup that are managed via
contrib/vagrant/start.sh
.
Using the provided Vagrantfile¶
To bring up a Vagrant VM with Cilium plus dependencies installed, run:
$ contrib/vagrant/start.sh
This will create and run a vagrant VM based on the base box
cilium/ubuntu
. The box is currently available for the
following providers:
- virtualbox
Options¶
The following environment variables can be set to customize the VMs brought up by vagrant:
NWORKERS=n
: Number of child nodes you want to start with the master, default 0.RELOAD=1
: Issue avagrant reload
instead ofvagrant up
, useful to resume halted VMs.NFS=1
: Use NFS for vagrant shared directories instead of rsync.K8S=1
: Build & install kubernetes on the nodes.k8s1
is the master node, which contains both master components: etcd, kube-controller-manager, kube-scheduler, kube-apiserver, and node components: kubelet, kube-proxy, kubectl and Cilium. When used in combination withNWORKERS=1
a second node is created, wherek8s2
will be a kubernetes node, which contains: kubelet, kube-proxy, kubectl and cilium.IPV4=1
: Run Cilium with IPv4 enabled.RUNTIME=x
: Sets up the container runtime to be used inside a kubernetes cluster. Valid options are:docker
,containerd
andcrio
. If not set, it defaults todocker
.VAGRANT_DEFAULT_PROVIDER={virtualbox \| libvirt \| ...}
If you want to start the VM with cilium enabled with containerd
, with
kubernetes installed and plus a worker, run:
$ RUNTIME=containerd K8S=1 NWORKERS=1 contrib/vagrant/start.sh
If you want to connect to the Kubernetes cluster running inside the developer VM via kubectl
from your host machine, set KUBECONFIG
environment variable to include new kubeconfig file:
$ export KUBECONFIG=$KUBECONFIG:$GOPATH/src/github.com/cilium/cilium/vagrant.kubeconfig
and add 127.0.0.1 k8s1
to your hosts file.
If you have any issue with the provided vagrant box
cilium/ubuntu
or need a different box format, you may
build the box yourself using the packer scripts
Manual Installation¶
Alternatively you can import the vagrant box cilium/ubuntu
directly and manually install Cilium:
$ vagrant init cilium/ubuntu
$ vagrant up
$ vagrant ssh [...]
$ cd go/src/github.com/cilium/cilium/
$ make
$ sudo make install
$ sudo mkdir -p /etc/sysconfig/
$ sudo cp contrib/systemd/cilium.service /etc/systemd/system/
$ sudo cp contrib/systemd/cilium /etc/sysconfig/cilium
$ sudo usermod -a -G cilium vagrant
$ sudo systemctl enable cilium
$ sudo systemctl restart cilium
Notes¶
Your Cilium tree is mapped to the VM so that you do not need to keep manually
copying files between your host and the VM. Folders are by default synced
automatically using VirtualBox Shared Folders .
You can also use NFS to access your Cilium tree from the VM by
setting the environment variable NFS
(mentioned above) before running the
startup script (export NFS=1
). Note that your host firewall must have a variety
of port open. The Vagrantfile will inform you of the configuration of these addresses
and ports to enable NFS.
Note
OSX file system is by default case insensitive, which can confuse git. At the writing of this Cilium repo has no file names that would be considered referring to the same file on a case insensitive file system. Regardless, it may be useful to create a disk image with a case sensitive file system for holding your git repos.
Note
VirtualBox for OSX currently (version 5.1.22) always reports host-only networks’ prefix length as 64. Cilium needs this prefix to be 16, and the startup script will check for this. This check always fails when using VirtualBox on OSX, but it is safe to let the startup script to reset the prefix length to 16.
If for some reason, running of the provisioning script fails, you should bring the VM down before trying again:
$ vagrant halt
Packer-CI-Build¶
As part of Cilium development, we use a custom base box with a bunch of pre-installed libraries and tools that we need to enhance our daily workflow. That base box is built with Packer and it is hosted in the packer-ci-build GitHub repository.
New versions of this box can be created via Jenkins Packer Build, where new builds of the image will be pushed to Vagrant Cloud . The version of the image corresponds to the BUILD_ID environment variable in the Jenkins job. That version ID will be used in Cilium Vagrantfiles.
Changes to this image are made via contributions to the packer-ci-build
repository. Authorized GitHub users can trigger builds with a GitHub comment on
the PR containing the trigger phrase build-me-please
. In case that a new box
needs to be rebased with a different branch than master, authorized developers
can run the build with custom parameters. To use a different Cilium branch in
the job go
to Build with parameters and a base branch can be set as the user needs.
This box will need to be updated when a new developer needs a new dependency that is not installed in the current version of the box, or if a dependency that is cached within the box becomes stale.
Development process¶
Local Development in Vagrant Box¶
See Setting up the development environment for information on how to setup the development environment.
When the development VM is provisioned, it builds and installs Cilium. After
the initial build and install you can do further building and testing
incrementally inside the VM. vagrant ssh
takes you to the Cilium source
tree directory (/home/vagrant/go/src/github.com/cilium/cilium
) by default,
and the following commands assume that you are working within that directory.
Build Cilium¶
Assuming you have synced (rsync) the source tree after you have made changes, or the tree is automatically in sync via NFS or guest additions folder sharing, you can issue a build as follows:
$ make
Install to dev environment¶
After a successful build and test you can re-install Cilium by:
$ sudo -E make install
Restart Cilium service¶
To run the newly installed version of Cilium, restart the service:
$ sudo systemctl restart cilium
You can verify the service and cilium-agent status by the following commands, respectively:
$ sudo systemctl status cilium
$ cilium status
Making Changes¶
- Create a topic branch:
git checkout -b myBranch master
- Make the changes you want
- Separate the changes into logical commits.
- Describe the changes in the commit messages. Focus on answering the question why the change is required and document anything that might be unexpected.
- If any description is required to understand your code changes, then those instructions should be code comments instead of statements in the commit description.
- Make sure your changes meet the following criteria:
- New code is covered by Unit Testing.
- End to end integration / runtime tests have been extended or added. If not required, mention in the commit message what existing test covers the new code.
- Follow-up commits are squashed together nicely. Commits should separate logical chunks of code and not represent a chronological list of changes.
- Run
git diff --check
to catch obvious white space violations - Run
make
to build your changes. This will also rungo fmt
and error out on any golang formatting errors. - See Unit Testing on how to run unit tests.
- See End-To-End Testing Framework for information how to run the end to end integration tests
Unit Testing¶
Cilium uses the standard go test framework in combination with gocheck for richer testing functionality.
Prerequisites¶
Some tests interact with the kvstore and depend on a local kvstore instances of both etcd and consul. To start the local instances, run:
$ make start-kvstores
Running all tests¶
To run unit tests over the entire repository, run the following command in the project root directory:
$ make unit-tests
Warning
Running envoy unit tests can sometimes cause the developer VM to run out of
memory. This pressure can be alleviated by shutting down the bazel caching
daemon left by these tests. Run (cd envoy; bazel shutdown)
after a build to
do this.
Testing individual packages¶
It is possible to test individual packages by invoking go test
directly.
You can then cd
into the package subject to testing and invoke go test:
$ cd pkg/kvstore
$ go test
If you need more verbose output, you can pass in the -check.v
and
-check.vv
arguments:
$ cd pkg/kvstore
$ go test -check.v -check.vv
Running individual tests¶
Due to the use of gocheck, the standard go test -run
will not work,
instead, the -check.f
argument has to be specified:
$ go test -check.f TestParallelAllocation
Automatically run unit tests on code changes¶
The script contrib/shell/test.sh
contains some helpful bash functions to
improve the feedback cycle between writing tests and seeing their results. If
you’re writing unit tests in a particular package, the watchtest
function
will watch for changes in a directory and run the unit tests for that package
any time the files change. For example, if writing unit tests in pkg/policy
,
run this in a terminal next to your editor:
$ . contrib/shell/test.sh
$ watchtest pkg/policy
This shell script depends on the inotify-tools
package on Linux.
Add/update a golang dependency¶
Once you have downloaded dep make sure you have version >= 0.4.1
$ dep version
dep:
version : v0.4.1
build date : 2018-01-24
git hash : 37d9ea0a
go version : go1.9.1
go compiler : gc
platform : linux/amd64
After that, you can edit the Gopkg.toml
file, add the library that you want
to add. Lets assume we want to add github.com/containernetworking/cni
version v0.5.2
:
[[constraint]]
name = "github.com/containernetworking/cni"
revision = "v0.5.2"
Once you add the libraries that you need you can save the file and run
$ dep ensure -v
For a first run, it can take a while as it will download all dependencies to your local cache but the remaining runs will be faster.
Debugging¶
Datapath code¶
The tool cilium monitor
can also be used to retrieve debugging information
from the BPF based datapath. Debugging messages are sent if either the
cilium-agent
itself or the respective endpoint is in debug mode. The debug
mode of the agent can be enabled by starting cilium-agent
with the option
--debug
enabled or by running cilium config debug=true
for an already
running agent. Debugging of an individual endpoint can be enabled by running
cilium endpoint config ID debug=true
$ cilium endpoint config 3978 debug=true
Endpoint 3978 configuration updated successfully
$ cilium monitor -v --hex
Listening for events on 2 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
------------------------------------------------------------------------------
CPU 00: MARK 0x1c56d86c FROM 3978 DEBUG: 70 bytes Incoming packet from container ifindex 85
00000000 33 33 00 00 00 02 ae 45 75 73 11 04 86 dd 60 00 |33.....Eus....`.|
00000010 00 00 00 10 3a ff fe 80 00 00 00 00 00 00 ac 45 |....:..........E|
00000020 75 ff fe 73 11 04 ff 02 00 00 00 00 00 00 00 00 |u..s............|
00000030 00 00 00 00 00 02 85 00 15 b4 00 00 00 00 01 01 |................|
00000040 ae 45 75 73 11 04 00 00 00 00 00 00 |.Eus........|
CPU 00: MARK 0x1c56d86c FROM 3978 DEBUG: Handling ICMPv6 type=133
------------------------------------------------------------------------------
CPU 00: MARK 0x1c56d86c FROM 3978 Packet dropped 131 (Invalid destination mac) 70 bytes ifindex=0 284->0
00000000 33 33 00 00 00 02 ae 45 75 73 11 04 86 dd 60 00 |33.....Eus....`.|
00000010 00 00 00 10 3a ff fe 80 00 00 00 00 00 00 ac 45 |....:..........E|
00000020 75 ff fe 73 11 04 ff 02 00 00 00 00 00 00 00 00 |u..s............|
00000030 00 00 00 00 00 02 85 00 15 b4 00 00 00 00 01 01 |................|
00000040 00 00 00 00 |....|
------------------------------------------------------------------------------
CPU 00: MARK 0x7dc2b704 FROM 3978 DEBUG: 86 bytes Incoming packet from container ifindex 85
00000000 33 33 ff 00 8a d6 ae 45 75 73 11 04 86 dd 60 00 |33.....Eus....`.|
00000010 00 00 00 20 3a ff fe 80 00 00 00 00 00 00 ac 45 |... :..........E|
00000020 75 ff fe 73 11 04 ff 02 00 00 00 00 00 00 00 00 |u..s............|
00000030 00 01 ff 00 8a d6 87 00 20 40 00 00 00 00 fd 02 |........ @......|
00000040 00 00 00 00 00 00 c0 a8 21 0b 00 00 8a d6 01 01 |........!.......|
00000050 ae 45 75 73 11 04 00 00 00 00 00 00 |.Eus........|
CPU 00: MARK 0x7dc2b704 FROM 3978 DEBUG: Handling ICMPv6 type=135
CPU 00: MARK 0x7dc2b704 FROM 3978 DEBUG: ICMPv6 neighbour soliciation for address b21a8c0:d68a0000
One of the most common issues when developing datapath code is that the BPF code cannot be loaded into the kernel. This frequently manifests as the endpoints appearing in the “not-ready” state and never switching out of it:
$ cilium endpoint list
ENDPOINT POLICY IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS
ENFORCEMENT
48896 Disabled 266 container:id.server fd02::c0a8:210b:0:bf00 10.11.13.37 not-ready
60670 Disabled 267 container:id.client fd02::c0a8:210b:0:ecfe 10.11.167.158 not-ready
Running cilium endpoint get
for one of the endpoints will provide a
description of known state about it, which includes BPF verification logs.
The files under /var/run/cilium/state
provide context about how the BPF
datapath is managed and set up. The .log files will describe the BPF
requirements and features that Cilium detected and used to generate the BPF
programs. The .h files describe specific configurations used for BPF program
compilation. The numbered directories describe endpoint-specific state,
including header configuration files and BPF binaries.
# for log in /var/run/cilium/state/*.log; do echo "cat $log"; cat $log; done
cat /var/run/cilium/state/bpf_features.log
BPF/probes: CONFIG_CGROUP_BPF=y is not in kernel configuration
BPF/probes: CONFIG_LWTUNNEL_BPF=y is not in kernel configuration
HAVE_LPM_MAP_TYPE: Your kernel doesn't support LPM trie maps for BPF, thus disabling CIDR policies. Recommendation is to run 4.11+ kernels.
HAVE_LRU_MAP_TYPE: Your kernel doesn't support LRU maps for BPF, thus switching back to using hash table for the cilium connection tracker. Recommendation is to run 4.10+ kernels.
Current BPF map state for particular programs is held under /sys/fs/bpf/
,
and the bpf-map utility can be useful
for debugging what is going on inside them, for example:
# ls /sys/fs/bpf/tc/globals/
cilium_calls_15124 cilium_calls_48896 cilium_ct4_global cilium_lb4_rr_seq cilium_lb6_services cilium_policy_25729 cilium_policy_60670 cilium_proxy6
cilium_calls_25729 cilium_calls_60670 cilium_ct6_global cilium_lb4_services cilium_lxc cilium_policy_3978 cilium_policy_reserved_1 cilium_reserved_policy
cilium_calls_3978 cilium_calls_netdev_ns_1 cilium_events cilium_lb6_reverse_nat cilium_policy cilium_policy_4314 cilium_policy_reserved_2 cilium_tunnel_map
cilium_calls_4314 cilium_calls_overlay_2 cilium_lb4_reverse_nat cilium_lb6_rr_seq cilium_policy_15124 cilium_policy_48896 cilium_proxy4
# bpf-map info /sys/fs/bpf/tc/globals/cilium_policy_15124
Type: Hash
Key size: 8
Value size: 24
Max entries: 1024
Flags: 0x0
# bpf-map dump /sys/fs/bpf/tc/globals/cilium_policy_15124
Key:
00000000 6a 01 00 00 82 23 06 00 |j....#..|
Value:
00000000 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 |........|
End-To-End Testing Framework¶
Introduction¶
Cilium uses Ginkgo as a testing framework for
writing end-to-end tests which test Cilium all the way from the API level (e.g.
importing policies, CLI) to the datapath (i.e, whether policy that is imported
is enforced accordingly in the datapath). The tests in the test
directory
are built on top of Ginkgo. Ginkgo provides a rich framework for developing
tests alongside the benefits of Golang (compilation-time checks, types, etc.).
To get accustomed to the basics of Ginkgo, we recommend reading the Ginkgo
Getting-Started Guide , as
well as running example tests to get a feel for the
Ginkgo workflow.
These test scripts will invoke vagrant
to create virtual machine(s) to
run the tests. The tests make heavy use of the Ginkgo focus concept to
determine which VMs are necessary to run particular tests. All test names
must begin with one of the following prefixes:
Runtime
: Test cilium in a runtime environment running on a single node.K8s
: Create a small multi-node kubernetes environment for testing features beyond a single host, and for testing kubernetes-specific features.Nightly
: sets up a multinode Kubernetes cluster to run scale, performance, and chaos testing for Cilium.
Running End-To-End Tests¶
Running All Tests¶
Running all of the Ginkgo tests may take an hour or longer. To run all the ginkgo tests, invoke the make command as follows from the root of the cilium repository:
$ sudo make -C test/
The first time that this is invoked, the testsuite will pull the testing VMs and provision Cilium into them. This may take several minutes, depending on your internet connection speed. Subsequent runs of the test will reuse the image.
Running Runtime Tests¶
To run all of the runtime tests, execute the following command from the test
directory:
ginkgo --focus="Runtime*" -noColor
Ginkgo searches for all tests in all subdirectories that are “named” beginning with the string “Runtime” and contain any characters after it. For instance, here is an example showing what tests will be ran using Ginkgo’s dryRun option:
$ ginkgo --focus="Runtime*" -noColor -v -dryRun
Running Suite: runtime
======================
Random Seed: 1516125117
Will run 42 of 164 specs
................
RuntimePolicyEnforcement Policy Enforcement Always
Always to Never with policy
/Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:258
•
------------------------------
RuntimePolicyEnforcement Policy Enforcement Always
Always to Never without policy
/Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:293
•
------------------------------
RuntimePolicyEnforcement Policy Enforcement Never
Container creation
/Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:332
•
------------------------------
RuntimePolicyEnforcement Policy Enforcement Never
Never to default with policy
/Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:349
.................
Ran 42 of 164 Specs in 0.002 seconds
SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 122 Skipped PASS
Ginkgo ran 1 suite in 1.830262168s
Test Suite Passed
The output has been truncated. For more information about this functionality, consult the aforementioned Ginkgo documentation.
Running Kubernetes Tests¶
To run all of the Kubernetes tests, run the following command from the test
directory:
ginkgo --focus="K8s*" -noColor
Similar to the Runtime test suite, Ginkgo searches for all tests in all subdirectories that are “named” beginning with the string “K8s” and contain any characters after it.
The Kubernetes tests support the following Kubernetes versions:
- 1.7
- 1.8
- 1.9
- 1.10
- 1.11
By default, the Vagrant VMs are provisioned with Kubernetes 1.9. To run with any other supported version of Kubernetes, run the test suite with the following format:
K8S_VERSION=<version> ginkgo --focus="K8s*" -noColor
Running Nightly Tests¶
To run all of the Nightly tests, run the following command from the test
directory:
ginkgo --focus="Nightly*" -noColor
Similar to the other test suites, Ginkgo searches for all tests in all
subdirectories that are “named” beginning with the string “Nightly” and contain
any characters after it. The default version of running Nightly test are 1.8,
but can be changed using the environment variable K8S_VERSION
.
Available CLI Options¶
For more advanced workflows, check the list of available custom options for the Cilium
framework in the test/
directory and interact with ginkgo directly:
$ cd test/
$ ginkgo . -- --help | grep -A 1 cilium
-cilium.SSHConfig string
Specify a custom command to fetch SSH configuration (eg: 'vagrant ssh-config')
-cilium.dsManifest string
Cilium daemon set manifest to use for running the test (only Kubernetes)
-cilium.holdEnvironment
On failure, hold the environment in its current state
-cilium.provision
Provision Vagrant boxes and Cilium before running test (default true)
-cilium.showCommands
Output which commands are ran to stdout
-cilium.testScope string
Specifies scope of test to be ran (k8s, Nightly, runtime)
$ ginkgo --focus "Policies*" -- --cilium.provision=false
For more information about other built-in options to Ginkgo, consult the Ginkgo documentation.
Running Specific Tests Within a Test Suite¶
If you want to run one specified test, there are a few options:
- By modifying code: add the prefix “FIt” on the test you want to run; this marks the test as focused. Ginkgo will skip other tests and will only run the “focused” test. For more information, consult the Focused Specs documentation from Ginkgo.
It("Example test", func(){
Expect(true).Should(BeTrue())
})
FIt("Example focused test", func(){
Expect(true).Should(BeTrue())
})
- From the command line: specify a more granular focus if you want to focus on, say, L7 tests:
ginkgo --focus "Run*" --focus "L7 "
This will focus on tests prefixed with “Run*”, and within that focus, run any test that starts with “L7”.
Test Reports¶
The Cilium Ginkgo framework formulates JUnit reports for each test. The following files currently are generated depending upon the test suite that is ran:
- runtime.xml
- K8s.xml
Best Practices for Writing Tests¶
- Provide informative output to console during a test using the By construct. This helps with debugging and gives those who did not write the test a good idea of what is going on. The lower the barrier of entry is for understanding tests, the better our tests will be!
- Leave the testing environment in the same state that it was in when the test started by deleting resources, resetting configuration, etc.
- Gather logs in the case that a test fails. If a test fails while running on Jenkins, a postmortem needs to be done to analyze why. So, dumping logs to a location where Jenkins can pick them up is of the highest imperative. Use the following code in an
AfterFailed
method:
AfterFailed(func() {
vm.ReportFailed()
})
Ginkgo Extensions¶
In Cilium, some Ginkgo features are extended to cover some uses cases that are useful for testing Cilium.
BeforeAll¶
This function will run before all BeforeEach within a
Describe or Context.
This method is an equivalent to SetUp
or initialize functions in common
unit test frameworks.
AfterAll¶
This method will run after all AfterEach functions
defined in a Describe or Context.
This method is used for tearing down objects created which are used by all
Its
within the given Context
or Describe
. It is ran after all Its
have ran, this method is a equivalent to tearDown
or finalize
methods in
common unit test frameworks.
A good use case for using AfterAll
method is to remove containers or pods
that are needed for multiple Its
in the given Context
or Describe
.
JustAfterEach¶
This method will run just after each test and before AfterFailed
and
AfterEach
. The main reason of this method is to to perform some assertions
for a group of tests. A good example of using a global JustAfterEach
function is for deadlock detection, which checks the Cilium logs for deadlocks
that may have occurred in the duration of the tests.
AfterFailed¶
This method will run before all AfterEach
and after JustAfterEach
. This
function is only called when the test failed.This construct is used to gather
logs, the status of Cilium, etc, which provide data for analysis when tests
fail.
Example Test Layout¶
Here is an example layout of how a test may be written with the aforementioned constructs:
Test description diagram:
Describe
BeforeAll(A)
AfterAll(A)
AfterFailed(A)
AfterEach(A)
JustAfterEach(A)
TESTA1
TESTA2
TESTA3
Context
BeforeAll(B)
AfterAll(B)
AfterFailed(B)
AfterEach(B)
JustAfterEach(B)
TESTB1
TESTB2
TESTB3
Test execution flow:
Describe
BeforeAll
TESTA1; JustAfterEach(A), AfterFailed(A), AfterEach(A)
TESTA2; JustAfterEach(A), AfterFailed(A), AfterEach(A)
TESTA3; JustAfterEach(A), AfterFailed(A), AfterEach(A)
Context
BeforeAll(B)
TESTB1:
JustAfterEach(B); JustAfterEach(A)
AfterFailed(B); AfterFailed(A);
AfterEach(B) ; AfterEach(A);
TESTB2:
JustAfterEach(B); JustAfterEach(A)
AfterFailed(B); AfterFailed(A);
AfterEach(B) ; AfterEach(A);
TESTB3:
JustAfterEach(B); JustAfterEach(A)
AfterFailed(B); AfterFailed(A);
AfterEach(B) ; AfterEach(A);
AfterAll(B)
AfterAll(A)
Debugging:¶
Ginkgo provides to us different ways of debugging. In case that you want to see
all the logs messages in the console you can run the test in verbose mode using
the option -v
:
ginkgo --focus "Runtime*" -v
In case that the verbose mode is not enough, you can retrieve all run commands
and their output in the report directory (./test/test-results
). Each test
creates a new folder, which contains a file called log where all information is
saved, in case of a failing test an exhaustive data will be added.
$ head test/test_results/RuntimeKafkaKafkaPolicyIngress/logs
level=info msg=Starting testName=RuntimeKafka
level=info msg="Vagrant: running command \"vagrant ssh-config runtime\""
cmd: "sudo cilium status" exitCode: 0
KVStore: Ok Consul: 172.17.0.3:8300
ContainerRuntime: Ok
Kubernetes: Disabled
Kubernetes APIs: [""]
Cilium: Ok OK
NodeMonitor: Disabled
Allocated IPv4 addresses:
Running with delve¶
Delve is a debugging tool for Go
applications. If you want to run your test with delve, you should add a new
breakpoint using
runtime.BreakPoint() in the
code, and run ginkgo using dlv
.
Example how to run ginkgo using dlv
:
dlv test . -- --ginkgo.focus="Runtime" -ginkgo.v=true --cilium.provision=false
Running End-To-End Tests In Other Environments¶
If you want to run tests in a different VM, you can use --cilium.SSHConfig
to
provide the SSH configuration of the endpoint on which tests will be ran. The
tests presume the following on the remote instance:
- Cilium source code is located in the directory
/home/vagrant/go/src/github.com/cilium/cilium/
. - Cilium is installed and running.
The ssh connection needs to be defined as a ssh-config
file and need to have
the following targets:
- runtime: To run runtime tests
- k8s{1..2}-${K8S_VERSION}: to run Kubernetes tests. These instances must have Kubernetes installed and running as a prerequisite for running tests.
An example ssh-config
can be the following:
Host runtime
HostName 127.0.0.1
User vagrant
Port 2222
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
PasswordAuthentication no
IdentityFile /home/eloy/.go/src/github.com/cilium/cilium/test/.vagrant/machines/runtime/virtualbox/private_key
IdentitiesOnly yes
LogLevel FATAL
To run this you can use the following command:
ginkgo -v -- --cilium.provision=false --cilium.SSHConfig="cat ssh-config"
VMs for Testing¶
The VMs used for testing are defined in test/Vagrantfile
. There are a variety of
configuration options that can be passed as environment variables:
ENV variable | Default Value | Options | Description |
---|---|---|---|
K8S_NODES | 2 | 0..100 | Number of Kubernetes nodes in the cluster |
NFS | 0 | 1 | If Cilium folder needs to be shared using NFS |
IPv6 | 0 | 0-1 | If 1 the Kubernetes cluster will use IPv6 |
CONTAINER_RUNTIME | docker | containerd | To set the default container runtime in the Kubernetes cluster |
K8S_VERSION | 1.10 | 1.** | Kubernetes version to install |
SERVER_BOX | cilium/ubuntu-dev | Vagrantcloud base image | |
CPU | 2 | 0..100 | Number of CPUs that need to have the VM |
MEMORY | 4096 | d+ | RAM size in Megabytes |
How to contribute¶
Getting Started¶
Make sure you have a GitHub account
Clone the cilium repository
go get -d github.com/cilium/cilium cd $GOPATH/src/github.com/cilium/cilium
Set up your Setting up the development environment
Check the GitHub issues for good tasks to get started.
Submitting a pull request¶
Contributions must be submitted in the form of pull requests against the github repository at: https://github.com/cilium/cilium
- Fork the Cilium repository to your own personal GitHub space or request access to a Cilium developer account on Slack
- Push your changes to the topic branch in your fork of the repository.
- Submit a pull request on https://github.com/cilium/cilium.
Before hitting the submit button, please make sure that the following requirements have been met:
Each commit compiles and is functional on its own to allow for bisecting of commits.
All code is covered by unit and/or runtime tests where feasible.
All changes have been tested and checked for regressions by running the existing testsuite against your changes. See the End-To-End Testing Framework section for additional details.
All commits contain a well written commit description including a title, description and a
Fixes: #XXX
line if the commit addresses a particular GitHub issue. Note that the GitHub issue will be automatically closed when the commit is merged.apipanic: Log stack at debug level Previously, it was difficult to debug issues when the API panicked because only a single line like the following was printed: level=warning msg="Cilium API handler panicked" client=@ method=GET panic_message="write unix /var/run/cilium/cilium.sock->@: write: broken pipe" This patch logs the stack at this point at debug level so that it can at least be determined in developer environments. Fixes: #4191 Signed-off-by: Joe Stringer <joe@covalent.io>
If any of of the commits fixes a particular commit already in the tree, that commit is referenced in the commit message of the bugfix. This ensures that whoever performs a backport will pull in all required fixes:
daemon: use endpoint RLock in HandleEndpoint Fixes: a804c7c7dd9a ("daemon: wait for endpoint to be in ready state if specified via EndpointChangeRequest") Signed-off-by: André Martins <andre@cilium.io>
All commits are signed off. See the section Developer’s Certificate of Origin.
Pick the appropriate milestone for which this PR is being targeted to, e.g.
1.1
,1.2
. This is in particular important in the time frame between the feature freeze and final release date.Pick the right release-note label
Labels When to set release-note/bug
This is a non-trivial bugfix release-note/major
This is a major feature addition, e.g. Add MongoDB support release-note/minor
This is a minor feature addition, e.g. Refactor endpoint package Verify the release note text. If not explicitly changed, the title of the PR will be used for the release notes. If you want to change this, you can add a special section to the description of the PR.
```release-note This is a release note text ```
Note
If multiple lines are provided, then the first line serves as the high level bullet point item and any additional line will be added as a sub item to the first line.
Pick the right labels for your PR:
Labels When to set kind/bug
This is a bugfix worth mentioning in the release notes kind/enhancement
This is an enhancement/feature priority/release-blocker
This PR should block the current release area/*
Code area this PR covers needs-backport/X.Y
PR needs to be backported to these stable releases pending-review
PR is immediately ready for review wip
PR is still work in progress, signals reviewers to hold. backport/X.Y
This is backport PR, may only be set as part of Backporting process upgrade-impact
The code changes have a potential upgrade impact
Getting a pull request merged¶
- As you submit the pull request as described in the section Submitting a pull request.
One of the reviewers will start a CI run by replying with a comment
test-me-please
as described in Triggering Pull-Request Builds With Jenkins. If you are a core team member, you may trigger the CI run yourself.- Hound: basic
golang/lint
static code analyzer. You need to make the puppy happy. - CI / Jenkins: Will run a series of tests:
- Unit tests
- Single node runtime tests
- Multi node Kubernetes tests
- Hound: basic
- As part of the submission, GitHub will have requested a review from the
respective code owners according to the
CODEOWNERS
file in the repository.- Address any feedback received from the reviewers
- You can push individual commits to address feedback and then rebase your branch at the end before merging.
- Owners of the repository will automatically adjust the labels on the pull request to track its state and progress towards merging.
- Once the PR has been reviewed and the CI tests have passed, the PR will be merged by one of the repository owners. In case this does not happen, ping us on Slack.
Pull request review process¶
Note
These instructions assume that whoever is reviewing is a member of the Cilium GitHub organization or has the status of a contributor. This is required to obtain the privileges to modify GitHub labels on the pull request.
Review overall correctness of the PR according to the rules specified in the section Submitting a pull request.
Set the label accordingly.
Labels When to set dont-merge/needs-sign-off
Some commits are not signed off needs-rebase
PR is outdated and needs to be rebased As soon as a PR has the label
pending-review
, review the code and request changes as needed by using the GitHubRequest Changes
feature or by using Reviewable.Validate that bugfixes are marked with
kind/bug
and validate whether the assessment of backport requirements as requested by the submitter conforms to the Stable releases process.Labels When to set needs-backport/X.Y
PR needs to be backported to these stable releases If the PR is subject to backport, validate that the PR does not mix bugfix and refactoring of code as it will heavily complicate the backport process. Demand for the PR to be split.
Validate the
release-note/*
label and check the PR title for release note suitability. Put yourself into the perspective of a future release notes reader with lack of context and ensure the title is precise but brief.Labels When to set dont-merge/needs-release-note
Do NOT merge PR, needs a release note release-note/bug
This is a non-trivial bugfix release-note/major
This is a major feature addition release-note/minor
This is a minor feature addition Check for upgrade compatibility impact and if in doubt, set the label
upgrade-impact
and discuss in the Slack channel.Labels When to set upgrade-impact
The code changes have a potential upgrade impact When everything looks OK, approve the changes.
When all review objectives for all
CODEOWNERS
are met and all CI tests have passed, you may set theready-to-merge
label to indicate that all criteria have been met.Labels When to set ready-to-merge
PR is ready to be merged
Documentation¶
Building¶
The documentation has several dependencies which can be installed using pip:
$ pip install -r Documentation/requirements.txt
Whenever making changes to Cilium documentation you should check that you did not introduce any new warnings or errors, and also check that your changes look as you intended. To do this you can build the docs:
$ make -C Documentation html
After this you can browse the updated docs as HTML starting at
Documentation\_build\html\index.html
.
Alternatively you can use a Docker container to build the pages:
$ make render-docs
This builds the docs in a container and builds and starts a web server with your document changes.
Now the documentation page should be browsable on http://localhost:8080.
CI / Jenkins¶
The main CI infrastructure is maintained at https://jenkins.cilium.io/
Jobs Overview¶
Cilium-PR-Ginkgo-Tests-Validated¶
Runs validated Ginkgo tests which are confirmed to be stable and have been verified. These tests must always pass.
The configuration for this job is contained within ginkgo.Jenkinsfile
.
It first runs unit tests using docker-compose using a YAML located at
test/docker-compose.yaml
.
The next steps happens in parallel:
- Runs the single-node e2e tests using the Docker runtime.
- Runs the multi-node Kubernetes e2e tests against the latest default version of Kubernetes specified above.
Cilium-PR-Ginkgo-Tests-k8s¶
Runs the Kubernetes e2e tests against all Kubernetes versions that are not currently not tested as part of each pull-request, but which Cilium still supports, as well as the the most-recently-released versions of Kubernetes that are not yet declared stable by Kubernetes upstream:
First stage (stable versions which Cilium still supports):
- 1.7
- 1.10
Second stage (unstable versions)
- 1.11 beta
- 1.12 alpha
Ginkgo-CI-Tests-Pipeline¶
Cilium-Nightly-Tests-PR¶
Runs long-lived tests which take extended time. Some of these tests have an expected failure rate.
Nightly tests run once per day in the Cilium-Nightly-Tests Job. The
configuration for this job is stored in Jenkinsfile.nightly
.
To see the results of these tests, you can view the JUnit Report for an individual job:
- Click on the build number you wish to get test results from on the left hand side of the Cilium-Nightly-Tests Job.
- Click on ‘Test Results’ on the left side of the page to view the results from the build. This will give you a report of which tests passed and failed. You can click on each test to view its corresponding output created from Ginkgo.
This first runs the Nightly tests with the following setup:
- 4 Kubernetes 1.8 nodes
- 4 GB of RAM per node.
- 4 vCPUs per node.
Then, it runs tests Kubernetes tests against versions of Kubernetes that are currently not tested against as part of each pull-request, but that Cilium still supports.
It also runs a variety of tests against Envoy to ensure that proxy functionality is working correctly.
Triggering Pull-Request Builds With Jenkins¶
To ensure that build resources are used judiciously, builds on Jenkins are manually triggered via comments on each pull-request that contain “trigger-phrases”. Only members of the Cilium GitHub organization are allowed to trigger these jobs. Refer to the table below for information regarding which phrase triggers which build, which build is required for a pull-request to be merged, etc. Each linked job contains a description illustrating which subset of tests the job runs.
Jenkins Job | Trigger Phrase | Required To Merge? |
---|---|---|
Cilium-PR-Ginkgo-Tests-Validated | test-me-please | Yes |
Cilium-Pr-Ginkgo-Test-k8s | test-missed-k8s | No |
Cilium-Nightly-Tests-PR | test-nightly | No |
Cilium-PR-Doc-Tests | test-docs-please | No |
Cilium-PR-Kubernetes-Upstream | test-upstream-k8s | No |
There are some feature flags based on Pull Requests labels, the list of labels are the following:
- area/containerd: Enable containerd runtime on all Kubernetes test.
Using Jenkins for testing¶
Typically when running Jenkins tests via one of the above trigger phases, it will run all of the tests in that particular category. However, there may be cases where you just want to run a single test quickly on Jenkins and observe the test result. To do so, you need to update the relevant test to have a custom name, and to update the Jenkins file to focus that test. Below is an example patch that shows how this can be achieved.
diff --git a/ginkgo.Jenkinsfile b/ginkgo.Jenkinsfile
index ee17808748a6..637f99269a41 100644
--- a/ginkgo.Jenkinsfile
+++ b/ginkgo.Jenkinsfile
@@ -62,10 +62,10 @@ pipeline {
steps {
parallel(
"Runtime":{
- sh 'cd ${TESTDIR}; ginkgo --focus="RuntimeValidated*" -v -noColor'
+ sh 'cd ${TESTDIR}; ginkgo --focus="XFoooo*" -v -noColor'
},
"K8s-1.9":{
- sh 'cd ${TESTDIR}; K8S_VERSION=1.9 ginkgo --focus=" K8sValidated*" -v -noColor ${FAILFAST}'
+ sh 'cd ${TESTDIR}; K8S_VERSION=1.9 ginkgo --focus=" K8sFooooo*" -v -noColor ${FAILFAST}'
},
failFast: true
)
diff --git a/test/k8sT/Nightly.go b/test/k8sT/Nightly.go
index 62b324619797..3f955c73a818 100644
--- a/test/k8sT/Nightly.go
+++ b/test/k8sT/Nightly.go
@@ -466,7 +466,7 @@ var _ = Describe("NightlyExamples", func() {
})
- It("K8sValidated Updating Cilium stable to master", func() {
+ FIt("K8sFooooo K8sValidated Updating Cilium stable to master", func() {
podFilter := "k8s:zgroup=testapp"
//This test should run in each PR for now.
CI Failure Triage¶
This section describes the process to triage CI failures. We define 3 categories:
Keyword | Description |
---|---|
Flake | Failure due to a temporary situation such as loss of connectivity to external services or bug in system component, e.g. quay.io is down, VM race conditions, kube-dns bug, … |
CI-Bug | Bug in the test itself that renders the test unreliable, e.g. timing issue when importing and missing to block until policy is being enforced before connectivity is verified. |
Regression | Failure is due to a regression, all failures in the CI that are not caused by bugs in the test are considered regressions. |
Pipelines subject to triage¶
Build/test failures for the following Jenkins pipelines must be reported as GitHub issues using the process below:
Pipeline | Description |
---|---|
Ginkgo-Tests-Validated-master | Runs whenever a PR is merged into master |
Ginkgo-Tests-Validated-1.0 | Runs standard Ginkgo tests on merge into branch v1.0 |
Ginkgo-CI-Tests-Pipeline | Runs every two hours on the master branch |
Master-Nightly-Tests-All | Runs durability tests every night |
Vagrant-Master-Boxes-Packer-Build | Runs on merge into github.com/cilium/packer-ci-build. |
BETA-cilium-v1.1-standard | Runs standard Ginkgo tests on merge into branch v1.1 |
BETA-cilium-v1.1-K8s-all | Runs K8s tests on merge into branch v1.1 |
BETA-cilium-v1.1-K8s-Upstream | Runs K8s upstream tests on merge into branch v1.1 |
BETA-cilium-v1.1-Docs | Runs docs tests on merge into branch v1.1 |
BETA-cilium-v1.1-Nightly | Runs durability tests on branch v1.1 every night |
Note
BETA-cilium-v1.0-*
is currently not subject to the daily triage process
as the quality of the tests backported to that branch does not justify the
effort.
Triage process¶
Discover untriaged Jenkins failures via the jenkins-failures.sh script. It defaults to checking the previous 24 hours but this can be modified by setting the SINCE environment variable (it is a unix timestamp). The script checks the various test pipelines that need triage.
$ contrib/scripts/jenkins-failures.sh
Note
You can quickly assign SINCE with statements like
SINCE=`date -d -3days`
Investigate the failure you are interested in and determine if it is a CI-Bug, Flake, or a Regression as defined in the table above.
Search GitHub issues to see if bug is already filed. Make sure to also include closed issues in your search as a CI issue can be considered solved and then re-appears. Good search terms are:
The test name, e.g.
k8s-1.7.K8sValidatedKafkaPolicyTest Kafka Policy Tests KafkaPolicies (from (k8s-1.7.xml))
The line on which the test failed, e.g.
github.com/cilium/cilium/test/k8sT/KafkaPolicies.go:202
The error message, e.g.
Failed to produce from empire-hq on topic deathstar-plan
If a corresponding GitHub issue exists, update it with:
- A link to the failing Jenkins build (note that the build information is eventually deleted).
- Attach the zipfile downloaded from Jenkins with logs from the failing tests. A zipfile for all tests is also available.
- Check how much time has passed since the last reported occurrence of this failure and move this issue to the correct column in the CI flakes project board.
If no existing GitHub issue was found, file a new GitHub issue:
- Attach zipfile downloaded from Jenkins with logs from failing test
- If the failure is a new regression or a real bug:
- Title:
<Short bug description>
- Labels
kind/bug
andneeds/triage
.
- Title:
- If failure is a new CI-Bug, Flake or if you are unsure:
- Title
CI: <testname>: <cause>
, e.g.CI: K8sValidatedPolicyTest Namespaces: cannot curl service
- Labels
kind/bug/CI
andneeds/triage
- Include a link to the failing Jenkins build (note that the build information is eventually deleted).
- Attach zipfile downloaded from Jenkins with logs from failing test
- Include the test name and whole Stacktrace section to help others find this issue.
- Add issue to CI flakes project
- Title
Note
Be extra careful when you see a new flake on a PR, and want to open an issue. It’s much more difficult to debug these without context around the PR and the changes it introduced. When creating an issue for a PR flake, include a description of the code change, the PR, or the diff. If it isn’t related to the PR, then it should already happen in master, and a new issue isn’t needed.
Edit the description of the Jenkins build to mark it as triaged. This will exclude it from future jenkins-failures.sh output.
- Login -> Click on build -> Edit Build Information
- Add the failure type and GH issue number. Use the table describing the failure categories, at the beginning of this section, to help categorize them.
Note
This step can only be performed with an account on Jenkins. If you are interested in CI failure reviews and do not have an account yet, ping us on Slack.
Examples:
Flake, quay.io is down
Flake, DNS not ready, #3333
CI-Bug, K8sValidatedPolicyTest: Namespaces, pod not ready, #9939
Regression, k8s host policy, #1111
Infrastructure details¶
Logging into VM running tests¶
- If you have access to credentials for Jenkins, log into the Jenkins slave running the test workload
- Identify the vagrant box running the specific test
$ vagrant global-status
id name provider state directory
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6e68c6c k8s1-build-PR-1588-6 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q/tests/k8s
ec5962a cilium-master-build-PR-1588-6 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q
bfaffaa k8s2-build-PR-1588-6 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q/tests/k8s
3fa346c k8s1-build-PR-1588-7 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q@2/tests/k8s
b7ded3c cilium-master-build-PR-1588-7 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q@2
- Log into the specific VM
$ JOB_BASE_NAME=PR-1588 BUILD_NUMBER=6 vagrant ssh 6e68c6c
Jenkinsfiles Extensions¶
Cilium uses a custom Jenkins helper library to gather metadata from PRs and simplify our Jenkinsfiles. The exported methods are:
- ispr(): return true if the current build is a PR.
- setIfPr(string, string): return the first argument in case of a PR, if not a PR return the second one.
- BuildIfLabel(String label, String Job): trigger a new Job if the PR has that specific Label.
- Status(String status, String context): set pull request check status on
the given context, example
Status("SUCCESS", "$JOB_BASE_NAME")
Release Management¶
This section describes the release cadence and all release related processes.
Release Cadence¶
Cilium schedules a minor release every 6 weeks. Each minor release is performed
by incrementing the Y
in the version format X.Y.Z
. The group of
committers can decide to increment X
instead to mark major milestones in
which case Y
is reset to 0.
Stable releases¶
The committers can nominate PRs merged into master as required for backport
into the stable release branches. Upon necessity, stable releases are published
with the version X.Y.Z+1
. Stable releases are regularly released in high
frequency or on demand to address major incidents.
In order to guarantee stable production usage while maintaining a high release cadence, the following stable releases will be maintained:
- Stable backports into the last two releases
- LTS release for extended long term backport coverage
Backport criteria for X.Y.Z+n¶
Criteria for the inclusion into latest stable release branch, i.e. what goes
into v1.1.x
before v1.2.0
has been released:
- All bugfixes
Backport criteria for X.Y-1.Z¶
Criteria for the inclusion into the stable release branch of the previous
release, i.e. what goes into v1.0.x
, before v1.2.0
has been released:
- Security relevant fixes
- Major bugfixes relevant to the correct operation of Cilium
LTS¶
The group of committers nominates a release to be a long term stable release. Such releases are guaranteed to receive backports for major and security relevant bugfixes. LTS releases will be declared end of life after 6 months. The group of committers will nominate and start supporting a new LTS release before the current LTS expires. If for some reason, no release can be declared LTS before the current LTS release expires, the current LTS lifetime will be extended.
Given the current 6 weeks release cadence, the development teams will aim at declaring every 3rd release to be an LTS to guarantee enough time overlap between LTS release.
Current LTS releases¶
Release | Original Release Date | Scheduled End of Life |
---|---|---|
1.0 | 2018-04-24 | 2018-10-24 |
Generic Release Process¶
This process applies to all releases other than minor releases, this includes:
- Stable releases
- Release candidates
If you intent to release a new minor release, see the Minor Release Process section instead.
Note
The following commands have been validated when ran in the VM used in the Cilium development process. See Setting up the development environment for detailed instructions about setting up said VM.
Ensure that the necessary backports have been completed and merged. See Backporting process.
Checkout the desired stable branch and pull it:
git checkout v1.0; git pull
Create a branch for the release pull request:
git checkout -b pr/prepare-v1.0.3
Update the
VERSION
file to representX.Y.Z+1
If this is the first release after creating a new release branch. Adjust the image pull policy for all
.sed
files inexamples/kubernetes
fromAlways
toIfNotPresent
.Update the image tag versions in the examples:
make -C examples/kubernetes clean all
Update the
cilium_version
andcilium_tag
variables inexamples/getting-started/Vagrantfile
Update the
AUTHORS file
make update-authors
Note
Check to see if the
AUTHORS
file has any formatting errors (for instance, indentation mismatches) as well as duplicate contributor names, and correct them accordingly.Generate the
NEWS.rst
addition based off of the prior release tag (e.g., if you are generating theNEWS.rst
for v1.0.3):git shortlog v1.0.2.. > add-to-NEWS.rst
Add a new section to
NEWS.rst
:v1.0.3 ====== :: <<contents of add-to-NEWS.rst>> [...] <<end of add-to-NEWS.rst>>
Add all modified files using
git add
and create a pull request with the titlePrepare for release v1.0.3
. Add the backport label to the PR which corresponds to the branch for which the release is being performed, e.g.backport/1.0
.Note
Make sure to create the PR against the desired stable branch. In this case
v1.0
Follow standard procedures to get the aforementioned PR merged into the desired stable branch. See Submitting a pull request for more information about this process.
Checkout out the stable branch and pull your merged changes:
git checkout v1.0; git pull
Create release tags:
git tag -a v1.0.3 -m 'Release v1.0.3' git tag -a 1.0.3 -m 'Release 1.0.3'
Note
There are two tags that correspond to the same release because GitHub recommends using
vx.y.z
for release version formatting, and ReadTheDocs, which hosts the Cilium documentation, requires the version to be in formatx.y.z
For more information about how ReadTheDocs does versioning, you can read their Versions Documentation.Build the binaries and push it to the release bucket:
DOMAIN=releases.cilium.io ./contrib/release/uploadrev v1.0.3
This step will print a markdown snippet which you will need when crafting the GitHub release so make sure to keep it handy.
Build the container images and push them
DOCKER_IMAGE_TAG=v1.0.3 make docker-image docker push cilium/cilium:v1.0.3
Push the git release tag
git push --tags
-
Choose the correct target branch, e.g.
v1.0
Choose the correct target tag, e.g.
v1.0.3
Title:
1.0.3
Check the
This is a pre-release
box if you are releasing a release candidate.Fill in the release description:
Changes ------- ``` << contents of NEWS.rst for this release >> ``` Release binaries ---------------- << contents of snippet outputed by uploadrev >>
Preview the description and then publish the release
Announce the release in the
#general
channel on SlackBump the version of Cilium used in the Cilium upgrade tests to use the new release
Please reach out on the
#development
channel on Slack for assistance with this task.
Minor Release Process¶
On Freeze date¶
Fork a new release branch from master:
git checkout master; git pull git checkout -b v1.2 git push
Protect the branch using the GitHub UI to disallow direct push and require merging via PRs with proper reviews.
Replace the contents of the
CODEOWNERS
file with the following to reduce code reviews to essential approvals:* @cilium/janitors api/ @cilium/api monitor/payload @cilium/api pkg/apisocket/ @cilium/api pkg/policy/api/ @cilium/api pkg/proxy/accesslog @cilium/api
Commit changes, open a pull request against the new
v1.2
branch, and get the pull request mergedgit checkout -b pr/prepare-v1.2 git add [...] git commit git push
Follow the Generic Release Process to release
v1.2.0-rc1
.Create the following GitHub labels:
backport-pending/1.2
backport-done/1.2
backport/1.2
needs-backport/1.2
Prepare the master branch for the next development cycle:
git checkout master; git pull
Update the
VERSION
file to containv1.2.90
Add the
VERSION
file usinggit add
and create & merge a PR titledPrepare for 1.3.0 development
.- Update the release branch on
Jenkins to be tested on every change and Nightly.
For the final release¶
- Follow the Generic Release Process to create the final replace and replace
X.Y.0-rcX
withX.Y.0
.
Backporting process¶
Cilium PRs that are marked with the label needs-backport/X.Y
need to be
backported to the stable branch X.Y
. The following steps summarize
the process for backporting these PRs.
- Make sure the Github labels are up-to-date, as this process will
deal with all commits from PRs that have the
needs-backport/X.Y
label set (for a stable release version X.Y). If any PRs contain labels such asbackport-pending/X.Y
, ensure that the backport for that PR have been merged and if so, change the label tobackport-done/X.Y
. - The scripts referred to below need to be run in Linux, they do not work on OSX. You can use the cilium dev VM for this, but you need to configure git to have your name and email address to be used in the commit messages:
$ git config --global user.name "John Doe"
$ git config --global user.email johndoe@example.com
- Make sure you have your a GitHub developer access token available. For details, see contrib/backporting/README.md
- Fetch the repo, e.g.,
git fetch
- Check a new branch for your backports based on the stable branch for that
version, e.g.,
git checkout -b pr/v1.0-backport-YY-MM-DD origin/v1.0
- Run the
check-stable
script, referring to your Github access token, this will list the commits that need backporting, from the newest to oldest:
$ GITHUB_TOKEN=xxx contrib/backporting/check-stable 1.0
- Cherry-pick the commits using the master git SHAs listed, starting from the oldest (bottom), working your way up and fixing any merge conflicts as they appear. Note that for PRs that have multiple commits you will want to check that you are cherry-picking oldest commits first.
$ contrib/backporting/cherry-pick <oldest-commit-sha>
...
$ contrib/backporting/cherry-pick <newest-commit-sha>
Push your backports branch to cilium repo, e.g.,
git push -u origin pr/v1.0-backports-YY-MM-DD
In Github, create a new PR from your branch towards the feature branch you are backporting to. Note that by default Github creates PRs against the master branch, so you will need to change it.
Label the new backport PR with the backport label for the stable branch such as
backport/X.Y
so that it is easy to find backport PRs later.Mark all PRs you backported with the backport pending label
backport-pending/X.Y
and clear theneeds-backport/vX.Y
label. This can be via the GitHub interface, or using the backport scriptcontrib/backporting/set-labels.py
, e.g.:# Set PR 1234's v1.0 backporting labels to pending $ contrib/backporting/set-labels.py 1234 pending 1.0
Note
contrib/backporting/set-labels.py
requires Python 3 and PyGithub installed.After the backport PR is merged, mark all backported PRs with
backport-done/X.Y
label and clear thebackport-pending/X.Y
label(s).# Set PR 1234's v1.0 backporting labels to done contrib/backporting/set-labels.py 1234 done 1.0.
Update cilium-builder and cilium-runtime images¶
Login to quay.io with your credentials to the repository that you want to update:
cilium-builder - contains all envoy dependencies cilium-runtime - contains all cilium dependencies (excluding envoy dependencies)
- After login, select the tab “builds” on the left menu.

- Click on the wheel.
- Enable the trigger for that build trigger.

- Confirm that you want to enable the trigger.

- After enabling the trigger, click again on the wheel.
- And click on “Run Trigger Now”.

- A new pop-up will appear and you can select the branch that contains your changes.
- Select the branch that contains the new changes.

- After selecting your branch click on “Start Build”.

- Once the build has started you can disable the Build trigger by clicking on the wheel.
- And click on “Disable Trigger”.

- Confirm that you want to disable the build trigger.

- Once the build is finished click under Tags (on the left menu).
- Click on the wheel and;
- Add a new tag to the image that was built.

- Write the name of the tag that you want to give for the newly built image.
- Confirm the name is correct and click on “Create Tag”.

- After the new tag was created you can delete the other tag, which is the name of your branch. Select the tag name.
- Click in Actions.
- Click in “Delete Tags”.

- Confirm that you want to delete tag with your branch name.

You have created a new image build with a new tag. The next steps should be to
update the repository root’s Dockerfile so that it points to the new
cilium-builder
or cilium-runtime
image recently created.
Nightly Docker image¶
After each successful Nightly build, a cilium/nightly image is pushed to dockerhub.
To use latest nightly build, please use cilium/nightly:latest
tag.
Nightly images are stored on dockerhub tagged with following format: YYYYMMDD-<job number>
.
Job number is added to tag for the unlikely event of two consecutive nightly builds being built on the same date.
Developer’s Certificate of Origin¶
To improve tracking of who did what, we’ve introduced a “sign-off” procedure.
The sign-off is a simple line at the end of the explanation for the commit, which certifies that you wrote it or otherwise have the right to pass it on as open-source work. The rules are pretty simple: if you can certify the below:
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
then you just add a line saying:
Signed-off-by: Random J Developer <random@developer.example.org>
Use your real name (sorry, no pseudonyms or anonymous contributions.)
Cilium Committer Grant/Revocation Policy¶
A Cilium committer is a participant in the project with the ability to commit code directly to the master repository. Commit access grants a broad ability to affect the progress of the project as presented by its most important artifact, the code and related resources that produce working binaries of Cilium. As such it represents a significant level of trust in an individual’s commitment to working with other committers and the community at large for the benefit of the project. It can not be granted lightly and, in the worst case, must be revocable if the trust placed in an individual was inappropriate.
This document suggests guidelines for granting and revoking commit access. It is intended to provide a framework for evaluation of such decisions without specifying deterministic rules that wouldn’t be sensitive to the nuance of specific situations. In the end the decision to grant or revoke committer privileges is a judgment call made by the existing set of committers.
Expectations for Developers with commit access¶
Pre-requisites¶
Be familiar with the Developer / Contributor Guide.
Review¶
Code (yours or others’) must be reviewed publicly (by you or others) before you push it to the repository. With one exception (see below), every change needs at least one review.
If one or more people know an area of code particularly well, code that affects that area should ordinarily get a review from one of them.
The riskier, more subtle, or more complicated the change, the more careful the review required. When a change needs careful review, use good judgment regarding the quality of reviews. If a change adds 1000 lines of new code, and a review posted 5 minutes later says just “Looks good,” then this is probably not a quality review.
(The size of a change is correlated with the amount of care needed in review, but it is not strictly tied to it. A search and replace across many files may not need much review, but one-line optimization changes can have widespread implications.)
Your own small changes to fix a recently broken build (“make”) or tests (“make check”), that you believe to be visible to a large number of developers, may be checked in without review. If you are not sure, ask for review.
Regularly review submitted code in areas where you have expertise. Consider reviewing other code as well.
Git conventions¶
If you apply a change (yours or another’s) then it is your responsibility to handle any resulting problems, especially broken builds and other regressions. If it is someone else’s change, then you can ask the original submitter to address it. Regardless, you need to ensure that the problem is fixed in a timely way. The definition of “timely” depends on the severity of the problem.
If a bug is present on master and other branches, fix it on master first, then backport the fix to other branches. Straightforward backports do not require additional review (beyond that for the fix on master).
Feature development should be done only on master. Occasionally it makes sense to add a feature to the most recent release branch, before the first actual release of that branch. These should be handled in the same way as bug fixes, that is, first implemented on master and then backported.
Keep the authorship of a commit clear by maintaining a correct list of “Signed-off-by:”s. If a confusing situation comes up, as it occasionally does, bring it up in the development forums. If you explain the use of “Signed-off-by:” to a new developer, explain not just how but why, since the intended meaning of “Signed-off-by:” is more important than the syntax.
Use Reported-by: and Tested-by: tags in commit messages to indicate the source of a bug report.
Keep the AUTHORS file up to date.
Granting Commit Access¶
Granting commit access should be considered when a candidate has demonstrated the following in their interaction with the project:
- Contribution of significant new features through the patch submission process where:
- Submissions are free of obvious critical defects
- Submissions do not typically require many iterations of improvement to be accepted
- Consistent participation in code review of other’s patches, including existing committers, with comments consistent with the overall project standards
- Assistance to those in the community who are less knowledgeable through active participation in project forums.
- Plans for sustained contribution to the project compatible with the project’s direction as viewed by current committers.
- Commitment to meet the expectations described in the “Expectations of Developer’s with commit access”
The process to grant commit access to a candidate is simple:
- An existing committer nominates the candidate by sending an email to all existing committers with information substantiating the contributions of the candidate in the areas described above.
- All existing committers discuss the pros and cons of granting commit access to the candidate in the email thread.
- When the discussion has converged or a reasonable time has elapsed without discussion developing (e.g. a few business days) the nominator calls for a final decision on the candidate with a followup email to the thread.
- Each committer may vote yes, no, or abstain by replying to the email thread. A failure to reply is an implicit abstention.
- After votes from all existing committers have been collected or a reasonable time has elapsed for them to be provided (e.g. a couple of business days) the votes are evaluated. To be granted commit access the candidate must receive yes votes from a majority of the existing committers and zero no votes. Since a no vote is effectively a veto of the candidate it should be accompanied by a reason for the vote.
- The nominator summarizes the result of the vote in an email to all existing committers.
- If the vote to grant commit access passed, the candidate is contacted with an invitation to become a committer to the project which asks them to agree to the committer expectations documented on the project web site.
- If the candidate agrees access is granted by setting up commit access to the repos.
Revoking Commit Access¶
There are two situations in which commit access might be revoked.
The straightforward situation is a committer who is no longer active in the project and has no plans to become active in the near future. The process in this case is:
- Any time after a committer has been inactive for more than 6 months any other committer to the project may identify that committer as a candidate for revocation of commit access due to inactivity.
- The plans of revocation should be sent in a private email to the candidate.
- If the candidate for removal states plans to continue participating no action is taken and this process terminates.
- If the candidate replies they no longer require commit access then commit access is removed and a notification is sent to the candidate and all existing committers.
- If the candidate can not be reached within 1 week of the first attempting to contact this process continues.
- A message proposing removal of commit access is sent to the candidate and all other committers.
- If the candidate for removal states plans to continue participating no action is taken.
- If the candidate replies they no longer require commit access then their access is removed.
- If the candidate can not be reached within 2 months of the second attempting to contact them, access is removed.
- In any case, where access is removed, this fact is published through an email to all existing committers (including the candidate for removal).
The more difficult situation is a committer who is behaving in a manner that is viewed as detrimental to the future of the project by other committers. This is a delicate situation with the potential for the creation of division within the greater community and should be handled with care. The process in this case is:
- Discuss the behavior of concern with the individual privately and explain why you believe it is detrimental to the project. Stick to the facts and keep the email professional. Avoid personal attacks and the temptation to hypothesize about unknowable information such as the other’s motivations. Make it clear that you would prefer not to discuss the behavior more widely but will have to raise it with other contributors if it does not change. Ideally the behavior is eliminated and no further action is required. If not,
- Start an email thread with all committers, including the source of the behavior, describing the behavior and the reason it is detrimental to the project. The message should have the same tone as the private discussion and should generally repeat the same points covered in that discussion. The person whose behavior is being questioned should not be surprised by anything presented in this discussion. Ideally the wider discussion provides more perspective to all participants and the issue is resolved. If not,
- Start an email thread with all committers except the source of the detrimental behavior requesting a vote on revocation of commit rights. Cite the discussion among all committers and describe all the reasons why it was not resolved satisfactorily. This email should be carefully written with the knowledge that the reasoning it contains may be published to the larger community to justify the decision.
- Each committer may vote yes, no, or abstain by replying to the email thread. A failure to reply is an implicit abstention.
- After all votes have been collected or a reasonable time has elapsed for them to be provided (e.g. a couple of business days) the votes are evaluated. For the request to revoke commit access for the candidate to pass it must receive yes votes from two thirds of the existing committers.
- anyone that votes no must provide their reasoning, and
- if the proposal passes then counter-arguments for the reasoning in no votes should also be documented along with the initial reasons the revocation was proposed. Ideally there should be no new counter-arguments supplied in a no vote as all concerns should have surfaced in the discussion before the vote.
- The original person to propose revocation summarizes the result of the vote in an email to all existing committers excepting the candidate for removal.
- If the vote to revoke commit access passes, access is removed and the candidate for revocation is informed of that fact and the reasons for it as documented in the email requesting the revocation vote.
- Ideally the revoked committer peacefully leaves the community and no further action is required. However, there is a distinct possibility that he/she will try to generate support for his/her point of view within the larger community. In this case the reasoning for removing commit access as described in the request for a vote will be published to the community.
Changing the Policy¶
The process for changing the policy is:
- Propose the changes to the policy in an email to all current committers and request discussion.
- After an appropriate period of discussion (a few days) update the proposal based on feedback if required and resend it to all current committers with a request for a formal vote.
- After all votes have been collected or a reasonable time has elapsed for them to be provided (e.g. a couple of business days) the votes are evaluated. For the request to modify the policy to pass it must receive yes votes from two thirds of the existing committers.
Template Emails¶
Nomination to Grant Commit Access¶
I would like to nominate *[candidate]* for commit access. I believe
*[he/she]* has met the conditions for commit access described in the
committer grant policy on the project web site in the following ways:
*[list of requirements & evidence]*
Please reply to all in this message thread with your comments and
questions. If that discussion concludes favorably I will request a formal
vote on the nomination in a few days.
Vote to Grant Commit Access¶
I nominated *[candidate]* for commit access on *[date]*. Having allowed
sufficient time for discussion it's now time to formally vote on the
proposal.
Please reply to all in this thread with your vote of: YES, NO, or ABSTAIN.
A failure to reply will be counted as an abstention. If you vote NO, by our
policy you must include the reasons for that vote in your reply. The
deadline for votes is *[date and time]*.
If a majority of committers vote YES and there are zero NO votes commit
access will be granted.
Vote Results for Grant of Commit Access¶
The voting period for granting to commit access to *[candidate]* initiated
at *[date and time]* is now closed with the following results:
YES: *[count of yes votes]* (*[% of voters]*)
NO: *[count of no votes]* (*[% of voters]*)
ABSTAIN: *[count of abstentions]* (*[% of voters]*)
Based on these results commit access *[is/is NOT]* granted.
Invitation to Accepted Committer¶
Due to your sustained contributions to the Cilium project we
would like to provide you with commit access to the project repository.
Developers with commit access must agree to fulfill specific
responsibilities described in the source repository:
/Documentation/commit-access.rst
Please let us know if you would like to accept commit access and if so that
you agree to fulfill these responsibilities. Once we receive your response
we'll set up access. We're looking forward continuing to work together to
advance the Cilium project.
Proposal to Remove Commit Access for Inactivity¶
Committer *[candidate]* has been inactive for *[duration]*. I have
attempted to privately contacted *[him/her]* and *[he/she]* could not be
reached.
Based on this I would like to formally propose removal of commit access.
If a response to this message documenting the reasons to retain commit
access is not received by *[date]* access will be removed.
Notification of Commit Removal for Inactivity¶
Committer *[candidate]* has been inactive for *[duration]*. *[He/she]*
*[stated no commit access is required/failed to respond]* to the formal
proposal to remove access on *[date]*. Commit access has now been removed.
Proposal to Revoke Commit Access for Detrimental Behavior¶
I regret that I feel compelled to propose revocation of commit access for
*[candidate]*. I have privately discussed with *[him/her]* the following
reasons I believe *[his/her]* actions are detrimental to the project and we
have failed to come to a mutual understanding:
*[List of reasons and supporting evidence]*
Please reply to all in this thread with your thoughts on this proposal. I
plan to formally propose a vote on the proposal on or after *[date and
time]*.
It is important to get all discussion points both for and against the
proposal on the table during the discussion period prior to the vote.
Please make it a high priority to respond to this proposal with your
thoughts.
Vote to Revoke Commit Access¶
I nominated *[candidate]* for revocation of commit access on *[date]*.
Having allowed sufficient time for discussion it's now time to formally
vote on the proposal.
Please reply to all in this thread with your vote of: YES, NO, or ABSTAIN.
A failure to reply will be counted as an abstention. If you vote NO, by our
policy you must include the reasons for that vote in your reply. The
deadline for votes is *[date and time]*.
If 2/3rds of committers vote YES commit access will be revoked.
The following reasons for revocation have been given in the original
proposal or during discussion:
*[list of reasons to remove access]*
The following reasons for retaining access were discussed:
*[list of reasons to retain access]*
The counter-argument for each reason for retaining access is:
*[list of counter-arguments for retaining access]*
Vote Results for Revocation of Commit Access¶
The voting period for revoking the commit access of *[candidate]* initiated
at *[date and time]* is now closed with the following results:
- YES: *[count of yes votes]* (*[% of voters]*)
- NO: *[count of no votes]* (*[% of voters]*)
- ABSTAIN: *[count of abstentions]* (*[% of voters]*)
Based on these results commit access *[is/is NOT]* revoked. The following
reasons for retaining commit access were proposed in NO votes:
*[list of reasons]*
The counter-arguments for each of these reasons are:
*[list of counter-arguments]*
Notification of Commit Revocation for Detrimental Behavior¶
After private discussion with you and careful consideration of the
situation, the other committers to the Cilium project have
concluded that it is in the best interest of the project that your commit
access to the project repositories be revoked and this has now occurred.
The reasons for this decision are:
*[list of reasons for removing access]*
While your goals and those of the project no longer appear to be aligned we
greatly appreciate all the work you have done for the project and wish you
continued success in your future work.
BPF and XDP Reference Guide¶
Note
This documentation section is targeted at developers and users who want to understand BPF and XDP in great technical depth. While reading this reference guide may help broaden your understanding of Cilium, it is not a requirement to use Cilium. Please refer to the Getting Started Guides and Concepts for a higher level introduction.
BPF is a highly flexible and efficient virtual machine-like construct in the Linux kernel allowing to execute bytecode at various hook points in a safe manner. It is used in a number of Linux kernel subsystems, most prominently networking, tracing and security (e.g. sandboxing).
Although BPF exists since 1992, this document covers the extended Berkeley Packet Filter (eBPF) version which has first appeared in Kernel 3.18 and renders the original version which is being referred to as “classic” BPF (cBPF) these days mostly obsolete. cBPF is known to many as being the packet filter language used by tcpdump. Nowadays, the Linux kernel runs eBPF only and loaded cBPF bytecode is transparently translated into an eBPF representation in the kernel before program execution. This documentation will generally refer to the term BPF unless explicit differences between eBPF and cBPF are being pointed out.
Even though the name Berkeley Packet Filter hints at a packet filtering specific purpose, the instruction set is generic and flexible enough these days that there are many use cases for BPF apart from networking. See Further Reading for a list of projects which use BPF.
Cilium uses BPF heavily in its data path, see Concepts for further information. The goal of this chapter is to provide a BPF reference guide in order to gain understanding of BPF, its networking specific use including loading BPF programs with tc (traffic control) and XDP (eXpress Data Path), and to aid with developing Cilium’s BPF templates.
BPF Architecture¶
BPF does not define itself by only providing its instruction set, but also by offering further infrastructure around it such as maps which act as efficient key / value stores, helper functions to interact with and leverage kernel functionality, tail calls for calling into other BPF programs, security hardening primitives, a pseudo file system for pinning objects (maps, programs), and infrastructure for allowing BPF to be offloaded, for example, to a network card.
LLVM provides a BPF back end, so that tools like clang can be used to compile C into a BPF object file, which can then be loaded into the kernel. BPF is deeply tied to the Linux kernel and allows for full programmability without sacrificing native kernel performance.
Last but not least, also the kernel subsystems making use of BPF are part of BPF’s infrastructure. The two main subsystems discussed throughout this document are tc and XDP where BPF programs can be attached to. XDP BPF programs are attached at the earliest networking driver stage and trigger a run of the BPF program upon packet reception. By definition, this achieves the best possible packet processing performance since packets cannot get processed at an even earlier point in software. However, since this processing occurs so early in the networking stack, the stack has not yet extracted metadata out of the packet. On the other hand, tc BPF programs are executed later in the kernel stack, so they have access to more metadata and core kernel functionality. Apart from tc and XDP programs, there are various other kernel subsystems as well which use BPF such as tracing (kprobes, uprobes, tracepoints, etc).
The following subsections provide further details on individual aspects of the BPF architecture.
Instruction Set¶
BPF is a general purpose RISC instruction set and was originally designed for the purpose of writing programs in a subset of C which can be compiled into BPF instructions through a compiler back end (e.g. LLVM), so that the kernel can later on map them through an in-kernel JIT compiler into native opcodes for optimal execution performance inside the kernel.
The advantages for pushing these instructions into the kernel include:
- Making the kernel programmable without having to cross kernel / user space boundaries. For example, BPF programs related to networking, as in the case of Cilium, can implement flexible container policies, load balancing and other means without having to move packets to user space and back into the kernel. State between BPF programs and kernel / user space can still be shared through maps whenever needed.
- Given the flexibility of a programmable data path, programs can be heavily optimized for performance also by compiling out features that are not required for the use cases the program solves. For example, if a container does not require IPv4, then the BPF program can be built to only deal with IPv6 in order to save resources in the fast-path.
- In case of networking (e.g. tc and XDP), BPF programs can be updated atomically without having to restart the kernel, system services or containers, and without traffic interruptions. Furthermore, any program state can also be maintained throughout updates via BPF maps.
- BPF provides a stable ABI towards user space, and does not require any third party kernel modules. BPF is a core part of the Linux kernel that is shipped everywhere, and guarantees that existing BPF programs keep running with newer kernel versions. This guarantee is the same guarantee that the kernel provides for system calls with regard to user space applications. Moreover, BPF programs are portable across different architectures.
- BPF programs work in concert with the kernel, they make use of existing kernel infrastructure (e.g. drivers, netdevices, tunnels, protocol stack, sockets) and tooling (e.g. iproute2) as well as the safety guarantees which the kernel provides. Unlike kernel modules, BPF programs are verified through an in-kernel verifier in order to ensure that they cannot crash the kernel, always terminate, etc. XDP programs, for example, reuse the existing in-kernel drivers and operate on the provided DMA buffers containing the packet frames without exposing them or an entire driver to user space as in other models. Moreover, XDP programs reuse the existing stack instead of bypassing it. BPF can be considered a generic “glue code” to kernel facilities for crafting programs to solve specific use cases.
The execution of a BPF program inside the kernel is always event driven! For example, a networking device which has a BPF program attached on its ingress path will trigger the execution of the program once a packet is received, a kernel address which has a kprobes with a BPF program attached will trap once the code at that address gets executed, then invoke the kprobes callback function for instrumentation which subsequently triggers the execution of the BPF program attached to it.
BPF consists of eleven 64 bit registers with 32 bit subregisters, a program counter
and a 512 byte large BPF stack space. Registers are named r0
- r10
. The
operating mode is 64 bit by default, the 32 bit subregisters can only be accessed
through special ALU (arithmetic logic unit) operations. The 32 bit lower subregisters
zero-extend into 64 bit when they are being written to.
Register r10
is the only register which is read-only and contains the frame pointer
address in order to access the BPF stack space. The remaining r0
- r9
registers are general purpose and of read/write nature.
A BPF program can call into a predefined helper function, which is defined by the core kernel (never by modules). The BPF calling convention is defined as follows:
r0
contains the return value of a helper function call.r1
-r5
hold arguments from the BPF program to the kernel helper function.r6
-r9
are callee saved registers that will be preserved on helper function call.
The BPF calling convention is generic enough to map directly to x86_64
, arm64
and other ABIs, thus all BPF registers map one to one to HW CPU registers, so that a
JIT only needs to issue a call instruction, but no additional extra moves for placing
function arguments. This calling convention was modeled to cover common call
situations without having a performance penalty. Calls with 6 or more arguments
are currently not supported. The helper functions in the kernel which are dedicated
to BPF (BPF_CALL_0()
to BPF_CALL_5()
functions) are specifically designed
with this convention in mind.
Register r0
is also the register containing the exit value for the BPF program.
The semantics of the exit value are defined by the type of program. Furthermore, when
handing execution back to the kernel, the exit value is passed as a 32 bit value.
Registers r1
- r5
are scratch registers, meaning the BPF program needs to
either spill them to the BPF stack or move them to callee saved registers if these
arguments are to be reused across multiple helper function calls. Spilling means
that the variable in the register is moved to the BPF stack. The reverse operation
of moving the variable from the BPF stack to the register is called filling. The
reason for spilling/filling is due to the limited number of registers.
Upon entering execution of a BPF program, register r1
initially contains the
context for the program. The context is the input argument for the program (similar
to argc/argv
pair for a typical C program). BPF is restricted to work on a single
context. The context is defined by the program type, for example, a networking
program can have a kernel representation of the network packet (skb
) as the
input argument.
The general operation of BPF is 64 bit to follow the natural model of 64 bit architectures in order to perform pointer arithmetics, pass pointers but also pass 64 bit values into helper functions, and to allow for 64 bit atomic operations.
The maximum instruction limit per program is restricted to 4096 BPF instructions, which, by design, means that any program will terminate quickly. Although the instruction set contains forward as well as backward jumps, the in-kernel BPF verifier will forbid loops so that termination is always guaranteed. Since BPF programs run inside the kernel, the verifier’s job is to make sure that these are safe to run, not affecting the system’s stability. This means that from an instruction set point of view, loops can be implemented, but the verifier will restrict that. However, there is also a concept of tail calls that allows for one BPF program to jump into another one. This, too, comes with an upper nesting limit of 32 calls, and is usually used to decouple parts of the program logic, for example, into stages.
The instruction format is modeled as two operand instructions, which helps mapping
BPF instructions to native instructions during JIT phase. The instruction set is
of fixed size, meaning every instruction has 64 bit encoding. Currently, 87 instructions
have been implemented and the encoding also allows to extend the set with further
instructions when needed. The instruction encoding of a single 64 bit instruction on a
big-endian machine is defined as a bit sequence from most significant bit (MSB) to least
significant bit (LSB) of op:8
, dst_reg:4
, src_reg:4
, off:16
, imm:32
.
off
and imm
is of signed type. The encodings are part of the kernel headers and
defined in linux/bpf.h
header, which also includes linux/bpf_common.h
.
op
defines the actual operation to be performed. Most of the encoding for op
has been reused from cBPF. The operation can be based on register or immediate
operands. The encoding of op
itself provides information on which mode to use
(BPF_X
for denoting register-based operations, and BPF_K
for immediate-based
operations respectively). In the latter case, the destination operand is always
a register. Both dst_reg
and src_reg
provide additional information about
the register operands to be used (e.g. r0
- r9
) for the operation. off
is used in some instructions to provide a relative offset, for example, for addressing
the stack or other buffers available to BPF (e.g. map values, packet data, etc),
or jump targets in jump instructions. imm
contains a constant / immediate value.
The available op
instructions can be categorized into various instruction
classes. These classes are also encoded inside the op
field. The op
field
is divided into (from MSB to LSB) code:4
, source:1
and class:3
. class
is the more generic instruction class, code
denotes a specific operational
code inside that class, and source
tells whether the source operand is a register
or an immediate value. Possible instruction classes include:
BPF_LD
,BPF_LDX
: Both classes are for load operations.BPF_LD
is used for loading a double word as a special instruction spanning two instructions due to theimm:32
split, and for byte / half-word / word loads of packet data. The latter was carried over from cBPF mainly in order to keep cBPF to BPF translations efficient, since they have optimized JIT code. For native BPF these packet load instructions are less relevant nowadays.BPF_LDX
class holds instructions for byte / half-word / word / double-word loads out of memory. Memory in this context is generic and could be stack memory, map value data, packet data, etc.BPF_ST
,BPF_STX
: Both classes are for store operations. Similar toBPF_LDX
theBPF_STX
is the store counterpart and is used to store the data from a register into memory, which, again, can be stack memory, map value, packet data, etc.BPF_STX
also holds special instructions for performing word and double-word based atomic add operations, which can be used for counters, for example. TheBPF_ST
class is similar toBPF_STX
by providing instructions for storing data into memory only that the source operand is an immediate value.BPF_ALU
,BPF_ALU64
: Both classes contain ALU operations. Generally,BPF_ALU
operations are in 32 bit mode andBPF_ALU64
in 64 bit mode. Both ALU classes have basic operations with source operand which is register-based and an immediate-based counterpart. Supported by both are add (+
), sub (-
), and (&
), or (|
), left shift (<<
), right shift (>>
), xor (^
), mul (*
), div (/
), mod (%
), neg (~
) operations. Also mov (<X> := <Y>
) was added as a special ALU operation for both classes in both operand modes.BPF_ALU64
also contains a signed right shift.BPF_ALU
additionally contains endianness conversion instructions for half-word / word / double-word on a given source register.BPF_JMP
: This class is dedicated to jump operations. Jumps can be unconditional and conditional. Unconditional jumps simply move the program counter forward, so that the next instruction to be executed relative to the current instruction isoff + 1
, whereoff
is the constant offset encoded in the instruction. Sinceoff
is signed, the jump can also be performed backwards as long as it does not create a loop and is within program bounds. Conditional jumps operate on both, register-based and immediate-based source operands. If the condition in the jump operations results intrue
, then a relative jump tooff + 1
is performed, otherwise the next instruction (0 + 1
) is performed. This fall-through jump logic differs compared to cBPF and allows for better branch prediction as it fits the CPU branch predictor logic more naturally. Available conditions are jeq (==
), jne (!=
), jgt (>
), jge (>=
), jsgt (signed>
), jsge (signed>=
), jlt (<
), jle (<=
), jslt (signed<
), jsle (signed<=
) and jset (jump ifDST & SRC
). Apart from that, there are three special jump operations within this class: the exit instruction which will leave the BPF program and return the current value inr0
as a return code, the call instruction, which will issue a function call into one of the available BPF helper functions, and a hidden tail call instruction, which will jump into a different BPF program.
The Linux kernel is shipped with a BPF interpreter which executes programs assembled in BPF instructions. Even cBPF programs are translated into eBPF programs transparently in the kernel, except for architectures that still ship with a cBPF JIT and have not yet migrated to an eBPF JIT.
Currently x86_64
, arm64
, ppc64
, s390x
, mips64
, sparc64
and
arm
architectures come with an in-kernel eBPF JIT compiler.
All BPF handling such as loading of programs into the kernel or creation of BPF maps
is managed through a central bpf()
system call. It is also used for managing map
entries (lookup / update / delete), and making programs as well as maps persistent
in the BPF file system through pinning.
Helper Functions¶
Helper functions are a concept which enables BPF programs to consult a core kernel defined set of function calls in order to retrieve / push data from / to the kernel. Available helper functions may differ for each BPF program type, for example, BPF programs attached to sockets are only allowed to call into a subset of helpers compared to BPF programs attached to the tc layer. Encapsulation and decapsulation helpers for lightweight tunneling constitute an example of functions which are only available to lower tc layers, whereas event output helpers for pushing notifications to user space are available to tc and XDP programs.
Each helper function is implemented with a commonly shared function signature similar to system calls. The signature is defined as:
u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
The calling convention as described in the previous section applies to all BPF helper functions.
The kernel abstracts helper functions into macros BPF_CALL_0()
to BPF_CALL_5()
which are similar to those of system calls. The following example is an extract
from a helper function which updates map elements by calling into the
corresponding map implementation callbacks:
BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
void *, value, u64, flags)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return map->ops->map_update_elem(map, key, value, flags);
}
const struct bpf_func_proto bpf_map_update_elem_proto = {
.func = bpf_map_update_elem,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_CONST_MAP_PTR,
.arg2_type = ARG_PTR_TO_MAP_KEY,
.arg3_type = ARG_PTR_TO_MAP_VALUE,
.arg4_type = ARG_ANYTHING,
};
There are various advantages of this approach: while cBPF overloaded its load instructions in order to fetch data at an impossible packet offset to invoke auxiliary helper functions, each cBPF JIT needed to implement support for such a cBPF extension. In case of eBPF, each newly added helper function will be JIT compiled in a transparent and efficient way, meaning that the JIT compiler only needs to emit a call instruction since the register mapping is made in such a way that BPF register assignments already match the underlying architecture’s calling convention. This allows for easily extending the core kernel with new helper functionality. All BPF helper functions are part of the core kernel and cannot be extended or added through kernel modules.
The aforementioned function signature also allows the verifier to perform type
checks. The above struct bpf_func_proto
is used to hand all the necessary
information which need to be known about the helper to the verifier, so that
the verifier can make sure that the expected types from the helper match the
current contents of the BPF program’s analyzed registers.
Argument types can range from passing in any kind of value up to restricted contents such as a pointer / size pair for the BPF stack buffer, which the helper should read from or write to. In the latter case, the verifier can also perform additional checks, for example, whether the buffer was previously initialized.
The list of available BPF helper functions is rather long and constantly growing,
for example, at the time of this writing, tc BPF programs can choose from 38
different BPF helpers. The kernel’s struct bpf_verifier_ops
contains a
get_func_proto
callback function that provides the mapping of a specific
enum bpf_func_id
to one of the available helpers for a given BPF program
type.
Maps¶
Maps are efficient key / value stores that reside in kernel space. They can be accessed from a BPF program in order to keep state among multiple BPF program invocations. They can also be accessed through file descriptors from user space and can be arbitrarily shared with other BPF programs or user space applications.
BPF programs which share maps with each other are not required to be of the same program type, for example, tracing programs can share maps with networking programs. A single BPF program can currently access up to 64 different maps directly.
Map implementations are provided by the core kernel. There are generic maps with per-CPU and non-per-CPU flavor that can read / write arbitrary data, but there are also a few non-generic maps that are used along with helper functions.
Generic maps currently available are BPF_MAP_TYPE_HASH
, BPF_MAP_TYPE_ARRAY
,
BPF_MAP_TYPE_PERCPU_HASH
, BPF_MAP_TYPE_PERCPU_ARRAY
, BPF_MAP_TYPE_LRU_HASH
,
BPF_MAP_TYPE_LRU_PERCPU_HASH
and BPF_MAP_TYPE_LPM_TRIE
. They all use the
same common set of BPF helper functions in order to perform lookup, update or
delete operations while implementing a different backend with differing semantics
and performance characteristics.
Non-generic maps that are currently in the kernel are BPF_MAP_TYPE_PROG_ARRAY
,
BPF_MAP_TYPE_PERF_EVENT_ARRAY
, BPF_MAP_TYPE_CGROUP_ARRAY
,
BPF_MAP_TYPE_STACK_TRACE
, BPF_MAP_TYPE_ARRAY_OF_MAPS
,
BPF_MAP_TYPE_HASH_OF_MAPS
. For example, BPF_MAP_TYPE_PROG_ARRAY
is an
array map which holds other BPF programs, BPF_MAP_TYPE_ARRAY_OF_MAPS
and
BPF_MAP_TYPE_HASH_OF_MAPS
both hold pointers to other maps such that entire
BPF maps can be atomically replaced at runtime. These types of maps tackle a
specific issue which was unsuitable to be implemented solely through a BPF helper
function since additional (non-data) state is required to be held across BPF
program invocations.
Object Pinning¶
BPF maps and programs act as a kernel resource and can only be accessed through file descriptors, backed by anonymous inodes in the kernel. Advantages, but also a number of disadvantages come along with them:
User space applications can make use of most file descriptor related APIs, file descriptor passing for Unix domain sockets work transparently, etc, but at the same time, file descriptors are limited to a processes’ lifetime, which makes options like map sharing rather cumbersome to carry out.
Thus, it brings a number of complications for certain use cases such as iproute2, where tc or XDP sets up and loads the program into the kernel and terminates itself eventually. With that, also access to maps is unavailable from user space side, where it could otherwise be useful, for example, when maps are shared between ingress and egress locations of the data path. Also, third party applications may wish to monitor or update map contents during BPF program runtime.
To overcome this limitation, a minimal kernel space BPF file system has been
implemented, where BPF map and programs can be pinned to, a process called
object pinning. The BPF system call has therefore been extended with two new
commands which can pin (BPF_OBJ_PIN
) or retrieve (BPF_OBJ_GET
) a
previously pinned object.
For instance, tools such as tc make use of this infrastructure for sharing maps on ingress and egress. The BPF related file system is not a singleton, it does support multiple mount instances, hard and soft links, etc.
Tail Calls¶
Another concept that can be used with BPF is called tail calls. Tail calls can be seen as a mechanism that allows one BPF program to call another, without returning back to the old program. Such a call has minimal overhead as unlike function calls, it is implemented as a long jump, reusing the same stack frame.
Such programs are verified independently of each other, thus for transferring
state, either per-CPU maps as scratch buffers or in case of tc programs, skb
fields such as the cb[]
area must be used.
Only programs of the same type can be tail called, and they also need to match in terms of JIT compilation, thus either JIT compiled or only interpreted programs can be invoked, but not mixed together.
There are two components involved for carrying out tail calls: the first part
needs to setup a specialized map called program array (BPF_MAP_TYPE_PROG_ARRAY
)
that can be populated by user space with key / values, where values are the
file descriptors of the tail called BPF programs, the second part is a
bpf_tail_call()
helper where the context, a reference to the program array
and the lookup key is passed to. Then the kernel inlines this helper call
directly into a specialized BPF instruction. Such a program array is currently
write-only from user space side.
The kernel looks up the related BPF program from the passed file descriptor
and atomically replaces program pointers at the given map slot. When no map
entry has been found at the provided key, the kernel will just “fall through”
and continue execution of the old program with the instructions following
after the bpf_tail_call()
. Tail calls are a powerful utility, for example,
parsing network headers could be structured through tail calls. During runtime,
functionality can be added or replaced atomically, and thus altering the BPF
program’s execution behavior.
BPF to BPF Calls¶
Aside from BPF helper calls and BPF tail calls, a more recent feature that has
been added to the BPF core infrastructure is BPF to BPF calls. Before this
feature was introduced into the kernel, a typical BPF C program had to declare
any reusable code that, for example, resides in headers as always_inline
such that when LLVM compiles and generates the BPF object file all these
functions were inlined and therefore duplicated many times in the resulting
object file, artificially inflating its code size:
#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __inline # define __inline \ inline __attribute__((always_inline)) #endif static __inline int foo(void) { return XDP_DROP; } __section("prog") int xdp_drop(struct xdp_md *ctx) { return foo(); } char __license[] __section("license") = "GPL";
The main reason why this was necessary was due to lack of function call support
in the BPF program loader as well as verifier, interpreter and JITs. Starting
with Linux kernel 4.16 and LLVM 6.0 this restriction got lifted and BPF programs
no longer need to use always_inline
everywhere. Thus, the prior shown BPF
example code can then be rewritten more naturally as:
#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif static int foo(void) { return XDP_DROP; } __section("prog") int xdp_drop(struct xdp_md *ctx) { return foo(); } char __license[] __section("license") = "GPL";
Mainstream BPF JIT compilers like x86_64
and arm64
support BPF to BPF
calls today with others following in near future. BPF to BPF call is an
important performance optimization since it heavily reduces the generated BPF
code size and therefore becomes friendlier to a CPU’s instruction cache.
The calling convention known from BPF helper function applies to BPF to BPF
calls just as well, meaning r1
up to r5
are for passing arguments to
the callee and the result is returned in r0
. r1
to r5
are scratch
registers whereas r6
to r9
preserved across calls the usual way. The
maximum number of nesting calls respectively allowed call frames is 8
.
A caller can pass pointers (e.g. to the caller’s stack frame) down to the
callee, but never vice versa.
BPF to BPF calls are currently incompatible with the use of BPF tail calls, since the latter requires to reuse the current stack setup as-is, whereas the former adds additional stack frames and thus changes the expected layout for tail calls.
BPF JIT compilers emit separate images for each function body and later fix up the function call addresses in the image in a final JIT pass. This has proven to require minimal changes to the JITs in that they can treat BPF to BPF calls as conventional BPF helper calls.
JIT¶
The 64 bit x86_64
, arm64
, ppc64
, s390x
, mips64
, sparc64
and 32 bit arm
architectures are all shipped with an in-kernel eBPF JIT
compiler, also all of them are feature equivalent and can be enabled through:
# echo 1 > /proc/sys/net/core/bpf_jit_enable
The 32 bit mips
, ppc
and sparc
architectures currently have a cBPF
JIT compiler. The mentioned architectures still having a cBPF JIT as well as all
remaining architectures supported by the Linux kernel which do not have a BPF JIT
compiler at all need to run eBPF programs through the in-kernel interpreter.
In the kernel’s source tree, eBPF JIT support can be easily determined through
issuing a grep for HAVE_EBPF_JIT
:
# git grep HAVE_EBPF_JIT arch/
arch/arm/Kconfig: select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
arch/arm64/Kconfig: select HAVE_EBPF_JIT
arch/powerpc/Kconfig: select HAVE_EBPF_JIT if PPC64
arch/mips/Kconfig: select HAVE_EBPF_JIT if (64BIT && !CPU_MICROMIPS)
arch/s390/Kconfig: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
arch/sparc/Kconfig: select HAVE_EBPF_JIT if SPARC64
arch/x86/Kconfig: select HAVE_EBPF_JIT if X86_64
JIT compilers speed up execution of the BPF program significantly since they
reduce the per instruction cost compared to the interpreter. Often instructions
can be mapped 1:1 with native instructions of the underlying architecture. This
also reduces the resulting executable image size and is therefore more
instruction cache friendly to the CPU. In particular in case of CISC instruction
sets such as x86
, the JITs are optimized for emitting the shortest possible
opcodes for a given instruction to shrink the total necessary size for the
program translation.
Hardening¶
BPF locks the entire BPF interpreter image (struct bpf_prog
) as well
as the JIT compiled image (struct bpf_binary_header
) in the kernel as
read-only during the program’s lifetime in order to prevent the code from
potential corruptions. Any corruption happening at that point, for example,
due to some kernel bugs will result in a general protection fault and thus
crash the kernel instead of allowing the corruption to happen silently.
Architectures that support setting the image memory as read-only can be determined through:
$ git grep ARCH_HAS_SET_MEMORY | grep select
arch/arm/Kconfig: select ARCH_HAS_SET_MEMORY
arch/arm64/Kconfig: select ARCH_HAS_SET_MEMORY
arch/s390/Kconfig: select ARCH_HAS_SET_MEMORY
arch/x86/Kconfig: select ARCH_HAS_SET_MEMORY
The option CONFIG_ARCH_HAS_SET_MEMORY
is not configurable, thanks to
which this protection is always built-in. Other architectures might follow
in the future.
In case of the x86_64
JIT compiler, the JITing of the indirect jump from
the use of tail calls is realized through a retpoline in case CONFIG_RETPOLINE
has been set which is the default at the time of writing in most modern Linux
distributions.
In case of /proc/sys/net/core/bpf_jit_harden
set to 1
additional
hardening steps for the JIT compilation take effect for unprivileged users.
This effectively trades off their performance slightly by decreasing a
(potential) attack surface in case of untrusted users operating on the
system. The decrease in program execution still results in better performance
compared to switching to interpreter entirely.
Currently, enabling hardening will blind all user provided 32 bit and 64 bit constants from the BPF program when it gets JIT compiled in order to prevent JIT spraying attacks which inject native opcodes as immediate values. This is problematic as these immediate values reside in executable kernel memory, therefore a jump that could be triggered from some kernel bug would jump to the start of the immediate value and then execute these as native instructions.
JIT constant blinding prevents this due to randomizing the actual instruction,
which means the operation is transformed from an immediate based source operand
to a register based one through rewriting the instruction by splitting the
actual load of the value into two steps: 1) load of a blinded immediate
value rnd ^ imm
into a register, 2) xoring that register with rnd
such that the original imm
immediate then resides in the register and
can be used for the actual operation. The example was provided for a load
operation, but really all generic operations are blinded.
Example of JITing a program with hardening disabled:
# echo 0 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f5e9 + <x>:
[...]
39: mov $0xa8909090,%eax
3e: mov $0xa8909090,%eax
43: mov $0xa8ff3148,%eax
48: mov $0xa89081b4,%eax
4d: mov $0xa8900bb0,%eax
52: mov $0xa810e0c1,%eax
57: mov $0xa8908eb4,%eax
5c: mov $0xa89020b0,%eax
[...]
The same program gets constant blinded when loaded through BPF as an unprivileged user in the case hardening is enabled:
# echo 1 > /proc/sys/net/core/bpf_jit_harden
ffffffffa034f1e5 + <x>:
[...]
39: mov $0xe1192563,%r10d
3f: xor $0x4989b5f3,%r10d
46: mov %r10d,%eax
49: mov $0xb8296d93,%r10d
4f: xor $0x10b9fd03,%r10d
56: mov %r10d,%eax
59: mov $0x8c381146,%r10d
5f: xor $0x24c7200e,%r10d
66: mov %r10d,%eax
69: mov $0xeb2a830e,%r10d
6f: xor $0x43ba02ba,%r10d
76: mov %r10d,%eax
79: mov $0xd9730af,%r10d
7f: xor $0xa5073b1f,%r10d
86: mov %r10d,%eax
89: mov $0x9a45662b,%r10d
8f: xor $0x325586ea,%r10d
96: mov %r10d,%eax
[...]
Both programs are semantically the same, only that none of the original immediate values are visible anymore in the disassembly of the second program.
At the same time, hardening also disables any JIT kallsyms exposure
for privileged users, preventing that JIT image addresses are not
exposed to /proc/kallsyms
anymore.
Moreover, the Linux kernel provides the option CONFIG_BPF_JIT_ALWAYS_ON
which removes the entire BPF interpreter from the kernel and permanently
enables the JIT compiler. This has been developed as part of a mitigation
in the context of Spectre v2 such that when used in a VM-based setting,
the guest kernel is not going to reuse the host kernel’s BPF interpreter
when mounting an attack anymore. For container-based environments, the
CONFIG_BPF_JIT_ALWAYS_ON
configuration option is optional, but in
case JITs are enabled there anyway, the interpreter may as well be compiled
out to reduce the kernel’s complexity. Thus, it is also generally
recommended for widely used JITs in case of main stream architectures
such as x86_64
and arm64
.
Last but not least, the kernel offers an option to disable the use of
the bpf(2)
system call for unprivileged users through the
/proc/sys/kernel/unprivileged_bpf_disabled
sysctl knob. This is
on purpose a one-time kill switch, meaning once set to 1
, there is
no option to reset it back to 0
until a new kernel reboot. When
set only CAP_SYS_ADMIN
privileged processes out of the initial
namespace are allowed to use the bpf(2)
system call from that
point onwards. Upon start, Cilium sets this knob to 1
as well.
# echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled
Offloads¶
Networking programs in BPF, in particular for tc and XDP do have an offload-interface to hardware in the kernel in order to execute BPF code directly on the NIC.
Currently, the nfp
driver from Netronome has support for offloading
BPF through a JIT compiler which translates BPF instructions to an
instruction set implemented against the NIC. This includes offloading
of BPF maps to the NIC as well, thus the offloaded BPF program can
perform map lookups, updates and deletions.
Toolchain¶
Current user space tooling, introspection facilities and kernel control knobs around BPF are discussed in this section. Note, the tooling and infrastructure around BPF is still rapidly evolving and thus may not provide a complete picture of all available tools.
Development Environment¶
A step by step guide for setting up a development environment for BPF can be found below for both Fedora and Ubuntu. This will guide you through building, installing and testing a development kernel as well as building and installing iproute2.
The step of manually building iproute2 and Linux kernel is usually not necessary given that major distributions already ship recent enough kernels by default, but would be needed for testing bleeding edge versions or contributing BPF patches to iproute2 and to the Linux kernel, respectively. Similarly, for debugging and introspection purposes building bpftool is optional, but recommended.
Fedora¶
The following applies to Fedora 25 or later:
$ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \
openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static
Note
If you are running some other Fedora derivative and dnf
is missing,
try using yum
instead.
Ubuntu¶
The following applies to Ubuntu 17.04 or later:
$ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \
clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl bison flex \
graphviz
Compiling the Kernel¶
Development of new BPF features for the Linux kernel happens inside the net-next
git tree, latest BPF fixes in the net
tree. The following command will obtain
the kernel source for the net-next
tree through git:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
If the git commit history is not of interest, then --depth 1
will clone the
tree much faster by truncating the git history only to the most recent commit.
In case the net
tree is of interest, it can be cloned from this url:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
There are dozens of tutorials in the Internet on how to build Linux kernels, one good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild) that can be followed with one of the two git trees mentioned above.
Make sure that the generated .config
file contains the following CONFIG_*
entries for running BPF. These entries are also needed for Cilium.
CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_ACT=y
CONFIG_BPF_JIT=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m
Some of the entries cannot be adjusted through make menuconfig
. For example,
CONFIG_HAVE_EBPF_JIT
is selected automatically if a given architecture does
come with an eBPF JIT. In this specific case, CONFIG_HAVE_EBPF_JIT
is optional
but highly recommended. An architecture not having an eBPF JIT compiler will need
to fall back to the in-kernel interpreter with the cost of being less efficient
executing BPF instructions.
Verifying the Setup¶
After you have booted into the newly compiled kernel, navigate to the BPF selftest suite in order to test BPF functionality (current working directory points to the root of the cloned git tree):
$ cd tools/testing/selftests/bpf/
$ make
$ sudo ./test_verifier
The verifier tests print out all the current checks being performed. The summary at the end of running all tests will dump information of test successes and failures:
Summary: 847 PASSED, 0 SKIPPED, 0 FAILED
Note
For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+ caused by the BPF function calls which do not need to be inlined anymore. See section BPF to BPF Calls or the cover letter mail from the kernel patch (https://lwn.net/Articles/741773/) for more information. Not every BPF program has a dependency on LLVM 6.0+ if it does not use this new feature. If your distribution does not provide LLVM 6.0+ you may compile it by following the instruction in the LLVM section.
In order to run through all BPF selftests, the following command is needed:
$ sudo make run_tests
If you see any failures, please contact us on Slack with the full test output.
Compiling iproute2¶
Similar to the net
(fixes only) and net-next
(new features) kernel trees,
the iproute2 git tree has two branches, namely master
and net-next
. The
master
branch is based on the net
tree and the net-next
branch is
based against the net-next
kernel tree. This is necessary, so that changes
in header files can be synchronized in the iproute2 tree.
In order to clone the iproute2 master
branch, the following command can
be used:
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git
Similarly, to clone into mentioned net-next
branch of iproute2, run the
following:
$ git clone -b net-next git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git
After that, proceed with the build and installation:
$ cd iproute2/
$ ./configure --prefix=/usr
TC schedulers
ATM no
libc has setns: yes
SELinux support: yes
ELF support: yes
libmnl support: no
Berkeley DB: no
docs: latex: no
WARNING: no docs can be built from LaTeX files
sgml2html: no
WARNING: no HTML docs can be built from SGML
$ make
[...]
$ sudo make install
Ensure that the configure
script shows ELF support: yes
, so that iproute2
can process ELF files from LLVM’s BPF back end. libelf was listed in the instructions
for installing the dependencies in case of Fedora and Ubuntu earlier.
Compiling bpftool¶
bpftool is an essential tool around debugging and introspection of BPF programs
and maps. It is part of the kernel tree and available under tools/bpf/bpftool/
.
Make sure to have cloned either the net
or net-next
kernel tree as described
earlier. In order to build and install bpftool, the following steps are required:
$ cd <kernel-tree>/tools/bpf/bpftool/
$ make
Auto-detecting system features:
... libbfd: [ on ]
... disassembler-four-args: [ OFF ]
CC xlated_dumper.o
CC prog.o
CC common.o
CC cgroup.o
CC main.o
CC json_writer.o
CC cfg.o
CC map.o
CC jit_disasm.o
CC disasm.o
make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf'
Auto-detecting system features:
... libelf: [ on ]
... bpf: [ on ]
CC libbpf.o
CC bpf.o
CC nlattr.o
LD libbpf-in.o
LINK libbpf.a
make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf'
LINK bpftool
$ sudo make install
LLVM¶
LLVM is currently the only compiler suite providing a BPF back end. gcc does not support BPF at this point.
The BPF back end was merged into LLVM’s 3.7 release. Major distributions enable the BPF back end by default when they package LLVM, therefore installing clang and llvm is sufficient on most recent distributions to start compiling C into BPF object files.
The typical workflow is that BPF programs are written in C, compiled by LLVM into object / ELF files, which are parsed by user space BPF ELF loaders (such as iproute2 or others), and pushed into the kernel through the BPF system call. The kernel verifies the BPF instructions and JITs them, returning a new file descriptor for the program, which then can be attached to a subsystem (e.g. networking). If supported, the subsystem could then further offload the BPF program to hardware (e.g. NIC).
For LLVM, BPF target support can be checked, for example, through the following:
$ llc --version
LLVM (http://llvm.org/):
LLVM version 3.8.1
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
Registered Targets:
[...]
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
[...]
By default, the bpf
target uses the endianness of the CPU it compiles on,
meaning that if the CPU’s endianness is little endian, the program is represented
in little endian format as well, and if the CPU’s endianness is big endian,
the program is represented in big endian. This also matches the runtime behavior
of BPF, which is generic and uses the CPU’s endianness it runs on in order
to not disadvantage architectures in any of the format.
For cross-compilation, the two targets bpfeb
and bpfel
were introduced,
thanks to that BPF programs can be compiled on a node running in one endianness
(e.g. little endian on x86) and run on a node in another endianness format (e.g.
big endian on arm). Note that the front end (clang) needs to run in the target
endianness as well.
Using bpf
as a target is the preferred way in situations where no mixture of
endianness applies. For example, compilation on x86_64
results in the same
output for the targets bpf
and bpfel
due to being little endian, therefore
scripts triggering a compilation also do not have to be endian aware.
A minimal, stand-alone XDP drop program might look like the following example
(xdp-example.c
):
#include <linux/bpf.h>
#ifndef __section
# define __section(NAME) \
__attribute__((section(NAME), used))
#endif
__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
return XDP_DROP;
}
char __license[] __section("license") = "GPL";
It can then be compiled and loaded into the kernel as follows:
$ clang -O2 -Wall -target bpf -c xdp-example.c -o xdp-example.o
# ip link set dev em1 xdp obj xdp-example.o
Note
Attaching an XDP BPF program to a network device as above requires Linux 4.11 with a device that supports XDP, or Linux 4.12 or later.
For the generated object file LLVM (>= 3.9) uses the official BPF machine value,
that is, EM_BPF
(decimal: 247
/ hex: 0xf7
). In this example, the program
has been compiled with bpf
target under x86_64
, therefore LSB
(as opposed
to MSB
) is shown regarding endianness:
$ file xdp-example.o
xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped
readelf -a xdp-example.o
will dump further information about the ELF file, which can
sometimes be useful for introspecting generated section headers, relocation entries
and the symbol table.
In the unlikely case where clang and LLVM need to be compiled from scratch, the following commands can be used:
$ git clone http://llvm.org/git/llvm.git
$ cd llvm/tools
$ git clone --depth 1 http://llvm.org/git/clang.git
$ cd ..; mkdir build; cd build
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF
$ make -j $(getconf _NPROCESSORS_ONLN)
$ ./bin/llc --version
LLVM (http://llvm.org/):
LLVM version x.y.zsvn
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
Registered Targets:
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
$ export PATH=$PWD/bin:$PATH # add to ~/.bashrc
Make sure that --version
mentions Optimized build.
, otherwise the
compilation time for programs when having LLVM in debugging mode will
significantly increase (e.g. by 10x or more).
For debugging, clang can generate the assembler output as follows:
$ clang -O2 -S -Wall -target bpf -c xdp-example.c -o xdp-example.S
$ cat xdp-example.S
.text
.section prog,"ax",@progbits
.globl xdp_drop
.p2align 3
xdp_drop: # @xdp_drop
# BB#0:
r0 = 1
exit
.section license,"aw",@progbits
.globl __license # @__license
__license:
.asciz "GPL"
Furthermore, more recent LLVM versions (>= 4.0) can also store debugging
information in dwarf format into the object file. This can be done through
the usual workflow by adding -g
for compilation.
$ clang -O2 -g -Wall -target bpf -c xdp-example.c -o xdp-example.o
$ llvm-objdump -S -no-show-raw-insn xdp-example.o
xdp-example.o: file format ELF64-BPF
Disassembly of section prog:
xdp_drop:
; {
0: r0 = 1
; return XDP_DROP;
1: exit
The llvm-objdump
tool can then annotate the assembler output with the
original C code used in the compilation. The trivial example in this case
does not contain much C code, however, the line numbers shown as 0:
and 1:
correspond directly to the kernel’s verifier log.
This means that in case BPF programs get rejected by the verifier, llvm-objdump
can help to correlate the instructions back to the original C code, which is
highly useful for analysis.
# ip link set dev em1 xdp obj xdp-example.o verb
Prog section 'prog' loaded (5)!
- Type: 6
- Instructions: 2 (0 over limit)
- License: GPL
Verifier analysis:
0: (b7) r0 = 1
1: (95) exit
processed 2 insns
As it can be seen in the verifier analysis, the llvm-objdump
output dumps
the same BPF assembler code as the kernel.
Leaving out the -no-show-raw-insn
option will also dump the raw
struct bpf_insn
as hex in front of the assembly:
$ llvm-objdump -S xdp-example.o
xdp-example.o: file format ELF64-BPF
Disassembly of section prog:
xdp_drop:
; {
0: b7 00 00 00 01 00 00 00 r0 = 1
; return foo();
1: 95 00 00 00 00 00 00 00 exit
For LLVM IR debugging, the compilation process for BPF can be split into
two steps, generating a binary LLVM IR intermediate file xdp-example.bc
, which
can later on be passed to llc:
$ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o
The generated LLVM IR can also be dumped in human readable format through:
$ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o -
LLVM is able to attach debug information such as the description of used data types in the program to the generated BPF object file. By default this is in DWARF format.
A heavily simplified version used by BPF is called BTF (BPF Type Format). The resulting DWARF can be converted into BTF and is later on loaded into the kernel through BPF object loaders. The kernel will then verify the BTF data for correctness and keeps track of the data types the BTF data is containing.
BPF maps can then be annotated with key and value types out of the BTF data such that a later dump of the map exports the map data along with the related type information. This allows for better introspection, debugging and value pretty printing. Note that BTF data is a generic debugging data format and as such any DWARF to BTF converted data can be loaded (e.g. kernel’s vmlinux DWARF data could be converted to BTF and loaded). Latter is in particular useful for BPF tracing in the future.
In order to generate BTF from DWARF debugging information, elfutils (>= 0.173)
is needed. If that is not available, then adding the -mattr=dwarfris
option
to the llc
command is required during compilation:
$ llc -march=bpf -mattr=help |& grep dwarfris
dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
[...]
The reason using -mattr=dwarfris
is because the flag dwarfris
(dwarf
relocation in section
) disables DWARF cross-section relocations between DWARF
and the ELF’s symbol table since libdw does not have proper BPF relocation
support, and therefore tools like pahole
would otherwise not be able to
properly dump structures from the object.
elfutils (>= 0.173) implements proper BPF relocation support and therefore
the same can be achieved without the -mattr=dwarfris
option. Dumping
the structures from the object file could be done from either DWARF or BTF
information. pahole
uses the LLVM emitted DWARF information at this
point, however, future pahole
versions could rely on BTF if available.
For converting DWARF into BTF, a recent pahole version (>= 1.12) is required. A recent pahole version can also be obtained from its official git repository if not available from one of the distribution packages:
$ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git
pahole
comes with the option -J
to convert DWARF into BTF from an
object file. pahole
can be probed for BTF support as follows (note that
the llvm-objcopy
tool is required for pahole
as well, so check its
presence, too):
$ pahole --help | grep BTF
-J, --btf_encode Encode as BTF
Generating debugging information also requires the front end to generate
source level debug information by passing -g
to the clang
command
line. Note that -g
is needed independently of whether llc
’s
dwarfris
option is used. Full example for generating the object file:
$ clang -O2 -g -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o
Alternatively, by using clang only to build a BPF program with debugging information (again, the dwarfris flag can be omitted when having proper elfutils version):
$ clang -target bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
After successful compilation pahole
can be used to properly dump structures
of the BPF program based on the DWARF information:
$ pahole xdp-example.o
struct xdp_md {
__u32 data; /* 0 4 */
__u32 data_end; /* 4 4 */
__u32 data_meta; /* 8 4 */
/* size: 12, cachelines: 1, members: 3 */
/* last cacheline: 12 bytes */
};
Through the option -J
pahole
can eventually generate the BTF from
DWARF. In the object file DWARF data will still be retained alongside the
newly added BTF data. Full clang
and pahole
example combined:
$ clang -target bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
$ pahole -J xdp-example.o
The presence of a .BTF
section can be seen through readelf
tool:
$ readelf -a xdp-example.o
[...]
[18] .BTF PROGBITS 0000000000000000 00000671
[...]
BPF loaders such as iproute2 will detect and load the BTF section, so that BPF maps can be annotated with type information.
LLVM by default uses the BPF base instruction set for generating code in order to make sure that the generated object file can also be loaded with older kernels such as long-term stable kernels (e.g. 4.9+).
However, LLVM has a -mcpu
selector for the BPF back end in order to
select different versions of the BPF instruction set, namely instruction
set extensions on top of the BPF base instruction set in order to generate
more efficient and smaller code.
Available -mcpu
options can be queried through:
$ llc -march bpf -mcpu=help
Available CPUs for this target:
generic - Select the generic processor.
probe - Select the probe processor.
v1 - Select the v1 processor.
v2 - Select the v2 processor.
[...]
The generic
processor is the default processor, which is also the
base instruction set v1
of BPF. Options v1
and v2
are typically
useful in an environment where the BPF program is being cross compiled
and the target host where the program is loaded differs from the one
where it is compiled (and thus available BPF kernel features might differ
as well).
The recommended -mcpu
option which is also used by Cilium internally is
-mcpu=probe
! Here, the LLVM BPF back end queries the kernel for availability
of BPF instruction set extensions and when found available, LLVM will use
them for compiling the BPF program whenever appropriate.
A full command line example with llc’s -mcpu=probe
:
$ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o
Generally, LLVM IR generation is architecture independent. There are
however a few differences when using clang -target bpf
versus
leaving -target bpf
out and thus using clang’s default target which,
depending on the underlying architecture, might be x86_64
, arm64
or others.
Quoting from the kernel’s Documentation/bpf/bpf_devel_QA.txt
:
- BPF programs may recursively include header file(s) with file scope inline assembly codes. The default target can handle this well, while bpf target may fail if bpf backend assembler does not understand these assembly codes, which is true in most cases.
- When compiled without -g, additional elf sections, e.g.,
.eh_frame
and.rela.eh_frame
, may be present in the object file with default target, but not with bpf target. - The default target may turn a C switch statement into a switch table
lookup and jump operation. Since the switch table is placed in the
global read-only section, the bpf program will fail to load.
The bpf target does not support switch table optimization. The clang
option
-fno-jump-tables
can be used to disable switch table generation. - For clang
-target bpf
, it is guaranteed that pointer or long / unsigned long types will always have a width of 64 bit, no matter whether underlying clang binary or default target (or kernel) is 32 bit. However, when native clang target is used, then it will compile these types based on the underlying architecture’s conventions, meaning in case of 32 bit architecture, pointer or long / unsigned long types e.g. in BPF context structure will have width of 32 bit while the BPF LLVM back end still operates in 64 bit.
The native target is mostly needed in tracing for the case of walking
the kernel’s struct pt_regs
that maps CPU registers, or other kernel
structures where CPU’s register width matters. In all other cases such
as networking, the use of clang -target bpf
is the preferred choice.
Note that LLVM’s BPF back end currently does not support generating code that makes use of BPF’s 32 bit subregisters. Inline assembly for BPF is currently unsupported, too.
Furthermore, compilation from BPF assembly (e.g. llvm-mc xdp-example.S -arch bpf -filetype=obj -o xdp-example.o
)
is currently not supported either due to missing BPF assembly parser.
When writing C programs for BPF, there are a couple of pitfalls to be aware of, compared to usual application development with C. The following items describe some of the differences for the BPF model:
Everything needs to be inlined, there are no function calls (on older LLVM versions) or shared library calls available.
Shared libraries, etc cannot be used with BPF. However, common library code used in BPF programs can be placed into header files and included in the main programs. For example, Cilium makes heavy use of it (see
bpf/lib/
). However, this still allows for including header files, for example, from the kernel or other libraries and reuse their static inline functions or macros / definitions.Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF function calls are supported, then LLVM needs to compile and inline the entire code into a flat sequence of BPF instructions for a given program section. In such case, best practice is to use an annotation like
__inline
for every library function as shown below. The use ofalways_inline
is recommended, since the compiler could still decide to uninline large functions that are only annotated asinline
.In case the latter happens, LLVM will generate a relocation entry into the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and will thus produce an error since only BPF maps are valid relocation entries which loaders can process.
#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __inline # define __inline \ inline __attribute__((always_inline)) #endif static __inline int foo(void) { return XDP_DROP; } __section("prog") int xdp_drop(struct xdp_md *ctx) { return foo(); } char __license[] __section("license") = "GPL";
Multiple programs can reside inside a single C file in different sections.
C programs for BPF make heavy use of section annotations. A C file is typically structured into 3 or more sections. BPF ELF loaders use these names to extract and prepare the relevant information in order to load the programs and maps through the bpf system call. For example, iproute2 uses
maps
andlicense
as default section name to find metadata needed for map creation and the license for the BPF program, respectively. On program creation time the latter is pushed into the kernel as well, and enables some of the helper functions which are exposed as GPL only in case the program also holds a GPL compatible license, for examplebpf_ktime_get_ns()
,bpf_probe_read()
and others.The remaining section names are specific for BPF program code, for example, the below code has been modified to contain two program sections,
ingress
andegress
. The toy example code demonstrates that both can share a map and common static inline helpers such as theaccount_data()
function.The
xdp-example.c
example has been modified to atc-example.c
example that can be loaded with tc and attached to a netdevice’s ingress and egress hook. It accounts the transferred bytes into a map calledacc_map
, which has two map slots, one for traffic accounted on the ingress hook, one on the egress hook.#include <linux/bpf.h> #include <linux/pkt_cls.h> #include <stdint.h> #include <iproute2/bpf_elf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __inline # define __inline \ inline __attribute__((always_inline)) #endif #ifndef lock_xadd # define lock_xadd(ptr, val) \ ((void)__sync_fetch_and_add(ptr, val)) #endif #ifndef BPF_FUNC # define BPF_FUNC(NAME, ...) \ (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME #endif static void *BPF_FUNC(map_lookup_elem, void *map, const void *key); struct bpf_elf_map acc_map __section("maps") = { .type = BPF_MAP_TYPE_ARRAY, .size_key = sizeof(uint32_t), .size_value = sizeof(uint32_t), .pinning = PIN_GLOBAL_NS, .max_elem = 2, }; static __inline int account_data(struct __sk_buff *skb, uint32_t dir) { uint32_t *bytes; bytes = map_lookup_elem(&acc_map, &dir); if (bytes) lock_xadd(bytes, skb->len); return TC_ACT_OK; } __section("ingress") int tc_ingress(struct __sk_buff *skb) { return account_data(skb, 0); } __section("egress") int tc_egress(struct __sk_buff *skb) { return account_data(skb, 1); } char __license[] __section("license") = "GPL";
The example also demonstrates a couple of other things which are useful to be aware of when developing programs. The code includes kernel headers, standard C headers and an iproute2 specific header containing the definition of
struct bpf_elf_map
. iproute2 has a common BPF ELF loader and as such the definition ofstruct bpf_elf_map
is the very same for XDP and tc typed programs.A
struct bpf_elf_map
entry defines a map in the program and contains all relevant information (such as key / value size, etc) needed to generate a map which is used from the two BPF programs. The structure must be placed into themaps
section, so that the loader can find it. There can be multiple map declarations of this type with different variable names, but all must be annotated with__section("maps")
.The
struct bpf_elf_map
is specific to iproute2. Different BPF ELF loaders can have different formats, for example, the libbpf in the kernel source tree, which is mainly used byperf
, has a different specification. iproute2 guarantees backwards compatibility forstruct bpf_elf_map
. Cilium follows the iproute2 model.The example also demonstrates how BPF helper functions are mapped into the C code and being used. Here,
map_lookup_elem()
is defined by mapping this function into theBPF_FUNC_map_lookup_elem
enum value which is exposed as a helper inuapi/linux/bpf.h
. When the program is later loaded into the kernel, the verifier checks whether the passed arguments are of the expected type and re-points the helper call into a real function call. Moreover,map_lookup_elem()
also demonstrates how maps can be passed to BPF helper functions. Here,&acc_map
from themaps
section is passed as the first argument tomap_lookup_elem()
.Since the defined array map is global, the accounting needs to use an atomic operation, which is defined as
lock_xadd()
. LLVM maps__sync_fetch_and_add()
as a built-in function to the BPF atomic add instruction, that is,BPF_STX | BPF_XADD | BPF_W
for word sizes.Last but not least, the
struct bpf_elf_map
tells that the map is to be pinned asPIN_GLOBAL_NS
. This means that tc will pin the map into the BPF pseudo file system as a node. By default, it will be pinned to/sys/fs/bpf/tc/globals/acc_map
for the given example. Due to thePIN_GLOBAL_NS
, the map will be placed under/sys/fs/bpf/tc/globals/
.globals
acts as a global namespace that spans across object files. If the example usedPIN_OBJECT_NS
, then tc would create a directory that is local to the object file. For example, different C files with BPF code could have the sameacc_map
definition as above with aPIN_GLOBAL_NS
pinning. In that case, the map will be shared among BPF programs originating from various object files.PIN_NONE
would mean that the map is not placed into the BPF file system as a node, and as a result will not be accessible from user space after tc quits. It would also mean that tc creates two separate map instances for each program, since it cannot retrieve a previously pinned map under that name. Theacc_map
part from the mentioned path is the name of the map as specified in the source code.Thus, upon loading of the
ingress
program, tc will find that no such map exists in the BPF file system and creates a new one. On success, the map will also be pinned, so that when theegress
program is loaded through tc, it will find that such map already exists in the BPF file system and will reuse that for theegress
program. The loader also makes sure in case maps exist with the same name that also their properties (key / value size, etc) match.Just like tc can retrieve the same map, also third party applications can use the
BPF_OBJ_GET
command from the bpf system call in order to create a new file descriptor pointing to the same map instance, which can then be used to lookup / update / delete map elements.The code can be compiled and loaded via iproute2 as follows:
$ clang -O2 -Wall -target bpf -c tc-example.c -o tc-example.o # tc qdisc add dev em1 clsact # tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress # tc filter add dev em1 egress bpf da obj tc-example.o sec egress # tc filter show dev em1 ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f # tc filter show dev em1 egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714 # mount | grep bpf sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700) # tree /sys/fs/bpf/ /sys/fs/bpf/ +-- ip -> /sys/fs/bpf/tc/ +-- tc | +-- globals | +-- acc_map +-- xdp -> /sys/fs/bpf/tc/ 4 directories, 1 fileAs soon as packets pass the
em1
device, counters from the BPF map will be increased.
- There are no global variables allowed.
For the reasons already mentioned in point 1, BPF cannot have global variables as often used in normal C programs.
However, there is a work-around in that the program can simply use a BPF map of type
BPF_MAP_TYPE_PERCPU_ARRAY
with just a single slot of arbitrary value size. This works, because during execution, BPF programs are guaranteed to never get preempted by the kernel and therefore can use the single map entry as a scratch buffer for temporary data, for example, to extend beyond the stack limitation. This also functions across tail calls, since it has the same guarantees with regards to preemption.Otherwise, for holding state across multiple BPF program runs, normal BPF maps can be used.
- There are no const strings or arrays allowed.
Defining
const
strings or other arrays in the BPF C program does not work for the same reasons as pointed out in sections 1 and 3, which is, that relocation entries will be generated in the ELF file which will be rejected by loaders due to not being part of the ABI towards loaders (loaders also cannot fix up such entries as it would require large rewrites of the already compiled BPF sequence).In the future, LLVM might detect these occurrences and early throw an error to the user.
Helper functions such as
trace_printk()
can be worked around as follows:static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...); #ifndef printk # define printk(fmt, ...) \ ({ \ char ____fmt[] = fmt; \ trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \ }) #endifThe program can then use the macro naturally like
printk("skb len:%u\n", skb->len);
. The output will then be written to the trace pipe.tc exec bpf dbg
can be used to retrieve the messages from there.The use of the
trace_printk()
helper function has a couple of disadvantages and thus is not recommended for production usage. Constant strings like the"skb len:%u\n"
need to be loaded into the BPF stack each time the helper function is called, but also BPF helper functions are limited to a maximum of 5 arguments. This leaves room for only 3 additional variables which can be passed for dumping.Therefore, despite being helpful for quick debugging, it is recommended (for networking programs) to use the
skb_event_output()
or thexdp_event_output()
helper, respectively. They allow for passing custom structs from the BPF program to the perf event ring buffer along with an optional packet sample. For example, Cilium’s monitor makes use of these helpers in order to implement a debugging framework, notifications for network policy violations, etc. These helpers pass the data through a lockless memory mapped per-CPUperf
ring buffer, and is thus significantly faster thantrace_printk()
.
- Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().
Since BPF programs cannot perform any function calls other than those to BPF helpers, common library code needs to be implemented as inline functions. In addition, also LLVM provides some built-ins that the programs can use for constant sizes (here:
n
) which will then always get inlined:#ifndef memset # define memset(dest, chr, n) __builtin_memset((dest), (chr), (n)) #endif #ifndef memcpy # define memcpy(dest, src, n) __builtin_memcpy((dest), (src), (n)) #endif #ifndef memmove # define memmove(dest, src, n) __builtin_memmove((dest), (src), (n)) #endifThe
memcmp()
built-in had some corner cases where inlining did not take place due to an LLVM issue in the back end, and is therefore not recommended to be used until the issue is fixed.
- There are no loops available (yet).
The BPF verifier in the kernel checks that a BPF program does not contain loops by performing a depth first search of all possible program paths besides other control flow graph validations. The purpose is to make sure that the program is always guaranteed to terminate.
A very limited form of looping is available for constant upper loop bounds by using
#pragma unroll
directive. Example code that is compiled to BPF:#pragma unroll for (i = 0; i < IPV6_MAX_HEADERS; i++) { switch (nh) { case NEXTHDR_NONE: return DROP_INVALID_EXTHDR; case NEXTHDR_FRAGMENT: return DROP_FRAG_NOSUPPORT; case NEXTHDR_HOP: case NEXTHDR_ROUTING: case NEXTHDR_AUTH: case NEXTHDR_DEST: if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0) return DROP_INVALID; nh = opthdr.nexthdr; if (nh == NEXTHDR_AUTH) len += ipv6_authlen(&opthdr); else len += ipv6_optlen(&opthdr); break; default: *nexthdr = nh; return len; } }Another possibility is to use tail calls by calling into the same program again and using a
BPF_MAP_TYPE_PERCPU_ARRAY
map for having a local scratch space. While being dynamic, this form of looping however is limited to a maximum of 32 iterations.In the future, BPF may have some native, but limited form of implementing loops.
- Partitioning programs with tail calls.
Tail calls provide the flexibility to atomically alter program behavior during runtime by jumping from one BPF program into another. In order to select the next program, tail calls make use of program array maps (
BPF_MAP_TYPE_PROG_ARRAY
), and pass the map as well as the index to the next program to jump to. There is no return to the old program after the jump has been performed, and in case there was no program present at the given map index, then execution continues on the original program.For example, this can be used to implement various stages of a parser, where such stages could be updated with new parsing features during runtime.
Another use case are event notifications, for example, Cilium can opt in packet drop notifications during runtime, where the
skb_event_output()
call is located inside the tail called program. Thus, during normal operations, the fall-through path will always be executed unless a program is added to the related map index, where the program then prepares the metadata and triggers the event notification to a user space daemon.Program array maps are quite flexible, enabling also individual actions to be implemented for programs located in each map index. For example, the root program attached to XDP or tc could perform an initial tail call to index 0 of the program array map, performing traffic sampling, then jumping to index 1 of the program array map, where firewalling policy is applied and the packet either dropped or further processed in index 2 of the program array map, where it is mangled and sent out of an interface again. Jumps in the program array map can, of course, be arbitrary. The kernel will eventually execute the fall-through path when the maximum tail call limit has been reached.
Minimal example extract of using tail calls:
[...] #ifndef __stringify # define __stringify(X) #X #endif #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif #ifndef __section_tail # define __section_tail(ID, KEY) \ __section(__stringify(ID) "/" __stringify(KEY)) #endif #ifndef BPF_FUNC # define BPF_FUNC(NAME, ...) \ (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME #endif #define BPF_JMP_MAP_ID 1 static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map, uint32_t index); struct bpf_elf_map jmp_map __section("maps") = { .type = BPF_MAP_TYPE_PROG_ARRAY, .id = BPF_JMP_MAP_ID, .size_key = sizeof(uint32_t), .size_value = sizeof(uint32_t), .pinning = PIN_GLOBAL_NS, .max_elem = 1, }; __section_tail(JMP_MAP_ID, 0) int looper(struct __sk_buff *skb) { printk("skb cb: %u\n", skb->cb[0]++); tail_call(skb, &jmp_map, 0); return TC_ACT_OK; } __section("prog") int entry(struct __sk_buff *skb) { skb->cb[0] = 0; tail_call(skb, &jmp_map, 0); return TC_ACT_OK; } char __license[] __section("license") = "GPL";When loading this toy program, tc will create the program array and pin it to the BPF file system in the global namespace under
jmp_map
. Also, the BPF ELF loader in iproute2 will also recognize sections that are marked as__section_tail()
. The providedid
instruct bpf_elf_map
will be matched against the id marker in the__section_tail()
, that is,JMP_MAP_ID
, and the program therefore loaded at the user specified program array map index, which is0
in this example. As a result, all provided tail call sections will be populated by the iproute2 loader to the corresponding maps. This mechanism is not specific to tc, but can be applied with any other BPF program type that iproute2 supports (such as XDP, lwt).The generated elf contains section headers describing the map id and the entry within that map:
$ llvm-objdump -S --no-show-raw-insn prog_array.o | less prog_array.o: file format ELF64-BPF Disassembly of section 1/0: looper: 0: r6 = r1 1: r2 = *(u32 *)(r6 + 48) 2: r1 = r2 3: r1 += 1 4: *(u32 *)(r6 + 48) = r1 5: r1 = 0 ll 7: call -1 8: r1 = r6 9: r2 = 0 ll 11: r3 = 0 12: call 12 13: r0 = 0 14: exit Disassembly of section prog: entry: 0: r2 = 0 1: *(u32 *)(r1 + 48) = r2 2: r2 = 0 ll 4: r3 = 0 5: call 12 6: r0 = 0 7: exiIn this case, the
section 1/0
indicates that thelooper()
function resides in the map id1
at position0
.The pinned map can be retrieved by a user space applications (e.g. Cilium daemon), but also by tc itself in order to update the map with new programs. Updates happen atomically, the initial entry programs that are triggered first from the various subsystems are also updated atomically.
Example for tc to perform tail call map updates:
# tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec foo
In case iproute2 would update the pinned program array, the
graft
command can be used. By pointing it toglobals/jmp_map
, tc will update the map at index / key0
with a new program residing in the object filenew.o
under sectionfoo
.
- Limited stack space of maximum 512 bytes.
Stack space in BPF programs is limited to only 512 bytes, which needs to be taken into careful consideration when implementing BPF programs in C. However, as mentioned earlier in point 3, aBPF_MAP_TYPE_PERCPU_ARRAY
map with a single entry can be used in order to enlarge scratch buffer space.
- Use of BPF inline assembly possible.
LLVM also allows to use inline assembly for BPF for the rare cases where it might be needed. The following (nonsense) toy example shows a 64 bit atomic add. Due to lack of documentation, LLVM source code in
lib/Target/BPF/BPFInstrInfo.td
as well astest/CodeGen/BPF/
might be helpful for providing some additional examples. Test code:#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif __section("prog") int xdp_test(struct xdp_md *ctx) { __u64 a = 2, b = 3, *c = &a; /* just a toy xadd example to show the syntax */ asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c)); return a; } char __license[] __section("license") = "GPL";The above program is compiled into the following sequence of BPF instructions:
Verifier analysis: 0: (b7) r1 = 2 1: (7b) *(u64 *)(r10 -8) = r1 2: (b7) r1 = 3 3: (bf) r2 = r10 4: (07) r2 += -8 5: (db) lock *(u64 *)(r2 +0) += r1 6: (79) r0 = *(u64 *)(r10 -8) 7: (95) exit processed 8 insns (limit 131072), stack depth 8
iproute2¶
There are various front ends for loading BPF programs into the kernel such as bcc,
perf, iproute2 and others. The Linux kernel source tree also provides a user space
library under tools/lib/bpf/
, which is mainly used and driven by perf for
loading BPF tracing programs into the kernel. However, the library itself is
generic and not limited to perf only. bcc is a toolkit providing many useful
BPF programs mainly for tracing that are loaded ad-hoc through a Python interface
embedding the BPF C code. Syntax and semantics for implementing BPF programs
slightly differ among front ends in general, though. Additionally, there are also
BPF samples in the kernel source tree (samples/bpf/
) which parse the generated
object files and load the code directly through the system call interface.
This and previous sections mainly focus on the iproute2 suite’s BPF front end for loading networking programs of XDP, tc or lwt type, since Cilium’s programs are implemented against this BPF loader. In future, Cilium will be equipped with a native BPF loader, but programs will still be compatible to be loaded through iproute2 suite in order to facilitate development and debugging.
All BPF program types supported by iproute2 share the same BPF loader logic
due to having a common loader back end implemented as a library (lib/bpf.c
in iproute2 source tree).
The previous section on LLVM also covered some iproute2 parts related to writing BPF C programs, and later sections in this document are related to tc and XDP specific aspects when writing programs. Therefore, this section will rather focus on usage examples for loading object files with iproute2 as well as some of the generic mechanics of the loader. It does not try to provide a complete coverage of all details, but enough for getting started.
1. Loading of XDP BPF object files.
Given a BPF object file
prog.o
has been compiled for XDP, it can be loaded throughip
to a XDP-supported netdevice calledem1
with the following command:# ip link set dev em1 xdp obj prog.o
The above command assumes that the program code resides in the default section which is called
prog
in XDP case. Should this not be the case, and the section is named differently, for example,foobar
, then the program needs to be loaded as:# ip link set dev em1 xdp obj prog.o sec foobar
Note that it is also possible to load the program out of the
.text
section. Changing the minimal, stand-alone XDP drop program by removing the__section()
annotation from thexdp_drop
entry point would look like the following:#include <linux/bpf.h> #ifndef __section # define __section(NAME) \ __attribute__((section(NAME), used)) #endif int xdp_drop(struct xdp_md *ctx) { return XDP_DROP; } char __license[] __section("license") = "GPL";And can be loaded as follows:
# ip link set dev em1 xdp obj prog.o sec .text
By default,
ip
will throw an error in case a XDP program is already attached to the networking interface, to prevent it from being overridden by accident. In order to replace the currently running XDP program with a new one, the-force
option must be used:# ip -force link set dev em1 xdp obj prog.o
Most XDP-enabled drivers today support an atomic replacement of the existing program with a new one without traffic interruption. There is always only a single program attached to an XDP-enabled driver due to performance reasons, hence a chain of programs is not supported. However, as described in the previous section, partitioning of programs can be performed through tail calls to achieve a similar use case when necessary.
The
ip link
command will display anxdp
flag if the interface has an XDP program attached.ip link | grep xdp
can thus be used to find all interfaces that have XDP running. Further introspection facilities are provided through the detailed view withip -d link
andbpftool
can be used to retrieve information about the attached program based on the BPF program ID shown in theip link
dump.In order to remove the existing XDP program from the interface, the following command must be issued:
# ip link set dev em1 xdp off
In the case of switching a driver’s operation mode from non-XDP to native XDP and vice versa, typically the driver needs to reconfigure its receive (and transmit) rings in order to ensure received packet are set up linearly within a single page for BPF to read and write into. However, once completed, then most drivers only need to perform an atomic replacement of the program itself when a BPF program is requested to be swapped.
In total, XDP supports three operation modes which iproute2 implements as well:
xdpdrv
,xdpoffload
andxdpgeneric
.
xdpdrv
stands for native XDP, meaning the BPF program is run directly in the driver’s receive path at the earliest possible point in software. This is the normal / conventional XDP mode and requires driver’s to implement XDP support, which all major 10G/40G/+ networking drivers in the upstream Linux kernel already provide.
xdpgeneric
stands for generic XDP and is intended as an experimental test bed for drivers which do not yet support native XDP. Given the generic XDP hook in the ingress path comes at a much later point in time when the packet already enters the stack’s main receive path as askb
, the performance is significantly less than with processing inxdpdrv
mode.xdpgeneric
therefore is for the most part only interesting for experimenting, less for production environments.Last but not least, the
xdpoffload
mode is implemented by SmartNICs such as those supported by Netronome’s nfp driver and allow for offloading the entire BPF/XDP program into hardware, thus the program is run on each packet reception directly on the card. This provides even higher performance than running in native XDP although not all BPF map types or BPF helper functions are available for use compared to native XDP. The BPF verifier will reject the program in such case and report to the user what is unsupported. Other than staying in the realm of supported BPF features and helper functions, no special precautions have to be taken when writing BPF C programs.When a command like
ip link set dev em1 xdp obj [...]
is used, then the kernel will attempt to load the program first as native XDP, and in case the driver does not support native XDP, it will automatically fall back to generic XDP. Thus, for example, using explicitlyxdpdrv
instead ofxdp
, the kernel will only attempt to load the program as native XDP and fail in case the driver does not support it, which provides a guarantee that generic XDP is avoided altogether.Example for enforcing a BPF/XDP program to be loaded in native XDP mode, dumping the link details and unloading the program again:
# ip -force link set dev em1 xdpdrv obj prog.o # ip link show [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 1 tag 57cd311f2e27366b [...] # ip link set dev em1 xdpdrv offSame example now for forcing generic XDP, even if the driver would support native XDP, and additionally dumping the BPF instructions of the attached dummy program through bpftool:
# ip -force link set dev em1 xdpgeneric obj prog.o # ip link show [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 4 tag 57cd311f2e27366b <-- BPF program ID 4 [...] # bpftool prog dump xlated id 4 <-- Dump of instructions running on em1 0: (b7) r0 = 1 1: (95) exit # ip link set dev em1 xdpgeneric offAnd last but not least offloaded XDP, where we additionally dump program information via bpftool for retrieving general metadata:
# ip -force link set dev em1 xdpoffload obj prog.o # ip link show [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 8 tag 57cd311f2e27366b [...] # bpftool prog show id 8 8: xdp tag 57cd311f2e27366b dev em1 <-- Also indicates a BPF program offloaded to em1 loaded_at Apr 11/20:38 uid 0 xlated 16B not jited memlock 4096B # ip link set dev em1 xdpoffload offNote that it is not possible to use
xdpdrv
andxdpgeneric
or other modes at the same time, meaning only one of the XDP operation modes must be picked.A switch between different XDP modes e.g. from generic to native or vice versa is not atomically possible. Only switching programs within a specific operation mode is:
# ip -force link set dev em1 xdpgeneric obj prog.o # ip -force link set dev em1 xdpoffload obj prog.o RTNETLINK answers: File exists # ip -force link set dev em1 xdpdrv obj prog.o RTNETLINK answers: File exists # ip -force link set dev em1 xdpgeneric obj prog.o <-- Succeeds due to xdpgeneric #Switching between modes requires to first leave the current operation mode in order to then enter the new one:
# ip -force link set dev em1 xdpgeneric obj prog.o # ip -force link set dev em1 xdpgeneric off # ip -force link set dev em1 xdpoffload obj prog.o # ip l [...] 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff prog/xdp id 17 tag 57cd311f2e27366b [...] # ip -force link set dev em1 xdpoffload off
2. Loading of tc BPF object files.
Given a BPF object file
prog.o
has been compiled for tc, it can be loaded through the tc command to a netdevice. Unlike XDP, there is no driver dependency for supporting attaching BPF programs to the device. Here, the netdevice is calledem1
, and with the following command the program can be attached to the networkingingress
path ofem1
:# tc qdisc add dev em1 clsact # tc filter add dev em1 ingress bpf da obj prog.oThe first step is to set up a
clsact
qdisc (Linux queueing discipline).clsact
is a dummy qdisc similar to theingress
qdisc, which can only hold classifier and actions, but does not perform actual queueing. It is needed in order to attach thebpf
classifier. Theclsact
qdisc provides two special hooks calledingress
andegress
, where the classifier can be attached to. Bothingress
andegress
hooks are located in central receive and transmit locations in the networking data path, where every packet on the device passes through. Theingress
hook is called from__netif_receive_skb_core() -> sch_handle_ingress()
in the kernel and theegress
hook from__dev_queue_xmit() -> sch_handle_egress()
.The equivalent for attaching the program to the
egress
hook looks as follows:# tc filter add dev em1 egress bpf da obj prog.o
The
clsact
qdisc is processed lockless fromingress
andegress
direction and can also be attached to virtual, queue-less devices such asveth
devices connecting containers.Next to the hook, the
tc filter
command selectsbpf
to be used inda
(direct-action) mode.da
mode is recommended and should always be specified. It basically means that thebpf
classifier does not need to call into external tc action modules, which are not necessary forbpf
anyway, since all packet mangling, forwarding or other kind of actions can already be performed inside the single BPF program which is to be attached, and is therefore significantly faster.At this point, the program has been attached and is executed once packets traverse the device. Like in XDP, should the default section name not be used, then it can be specified during load, for example, in case of section
foobar
:# tc filter add dev em1 egress bpf da obj prog.o sec foobar
iproute2’s BPF loader allows for using the same command line syntax across program types, hence the
obj prog.o sec foobar
is the same syntax as with XDP mentioned earlier.The attached programs can be listed through the following commands:
# tc filter show dev em1 ingress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f # tc filter show dev em1 egress filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714The output of
prog.o:[ingress]
tells that program sectioningress
was loaded from the fileprog.o
, andbpf
operates indirect-action
mode. The programid
andtag
is appended for each case, where the latter denotes a hash over the instruction stream which can be correlated with the object file orperf
reports with stack traces, etc. Last but not least, theid
represents the system-wide unique BPF program identifier that can be used along withbpftool
to further inspect or dump the attached BPF program.tc can attach more than just a single BPF program, it provides various other classifiers which can be chained together. However, attaching a single BPF program is fully sufficient since all packet operations can be contained in the program itself thanks to
da
(direct-action
) mode, meaning the BPF program itself will already return the tc action verdict such asTC_ACT_OK
,TC_ACT_SHOT
and others. For optimal performance and flexibility, this is the recommended usage.In the above
show
command, tc also displayspref 49152
andhandle 0x1
next to the BPF related output. Both are auto-generated in case they are not explicitly provided through the command line.pref
denotes a priority number, which means that in case multiple classifiers are attached, they will be executed based on ascending priority, andhandle
represents an identifier in case multiple instances of the same classifier have been loaded under the samepref
. Since in case of BPF, a single program is fully sufficient,pref
andhandle
can typically be ignored.Only in the case where it is planned to atomically replace the attached BPF programs, it would be recommended to explicitly specify
pref
andhandle
a priori on initial load, so that they do not have to be queried at a later point in time for thereplace
operation. Thus, creation becomes:# tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar # tc filter show dev em1 ingress filter protocol all pref 1 bpf filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396fAnd for the atomic replacement, the following can be issued for updating the existing program at
ingress
hook with the new BPF program from the fileprog.o
in sectionfoobar
:# tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar
Last but not least, in order to remove all attached programs from the
ingress
respectivelyegress
hook, the following can be used:# tc filter del dev em1 ingress # tc filter del dev em1 egressFor removing the entire
clsact
qdisc from the netdevice, which implicitly also removes all attached programs from theingress
andegress
hooks, the below command is provided:# tc qdisc del dev em1 clsact
tc BPF programs can also be offloaded if the NIC and driver has support for it similarly as with XDP BPF programs. Netronome’s nfp supported NICs offer both types of BPF offload.
# tc qdisc add dev em1 clsact # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o Error: TC offload is disabled on net device. We have an error talking to the kernelIf the above error is shown, then tc hardware offload first needs to be enabled for the device through ethtool’s
hw-tc-offload
setting:# ethtool -K em1 hw-tc-offload on # tc qdisc add dev em1 clsact # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o # tc filter show dev em1 ingress filter protocol all pref 1 bpf filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366bThe
in_hw
flag confirms that the program has been offloaded to the NIC.Note that BPF offloads for both tc and XDP cannot be loaded at the same time, either the tc or XDP offload option must be selected.
3. Testing BPF offload interface via netdevsim driver.
The netdevsim driver which is part of the Linux kernel provides a dummy driver which implements offload interfaces for XDP BPF and tc BPF programs and facilitates testing kernel changes or low-level user space programs implementing a control plane directly against the kernel’s UAPI.
A netdevsim device can be created as follows:
# ip link add dev sim0 type netdevsim # ip link set dev sim0 up # ethtool -K sim0 hw-tc-offload on # ip l [...] 7: sim0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether a2:24:4c:1c:c2:b3 brd ff:ff:ff:ff:ff:ffAfter that step, XDP BPF or tc BPF programs can be test loaded as shown in the various examples earlier:
# ip -force link set dev sim0 xdpoffload obj prog.o # ip l [...] 7: sim0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether a2:24:4c:1c:c2:b3 brd ff:ff:ff:ff:ff:ff prog/xdp id 20 tag 57cd311f2e27366b
These two workflows are the basic operations to load XDP BPF respectively tc BPF programs with iproute2.
There are other various advanced options for the BPF loader that apply both to XDP and tc, some of them are listed here. In the examples only XDP is presented for simplicity.
1. Verbose log output even on success.
The option
verb
can be appended for loading programs in order to dump the verifier log, even if no error occurred:# ip link set dev em1 xdp obj xdp-example.o verb Prog section 'prog' loaded (5)! - Type: 6 - Instructions: 2 (0 over limit) - License: GPL Verifier analysis: 0: (b7) r0 = 1 1: (95) exit processed 2 insns
2. Load program that is already pinned in BPF file system.
Instead of loading a program from an object file, iproute2 can also retrieve the program from the BPF file system in case some external entity pinned it there and attach it to the device:
# ip link set dev em1 xdp pinned /sys/fs/bpf/prog
iproute2 can also use the short form that is relative to the detected mount point of the BPF file system:
# ip link set dev em1 xdp pinned m:prog
When loading BPF programs, iproute2 will automatically detect the mounted
file system instance in order to perform pinning of nodes. In case no mounted
BPF file system instance was found, then tc will automatically mount it
to the default location under /sys/fs/bpf/
.
In case an instance has already been found, then it will be used and no additional mount will be performed:
# mkdir /var/run/bpf # mount --bind /var/run/bpf /var/run/bpf # mount -t bpf bpf /var/run/bpf # tc filter add dev em1 ingress bpf da obj tc-example.o sec prog # tree /var/run/bpf /var/run/bpf +-- ip -> /run/bpf/tc/ +-- tc | +-- globals | +-- jmp_map +-- xdp -> /run/bpf/tc/ 4 directories, 1 file
By default tc will create an initial directory structure as shown above,
where all subsystem users will point to the same location through symbolic
links for the globals
namespace, so that pinned BPF maps can be reused
among various BPF program types in iproute2. In case the file system instance
has already been mounted and an existing structure already exists, then tc will
not override it. This could be the case for separating lwt
, tc
and
xdp
maps in order to not share globals
among all.
As briefly covered in the previous LLVM section, iproute2 will install a header file upon installation which can be included through the standard include path by BPF programs:
#include <iproute2/bpf_elf.h>
The purpose of this header file is to provide an API for maps and default section names used by programs. It’s a stable contract between iproute2 and BPF programs.
The map definition for iproute2 is struct bpf_elf_map
. Its members have
been covered earlier in the LLVM section of this document.
When parsing the BPF object file, the iproute2 loader will walk through
all ELF sections. It initially fetches ancillary sections like maps
and
license
. For maps
, the struct bpf_elf_map
array will be checked
for validity and whenever needed, compatibility workarounds are performed.
Subsequently all maps are created with the user provided information, either
retrieved as a pinned object, or newly created and then pinned into the BPF
file system. Next the loader will handle all program sections that contain
ELF relocation entries for maps, meaning that BPF instructions loading
map file descriptors into registers are rewritten so that the corresponding
map file descriptors are encoded into the instructions immediate value, in
order for the kernel to be able to convert them later on into map kernel
pointers. After that all the programs themselves are created through the BPF
system call, and tail called maps, if present, updated with the program’s file
descriptors.
bpftool¶
bpftool is the main introspection and debugging tool around BPF and developed
and shipped along with the Linux kernel tree under tools/bpf/bpftool/
.
The tool can dump all BPF programs and maps that are currently loaded in the system, or list and correlate all BPF maps used by a specific program. Furthermore, it allows to dump the entire map’s key / value pairs, or lookup, update, delete individual ones as well as retrieve a key’s neighbor key in the map. Such operations can be performed based on BPF program or map IDs or by specifying the location of a BPF file system pinned program or map. The tool additionally also offers an option to pin maps or programs into the BPF file system.
For a quick overview of all BPF programs currently loaded on the host invoke the following command:
# bpftool prog 398: sched_cls tag 56207908be8ad877 loaded_at Apr 09/16:24 uid 0 xlated 8800B jited 6184B memlock 12288B map_ids 18,5,17,14 399: sched_cls tag abc95fb4835a6ec9 loaded_at Apr 09/16:24 uid 0 xlated 344B jited 223B memlock 4096B map_ids 18 400: sched_cls tag afd2e542b30ff3ec loaded_at Apr 09/16:24 uid 0 xlated 1720B jited 1001B memlock 4096B map_ids 17 401: sched_cls tag 2dbbd74ee5d51cc8 loaded_at Apr 09/16:24 uid 0 xlated 3728B jited 2099B memlock 4096B map_ids 17 [...]
Similarly, to get an overview of all active maps:
# bpftool map 5: hash flags 0x0 key 20B value 112B max_entries 65535 memlock 13111296B 6: hash flags 0x0 key 20B value 20B max_entries 65536 memlock 7344128B 7: hash flags 0x0 key 10B value 16B max_entries 8192 memlock 790528B 8: hash flags 0x0 key 22B value 28B max_entries 8192 memlock 987136B 9: hash flags 0x0 key 20B value 8B max_entries 512000 memlock 49352704B [...]
Note that for each command, bpftool also supports json based output by
appending --json
at the end of the command line. An additional
--pretty
improves the output to be more human readable.
# bpftool prog --json --pretty
For dumping the post-verifier BPF instruction image of a specific BPF program, one starting point could be to inspect a specific program, e.g. attached to the tc ingress hook:
# tc filter show dev cilium_host egress filter protocol all pref 1 bpf chain 0 filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_host.o:[from-netdev] \ direct-action not_in_hw id 406 tag e0362f5bd9163a0a jited
The program from the object file bpf_host.o
, section from-netdev
has
a BPF program ID of 406
as denoted in id 406
. Based on this information
bpftool can provide some high-level metadata specific to the program:
# bpftool prog show id 406 406: sched_cls tag e0362f5bd9163a0a loaded_at Apr 09/16:24 uid 0 xlated 11144B jited 7721B memlock 12288B map_ids 18,20,8,5,6,14
The program of ID 406 is of type sched_cls
(BPF_PROG_TYPE_SCHED_CLS
),
has a tag
of e0362f5bd9163a0a
(sha sum over the instruction sequence),
it was loaded by root uid 0
on Apr 09/16:24
. The BPF instruction
sequence is 11,144 bytes
long and the JITed image 7,721 bytes
. The
program itself (excluding maps) consumes 12,288 bytes
that are accounted /
charged against user uid 0
. And the BPF program uses the BPF maps with
IDs 18
, 20
, 8
, 5
, 6
and 14
. The latter IDs can further
be used to get information or dump the map themselves.
Additionally, bpftool can issue a dump request of the BPF instructions the program runs:
# bpftool prog dump xlated id 406 0: (b7) r7 = 0 1: (63) *(u32 *)(r1 +60) = r7 2: (63) *(u32 *)(r1 +56) = r7 3: (63) *(u32 *)(r1 +52) = r7 [...] 47: (bf) r4 = r10 48: (07) r4 += -40 49: (79) r6 = *(u64 *)(r10 -104) 50: (bf) r1 = r6 51: (18) r2 = map[id:18] <-- BPF map id 18 53: (b7) r5 = 32 54: (85) call bpf_skb_event_output#5656112 <-- BPF helper call 55: (69) r1 = *(u16 *)(r6 +192) [...]
bpftool correlates BPF map IDs into the instruction stream as shown above as well as calls to BPF helpers or other BPF programs.
The instruction dump reuses the same ‘pretty-printer’ as the kernel’s BPF
verifier. Since the program was JITed and therefore the actual JIT image
that was generated out of above xlated
instructions is executed, it
can be dumped as well through bpftool:
# bpftool prog dump jited id 406 0: push %rbp 1: mov %rsp,%rbp 4: sub $0x228,%rsp b: sub $0x28,%rbp f: mov %rbx,0x0(%rbp) 13: mov %r13,0x8(%rbp) 17: mov %r14,0x10(%rbp) 1b: mov %r15,0x18(%rbp) 1f: xor %eax,%eax 21: mov %rax,0x20(%rbp) 25: mov 0x80(%rdi),%r9d [...]
Mainly for BPF JIT developers, the option also exists to interleave the disassembly with the actual native opcodes:
# bpftool prog dump jited id 406 opcodes 0: push %rbp 55 1: mov %rsp,%rbp 48 89 e5 4: sub $0x228,%rsp 48 81 ec 28 02 00 00 b: sub $0x28,%rbp 48 83 ed 28 f: mov %rbx,0x0(%rbp) 48 89 5d 00 13: mov %r13,0x8(%rbp) 4c 89 6d 08 17: mov %r14,0x10(%rbp) 4c 89 75 10 1b: mov %r15,0x18(%rbp) 4c 89 7d 18 [...]
The same interleaving can be done for the normal BPF instructions which can sometimes be useful for debugging in the kernel:
# bpftool prog dump xlated id 406 opcodes 0: (b7) r7 = 0 b7 07 00 00 00 00 00 00 1: (63) *(u32 *)(r1 +60) = r7 63 71 3c 00 00 00 00 00 2: (63) *(u32 *)(r1 +56) = r7 63 71 38 00 00 00 00 00 3: (63) *(u32 *)(r1 +52) = r7 63 71 34 00 00 00 00 00 4: (63) *(u32 *)(r1 +48) = r7 63 71 30 00 00 00 00 00 5: (63) *(u32 *)(r1 +64) = r7 63 71 40 00 00 00 00 00 [...]
The basic blocks of a program can also be visualized with the help of
graphviz
. For this purpose bpftool has a visual
dump mode that
generates a dot file instead of the plain BPF xlated
instruction
dump that can later be converted to a png file:
# bpftool prog dump xlated id 406 visual &> output.dot $ dot -Tpng output.dot -o output.png
Another option would be to pass the dot file to dotty as a viewer, that
is dotty output.dot
, where the result for the bpf_host.o
program
looks as follows (small extract):

Note that the xlated
instruction dump provides the post-verifier BPF
instruction image which means that it dumps the instructions as if they
were to be run through the BPF interpreter. In the kernel, the verifier
performs various rewrites of the original instructions provided by the
BPF loader.
One example of rewrites is the inlining of helper functions in order to improve runtime performance, here in the case of a map lookup for hash tables:
# bpftool prog dump xlated id 3 0: (b7) r1 = 2 1: (63) *(u32 *)(r10 -4) = r1 2: (bf) r2 = r10 3: (07) r2 += -4 4: (18) r1 = map[id:2] <-- BPF map id 2 6: (85) call __htab_map_lookup_elem#77408 <-+ BPF helper inlined rewrite 7: (15) if r0 == 0x0 goto pc+2 | 8: (07) r0 += 56 | 9: (79) r0 = *(u64 *)(r0 +0) <-+ 10: (15) if r0 == 0x0 goto pc+24 11: (bf) r2 = r10 12: (07) r2 += -4 [...]
bpftool correlates calls to helper functions or BPF to BPF calls through
kallsyms. Therefore, make sure that JITed BPF programs are exposed to
kallsyms (bpf_jit_kallsyms
) and that kallsyms addresses are not
obfuscated (calls are otherwise shown as call bpf_unspec#0
):
# echo 0 > /proc/sys/kernel/kptr_restrict # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
BPF to BPF calls are correlated as well for both, interpreter as well
as JIT case. In the latter, the tag of the subprogram is shown as
call target. In each case, the pc+2
is the pc-relative offset of
the call target, which denotes the subprogram.
# bpftool prog dump xlated id 1 0: (85) call pc+2#__bpf_prog_run_args32 1: (b7) r0 = 1 2: (95) exit 3: (b7) r0 = 2 4: (95) exit
JITed variant of the dump:
# bpftool prog dump xlated id 1 0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F 1: (b7) r0 = 1 2: (95) exit 3: (b7) r0 = 2 4: (95) exit
In the case of tail calls, the kernel maps them into a single instruction internally, bpftool will still correlate them as a helper call for ease of debugging:
# bpftool prog dump xlated id 2 [...] 10: (b7) r2 = 8 11: (85) call bpf_trace_printk#-41312 12: (bf) r1 = r6 13: (18) r2 = map[id:1] 15: (b7) r3 = 0 16: (85) call bpf_tail_call#12 17: (b7) r1 = 42 18: (6b) *(u16 *)(r6 +46) = r1 19: (b7) r0 = 0 20: (95) exit # bpftool map show id 1 1: prog_array flags 0x0 key 4B value 4B max_entries 1 memlock 4096B
Dumping an entire map is possible through the map dump
subcommand
which iterates through all present map elements and dumps the key /
value pairs.
If no BTF (BPF Type Format) data is available for a given map, then the key / value pairs are dumped as hex:
# bpftool map dump id 5 key: f0 0d 00 00 00 00 00 00 0a 66 00 00 00 00 8a d6 02 00 00 00 value: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 key: 0a 66 1c ee 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 value: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [...] Found 6 elements
However, with BTF, the map also holds debugging information about
the key and value structures. For example, BTF in combination with
BPF maps and the BPF_ANNOTATE_KV_PAIR() macro from iproute2 will
result in the following dump (test_xdp_noinline.o
from kernel
selftests):
# cat tools/testing/selftests/bpf/test_xdp_noinline.c [...] struct ctl_value { union { __u64 value; __u32 ifindex; __u8 mac[6]; }; }; struct bpf_map_def __attribute__ ((section("maps"), used)) ctl_array = { .type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(__u32), .value_size = sizeof(struct ctl_value), .max_entries = 16, .map_flags = 0, }; BPF_ANNOTATE_KV_PAIR(ctl_array, __u32, struct ctl_value); [...]
The BPF_ANNOTATE_KV_PAIR() macro forces a map-specific ELF section containing an empty key and value, this enables the iproute2 BPF loader to correlate BTF data with that section and thus allows to choose the corresponding types out of the BTF for loading the map.
Compiling through LLVM and generating BTF through debugging information
by pahole
:
# clang [...] -O2 -target bpf -g -emit-llvm -c test_xdp_noinline.c -o - | llc -march=bpf -mcpu=probe -mattr=dwarfris -filetype=obj -o test_xdp_noinline.o # pahole -J test_xdp_noinline.o
Now loading into kernel and dumping the map via bpftool:
# ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test # ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:227 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever [...] # bpftool prog show id 227 227: xdp tag a85e060c275c5616 gpl loaded_at 2018-07-17T14:41:29+0000 uid 0 xlated 8152B not jited memlock 12288B map_ids 381,385,386,382,384,383 # bpftool map dump id 386 [{ "key": 0, "value": { "": { "value": 0, "ifindex": 0, "mac": [] } } },{ "key": 1, "value": { "": { "value": 0, "ifindex": 0, "mac": [] } } },{ [...]
Lookup, update, delete, and ‘get next key’ operations on the map for specific keys can be performed through bpftool as well.
BPF sysctls¶
The Linux kernel provides few sysctls that are BPF related and covered in this section.
/proc/sys/net/core/bpf_jit_enable
: Enables or disables the BPF JIT compiler.Value Description 0 Disable the JIT and use only interpreter (kernel’s default value) 1 Enable the JIT compiler 2 Enable the JIT and emit debugging traces to the kernel log As described in subsequent sections,
bpf_jit_disasm
tool can be used to process debugging traces when the JIT compiler is set to debugging mode (option2
)./proc/sys/net/core/bpf_jit_harden
: Enables or disables BPF JIT hardening. Note that enabling hardening trades off performance, but can mitigate JIT spraying by blinding out the BPF program’s immediate values. For programs processed through the interpreter, blinding of immediate values is not needed / performed.Value Description 0 Disable JIT hardening (kernel’s default value) 1 Enable JIT hardening for unprivileged users only 2 Enable JIT hardening for all users /proc/sys/net/core/bpf_jit_kallsyms
: Enables or disables export of JITed programs as kernel symbols to/proc/kallsyms
so that they can be used together withperf
tooling as well as making these addresses aware to the kernel for stack unwinding, for example, used in dumping stack traces. The symbol names contain the BPF program tag (bpf_prog_<tag>
). Ifbpf_jit_harden
is enabled, then this feature is disabled.Value Description 0 Disable JIT kallsyms export (kernel’s default value) 1 Enable JIT kallsyms export for privileged users only /proc/sys/kernel/unprivileged_bpf_disabled
: Enables or disable unprivileged use of thebpf(2)
system call. The Linux kernel has unprivileged use ofbpf(2)
enabled by default, but once the switch is flipped, unprivileged use will be permanently disabled until the next reboot. This sysctl knob is a one-time switch, meaning if once set, then neither an application nor an admin can reset the value anymore. This knob does not affect any cBPF programs such as seccomp or traditional socket filters that do not use thebpf(2)
system call for loading the program into the kernel.Value Description 0 Unprivileged use of bpf syscall enabled (kernel’s default value) 1 Unprivileged use of bpf syscall disabled
Kernel Testing¶
The Linux kernel ships a BPF selftest suite, which can be found in the kernel
source tree under tools/testing/selftests/bpf/
.
$ cd tools/testing/selftests/bpf/
$ make
# make run_tests
The test suite contains test cases against the BPF verifier, program tags, various tests against the BPF map interface and map types. It contains various runtime tests from C code for checking LLVM back end, and eBPF as well as cBPF asm code that is run in the kernel for testing the interpreter and JITs.
JIT Debugging¶
For JIT developers performing audits or writing extensions, each compile run can output the generated JIT image into the kernel log through:
# echo 2 > /proc/sys/net/core/bpf_jit_enable
Whenever a new BPF program is loaded, the JIT compiler will dump the output,
which can then be inspected with dmesg
, for example:
[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f from=tcpdump pid=20583
[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
flen
is the length of the BPF program (here, 6 BPF instructions), and proglen
tells the number of bytes generated by the JIT for the opcode image (here, 70 bytes
in size). pass
means that the image was generated in 3 compiler passes, for
example, x86_64
can have various optimization passes to further reduce the image
size when possible. image
contains the address of the generated JIT image, from
and pid
the user space application name and PID respectively, which triggered the
compilation process. The dump output for eBPF and cBPF JITs is the same format.
In the kernel tree under tools/bpf/
, there is a tool called bpf_jit_disasm
. It
reads out the latest dump and prints the disassembly for further inspection:
# ./bpf_jit_disasm
70 bytes emitted from JIT compiler (pass:3, flen:6)
ffffffffa0069c8f + <x>:
0: push %rbp
1: mov %rsp,%rbp
4: sub $0x60,%rsp
8: mov %rbx,-0x8(%rbp)
c: mov 0x68(%rdi),%r9d
10: sub 0x6c(%rdi),%r9d
14: mov 0xd8(%rdi),%r8
1b: mov $0xc,%esi
20: callq 0xffffffffe0ff9442
25: cmp $0x800,%eax
2a: jne 0x0000000000000042
2c: mov $0x17,%esi
31: callq 0xffffffffe0ff945e
36: cmp $0x1,%eax
39: jne 0x0000000000000042
3b: mov $0xffff,%eax
40: jmp 0x0000000000000044
42: xor %eax,%eax
44: leaveq
45: retq
Alternatively, the tool can also dump related opcodes along with the disassembly.
# ./bpf_jit_disasm -o
70 bytes emitted from JIT compiler (pass:3, flen:6)
ffffffffa0069c8f + <x>:
0: push %rbp
55
1: mov %rsp,%rbp
48 89 e5
4: sub $0x60,%rsp
48 83 ec 60
8: mov %rbx,-0x8(%rbp)
48 89 5d f8
c: mov 0x68(%rdi),%r9d
44 8b 4f 68
10: sub 0x6c(%rdi),%r9d
44 2b 4f 6c
14: mov 0xd8(%rdi),%r8
4c 8b 87 d8 00 00 00
1b: mov $0xc,%esi
be 0c 00 00 00
20: callq 0xffffffffe0ff9442
e8 1d 94 ff e0
25: cmp $0x800,%eax
3d 00 08 00 00
2a: jne 0x0000000000000042
75 16
2c: mov $0x17,%esi
be 17 00 00 00
31: callq 0xffffffffe0ff945e
e8 28 94 ff e0
36: cmp $0x1,%eax
83 f8 01
39: jne 0x0000000000000042
75 07
3b: mov $0xffff,%eax
b8 ff ff 00 00
40: jmp 0x0000000000000044
eb 02
42: xor %eax,%eax
31 c0
44: leaveq
c9
45: retq
c3
More recently, bpftool
adapted the same feature of dumping the BPF JIT
image based on a given BPF program ID already loaded in the system (see
bpftool section).
For performance analysis of JITed BPF programs, perf
can be used as
usual. As a prerequisite, JITed programs need to be exported through kallsyms
infrastructure.
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
Enabling or disabling bpf_jit_kallsyms
does not require a reload of the
related BPF programs. Next, a small workflow example is provided for profiling
BPF programs. A crafted tc BPF program is used for demonstration purposes,
where perf records a failed allocation inside bpf_clone_redirect()
helper.
Due to the use of direct write, bpf_try_make_head_writable()
failed, which
would then release the cloned skb
again and return with an error message.
perf
thus records all kfree_skb
events.
# tc qdisc add dev em1 clsact
# tc filter add dev em1 ingress bpf da obj prog.o sec main
# tc filter show dev em1 ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 prog.o:[main] direct-action id 1 tag 8227addf251b7543
# cat /proc/kallsyms
[...]
ffffffffc00349e0 t fjes_hw_init_command_registers [fjes]
ffffffffc003e2e0 d __tracepoint_fjes_hw_stop_debug_err [fjes]
ffffffffc0036190 t fjes_hw_epbuf_tx_pkt_send [fjes]
ffffffffc004b000 t bpf_prog_8227addf251b7543
# perf record -a -g -e skb:kfree_skb sleep 60
# perf script --kallsyms=/proc/kallsyms
[...]
ksoftirqd/0 6 [000] 1004.578402: skb:kfree_skb: skbaddr=0xffff9d4161f20a00 protocol=2048 location=0xffffffffc004b52c
7fffb8745961 bpf_clone_redirect (/lib/modules/4.10.0+/build/vmlinux)
7fffc004e52c bpf_prog_8227addf251b7543 (/lib/modules/4.10.0+/build/vmlinux)
7fffc05b6283 cls_bpf_classify (/lib/modules/4.10.0+/build/vmlinux)
7fffb875957a tc_classify (/lib/modules/4.10.0+/build/vmlinux)
7fffb8729840 __netif_receive_skb_core (/lib/modules/4.10.0+/build/vmlinux)
7fffb8729e38 __netif_receive_skb (/lib/modules/4.10.0+/build/vmlinux)
7fffb872ae05 process_backlog (/lib/modules/4.10.0+/build/vmlinux)
7fffb872a43e net_rx_action (/lib/modules/4.10.0+/build/vmlinux)
7fffb886176c __do_softirq (/lib/modules/4.10.0+/build/vmlinux)
7fffb80ac5b9 run_ksoftirqd (/lib/modules/4.10.0+/build/vmlinux)
7fffb80ca7fa smpboot_thread_fn (/lib/modules/4.10.0+/build/vmlinux)
7fffb80c6831 kthread (/lib/modules/4.10.0+/build/vmlinux)
7fffb885e09c ret_from_fork (/lib/modules/4.10.0+/build/vmlinux)
The stack trace recorded by perf
will then show the bpf_prog_8227addf251b7543()
symbol as part of the call trace, meaning that the BPF program with the
tag 8227addf251b7543
was related to the kfree_skb
event, and
such program was attached to netdevice em1
on the ingress hook as
shown by tc.
Introspection¶
The Linux kernel provides various tracepoints around BPF and XDP which can be used for additional introspection, for example, to trace interactions of user space programs with the bpf system call.
Tracepoints for BPF:
# perf list | grep bpf:
bpf:bpf_map_create [Tracepoint event]
bpf:bpf_map_delete_elem [Tracepoint event]
bpf:bpf_map_lookup_elem [Tracepoint event]
bpf:bpf_map_next_key [Tracepoint event]
bpf:bpf_map_update_elem [Tracepoint event]
bpf:bpf_obj_get_map [Tracepoint event]
bpf:bpf_obj_get_prog [Tracepoint event]
bpf:bpf_obj_pin_map [Tracepoint event]
bpf:bpf_obj_pin_prog [Tracepoint event]
bpf:bpf_prog_get_type [Tracepoint event]
bpf:bpf_prog_load [Tracepoint event]
bpf:bpf_prog_put_rcu [Tracepoint event]
Example usage with perf
(alternatively to sleep
example used here,
a specific application like tc
could be used here instead, of course):
# perf record -a -e bpf:* sleep 10
# perf script
sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
For the BPF programs, their individual program tag is displayed.
For debugging, XDP also has a tracepoint that is triggered when exceptions are raised:
# perf list | grep xdp:
xdp:xdp_exception [Tracepoint event]
Exceptions are triggered in the following scenarios:
- The BPF program returned an invalid / unknown XDP action code.
- The BPF program returned with
XDP_ABORTED
indicating a non-graceful exit. - The BPF program returned with
XDP_TX
, but there was an error on transmit, for example, due to the port not being up, due to the transmit ring being full, due to allocation failures, etc.
Both tracepoint classes can also be inspected with a BPF program itself
attached to one or more tracepoints, collecting further information
in a map or punting such events to a user space collector through the
bpf_perf_event_output()
helper, for example.
Miscellaneous¶
BPF programs and maps are memory accounted against RLIMIT_MEMLOCK
similar
to perf
. The currently available size in unit of system pages which may be
locked into memory can be inspected through ulimit -l
. The setrlimit system
call man page provides further details.
The default limit is usually insufficient to load more complex programs or
larger BPF maps, so that the BPF system call will return with errno
of EPERM
. In such situations a workaround with ulimit -l unlimited
or
with a sufficiently large limit could be performed. The RLIMIT_MEMLOCK
is
mainly enforcing limits for unprivileged users. Depending on the setup,
setting a higher limit for privileged users is often acceptable.
Program Types¶
At the time of this writing, there are eighteen different BPF program types available, two of the main types for networking are further explained in below subsections, namely XDP BPF programs as well as tc BPF programs. Extensive usage examples for the two program types for LLVM, iproute2 or other tools are spread throughout the toolchain section and not covered here. Instead, this section focuses on their architecture, concepts and use cases.
XDP¶
XDP stands for eXpress Data Path and provides a framework for BPF that enables high-performance programmable packet processing in the Linux kernel. It runs the BPF program at the earliest possible point in software, namely at the moment the network driver receives the packet.
At this point in the fast-path the driver just picked up the packet from its
receive rings, without having done any expensive operations such as allocating
an skb
for pushing the packet further up the networking stack, without
having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program
is executed at the earliest point when it becomes available to the CPU for
processing.
XDP works in concert with the Linux kernel and its infrastructure, meaning the kernel is not bypassed as in various networking frameworks that operate in user space only. Keeping the packet in kernel space has several major advantages:
- XDP is able to reuse all the upstream developed kernel networking drivers, user space tooling, or even other available in-kernel infrastructure such as routing tables, sockets, etc in BPF helper calls itself.
- Residing in kernel space, XDP has the same security model as the rest of the kernel for accessing hardware.
- There is no need for crossing kernel / user space boundaries since the processed packet already resides in the kernel and can therefore flexibly forward packets into other in-kernel entities like namespaces used by containers or the kernel’s networking stack itself. This is particularly relevant in times of Meltdown and Spectre.
- Punting packets from XDP to the kernel’s robust, widely used and efficient TCP/IP stack is trivially possible, allows for full reuse and does not require maintaining a separate TCP/IP stack as with user space frameworks.
- The use of BPF allows for full programmability, keeping a stable ABI with the same ‘never-break-user-space’ guarantees as with the kernel’s system call ABI and compared to modules it also provides safety measures thanks to the BPF verifier that ensures the stability of the kernel’s operation.
- XDP trivially allows for atomically swapping programs during runtime without any network traffic interruption or even kernel / system reboot.
- XDP allows for flexible structuring of workloads integrated into the kernel. For example, it can operate in “busy polling” or “interrupt driven” mode. Explicitly dedicating CPUs to XDP is not required. There are no special hardware requirements and it does not rely on hugepages.
- XDP does not require any third party kernel modules or licensing. It is a long-term architectural solution, a core part of the Linux kernel, and developed by the kernel community.
- XDP is already enabled and shipped everywhere with major distributions running a kernel equivalent to 4.8 or higher and supports most major 10G or higher networking drivers.
As a framework for running BPF in the driver, XDP additionally ensures that
packets are laid out linearly and fit into a single DMA’ed page which is
readable and writable by the BPF program. XDP also ensures that additional
headroom of 256 bytes is available to the program for implementing custom
encapsulation headers with the help of the bpf_xdp_adjust_head()
BPF helper
or adding custom metadata in front of the packet through bpf_xdp_adjust_meta()
.
The framework contains XDP action codes further described in the section below which a BPF program can return in order to instruct the driver how to proceed with the packet, and it enables the possibility to atomically replace BPF programs running at the XDP layer. XDP is tailored for high-performance by design. BPF allows to access the packet data through ‘direct packet access’ which means that the program holds data pointers directly in registers, loads the content into registers, respectively writes from there into the packet.
The packet representation in XDP that is passed to the BPF program as the BPF context looks as follows:
struct xdp_buff {
void *data;
void *data_end;
void *data_meta;
void *data_hard_start;
struct xdp_rxq_info *rxq;
};
data
points to the start of the packet data in the page, and as the
name suggests, data_end
points to the end of the packet data. Since XDP
allows for a headroom, data_hard_start
points to the maximum possible
headroom start in the page, meaning, when the packet should be encapsulated,
then data
is moved closer towards data_hard_start
via bpf_xdp_adjust_head()
.
The same BPF helper function also allows for decapsulation in which case
data
is moved further away from data_hard_start
.
data_meta
initially points to the same location as data
but
bpf_xdp_adjust_meta()
is able to move the pointer towards data_hard_start
as well in order to provide room for custom metadata which is invisible to
the normal kernel networking stack but can be read by tc BPF programs since
it is transferred from XDP to the skb
. Vice versa, it can remove or reduce
the size of the custom metadata through the same BPF helper function by
moving data_meta
away from data_hard_start
again. data_meta
can
also be used solely for passing state between tail calls similarly to the
skb->cb[]
control block case that is accessible in tc BPF programs.
This gives the following relation respectively invariant for the struct xdp_buff
packet pointers: data_hard_start
<= data_meta
<= data
< data_end
.
The rxq
field points to some additional per receive queue metadata which
is populated at ring setup time (not at XDP runtime):
struct xdp_rxq_info {
struct net_device *dev;
u32 queue_index;
u32 reg_state;
} ____cacheline_aligned;
The BPF program can retrieve queue_index
as well as additional data
from the netdevice itself such as ifindex
, etc.
BPF program return codes
After running the XDP BPF program, a verdict is returned from the program in
order to tell the driver how to process the packet next. In the linux/bpf.h
system header file all available return verdicts are enumerated:
enum xdp_action {
XDP_ABORTED = 0,
XDP_DROP,
XDP_PASS,
XDP_TX,
XDP_REDIRECT,
};
XDP_DROP
as the name suggests will drop the packet right at the driver
level without wasting any further resources. This is in particular useful
for BPF programs implementing DDoS mitigation mechanisms or firewalling in
general. The XDP_PASS
return code means that the packet is allowed to
be passed up to the kernel’s networking stack. Meaning, the current CPU
that was processing this packet now allocates a skb
, populates it, and
passes it onwards into the GRO engine. This would be equivalent to the
default packet handling behavior without XDP. With XDP_TX
the BPF program
has an efficient option to transmit the network packet out of the same NIC it
just arrived on again. This is typically useful when few nodes are implementing,
for example, firewalling with subsequent load balancing in a cluster and
thus act as a hairpinned load balancer pushing the incoming packets back
into the switch after rewriting them in XDP BPF. XDP_REDIRECT
is similar
to XDP_TX
in that it is able to transmit the XDP packet, but through
another NIC. Another option for the XDP_REDIRECT
case is to redirect
into a BPF cpumap, meaning, the CPUs serving XDP on the NIC’s receive queues
can continue to do so and push the packet for processing the upper kernel
stack to a remote CPU. This is similar to XDP_PASS
, but with the ability
that the XDP BPF program can keep serving the incoming high load as opposed
to temporarily spend work on the current packet for pushing into upper
layers. Last but not least, XDP_ABORTED
which serves denoting an exception
like state from the program and has the same behavior as XDP_DROP
only
that XDP_ABORTED
passes the trace_xdp_exception
tracepoint which
can be additionally monitored to detect misbehavior.
Use cases for XDP
Some of the main use cases for XDP are presented in this subsection. The list is non-exhaustive and given the programmability and efficiency XDP and BPF enables, it can easily be adapted to solve very specific use cases.
DDoS mitigation, firewalling
One of the basic XDP BPF features is to tell the driver to drop a packet with
XDP_DROP
at this early stage which allows for any kind of efficient network policy enforcement with having an extremely low per-packet cost. This is ideal in situations when needing to cope with any sort of DDoS attacks, but also more general allows to implement any sort of firewalling policies with close to no overhead in BPF e.g. in either case as stand alone appliance (e.g. scrubbing ‘clean’ traffic throughXDP_TX
) or widely deployed on nodes protecting end hosts themselves (viaXDP_PASS
or cpumapXDP_REDIRECT
for good traffic). Offloaded XDP takes this even one step further by moving the already small per-packet cost entirely into the NIC with processing at line-rate.
Forwarding and load-balancing
Another major use case of XDP is packet forwarding and load-balancing through either
XDP_TX
orXDP_REDIRECT
actions. The packet can be arbitrarily mangled by the BPF program running in the XDP layer, even BPF helper functions are available for increasing or decreasing the packet’s headroom in order to arbitrarily encapsulate respectively decapsulate the packet before sending it out again. WithXDP_TX
hairpinned load-balancers can be implemented that push the packet out of the same networking device it originally arrived on, or with theXDP_REDIRECT
action it can be forwarded to another NIC for transmission. The latter return code can also be used in combination with BPF’s cpumap to load-balance packets for passing up the local stack, but on remote, non-XDP processing CPUs.
Pre-stack filtering / processing
Besides policy enforcement, XDP can also be used for hardening the kernel’s networking stack with the help of
XDP_DROP
case, meaning, it can drop irrelevant packets for a local node right at the earliest possible point before the networking stack sees them e.g. given we know that a node only serves TCP traffic, any UDP, SCTP or other L4 traffic can be dropped right away. This has the advantage that packets do not need to traverse various entities like GRO engine, the kernel’s flow dissector and others before it can be determined to drop them and thus this allows for reducing the kernel’s attack surface. Thanks to XDP’s early processing stage, this effectively ‘pretends’ to the kernel’s networking stack that these packets have never been seen by the networking device. Additionally, if a potential bug in the stack’s receive path got uncovered and would cause a ‘ping of death’ like scenario, XDP can be utilized to drop such packets right away without having to reboot the kernel or restart any services. Due to the ability to atomically swap such programs to enforce a drop of bad packets, no network traffic is even interrupted on a host.Another use case for pre-stack processing is that given the kernel has not yet allocated an
skb
for the packet, the BPF program is free to modify the packet and, again, have it ‘pretend’ to the stack that it was received by the networking device this way. This allows for cases such as having custom packet mangling and encapsulation protocols where the packet can be decapsulated prior to entering GRO aggregation in which GRO otherwise would not be able to perform any sort of aggregation due to not being aware of the custom protocol. XDP also allows to push metadata (non-packet data) in front of the packet. This is ‘invisible’ to the normal kernel stack, can be GRO aggregated (for matching metadata) and later on processed in coordination with a tc ingress BPF program where it has the context of askb
available for e.g. setting various skb fields.
Flow sampling, monitoring
XDP can also be used for cases such as packet monitoring, sampling or any other network analytics, for example, as part of an intermediate node in the path or on end hosts in combination also with prior mentioned use cases. For complex packet analysis, XDP provides a facility to efficiently push network packets (truncated or with full payload) and custom metadata into a fast lockless per CPU memory mapped ring buffer provided from the Linux perf infrastructure to an user space application. This also allows for cases where only a flow’s initial data can be analyzed and once determined as good traffic having the monitoring bypassed. Thanks to the flexibility brought by BPF, this allows for implementing any sort of custom monitoring or sampling.
One example of XDP BPF production usage is Facebook’s SHIV and Droplet infrastructure which implement their L4 load-balancing and DDoS countermeasures. Migrating their production infrastructure away from netfilter’s IPVS (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared to their previous IPVS setup. This was first presented at the netdev 2.1 conference:
- Slides: https://www.netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
- Video: https://youtu.be/YEU2ClcGqts
Another example is the integration of XDP into Cloudflare’s DDoS mitigation
pipeline, which originally was using cBPF instead of eBPF for attack signature
matching through iptables’ xt_bpf
module. Due to use of iptables this
caused severe performance problems under attack where a user space bypass
solution was deemed necessary but came with drawbacks as well such as needing
to busy poll the NIC and expensive packet re-injection into the kernel’s stack.
The migration over to eBPF and XDP combined best of both worlds by having
high-performance programmable packet processing directly inside the kernel:
- Slides: https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP.pdf
- Video: https://youtu.be/7OuOukmuivg
XDP operation modes
XDP has three operation modes where ‘native’ XDP is the default mode. When talked about XDP this mode is typically implied.
Native XDP
This is the default mode where the XDP BPF program is run directly out of the networking driver’s early receive path. Most widespread used NICs for 10G and higher support native XDP already.
Offloaded XDP
In the offloaded XDP mode the XDP BPF program is directly offloaded into the NIC instead of being executed on the host CPU. Thus, the already extremely low per-packet cost is pushed off the host CPU entirely and executed on the NIC, providing even higher performance than running in native XDP. This offload is typically implemented by SmartNICs containing multi-threaded, multicore flow processors where a in-kernel JIT compiler translates BPF into native instructions for the latter. Drivers supporting offloaded XDP usually also support native XDP for cases where some BPF helpers may not yet or only be available for the native mode.
Generic XDP
For drivers not implementing native or offloaded XDP yet, the kernel provides an option for generic XDP which does not require any driver changes since run at a much later point out of the networking stack. This setting is primarily targeted at developers who want to write and test programs against the kernel’s XDP API, and will not operate at the performance rate of the native or offloaded modes. For XDP usage in a production environment either the native or offloaded mode is better suited and the recommended way to run XDP.
Driver support
Since BPF and XDP is evolving quickly in terms of feature and driver support, the following lists native and offloaded XDP drivers as of kernel 4.17.
Drivers supporting native XDP
- Broadcom
- bnxt
- Cavium
- thunderx
- Intel
- ixgbe
- ixgbevf
- i40e
- Mellanox
- mlx4
- mlx5
- Netronome
- nfp
- Others
- tun
- virtio_net
- Qlogic
- qede
- Solarflare
- sfc [1]
Drivers supporting offloaded XDP
- Netronome
- nfp [2]
Note that examples for writing and loading XDP programs are included in the toolchain section under the respective tools.
[1] | XDP for sfc available via out of tree driver as of kernel 4.17, but will be upstreamed soon. |
[2] | (1, 2) Some BPF helper functions such as retrieving the current CPU number will not be available in an offloaded setting. |
tc (traffic control)¶
Aside from other program types such as XDP, BPF can also be used out of the kernel’s tc (traffic control) layer in the networking data path. On a high-level there are three major differences when comparing XDP BPF programs to tc BPF ones:
The BPF input context is a
sk_buff
not axdp_buff
. When the kernel’s networking stack receives a packet, after the XDP layer, it allocates a buffer and parses the packet to store metadata about the packet. This representation is known as thesk_buff
. This structure is then exposed in the BPF input context so that BPF programs from the tc ingress layer can use the metadata that the stack extracts from the packet. This can be useful, but comes with an associated cost of the stack performing this allocation and metadata extraction, and handling the packet until it hits the tc hook. By definition, thexdp_buff
doesn’t have access to this metadata because the XDP hook is called before this work is done. This is a significant contributor to the performance difference between the XDP and tc hooks.Therefore, BPF programs attached to the tc BPF hook can, for instance, read or write the skb’s
mark
,pkt_type
,protocol
,priority
,queue_mapping
,napi_id
,cb[]
array,hash
,tc_classid
ortc_index
, vlan metadata, the XDP transferred custom metadata and various other information. All members of thestruct __sk_buff
BPF context used in tc BPF are defined in thelinux/bpf.h
system header.Generally, the
sk_buff
is of a completely different nature thanxdp_buff
where both come with advantages and disadvantages. For example, thesk_buff
case has the advantage that it is rather straight forward to mangle its associated metadata, however, it also contains a lot of protocol specific information (e.g. GSO related state) which makes it difficult to simply switch protocols by solely rewriting the packet data. This is due to the stack processing the packet based on the metadata rather than having the cost of accessing the packet contents each time. Thus, additional conversion is required from BPF helper functions taking care thatsk_buff
internals are properly converted as well. Thexdp_buff
case however does not face such issues since it comes at such an early stage where the kernel has not even allocated ansk_buff
yet, thus packet rewrites of any kind can be realized trivially. However, thexdp_buff
case has the disadvantage thatsk_buff
metadata is not available for mangling at this stage. The latter is overcome by passing custom metadata from XDP BPF to tc BPF, though. In this way, the limitations of each program type can be overcome by operating complementary programs of both types as the use case requires.
Compared to XDP, tc BPF programs can be triggered out of ingress and also egress points in the networking data path as opposed to ingress only in the case of XDP.
The two hook points
sch_handle_ingress()
andsch_handle_egress()
in the kernel are triggered out of__netif_receive_skb_core()
and__dev_queue_xmit()
, respectively. The latter two are the main receive and transmit functions in the data path that, setting XDP aside, are triggered for every network packet going in or coming out of the node allowing for full visibility for tc BPF programs at these hook points.
The tc BPF programs do not require any driver changes since they are run at hook points in generic layers in the networking stack. Therefore, they can be attached to any type of networking device.
While this provides flexibility, it also trades off performance compared to running at the native XDP layer. However, tc BPF programs still come at the earliest point in the generic kernel’s networking data path after GRO has been run but before any protocol processing, traditional iptables firewalling such as iptables PREROUTING or nftables ingress hooks or other packet processing takes place. Likewise on egress, tc BPF programs execute at the latest point before handing the packet to the driver itself for transmission, meaning after traditional iptables firewalling hooks like iptables POSTROUTING, but still before handing the packet to the kernel’s GSO engine.
One exception which does require driver changes however are offloaded tc BPF programs, typically provided by SmartNICs in a similar way as offloaded XDP just with differing set of features due to the differences in the BPF input context, helper functions and verdict codes.
BPF programs run in the tc layer are run from the cls_bpf
classifier.
While the tc terminology describes the BPF attachment point as a “classifier”,
this is a bit misleading since it under-represents what cls_bpf
is
capable of. That is to say, a fully programmable packet processor being able
not only to read the skb
metadata and packet data, but to also arbitrarily
mangle both, and terminate the tc processing with an action verdict. cls_bpf
can thus be regarded as a self-contained entity that manages and executes tc
BPF programs.
cls_bpf
can hold one or more tc BPF programs. In the case where Cilium
deploys cls_bpf
programs, it attaches only a single program for a given hook
in direct-action
mode. Typically, in the traditional tc scheme, there is a
split between classifier and action modules, where the classifier has one
or more actions attached to it that are triggered once the classifier has a
match. In the modern world for using tc in the software data path this model
does not scale well for complex packet processing. Given tc BPF programs
attached to cls_bpf
are fully self-contained, they effectively fuse the
parsing and action process together into a single unit. Thanks to cls_bpf
’s
direct-action
mode, it will just return the tc action verdict and
terminate the processing pipeline immediately. This allows for implementing
scalable programmable packet processing in the networking data path by avoiding
linear iteration of actions. cls_bpf
is the only such “classifier” module
in the tc layer capable of such a fast-path.
Like XDP BPF programs, tc BPF programs can be atomically updated at runtime
via cls_bpf
without interrupting any network traffic or having to restart
services.
Both the tc ingress and the egress hook where cls_bpf
itself can be
attached to is managed by a pseudo qdisc called sch_clsact
. This is a
drop-in replacement and proper superset of the ingress qdisc since it
is able to manage both, ingress and egress tc hooks. For tc’s egress hook
in __dev_queue_xmit()
it is important to stress that it is not executed
under the kernel’s qdisc root lock. Thus, both tc ingress and egress hooks
are executed in a lockless manner in the fast-path. In either case, preemption
is disabled and execution happens under RCU read side.
Typically on egress there are qdiscs attached to netdevices such as sch_mq
,
sch_fq
, sch_fq_codel
or sch_htb
where some of them are classful
qdiscs that contain subclasses and thus require a packet classification
mechanism to determine a verdict where to demux the packet. This is handled
by a call to tcf_classify()
which calls into tc classifiers if present.
cls_bpf
can also be attached and used in such cases. Such operation usually
happens under the qdisc root lock and can be subject to lock contention. The
sch_clsact
qdisc’s egress hook comes at a much earlier point however which
does not fall under that and operates completely independent from conventional
egress qdiscs. Thus for cases like sch_htb
the sch_clsact
qdisc could
perform the heavy lifting packet classification through tc BPF outside of the
qdisc root lock, setting the skb->mark
or skb->priority
from there such
that sch_htb
only requires a flat mapping without expensive packet
classification under the root lock thus reducing contention.
Offloaded tc BPF programs are supported for the case of sch_clsact
in
combination with cls_bpf
where the prior loaded BPF program was JITed
from a SmartNIC driver to be run natively on the NIC. Only cls_bpf
programs operating in direct-action
mode are supported to be offloaded.
cls_bpf
only supports offloading a single program and cannot offload
multiple programs. Furthermore only the ingress hook supports offloading
BPF programs.
One cls_bpf
instance is able to hold multiple tc BPF programs internally.
If this is the case, then the TC_ACT_UNSPEC
program return code will
continue execution with the next tc BPF program in that list. However, this
has the drawback that several programs would need to parse the packet over
and over again resulting in degraded performance.
BPF program return codes
Both the tc ingress and egress hook share the same action return verdicts
that tc BPF programs can use. They are defined in the linux/pkt_cls.h
system header:
#define TC_ACT_UNSPEC (-1)
#define TC_ACT_OK 0
#define TC_ACT_SHOT 2
#define TC_ACT_STOLEN 4
#define TC_ACT_REDIRECT 7
There are a few more action TC_ACT_*
verdicts available in the system
header file which are also used in the two hooks. However, they share the
same semantics with the ones above. Meaning, from a tc BPF perspective,
TC_ACT_OK
and TC_ACT_RECLASSIFY
have the same semantics, as well as
the three TC_ACT_STOLEN
, TC_ACT_QUEUED
and TC_ACT_TRAP
opcodes.
Therefore, for these cases we only describe TC_ACT_OK
and the TC_ACT_STOLEN
opcode for the two groups.
Starting out with TC_ACT_UNSPEC
. It has the meaning of “unspecified action”
and is used in three cases, i) when an offloaded tc BPF program is attached
and the tc ingress hook is run where the cls_bpf
representation for the
offloaded program will return TC_ACT_UNSPEC
, ii) in order to continue
with the next tc BPF program in cls_bpf
for the multi-program case. The
latter also works in combination with offloaded tc BPF programs from point i)
where the TC_ACT_UNSPEC
from there continues with a next tc BPF program
solely running in non-offloaded case. Last but not least, iii) TC_ACT_UNSPEC
is also used for the single program case to simply tell the kernel to continue
with the skb
without additional side-effects. TC_ACT_UNSPEC
is very
similar to the TC_ACT_OK
action code in the sense that both pass the
skb
onwards either to upper layers of the stack on ingress or down to
the networking device driver for transmission on egress, respectively. The
only difference to TC_ACT_OK
is that TC_ACT_OK
sets skb->tc_index
based on the classid the tc BPF program set. The latter is set out of the
tc BPF program itself through skb->tc_classid
from the BPF context.
TC_ACT_SHOT
instructs the kernel to drop the packet, meaning, upper
layers of the networking stack will never see the skb
on ingress and
similarly the packet will never be submitted for transmission on egress.
TC_ACT_SHOT
and TC_ACT_STOLEN
are both similar in nature with few
differences: TC_ACT_SHOT
will indicate to the kernel that the skb
was released through kfree_skb()
and return NET_XMIT_DROP
to the
callers for immediate feedback, whereas TC_ACT_STOLEN
will release
the skb
through consume_skb()
and pretend to upper layers that
the transmission was successful through NET_XMIT_SUCCESS
. The perf’s
drop monitor which records traces of kfree_skb()
will therefore
also not see any drop indications from TC_ACT_STOLEN
since its
semantics are such that the skb
has been “consumed” or queued but
certainly not “dropped”.
Last but not least the TC_ACT_REDIRECT
action which is available for
tc BPF programs as well. This allows to redirect the skb
to the same
or another’s device ingress or egress path together with the bpf_redirect()
helper. Being able to inject the packet into another device’s ingress or
egress direction allows for full flexibility in packet forwarding with
BPF. There are no requirements on the target networking device other than
being a networking device itself, there is no need to run another instance
of cls_bpf
on the target device or other such restrictions.
tc BPF FAQ
This section contains a few miscellaneous question and answer pairs related to tc BPF programs that are asked from time to time.
- Question: What about
act_bpf
as a tc action module, is it still relevant? - Answer: Not really. Although
cls_bpf
andact_bpf
share the same functionality for tc BPF programs,cls_bpf
is more flexible since it is a proper superset ofact_bpf
. The way tc works is that tc actions need to be attached to tc classifiers. In order to achieve the same flexibility ascls_bpf
,act_bpf
would need to be attached to thecls_matchall
classifier. As the name says, this will match on every packet in order to pass them through for attached tc action processing. Foract_bpf
, this is will result in less efficient packet processing than usingcls_bpf
indirect-action
mode directly. Ifact_bpf
is used in a setting with other classifiers thancls_bpf
orcls_matchall
then this will perform even worse due to the nature of operation of tc classifiers. Meaning, if classifier A has a mismatch, then the packet is passed to classifier B, reparsing the packet, etc, thus in the typical case there will be linear processing where the packet would need to traverse N classifiers in the worst case to find a match and executeact_bpf
on that. Therefore,act_bpf
has never been largely relevant. Additionally,act_bpf
does not provide a tc offloading interface either compared tocls_bpf
.
- Question: Is it recommended to use
cls_bpf
not indirect-action
mode? - Answer: No. The answer is similar to the one above in that this is otherwise
unable to scale for more complex processing. tc BPF can already do everything needed
by itself in an efficient manner and thus there is no need for anything other than
direct-action
mode.
- Question: Is there any performance difference in offloaded
cls_bpf
and offloaded XDP? - Answer: No. Both are JITed through the same compiler in the kernel which handles the offloading to the SmartNIC and the loading mechanism for both is very similar as well. Thus, the BPF program gets translated into the same target instruction set in order to be able to run on the NIC natively. The two tc BPF and XDP BPF program types have a differing set of features, so depending on the use case one might be picked over the other due to availability of certain helper functions in the offload case, for example.
Use cases for tc BPF
Some of the main use cases for tc BPF programs are presented in this subsection. Also here, the list is non-exhaustive and given the programmability and efficiency of tc BPF, it can easily be tailored and integrated into orchestration systems in order to solve very specific use cases. While some use cases with XDP may overlap, tc BPF and XDP BPF are mostly complementary to each other and both can also be used at the same time or one over the other depending which is most suitable for a given problem to solve.
Policy enforcement for containers
One application which tc BPF programs are suitable for is to implement policy enforcement, custom firewalling or similar security measures for containers or pods, respectively. In the conventional case, container isolation is implemented through network namespaces with veth networking devices connecting the host’s initial namespace with the dedicated container’s namespace. Since one end of the veth pair has been moved into the container’s namespace whereas the other end remains in the initial namespace of the host, all network traffic from the container has to pass through the host-facing veth device allowing for attaching tc BPF programs on the tc ingress and egress hook of the veth. Network traffic going into the container will pass through the host-facing veth’s tc egress hook whereas network traffic coming from the container will pass through the host-facing veth’s tc ingress hook.
For virtual devices like veth devices XDP is unsuitable in this case since the kernel operates solely on a
skb
here and generic XDP has a few limitations where it does not operate with clonedskb
’s. The latter is heavily used from the TCP/IP stack in order to hold data segments for retransmission where the generic XDP hook would simply get bypassed instead. Moreover, generic XDP needs to linearize the entireskb
resulting in heavily degraded performance. tc BPF on the other hand is more flexible as it specializes on theskb
input context case and thus does not need to cope with the limitations from generic XDP.
Forwarding and load-balancing
The forwarding and load-balancing use case is quite similar to XDP, although slightly more targeted towards east-west container workloads rather than north-south traffic (though both technologies can be used in either case). Since XDP is only available on ingress side, tc BPF programs allow for further use cases that apply in particular on egress, for example, container based traffic can already be NATed and load-balanced on the egress side through BPF out of the initial namespace such that this is done transparent to the container itself. Egress traffic is already based on the
sk_buff
structure due to the nature of the kernel’s networking stack, so packet rewrites and redirects are suitable out of tc BPF. By utilizing thebpf_redirect()
helper function, BPF can take over the forwarding logic to push the packet either into the ingress or egress path of another networking device. Thus, any bridge-like devices become unnecessary to use as well by utilizing tc BPF as forwarding fabric.
Flow sampling, monitoring
Like in XDP case, flow sampling and monitoring can be realized through a high-performance lockless per-CPU memory mapped perf ring buffer where the BPF program is able to push custom data, the full or truncated packet contents, or both up to a user space application. From the tc BPF program this is realized through the
bpf_skb_event_output()
BPF helper function which has the same function signature and semantics asbpf_xdp_event_output()
. Given tc BPF programs can be attached to ingress and egress as opposed to only ingress in XDP BPF case plus the two tc hooks are at the lowest layer in the (generic) networking stack, this allows for bidirectional monitoring of all network traffic from a particular node. This might be somewhat related to the cBPF case which tcpdump and Wireshark makes use of, though, without having to clone theskb
and with being a lot more flexible in terms of programmability where, for example, BPF can already perform in-kernel aggregation rather than pushing everything up to user space as well as custom annotations for packets pushed into the ring buffer. The latter is also heavily used in Cilium where packet drops can be further annotated to correlate container labels and reasons for why a given packet had to be dropped (such as due to policy violation) in order to provide a richer context.
Packet scheduler pre-processing
The
sch_clsact
’s egress hook which is calledsch_handle_egress()
runs right before taking the kernel’s qdisc root lock, thus tc BPF programs can be utilized to perform all the heavy lifting packet classification and mangling before the packet is transmitted into a real full blown qdisc such assch_htb
. This type of interaction ofsch_clsact
with a real qdisc likesch_htb
coming later in the transmission phase allows to reduce the lock contention on transmission sincesch_clsact
’s egress hook is executed without taking locks.
One concrete example user of tc BPF but also XDP BPF programs is Cilium. Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes and operates at Layer 3/4 as well as Layer 7. At the heart of Cilium operates BPF in order to implement the policy enforcement as well as load balancing and monitoring.
- Slides: https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp
- Video: https://youtu.be/ilKlmTDdFgk
- Github: https://github.com/cilium/cilium
Driver support
Since tc BPF programs are triggered from the kernel’s networking stack and not directly out of the driver, they do not require any extra driver modification and therefore can run on any networking device. The only exception listed below is for offloading tc BPF programs to the NIC.
Drivers supporting offloaded tc BPF
- Netronome
- nfp [2]
Note that also here examples for writing and loading tc BPF programs are included in the toolchain section under the respective tools.
Further Reading¶
Mentioned lists of docs, projects, talks, papers, and further reading material are likely not complete. Thus, feel free to open pull requests to complete the list.
Kernel Developer FAQ¶
Under Documentation/bpf/
, the Linux kernel provides two FAQ files that
are mainly targeted for kernel developers involved in the BPF subsystem.
BPF Devel FAQ: this document provides mostly information around patch submission process as well as BPF kernel tree, stable tree and bug reporting workflows, questions around BPF’s extensibility and interaction with LLVM and more.
BPF Design FAQ: this document tries to answer frequently asked questions around BPF design decisions related to the instruction set, verifier, calling convention, JITs, etc.
Projects using BPF¶
The following list includes a selection of open source projects making use of BPF respectively provide tooling for BPF. In this context the eBPF instruction set is specifically meant instead of projects utilizing the legacy cBPF:
Tracing
BCC
BCC stands for BPF Compiler Collection and its key feature is to provide a set of easy to use and efficient kernel tracing utilities all based upon BPF programs hooking into kernel infrastructure based upon kprobes, kretprobes, tracepoints, uprobes, uretprobes as well as USDT probes. The collection provides close to hundred tools targeting different layers across the stack from applications, system libraries, to the various different kernel subsystems in order to analyze a system’s performance characteristics or problems. Additionally, BCC provides an API in order to be used as a library for other projects.
bpftrace
bpftrace is a DTrace-style dynamic tracing tool for Linux and uses LLVM as a back end to compile scripts to BPF-bytecode and makes use of BCC for interacting with the kernel’s BPF tracing infrastructure. It provides a higher-level language for implementing tracing scripts compared to native BCC.
perf
The perf tool which is developed by the Linux kernel community as part of the kernel source tree provides a way to load tracing BPF programs through the conventional perf record subcommand where the aggregated data from BPF can be retrieved and post processed in perf.data for example through perf script and other means.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf
ply
ply is a tracing tool that follows the ‘Little Language’ approach of yore, and compiles ply scripts into Linux BPF programs that are attached to kprobes and tracepoints in the kernel. The scripts have a C-like syntax, heavily inspired by DTrace and by extension awk. ply keeps dependencies to very minimum and only requires flex and bison at build time, only libc at runtime.
systemtap
systemtap is a scripting language and tool for extracting, filtering and summarizing data in order to diagnose and analyze performance or functional problems. It comes with a BPF back end called stapbpf which translates the script directly into BPF without the need of an additional compiler and injects the probe into the kernel. Thus, unlike stap’s kernel modules this does neither have external dependencies nor requires to load kernel modules.
https://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=summary
PCP
Performance Co-Pilot (PCP) is a system performance and analysis framework which is able to collect metrics through a variety of agents as well as analyze collected systems’ performance metrics in real-time or by using historical data. With pmdabcc, PCP has a BCC based performance metrics domain agent which extracts data from the kernel via BPF and BCC.
Weave Scope
Weave Scope is a cloud monitoring tool collecting data about processes, networking connections or other system data by making use of BPF in combination with kprobes. Weave Scope works on top of the gobpf library in order to load BPF ELF files into the kernel, and comes with a tcptracer-bpf tool which monitors connect, accept and close calls in order to trace TCP events.
Networking
Cilium
Cilium provides and transparently secures network connectivity and load-balancing between application workloads such as application containers or processes. Cilium operates at Layer 3/4 to provide traditional networking and security services as well as Layer 7 to protect and secure use of modern application protocols such as HTTP, gRPC and Kafka. It is integrated into orchestration frameworks such as Kubernetes and Mesos, and BPF is the foundational part of Cilium that operates in the kernel’s networking data path.
Suricata
Suricata is a network IDS, IPS and NSM engine, and utilizes BPF as well as XDP in three different areas, that is, as BPF filter in order to process or bypass certain packets, as a BPF based load balancer in order to allow for programmable load balancing and for XDP to implement a bypass or dropping mechanism at high packet rates.
http://suricata.readthedocs.io/en/latest/capture-hardware/ebpf-xdp.html
systemd
systemd allows for IPv4/v6 accounting as well as implementing network access control for its systemd units based on BPF’s cgroup ingress and egress hooks. Accounting is based on packets / bytes, and ACLs can be specified as address prefixes for allow / deny rules. More information can be found at:
http://0pointer.net/blog/ip-accounting-and-access-lists-with-systemd.html
iproute2
iproute2 offers the ability to load BPF programs as LLVM generated ELF files into the kernel. iproute2 supports both, XDP BPF programs as well as tc BPF programs through a common BPF loader backend. The tc and ip command line utilities enable loader and introspection functionality for the user.
https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/
p4c-xdp
p4c-xdp presents a P4 compiler backend targeting BPF and XDP. P4 is a domain specific language describing how packets are processed by the data plane of a programmable network element such as NICs, appliances or switches, and with the help of p4c-xdp P4 programs can be translated into BPF C programs which can be compiled by clang / LLVM and loaded as BPF programs into the kernel at XDP layer for high performance packet processing.
Others
LLVM
clang / LLVM provides the BPF back end in order to compile C BPF programs into BPF instructions contained in ELF files. The LLVM BPF back end is developed alongside with the BPF core infrastructure in the Linux kernel and maintained by the same community. clang / LLVM is a key part in the toolchain for developing BPF programs.
libbpf
libbpf is a generic BPF library which is developed by the Linux kernel community as part of the kernel source tree and allows for loading and attaching BPF programs from LLVM generated ELF files into the kernel. The library is used by other kernel projects such as perf and bpftool.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf
bpftool
bpftool is the main tool for introspecting and debugging BPF programs and BPF maps, and like libbpf is developed by the Linux kernel community. It allows for dumping all active BPF programs and maps in the system, dumping and disassembling BPF or JITed BPF instructions from a program as well as dumping and manipulating BPF maps in the system. bpftool supports interaction with the BPF filesystem, loading various program types from an object file into the kernel and much more.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/bpftool
gobpf
gobpf provides go bindings for the bcc framework as well as low-level routines in order to load and use BPF programs from ELF files.
ebpf_asm
ebpf_asm provides an assembler for BPF programs written in an Intel-like assembly syntax, and therefore offers an alternative for writing BPF programs directly in assembly for cases where programs are rather small and simple without needing the clang / LLVM toolchain.
XDP Newbies¶
There are a couple of walk-through posts by David S. Miller to the xdp-newbies mailing list (http://vger.kernel.org/vger-lists.html#xdp-newbies), which explain various parts of XDP and BPF:
- May 2017,
- BPF Verifier Overview, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00185.html
- May 2017,
- Contextually speaking…, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00181.html
- May 2017,
- bpf.h and you…, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00179.html
- Apr 2017,
- XDP example of the day, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00009.html
BPF Newsletter¶
Alexander Alemayhu initiated a newsletter around BPF roughly once per week covering latest developments around BPF in Linux kernel land and its surrounding ecosystem in user space.
All BPF update newsletters (01 - 12) can be found here:
Podcasts¶
There have been a number of technical podcasts partially covering BPF. Incomplete list:
- Feb 2017,
- Linux Networking Update from Netdev Conference, Thomas Graf, Software Gone Wild, Show 71, http://blog.ipspace.net/2017/02/linux-networking-update-from-netdev.html http://media.blubrry.com/ipspace/stream.ipspace.net/nuggets/podcast/Show_71-NetDev_Update.mp3
- Jan 2017,
- The IO Visor Project, Brenden Blanco, OVS Orbit, Episode 23, https://ovsorbit.org/#e23 https://ovsorbit.org/episode-23.mp3
- Oct 2016,
- Fast Linux Packet Forwarding, Thomas Graf, Software Gone Wild, Show 64, http://blog.ipspace.net/2016/10/fast-linux-packet-forwarding-with.html http://media.blubrry.com/ipspace/stream.ipspace.net/nuggets/podcast/Show_64-Cilium_with_Thomas_Graf.mp3
- Aug 2016,
- P4 on the Edge, John Fastabend, OVS Orbit, Episode 11, https://ovsorbit.org/#e11 https://ovsorbit.org/episode-11.mp3
- May 2016,
- Cilium, Thomas Graf, OVS Orbit, Episode 4, https://ovsorbit.org/#e4 https://ovsorbit.benpfaff.org/episode-4.mp3
Blog posts¶
The following (incomplete) list includes blog posts around BPF, XDP and related projects:
- May 2017,
- An entertaining eBPF XDP adventure, Suchakra Sharma, https://suchakra.wordpress.com/2017/05/23/an-entertaining-ebpf-xdp-adventure/
- May 2017,
- eBPF, part 2: Syscall and Map Types, Ferris Ellis, https://ferrisellis.com/posts/ebpf_syscall_and_maps/
- May 2017,
- Monitoring the Control Plane, Gary Berger, http://firstclassfunc.com/2017/05/monitoring-the-control-plane/
- Apr 2017,
- USENIX/LISA 2016 Linux bcc/BPF Tools, Brendan Gregg, http://www.brendangregg.com/blog/2017-04-29/usenix-lisa-2016-bcc-bpf-tools.html
- Apr 2017,
- Liveblog: Cilium for Network and Application Security with BPF and XDP, Scott Lowe, http://blog.scottlowe.org//2017/04/18/black-belt-cilium/
- Apr 2017,
- eBPF, part 1: Past, Present, and Future, Ferris Ellis, https://ferrisellis.com/posts/ebpf_past_present_future/
- Mar 2017,
- Analyzing KVM Hypercalls with eBPF Tracing, Suchakra Sharma, https://suchakra.wordpress.com/2017/03/31/analyzing-kvm-hypercalls-with-ebpf-tracing/
- Jan 2017,
- Golang bcc/BPF Function Tracing, Brendan Gregg, http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
- Dec 2016,
- Give me 15 minutes and I’ll change your view of Linux tracing, Brendan Gregg, http://www.brendangregg.com/blog/2016-12-27/linux-tracing-in-15-minutes.html
- Nov 2016,
- Cilium: Networking and security for containers with BPF and XDP, Daniel Borkmann, https://opensource.googleblog.com/2016/11/cilium-networking-and-security.html
- Nov 2016,
- Linux bcc/BPF tcplife: TCP Lifespans, Brendan Gregg, http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html
- Oct 2016,
- DTrace for Linux 2016, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-27/dtrace-for-linux-2016.html
- Oct 2016,
- Linux 4.9’s Efficient BPF-based Profiler, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html
- Oct 2016,
- Linux bcc tcptop, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html
- Oct 2016,
- Linux bcc/BPF Node.js USDT Tracing, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html
- Oct 2016,
- Linux bcc/BPF Run Queue (Scheduler) Latency, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html
- Oct 2016,
- Linux bcc ext4 Latency Tracing, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html
- Oct 2016,
- Linux MySQL Slow Query Tracing with bcc/BPF, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html
- Oct 2016,
- Linux bcc Tracing Security Capabilities, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html
- Sep 2016,
- Suricata bypass feature, Eric Leblond, https://www.stamus-networks.com/2016/09/28/suricata-bypass-feature/
- Aug 2016,
- Introducing the p0f BPF compiler, Gilberto Bertin, https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler/
- Jun 2016,
- Ubuntu Xenial bcc/BPF, Brendan Gregg, http://www.brendangregg.com/blog/2016-06-14/ubuntu-xenial-bcc-bpf.html
- Mar 2016,
- Linux BPF/bcc Road Ahead, March 2016, Brendan Gregg, http://www.brendangregg.com/blog/2016-03-28/linux-bpf-bcc-road-ahead-2016.html
- Mar 2016,
- Linux BPF Superpowers, Brendan Gregg, http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html
- Feb 2016,
- Linux eBPF/bcc uprobes, Brendan Gregg, http://www.brendangregg.com/blog/2016-02-08/linux-ebpf-bcc-uprobes.html
- Feb 2016,
- Who is waking the waker? (Linux chain graph prototype), Brendan Gregg, http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html
- Feb 2016,
- Linux Wakeup and Off-Wake Profiling, Brendan Gregg, http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
- Jan 2016,
- Linux eBPF Off-CPU Flame Graph, Brendan Gregg, http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html
- Jan 2016,
- Linux eBPF Stack Trace Hack, Brendan Gregg, http://www.brendangregg.com/blog/2016-01-18/ebpf-stack-trace-hack.html
- Sep 2015,
- Linux Networking, Tracing and IO Visor, a New Systems Performance Tool for a Distributed World, Suchakra Sharma, https://thenewstack.io/comparing-dtrace-iovisor-new-systems-performance-platform-advance-linux-networking-virtualization/
- Aug 2015,
- BPF Internals - II, Suchakra Sharma, https://suchakra.wordpress.com/2015/08/12/bpf-internals-ii/
- May 2015,
- eBPF: One Small Step, Brendan Gregg, http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html
- May 2015,
- BPF Internals - I, Suchakra Sharma, https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/
- Jul 2014,
- Introducing the BPF Tools, Marek Majkowski, https://blog.cloudflare.com/introducing-the-bpf-tools/
- May 2014,
- BPF - the forgotten bytecode, Marek Majkowski, https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
Talks¶
The following (incomplete) list includes talks and conference papers related to BPF and XDP:
- May 2017,
- PyCon 2017, Portland, Executing python functions in the linux kernel by transpiling to bpf, Alex Gartrell, https://www.youtube.com/watch?v=CpqMroMBGP4
- May 2017,
- gluecon 2017, Denver, Cilium + BPF: Least Privilege Security on API Call Level for Microservices, Dan Wendlandt, http://gluecon.com/#agenda
- May 2017,
- Lund Linux Con, Lund, XDP - eXpress Data Path, Jesper Dangaard Brouer, http://people.netfilter.org/hawk/presentations/LLC2017/XDP_DDoS_protecting_LLC2017.pdf
- May 2017,
- Polytechnique Montreal, Trace Aggregation and Collection with eBPF, Suchakra Sharma, http://step.polymtl.ca/~suchakra/eBPF-5May2017.pdf
- Apr 2017,
- DockerCon, Austin, Cilium - Network and Application Security with BPF and XDP, Thomas Graf, https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp
- Apr 2017,
- NetDev 2.1, Montreal, XDP Mythbusters, David S. Miller, https://www.netdevconf.org/2.1/slides/apr7/miller-XDP-MythBusters.pdf
- Apr 2017,
- NetDev 2.1, Montreal, Droplet: DDoS countermeasures powered by BPF + XDP, Huapeng Zhou, Doug Porter, Ryan Tierney, Nikita Shirokov, https://www.netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
- Apr 2017,
- NetDev 2.1, Montreal, XDP in practice: integrating XDP in our DDoS mitigation pipeline, Gilberto Bertin, https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP.pdf
- Apr 2017,
- NetDev 2.1, Montreal, XDP for the Rest of Us, Andy Gospodarek, Jesper Dangaard Brouer, https://www.netdevconf.org/2.1/slides/apr7/gospodarek-Netdev2.1-XDP-for-the-Rest-of-Us_Final.pdf
- Mar 2017,
- SCALE15x, Pasadena, Linux 4.x Tracing: Performance Analysis with bcc/BPF, Brendan Gregg, https://www.slideshare.net/brendangregg/linux-4x-tracing-performance-analysis-with-bccbpf
- Mar 2017,
- XDP Inside and Out, David S. Miller, https://github.com/iovisor/bpf-docs/raw/master/XDP_Inside_and_Out.pdf
- Mar 2017,
- OpenSourceDays, Copenhagen, XDP - eXpress Data Path, Used for DDoS protection, Jesper Dangaard Brouer, https://github.com/iovisor/bpf-docs/raw/master/XDP_Inside_and_Out.pdf
- Mar 2017,
- source{d}, Infrastructure 2017, Madrid, High-performance Linux monitoring with eBPF, Alfonso Acosta, https://www.youtube.com/watch?v=k4jqTLtdrxQ
- Feb 2017,
- FOSDEM 2017, Brussels, Stateful packet processing with eBPF, an implementation of OpenState interface, Quentin Monnet, https://fosdem.org/2017/schedule/event/stateful_ebpf/
- Feb 2017,
- FOSDEM 2017, Brussels, eBPF and XDP walkthrough and recent updates, Daniel Borkmann, http://borkmann.ch/talks/2017_fosdem.pdf
- Feb 2017,
- FOSDEM 2017, Brussels, Cilium - BPF & XDP for containers, Thomas Graf, https://fosdem.org/2017/schedule/event/cilium/
- Jan 2017,
- linuxconf.au, Hobart, BPF: Tracing and more, Brendan Gregg, https://www.slideshare.net/brendangregg/bpf-tracing-and-more
- Dec 2016,
- USENIX LISA 2016, Boston, Linux 4.x Tracing Tools: Using BPF Superpowers, Brendan Gregg, https://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers
- Nov 2016,
- Linux Plumbers, Santa Fe, Cilium: Networking & Security for Containers with BPF & XDP, Thomas Graf, http://www.slideshare.net/ThomasGraf5/clium-container-networking-with-bpf-xdp
- Nov 2016,
- OVS Conference, Santa Clara, Offloading OVS Flow Processing using eBPF, William (Cheng-Chun) Tu, http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf
- Oct 2016,
- One.com, Copenhagen, XDP - eXpress Data Path, Intro and future use-cases, Jesper Dangaard Brouer, http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf
- Oct 2016,
- Docker Distributed Systems Summit, Berlin, Cilium: Networking & Security for Containers with BPF & XDP, Thomas Graf, http://www.slideshare.net/Docker/cilium-bpf-xdp-for-containers-66969823
- Oct 2016,
- NetDev 1.2, Tokyo, Data center networking stack, Tom Herbert, http://netdevconf.org/1.2/session.html?tom-herbert
- Oct 2016,
- NetDev 1.2, Tokyo, Fast Programmable Networks & Encapsulated Protocols, David S. Miller, http://netdevconf.org/1.2/session.html?david-miller-keynote
- Oct 2016,
- NetDev 1.2, Tokyo, XDP workshop - Introduction, experience, and future development, Tom Herbert, http://netdevconf.org/1.2/session.html?herbert-xdp-workshop
- Oct 2016,
- NetDev1.2, Tokyo, The adventures of a Suricate in eBPF land, Eric Leblond, http://netdevconf.org/1.2/slides/oct6/10_suricata_ebpf.pdf
- Oct 2016,
- NetDev1.2, Tokyo, cls_bpf/eBPF updates since netdev 1.1, Daniel Borkmann, http://borkmann.ch/talks/2016_tcws.pdf
- Oct 2016,
- NetDev1.2, Tokyo, Advanced programmability and recent updates with tc’s cls_bpf, Daniel Borkmann, http://borkmann.ch/talks/2016_netdev2.pdf http://www.netdevconf.org/1.2/papers/borkmann.pdf
- Oct 2016,
- NetDev 1.2, Tokyo, eBPF/XDP hardware offload to SmartNICs, Jakub Kicinski, Nic Viljoen, http://netdevconf.org/1.2/papers/eBPF_HW_OFFLOAD.pdf
- Aug 2016,
- LinuxCon, Toronto, What Can BPF Do For You?, Brenden Blanco, https://events.linuxfoundation.org/sites/events/files/slides/iovisor-lc-bof-2016.pdf
- Aug 2016,
- LinuxCon, Toronto, Cilium - Fast IPv6 Container Networking with BPF and XDP, Thomas Graf, https://www.slideshare.net/ThomasGraf5/cilium-fast-ipv6-container-networking-with-bpf-and-xdp
- Aug 2016,
- P4, EBPF and Linux TC Offload, Dinan Gunawardena, Jakub Kicinski, https://de.slideshare.net/Open-NFP/p4-epbf-and-linux-tc-offload
- Jul 2016,
- Linux Meetup, Santa Clara, eXpress Data Path, Brenden Blanco, http://www.slideshare.net/IOVisor/express-data-path-linux-meetup-santa-clara-july-2016
- Jul 2016,
- Linux Meetup, Santa Clara, CETH for XDP, Yan Chan, Yunsong Lu, http://www.slideshare.net/IOVisor/ceth-for-xdp-linux-meetup-santa-clara-july-2016
- May 2016,
- P4 workshop, Stanford, P4 on the Edge, John Fastabend, https://schd.ws/hosted_files/2016p4workshop/1d/Intel%20Fastabend-P4%20on%20the%20Edge.pdf
- Mar 2016,
- Performance @Scale 2016, Menlo Park, Linux BPF Superpowers, Brendan Gregg, https://www.slideshare.net/brendangregg/linux-bpf-superpowers
- Mar 2016,
- eXpress Data Path, Tom Herbert, Alexei Starovoitov, https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf
- Feb 2016,
- NetDev1.1, Seville, On getting tc classifier fully programmable with cls_bpf, Daniel Borkmann, http://borkmann.ch/talks/2016_netdev.pdf http://www.netdevconf.org/1.1/proceedings/papers/On-getting-tc-classifier-fully-programmable-with-cls-bpf.pdf
- Jan 2016,
- FOSDEM 2016, Brussels, Linux tc and eBPF, Daniel Borkmann, http://borkmann.ch/talks/2016_fosdem.pdf
- Oct 2015,
- LinuxCon Europe, Dublin, eBPF on the Mainframe, Michael Holzheu, https://events.linuxfoundation.org/sites/events/files/slides/ebpf_on_the_mainframe_lcon_2015.pdf
- Aug 2015,
- Tracing Summit, Seattle, LLTng’s Trace Filtering and beyond (with some eBPF goodness, of course!), Suchakra Sharma, https://github.com/iovisor/bpf-docs/raw/master/ebpf_excerpt_20Aug2015.pdf
- Jun 2015,
- LinuxCon Japan, Tokyo, Exciting Developments in Linux Tracing, Elena Zannoni, https://events.linuxfoundation.org/sites/events/files/slides/tracing-linux-ezannoni-linuxcon-ja-2015_0.pdf
- Feb 2015,
- Collaboration Summit, Santa Rosa, BPF: In-kernel Virtual Machine, Alexei Starovoitov, https://events.linuxfoundation.org/sites/events/files/slides/bpf_collabsummit_2015feb20.pdf
- Feb 2015,
- NetDev 0.1, Ottawa, BPF: In-kernel Virtual Machine, Alexei Starovoitov, http://netdevconf.org/0.1/sessions/15.html
- Feb 2014,
- DevConf.cz, Brno, tc and cls_bpf: lightweight packet classifying with BPF, Daniel Borkmann, http://borkmann.ch/talks/2014_devconf.pdf
Further Documents¶
- Dive into BPF: a list of reading material, Quentin Monnet (https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/)
- XDP - eXpress Data Path, Jesper Dangaard Brouer (https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html)
API Reference¶
Introduction¶
The Cilium API is JSON based and provided by the cilium-agent
. The purpose
of the API is to provide visibility and control over an individual agent
instance. In general, all API calls affect only the resources managed by the
individual cilium-agent
serving the API. A few selected API calls such as
the security identity resolution provides cluster wide visibility. Such API
calls are marked specifically. Unless noted otherwise, API calls will only affect
local agent resources.
How to access the API¶
CLI Client¶
The easiest way to access the API is via the cilium
CLI client. cilium
will automatically locate the API of the agent running on the same node and
access it. However, using the -H
or --host
flag, the cilium
client
can be pointed to an arbitrary API address.
Example¶
$ cilium -H unix:///var/run/cilium/cilium.sock
[...]
Golang Package¶
The following Go packages can be used to access the API:
Package | Description |
pkg/client | Main client API abstraction |
api/v1/models | API resource data type models |
Example¶
The full example can be found in the cilium/client-example repository.
import (
"fmt"
"github.com/cilium/cilium/pkg/client"
)
func main() {
c, err := client.NewDefaultClient()
if err != nil {
...
}
endpoints, err := c.EndpointList()
if err != nil {
...
}
for _, ep := range endpoints {
fmt.Printf("%8d %14s %16s %32s\n", ep.ID, ep.ContainerName, ep.Addressing.IPV4, ep.Addressing.IPV6)
}
Compatibility Guarantees¶
Cilium API is stable as of version 1.0, backward compatibility will be upheld for whole lifecycle of Cilium 1.x.
API Reference¶
-
GET
/healthz
¶ Get health of Cilium daemon
Returns health and status information of the Cilium daemon and related components such as the local container runtime, connected datastore, Kubernetes integration.
Status Codes: - 200 OK – Success
-
GET
/config
¶ Get configuration of Cilium daemon
Returns the configuration of the Cilium daemon.
Status Codes: - 200 OK – Success
-
PATCH
/config
¶ Modify daemon configuration
Updates the daemon configuration by applying the provided ConfigurationMap and regenerates & recompiles all required datapath components.
Status Codes: - 200 OK – Success
- 400 Bad Request – Bad configuration parameters
- 500 Internal Server Error – Recompilation failed
-
GET
/endpoint/{id}
¶ Get endpoint by endpoint ID
Returns endpoint information
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid endpoint ID format for specified type
- 404 Not Found – Endpoint not found
- id (string) –
-
PUT
/endpoint/{id}
¶ Create endpoint
Creates a new endpoint
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 201 Created – Created
- 400 Bad Request – Invalid endpoint in request
- 409 Conflict – Endpoint already exists
- 500 Internal Server Error – Endpoint creation failed
- id (string) –
-
PATCH
/endpoint/{id}
¶ Modify existing endpoint
Applies the endpoint change request to an existing endpoint
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid modify endpoint request
- 404 Not Found – Endpoint does not exist
- 500 Internal Server Error – Endpoint update failed
- id (string) –
-
DELETE
/endpoint/{id}
¶ Delete endpoint
Deletes the endpoint specified by the ID. Deletion is imminent and atomic, if the deletion request is valid and the endpoint exists, deletion will occur even if errors are encountered in the process. If errors have been encountered, the code 202 will be returned, otherwise 200 on success.
All resources associated with the endpoint will be freed and the workload represented by the endpoint will be disconnected.It will no longer be able to initiate or receive communications of any sort.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 206 Partial Content – Deleted with a number of errors encountered
- 400 Bad Request – Invalid endpoint ID format for specified type. Details in error message
- 404 Not Found – Endpoint not found
- id (string) –
-
GET
/endpoint
¶ Retrieves a list of endpoints that have metadata matching the provided parameters.
Retrieves a list of endpoints that have metadata matching the provided parameters, or all endpoints if no parameters provided.
Status Codes: - 200 OK – Success
- 404 Not Found – Endpoints with provided parameters not found
-
GET
/endpoint/{id}/config
¶ Retrieve endpoint configuration
Retrieves the configuration of the specified endpoint.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 404 Not Found – Endpoint not found
- id (string) –
-
PATCH
/endpoint/{id}/config
¶ Modify mutable endpoint configuration
Update the configuration of an existing endpoint and regenerates & recompiles the corresponding programs automatically.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid configuration request
- 404 Not Found – Endpoint not found
- 500 Internal Server Error – Update failed. Details in message.
- id (string) –
-
GET
/endpoint/{id}/labels
¶ Retrieves the list of labels associated with an endpoint.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 404 Not Found – Endpoint not found
- id (string) –
-
PATCH
/endpoint/{id}/labels
¶ Set label configuration of endpoint
Sets labels associated with an endpoint. These can be user provided or derived from the orchestration system.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 404 Not Found – Endpoint not found
- 500 Internal Server Error – Error while updating labels
- id (string) –
-
GET
/endpoint/{id}/log
¶ Retrieves the status logs associated with this endpoint.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid identity provided
- 404 Not Found – Endpoint not found
- id (string) –
-
GET
/endpoint/{id}/healthz
¶ Retrieves the status logs associated with this endpoint.
Parameters: - id (string) –
String describing an endpoint with the format
[prefix:]id
. If no prefix is specified, a prefix ofcilium-local:
is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.- Supported endpoint id prefixes:
- cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
- cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
- container-id: Container runtime ID, e.g. container-id:22222
- container-name: Container name, e.g. container-name:foobar
- pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
- docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid identity provided
- 404 Not Found – Endpoint not found
- id (string) –
-
GET
/identity
¶ Retrieves a list of identities that have metadata matching the provided parameters.
Retrieves a list of identities that have metadata matching the provided parameters, or all identities if no parameters are provided.
Status Codes: - 200 OK – Success
- 404 Not Found – Identities with provided parameters not found
- 520 – Identity storage unreachable. Likely a network problem.
- 521 – Invalid identity format in storage
-
GET
/identity/{id}
¶ Retrieve identity
Parameters: - id (string) – Cluster wide unique identifier of a security identity.
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid identity provided
- 404 Not Found – Identity not found
- 520 – Identity storage unreachable. Likely a network problem.
- 521 – Invalid identity format in storage
-
POST
/ipam
¶ Allocate an IP address
Query Parameters: - family (string) –
Status Codes: - 201 Created – Success
- 502 Bad Gateway – Allocation failure
-
POST
/ipam/{ip}
¶ Allocate an IP address
Parameters: - ip (string) – IP address
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid IP address
- 409 Conflict – IP already allocated
- 500 Internal Server Error – IP allocation failure. Details in message.
- 501 Not Implemented – Allocation for address family disabled
-
DELETE
/ipam/{ip}
¶ Release an allocated IP address
Parameters: - ip (string) – IP address
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid IP address
- 404 Not Found – IP address not found
- 500 Internal Server Error – Address release failure
- 501 Not Implemented – Allocation for address family disabled
-
GET
/policy
¶ Retrieve entire policy tree
Returns the entire policy tree with all children.
Status Codes: - 200 OK – Success
- 404 Not Found – No policy rules found
-
PUT
/policy
¶ Create or update a policy (sub)tree
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid policy
- 460 – Invalid path
- 500 Internal Server Error – Policy import failed
-
DELETE
/policy
¶ Delete a policy (sub)tree
Status Codes: - 200 OK – Success
- 400 Bad Request – Invalid request
- 404 Not Found – Policy not found
- 500 Internal Server Error – Error while deleting policy
-
GET
/service/{id}
¶ Retrieve configuration of a service
Parameters: - id (integer) – ID of service
Status Codes: - 200 OK – Success
- 404 Not Found – Service not found
-
PUT
/service/{id}
¶ Create or update service
Parameters: - id (integer) – ID of service
Status Codes: - 200 OK – Updated
- 201 Created – Created
- 460 – Invalid frontend in service configuration
- 461 – Invalid backend in service configuration
- 500 Internal Server Error – Error while creating service
-
DELETE
/service/{id}
¶ Delete a service
Parameters: - id (integer) – ID of service
Status Codes: - 200 OK – Success
- 404 Not Found – Service not found
- 500 Internal Server Error – Service deletion failed
-
GET
/prefilter
¶ Retrieve list of CIDRs
Status Codes: - 200 OK – Success
- 500 Internal Server Error – Prefilter get failed
-
PATCH
/prefilter
¶ Update list of CIDRs
Status Codes: - 200 OK – Updated
- 461 – Invalid CIDR prefix
- 500 Internal Server Error – Prefilter update failed
-
GET
/debuginfo
¶ Retrieve information about the agent and evironment for debugging
Status Codes: - 200 OK – Success
- 500 Internal Server Error – DebugInfo get failed
-
GET
/map/{name}
¶ Retrieve contents of BPF map
Parameters: - name (string) – Name of map
Status Codes: - 200 OK – Success
- 404 Not Found – Map not found
Command Cheatsheet¶
Cilium is controlled via an easy command-line interface. This CLI is a single application that takes subcommands that you can find in the command reference guide.
$ cilium
CLI for interacting with the local Cilium Agent
Usage:
cilium [command]
Available Commands:
bpf Direct access to local BPF maps
cleanup Reset the agent state
completion Output shell completion code for bash
config Cilium configuration options
debuginfo Request available debugging information from agent
endpoint Manage endpoints
identity Manage security identities
kvstore Direct access to the kvstore
monitor Monitoring
policy Manage security policies
prefilter Manage XDP CIDR filters
service Manage services & loadbalancers
status Display status of daemon
version Print version information
Flags:
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
Use "cilium [command] --help" for more information about a command.
All commands and subcommands have the option -h
that will provide information
about the options and arguments that the subcommand has. In case of any error in
the command, Cilium CLI will return a non-zero status.
Command utilities:¶
JSON Output¶
All the list commands will return a pretty printed list with the information
retrieved from Cilium Daemon. If you need something more detailed you can use JSON
output, to get the JSON output you can use the global option -o json
$ cilium endpoint list -o json
Moreover, Cilium also provides a JSONPath support, so detailed information can be extracted. JSONPath template reference can be found in Kubernetes documentation
$ cilium endpoint list -o jsonpath='{[*].id}'
29898 38939 56326
$ cilium endpoint list -o jsonpath='{range [*]}{@.id}{"="}{@.status.policy.spec.policy-enabled}{"\n"}{end}'
29898=none
38939=none
56326=none
Shell Tab-completion¶
If you use bash or zsh, Cilium CLI can provide tab completion for subcommands. If you want to install tab completion, you should run the following command in your terminal.
$ source <(cilium completion)
If you want to have Cilium completion always loaded, you can install using the following:
$ echo "source <(cilium completion)" >> ~/.bashrc
Command examples:¶
Basics¶
Check the status of the agent
$ cilium status
KVStore: Ok Consul: 172.17.0.3:8300
ContainerRuntime: Ok
Kubernetes: Disabled
Cilium: Ok OK
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
Controller Status: 6/6 healthy
Proxy Status: OK, ip 10.15.28.238, port-range 10000-20000
Cluster health: 1/1 reachable (2018-04-11T07:33:09Z)
$
Get a detailed status of the agent:
$ cilium status --all-controllers --all-health --all-redirects
KVStore: Ok Consul: 172.17.0.3:8300
ContainerRuntime: Ok
Kubernetes: Disabled
Cilium: Ok OK
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
Controller Status: 6/6 healthy
Name Last success Last error Count Message
kvstore-lease-keepalive 2m52s ago never 0 no error
ipcache-bpf-garbage-collection 2m50s ago never 0 no error
resolve-identity-29898 2m50s ago never 0 no error
sync-identity-to-k8s-pod (29898) 50s ago never 0 no error
sync-IPv4-identity-mapping (29898) 2m49s ago never 0 no error
sync-IPv6-identity-mapping (29898) 2m49s ago never 0 no error
Proxy Status: OK, ip 10.15.28.238, port-range 10000-20000
Cluster health: 1/1 reachable (2018-04-11T07:32:09Z)
Name IP Reachable Endpoints reachable
runtime (localhost) 10.0.2.15 true false
$
Get the current agent configuration
cilium config
Policy management¶
Importing a Cilium Network Policy
cilium policy import my-policy.json
Get list of all imported policy rules
cilium policy get
Remove all policies
cilium policy delete --all
Tracing¶
Check policy enforcement between two labels on port 80:
cilium policy trace -s <app.from> -d <app.to> --dport 80
Check policy enforcement between two identities
cilium policy trace --src-identity <from-id> --dst-identity <to-id>
Check policy enforcement between two pods:
cilium policy trace --src-k8s-pod <namespace>:<pod.from> --dst-k8s-pod <namespace>:<pod.to>
Monitoring¶
Monitor cilium datapath notifications
cilium monitor
Verbose output (including debug if enabled)
cilium monitor -v
Filter for only the events related to endpoint
cilium monitor --related-to=<id>
Filter for only events on layer 7
cilium monitor -t L7
Show notifications only for dropped packet events
cilium monitor --type drop
Don’t dissect packet payload, display payload in hex information
cilium monitor -v --hex
Connectivity¶
Check cluster Connectivity
cilium-health status
There is also a blog post related to this tool.
Endpoints¶
Get list of all local endpoints
cilium endpoint list
Get detailed view of endpoint properties and state
cilium endpoint get <id>
Show recent endpoint specific log entries
cilium endpoint log <id>
Enable debugging output on the cilium monitor for this endpoint
cilium endpoint config <id> Debug=true
Loadbalancing¶
Get list of loadbalancer services
cilium service list
Or you can get the loadbalancer information using bpf list ::
cilium bpf lb list
Add a new loadbalancer
cilium service update --frontend 127.0.0.1:80 \
--backends 127.0.0.2:90,127.0.0.3:90 \
--id 20 \
--rev 2
BPF¶
List node tunneling mapping information
cilium bpf tunnel list
Checking logs for verifier issue
journalctl -u cilium | grep -B20 -F10 Verifier
List connection tracking entries:
sudo cilium bpf ct list global
Flush connection tracking entries:
sudo cilium bpf ct flush
List proxy configuration:
sudo cilium bpf proxy list
Kubernetes examples:¶
If you running Cilium on top of Kubernetes you may also want a way to list all cilium endpoints or policies from a single Kubectl commands. Cilium provides all this information to the user by using Kubernetes Resource Definitions:
Policies¶
In Kubernetes you can use two kinds of policies, Kubernetes Network Policies or
Cilium Network Policies. Both can be retrieved from the kubectl
command:
kubectl get netpol
$ kubectl get cnp
NAME AGE
rule1 3m
$ kubectl get cnp rule1
NAME AGE
rule1 3m
$ kubectl get cnp rule1 -o json
Endpoints¶
To retrieve a list of all endpoints managed by cilium, Cilum Endpoint
resource can be used.
$ kubectl get cep
NAME AGE
34e299f0-b25c2fef 41s
34e299f0-dd86986c 42s
4d088f48-83e4f98d 2m
4d088f48-d04ab55f 2m
5c6211b5-9217a4d1 1m
5c6211b5-dccc3d24 1m
700e0976-6cb50b02 3m
700e0976-afd3a30c 3m
78092a35-4874ed16 1m
78092a35-4b08b92b 1m
9b74f61f-14571299 7s
9b74f61f-f9a96f4a 7s
$ kubectl get cep 700e0976-6cb50b02 -o json
$ kubectl get cep -o jsonpath='{range .items[*]}{@.status.id}{"="}{@.status.status.policy.spec.policy-enabled}{"\n"}{end}'
30391=ingress
5766=ingress
51796=none
40355=none
Microscope¶
Cilium also provides an option to monitor all connections from all Kubernetes nodes. Microscope is a distributed monitor that connects to all Cilium instances and retrieves monitor information from there.
Cilium also provides the ability to monitor all cilium-managed connections in the kubernetes cluster via Microscope. It is a distributed monitor that connects to all Cilium instances and retrieves monitor information from each node.
Microscope can be installed an run as a pod, the basic usage is the following:
$ kubectl apply -f
https://github.com/cilium/microscope/blob/master/docs/microscope.yaml
$ kubectl exec -n kube-system microscope -- microscope -h
More information about Cilium Microscope options can be found on the project homepage: cilium/microscope
Command Reference¶
cilium-agent¶
Run the cilium agent
Options¶
--access-log string Path to access log of supported L7 requests observed
--agent-labels stringSlice Additional labels to identify this agent
--allow-localhost string Policy when to allow local stack to reach local endpoints { auto | always | policy } (default "auto")
--auto-ipv6-node-routes Automatically adds IPv6 L3 routes to reach other nodes for non-overlay mode (--device) (BETA)
--bpf-root string Path to BPF filesystem
--cluster-id int Unique identifier of the cluster
--cluster-name string Name of the cluster (default "default")
--clustermesh-config string Path to the ClusterMesh configuration directory
--config string Configuration file (default "$HOME/ciliumd.yaml")
--conntrack-garbage-collector-interval uint Garbage collection interval for the connection tracking table (in seconds) (default 60)
--container-runtime stringSlice Sets the container runtime(s) used by Cilium { containerd | crio | docker | none | auto } ( "auto" uses the container runtime found in the order: "docker", "containerd", "crio" ) (default [auto])
--container-runtime-endpoint map Container runtime(s) endpoint(s). (default: --container-runtime-endpoint=containerd=/var/run/containerd/containerd.sock, --container-runtime-endpoint=crio=/var/run/crio.sock, --container-runtime-endpoint=docker=unix:///var/run/docker.sock) (default map[])
-D, --debug Enable debugging mode
--debug-verbose stringSlice List of enabled verbose debug groups
-d, --device string Device facing cluster/external network for direct L3 (non-overlay mode) (default "undefined")
--disable-conntrack Disable connection tracking
--disable-ipv4 Disable IPv4 mode
--disable-k8s-services Disable east-west K8s load balancing by cilium
-e, --docker string Path to docker runtime socket (DEPRECATED: use container-runtime-endpoint instead) (default "unix:///var/run/docker.sock")
--enable-policy string Enable policy enforcement (default "default")
--enable-tracing Enable tracing while determining policy (debugging)
--envoy-log string Path to a separate Envoy log file, if any
--fixed-identity-mapping map Key-value for the fixed identity mapping which allows to use reserved label for fixed identities (default map[])
--ipv4-cluster-cidr-mask-size int Mask size for the cluster wide CIDR (default 8)
--ipv4-node string IPv4 address of node (default "auto")
--ipv4-range string Per-node IPv4 endpoint prefix, e.g. 10.16.0.0/16 (default "auto")
--ipv4-service-range string Kubernetes IPv4 services CIDR if not inside cluster prefix (default "auto")
--ipv6-cluster-alloc-cidr string IPv6 /64 CIDR used to allocate per node endpoint /96 CIDR (default "f00d::/64")
--ipv6-node string IPv6 address of node (default "auto")
--ipv6-range string Per-node IPv6 endpoint prefix, must be /96, e.g. fd02:1:1::/96 (default "auto")
--ipv6-service-range string Kubernetes IPv6 services CIDR if not inside cluster prefix (default "auto")
--k8s-api-server string Kubernetes api address server (for https use --k8s-kubeconfig-path instead)
--k8s-kubeconfig-path string Absolute path of the kubernetes kubeconfig file
--k8s-require-ipv4-pod-cidr Require IPv4 PodCIDR to be specified in node resource
--k8s-require-ipv6-pod-cidr Require IPv6 PodCIDR to be specified in node resource
--keep-bpf-templates Do not restore BPF template files from binary
--keep-config When restoring state, keeps containers' configuration in place
--kvstore string Key-value store type
--kvstore-opt map Key-value store options (default map[])
--label-prefix-file string Valid label prefixes file path
--labels stringSlice List of label prefixes used to determine identity of an endpoint
--lb string Enables load balancer mode where load balancer bpf program is attached to the given interface
--lib-dir string Directory path to store runtime build environment (default "/var/lib/cilium")
--log-driver stringSlice Logging endpoints to use for example syslog, fluentd
--log-opt map Log driver options for cilium (default map[])
--logstash Enable logstash integration
--logstash-agent string Logstash agent address (default "127.0.0.1:8080")
--logstash-probe-timer uint32 Logstash probe timer (seconds) (default 10)
--masquerade Masquerade packets from endpoints leaving the host (default true)
--monitor-aggregation string Level of monitor aggregation for traces from the datapath (default "None")
--mtu int Overwrite auto-detected MTU of underlying network (default 1500)
--nat46-range string IPv6 prefix to map IPv4 addresses to (default "0:0:0:0:0:FFFF::/96")
--pprof Enable serving the pprof debugging API
--prefilter-device string Device facing external network for XDP prefiltering (default "undefined")
--prefilter-mode string Prefilter mode { native | generic } (default: native) (default "native")
--prometheus-serve-addr string IP:Port on which to serve prometheus metrics (pass ":Port" to bind on all interfaces, "" is off)
--restore Restores state, if possible, from previous daemon (default true)
--sidecar-istio-proxy-image string Regular expression matching compatible Istio sidecar istio-proxy container image names (default "cilium/istio_proxy")
--single-cluster-route Use a single cluster route instead of per node routes
--socket-path string Sets daemon's socket path to listen for connections (default "/var/run/cilium/cilium.sock")
--state-dir string Directory path to store runtime state (default "/var/run/cilium")
--trace-payloadlen int Length of payload to capture when tracing (default 128)
-t, --tunnel string Tunnel mode {vxlan, geneve, disabled} (default "vxlan")
--version Print version information
cilium¶
cilium¶
CLI
Synopsis¶
CLI for interacting with the local Cilium Agent
Options¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium cleanup - Reset the agent state
- cilium completion - Output shell completion code for bash
- cilium config - Cilium configuration options
- cilium debuginfo - Request available debugging information from agent
- cilium endpoint - Manage endpoints
- cilium identity - Manage security identities
- cilium kvstore - Direct access to the kvstore
- cilium map - Access BPF maps
- cilium monitor - Display BPF program events
- cilium node - Manage cluster nodes
- cilium policy - Manage security policies
- cilium prefilter - Manage XDP CIDR filters
- cilium service - Manage services & loadbalancers
- cilium status - Display status of daemon
- cilium version - Print version information
cilium bpf¶
Direct access to local BPF maps
Synopsis¶
Direct access to local BPF maps
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium bpf ct - Connection tracking tables
- cilium bpf endpoint - Local endpoint map
- cilium bpf ipcache - Manage the IPCache mappings for IP/CIDR <-> Identity
- cilium bpf lb - Load-balancing configuration
- cilium bpf metrics - BPF datapath traffic metrics
- cilium bpf policy - Manage policy related BPF maps
- cilium bpf proxy - Proxy configuration
- cilium bpf tunnel - Tunnel endpoint map
cilium bpf ct¶
Connection tracking tables
Synopsis¶
Connection tracking tables
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf ct flush - Flush all connection tracking entries
- cilium bpf ct list - List connection tracking entries
cilium bpf ct flush¶
Flush all connection tracking entries
Synopsis¶
Flush all connection tracking entries
cilium bpf ct flush ( <endpoint identifier> | global )
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf ct - Connection tracking tables
cilium bpf ct list¶
List connection tracking entries
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf ct - Connection tracking tables
cilium bpf endpoint¶
Local endpoint map
Synopsis¶
Local endpoint map
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf endpoint delete - Delete local endpoint entries
- cilium bpf endpoint list - List local endpoint entries
cilium bpf endpoint delete¶
Delete local endpoint entries
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf endpoint - Local endpoint map
cilium bpf endpoint list¶
List local endpoint entries
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf endpoint - Local endpoint map
cilium bpf ipcache¶
Manage the IPCache mappings for IP/CIDR <-> Identity
Synopsis¶
Manage the IPCache mappings for IP/CIDR <-> Identity
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf ipcache list - List endpoint IPs (local and remote) and their corresponding security identities
cilium bpf ipcache list¶
List endpoint IPs (local and remote) and their corresponding security identities
Synopsis¶
List endpoint IPs (local and remote) and their corresponding security identities.
Note that for Linux kernel versions between 4.11 and 4.15 inclusive, the native LPM map type used for implementing the IPCache does not provide the ability to walk / dump the entries, so on these kernel versions this tool will never return any entries, even if entries exist in the map.
cilium bpf ipcache list
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf ipcache - Manage the IPCache mappings for IP/CIDR <-> Identity
cilium bpf lb¶
Load-balancing configuration
Synopsis¶
Load-balancing configuration
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf lb list - List load-balancing configuration
cilium bpf lb list¶
List load-balancing configuration
Options¶
-o, --output string json| jsonpath='{}'
--revnat List reverse NAT entries
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf lb - Load-balancing configuration
cilium bpf metrics¶
BPF datapath traffic metrics
Synopsis¶
BPF datapath traffic metrics
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf metrics list - List BPF datapath traffic metrics
cilium bpf metrics list¶
List BPF datapath traffic metrics
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf metrics - BPF datapath traffic metrics
cilium bpf policy¶
Manage policy related BPF maps
Synopsis¶
Manage policy related BPF maps
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf policy add - Add/update policy entry
- cilium bpf policy delete - Delete a policy entry
- cilium bpf policy get - List contents of a policy BPF map
cilium bpf policy add¶
Add/update policy entry
Synopsis¶
Add/update policy entry
cilium bpf policy add <endpoint id> <traffic-direction> <identity> [port/proto]
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf policy - Manage policy related BPF maps
cilium bpf policy delete¶
Delete a policy entry
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf policy - Manage policy related BPF maps
cilium bpf policy get¶
List contents of a policy BPF map
Options¶
-n, --numeric Do not resolve IDs
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf policy - Manage policy related BPF maps
cilium bpf proxy¶
Proxy configuration
Synopsis¶
Proxy configuration
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf proxy flush - Flush all proxy entries
- cilium bpf proxy list - List proxy configuration
cilium bpf proxy flush¶
Flush all proxy entries
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf proxy - Proxy configuration
cilium bpf proxy list¶
List proxy configuration
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf proxy - Proxy configuration
cilium bpf tunnel¶
Tunnel endpoint map
Synopsis¶
Tunnel endpoint map
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf - Direct access to local BPF maps
- cilium bpf tunnel list - List tunnel endpoint entries
cilium bpf tunnel list¶
List tunnel endpoint entries
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium bpf tunnel - Tunnel endpoint map
cilium cleanup¶
Reset the agent state
Options¶
-f, --force Skip confirmation
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium completion¶
Output shell completion code for bash
Examples¶
# Installing bash completion on macOS using homebrew
## If running Bash 3.2 included with macOS
brew install bash-completion
## or, if running Bash 4.1+
brew install bash-completion@2
## afterwards you only need to run
cilium completion bash > $(brew --prefix)/etc/bash_completion.d/cilium
# Installing bash completion on Linux
## Load the cilium completion code for bash into the current shell
source <(cilium completion bash)
## Write bash completion code to a file and source if from .bash_profile
cilium completion bash > ~/.cilium/completion.bash.inc
printf "
# Cilium shell completion
source '$HOME/.cilium/completion.bash.inc'
" >> $HOME/.bash_profile
source $HOME/.bash_profile
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium config¶
Cilium configuration options
Options¶
--list-options List available options
-n, --num-pages int Number of pages for perf ring buffer. New values have to be > 0
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium debuginfo¶
Request available debugging information from agent
Options¶
-f, --file string Redirect output to file
--file-per-command Generate a single file per command
--html-file string Convert default output to HTML file
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium endpoint¶
Manage endpoints
Synopsis¶
Manage endpoints
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium endpoint config - View & modify endpoint configuration
- cilium endpoint disconnect - Disconnect an endpoint from the network
- cilium endpoint get - Display endpoint information
- cilium endpoint health - View endpoint health
- cilium endpoint labels - Manage label configuration of endpoint
- cilium endpoint list - List all endpoints
- cilium endpoint log - View endpoint status log
- cilium endpoint regenerate - Force regeneration of endpoint program
cilium endpoint config¶
View & modify endpoint configuration
Synopsis¶
View & modify endpoint configuration
cilium endpoint config <endpoint id> [<option>=(enable|disable) ...]
Examples¶
endpoint config 5421 DropNotification=false TraceNotification=false
Options¶
--list-options List available options
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint disconnect¶
Disconnect an endpoint from the network
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint get¶
Display endpoint information
Synopsis¶
Display endpoint information
cilium endpoint get ( <endpoint identifier> | -l <endpoint labels> )
Examples¶
cilium endpoint get 4598, cilium endpoint get pod-name:default:foobar, cilium endpoint get -l id.baz
Options¶
-l, --labels stringSlice list of labels
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint health¶
View endpoint health
Examples¶
cilium endpoint health 5421
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint labels¶
Manage label configuration of endpoint
Options¶
-a, --add stringSlice Add/enable labels
-d, --delete stringSlice Delete/disable labels
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint list¶
List all endpoints
Options¶
--no-headers Do not print headers
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint log¶
View endpoint status log
Examples¶
cilium endpoint log 5421
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium endpoint regenerate¶
Force regeneration of endpoint program
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium endpoint - Manage endpoints
cilium identity¶
Manage security identities
Synopsis¶
Manage security identities
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium identity get - Retrieve information about an identity
- cilium identity list - List identities
cilium identity get¶
Retrieve information about an identity
Options¶
--label stringSlice Label to lookup
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium identity - Manage security identities
cilium identity list¶
List identities
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium identity - Manage security identities
cilium kvstore¶
Direct access to the kvstore
Synopsis¶
Direct access to the kvstore
Options¶
--kvstore string kvstore type
--kvstore-opt map kvstore options (default map[])
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium kvstore delete - Delete a key
- cilium kvstore get - Retrieve a key
- cilium kvstore set - Set a key and value
cilium kvstore delete¶
Delete a key
Examples¶
cilium kvstore delete --recursive foo
Options¶
--recursive Recursive lookup
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
--kvstore string kvstore type
--kvstore-opt map kvstore options (default map[])
SEE ALSO¶
- cilium kvstore - Direct access to the kvstore
cilium kvstore get¶
Retrieve a key
Examples¶
cilium kvstore get --recursive foo
Options¶
-o, --output string json| jsonpath='{}'
--recursive Recursive lookup
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
--kvstore string kvstore type
--kvstore-opt map kvstore options (default map[])
SEE ALSO¶
- cilium kvstore - Direct access to the kvstore
cilium kvstore set¶
Set a key and value
Examples¶
cilium kvstore set foo=bar
Options¶
--key string Key
--value string Value
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
--kvstore string kvstore type
--kvstore-opt map kvstore options (default map[])
SEE ALSO¶
- cilium kvstore - Direct access to the kvstore
cilium map¶
Access BPF maps
Synopsis¶
Access BPF maps
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium map get - Display BPF map information
- cilium map list - List all open BPF maps
cilium map get¶
Display BPF map information
Examples¶
cilium map get cilium_ipcache
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium map - Access BPF maps
cilium map list¶
List all open BPF maps
Examples¶
cilium map list
Options¶
-o, --output string json| jsonpath='{}'
--verbose Print cache contents of all maps
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium map - Access BPF maps
cilium monitor¶
Display BPF program events
Synopsis¶
The monitor displays notifications and events emitted by the BPF programs attached to endpoints and devices. This includes:
- Dropped packet notifications
- Captured packet traces
- Debugging information
cilium monitor
Options¶
--from []uint16 Filter by source endpoint id
--hex Do not dissect, print payload in HEX
-j, --json Enable json output. Shadows -v flag
--related-to []uint16 Filter by either source or destination endpoint id
--to []uint16 Filter by destination endpoint id
-t, --type []string Filter by event types [agent capture debug drop l7 trace]
-v, --verbose Enable verbose output
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium node¶
Manage cluster nodes
Synopsis¶
Manage cluster nodes
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium node list - List nodes
cilium node list¶
List nodes
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium node - Manage cluster nodes
cilium policy¶
Manage security policies
Synopsis¶
Manage security policies
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium policy delete - Delete policy rules
- cilium policy get - Display policy node information
- cilium policy import - Import security policy in JSON format
- cilium policy trace - Trace a policy decision
- cilium policy validate - Validate a policy
- cilium policy wait - Wait for all endpoints to have updated to a given policy revision
cilium policy delete¶
Delete policy rules
Options¶
--all Delete all policies
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium policy - Manage security policies
cilium policy get¶
Display policy node information
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium policy - Manage security policies
cilium policy import¶
Import security policy in JSON format
Examples¶
cilium policy import ~/policy.json
cilium policy import ./policies/app/
Options¶
-o, --output string json| jsonpath='{}'
--print Print policy after import
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium policy - Manage security policies
cilium policy trace¶
Trace a policy decision
Synopsis¶
Verifies if the source is allowed to consume destination. Source / destination can be provided as endpoint ID, security ID, Kubernetes Pod, YAML file, set of LABELs. LABEL is represented as SOURCE:KEY[=VALUE]. dports can be can be for example: 80/tcp, 53 or 23/udp. If multiple sources and / or destinations are provided, each source is tested whether there is a policy allowing traffic between it and each destination
cilium policy trace ( -s <label context> | --src-identity <security identity> | --src-endpoint <endpoint ID> | --src-k8s-pod <namespace:pod-name> | --src-k8s-yaml <path to YAML file> ) ( -d <label context> | --dst-identity <security identity> | --dst-endpoint <endpoint ID> | --dst-k8s-pod <namespace:pod-name> | --dst-k8s-yaml <path to YAML file>) [--dport <port>[/<protocol>]
Options¶
--dport stringSlice L4 destination port to search on outgoing traffic of the source label context and on incoming traffic of the destination label context
-d, --dst stringSlice Destination label context
--dst-endpoint string Destination endpoint
--dst-identity int Destination identity (default -1)
--dst-k8s-pod string Destination k8s pod ([namespace:]podname)
--dst-k8s-yaml string Path to YAML file for destination
-o, --output string json| jsonpath='{}'
-s, --src stringSlice Source label context
--src-endpoint string Source endpoint
--src-identity int Source identity (default -1)
--src-k8s-pod string Source k8s pod ([namespace:]podname)
--src-k8s-yaml string Path to YAML file for source
-v, --verbose Set tracing to TRACE_VERBOSE
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium policy - Manage security policies
cilium policy validate¶
Validate a policy
Options¶
--print Print policy after validation
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium policy - Manage security policies
cilium policy wait¶
Wait for all endpoints to have updated to a given policy revision
Synopsis¶
Wait for all endpoints to have updated to a given policy revision
cilium policy wait <revision>
Options¶
--fail-wait-time int Wait time after which command fails if endpoint regeration fails (seconds) (default 60)
--max-wait-time int Wait time after which command fails (seconds) (default 360)
--sleep-time int Sleep interval between checks (seconds) (default 1)
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium policy - Manage security policies
cilium prefilter¶
Manage XDP CIDR filters
Synopsis¶
Manage XDP CIDR filters
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium prefilter delete - Delete CIDR filters
- cilium prefilter list - List CIDR filters
- cilium prefilter update - Update CIDR filters
cilium prefilter delete¶
Delete CIDR filters
Options¶
--cidr stringSlice List of CIDR prefixes to delete
--revision uint Update revision
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium prefilter - Manage XDP CIDR filters
cilium prefilter list¶
List CIDR filters
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium prefilter - Manage XDP CIDR filters
cilium prefilter update¶
Update CIDR filters
Options¶
--cidr stringSlice List of CIDR prefixes to block
--revision uint Update revision
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium prefilter - Manage XDP CIDR filters
cilium service¶
Manage services & loadbalancers
Synopsis¶
Manage services & loadbalancers
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium - CLI
- cilium service delete - Delete a service
- cilium service get - Display service information
- cilium service list - List services
- cilium service update - Update a service
cilium service delete¶
Delete a service
Options¶
--all Delete all services
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium service - Manage services & loadbalancers
cilium service get¶
Display service information
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium service - Manage services & loadbalancers
cilium service list¶
List services
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium service - Manage services & loadbalancers
cilium service update¶
Update a service
Options¶
--backends stringSlice Backend address or addresses followed by optional weight (<IP:Port>[/weight])
--frontend string Frontend address
--id uint Identifier
--rev Add reverse translation (default true)
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
SEE ALSO¶
- cilium service - Manage services & loadbalancers
cilium status¶
Display status of daemon
Options¶
--all-addresses Show all allocated addresses, not just count
--all-controllers Show all controllers, not just failing
--all-health Show all health status, not just failing
--all-nodes Show all nodes, not just localhost
--all-redirects Show all redirects
--brief Only print a one-line status message
-o, --output string json| jsonpath='{}'
--verbose Equivalent to --all-addresses --all-controllers --all-nodes --all-health
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium version¶
Print version information
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--config string config file (default is $HOME/.cilium.yaml)
-D, --debug Enable debug messages
-H, --host string URI to server-side API
cilium-bugtool¶
Collects agent & system information useful for bug reporting
Examples¶
# Collect information and create archive file
$ cilium-bugtool
[...]
# Collect information and serve via HTTP
$ cilium-bugtool --serve
[...]
# Collect and retrieve archive if Cilium is running in a Kubernetes pod
$ kubectl get pods --namespace kube-system
NAME READY STATUS RESTARTS AGE
cilium-kg8lv 1/1 Running 0 13m
[...]
$ kubectl -n kube-system exec cilium-kg8lv cilium-bugtool
$ kubectl cp kube-system/cilium-kg8lv:/tmp/cilium-bugtool-243785589.tar /tmp/cilium-bugtool-243785589.tar
Options¶
--archive Create archive when false skips deletion of the output directory (default true)
-o, --archiveType string Archive type: tar | gz (default "tar")
--config string Configuration to decide what should be run (default "./.cilium-bugtool.config")
--dry-run Create configuration file of all commands that would have been executed
--exec-timeout duration The default timeout for any cmd execution in seconds (default 30s)
-H, --host string URI to server-side API
--k8s-label string Kubernetes label for Cilium pod (default "k8s-app=cilium")
--k8s-mode Require Kubernetes pods to be found or fail
--k8s-namespace string Kubernetes namespace for Cilium pod (default "kube-system")
-p, --port int Port to use for the HTTP server, (default 4444) (default 4444)
--serve Start HTTP server to serve static files
-t, --tmp string Path to store extracted files (default "/tmp")
cilium-health get¶
Display local cilium agent status
Options¶
-o, --output string json| jsonpath='{}'
Options inherited from parent commands¶
--admin string Expose resources over 'unix' socket, 'any' socket (default "unix")
-c, --cilium string URI to Cilium server API
-d, --daemon Run as a daemon
-D, --debug Enable debug messages
-H, --host string URI to cilium-health server API
-i, --interval uint Interval (in seconds) for periodic connectivity probes (default 60)
--log-driver stringSlice Logging endpoints to use for example syslog, fluentd
--log-opt map Log driver options for cilium-health (default map[])
-p, --passive Only respond to HTTP health checks
--pidfile string Write the PID to the specified file
SEE ALSO¶
- cilium-health - Cilium Health Agent
cilium-health ping¶
Check whether the cilium-health API is up
Options inherited from parent commands¶
--admin string Expose resources over 'unix' socket, 'any' socket (default "unix")
-c, --cilium string URI to Cilium server API
-d, --daemon Run as a daemon
-D, --debug Enable debug messages
-H, --host string URI to cilium-health server API
-i, --interval uint Interval (in seconds) for periodic connectivity probes (default 60)
--log-driver stringSlice Logging endpoints to use for example syslog, fluentd
--log-opt map Log driver options for cilium-health (default map[])
-p, --passive Only respond to HTTP health checks
--pidfile string Write the PID to the specified file
SEE ALSO¶
- cilium-health - Cilium Health Agent
cilium-health status¶
Display cilium connectivity to other nodes
Options¶
-o, --output string json| jsonpath='{}'
--probe Synchronously probe connectivity status
--succinct Print the result succinctly (one node per line)
--verbose Print more information in results
Options inherited from parent commands¶
--admin string Expose resources over 'unix' socket, 'any' socket (default "unix")
-c, --cilium string URI to Cilium server API
-d, --daemon Run as a daemon
-D, --debug Enable debug messages
-H, --host string URI to cilium-health server API
-i, --interval uint Interval (in seconds) for periodic connectivity probes (default 60)
--log-driver stringSlice Logging endpoints to use for example syslog, fluentd
--log-opt map Log driver options for cilium-health (default map[])
-p, --passive Only respond to HTTP health checks
--pidfile string Write the PID to the specified file
SEE ALSO¶
- cilium-health - Cilium Health Agent
cilium-health¶
Cilium Health Agent
Options¶
--admin string Expose resources over 'unix' socket, 'any' socket (default "unix")
-c, --cilium string URI to Cilium server API
-d, --daemon Run as a daemon
-D, --debug Enable debug messages
-H, --host string URI to cilium-health server API
-i, --interval uint Interval (in seconds) for periodic connectivity probes (default 60)
--log-driver stringSlice Logging endpoints to use for example syslog, fluentd
--log-opt map Log driver options for cilium-health (default map[])
-p, --passive Only respond to HTTP health checks
--pidfile string Write the PID to the specified file
SEE ALSO¶
- cilium-health get - Display local cilium agent status
- cilium-health ping - Check whether the cilium-health API is up
- cilium-health status - Display cilium connectivity to other nodes
Key-Value Store¶
Option | Description | Default |
–kvstore TYPE | Key Value Store Type: (consul, etcd) | |
–kvstore-opt OPTS |
consul¶
When using consul, the consul agent address needs to be provided with the
consul.address
:
Option | Type | Description |
consul.address | Address | Address of consul agent |
etcd¶
When using etcd, one of the following options need to be provided to configure the etcd endpoints:
Option | Type | Description |
etcd.address | Address | Address of etcd endpoint |
etcd.config | Path | Path to an etcd configuration file. |
Example of the etcd configuration file:
---
endpoints:
- https://192.168.0.1:2379
- https://192.168.0.2:2379
ca-file: '/var/lib/cilium/etcd-ca.pem'
# In case you want client to server authentication
key-file: '/var/lib/cilium/etcd-client.key'
cert-file: '/var/lib/cilium/etcd-client.crt'
Key-Value Store¶
Cilium uses an external key-value store to exchange information across multiple Cilium instances:
Layout¶
All data is stored under a common key prefix:
Prefix | Description |
---|---|
cilium/ |
All keys share this common prefix. |
cilium/state/ |
State stored by agents, data is automatically recreated on removal or corruption. |
Cluster Nodes¶
Every agent will register itself as a node in the kvstore and make the following information available to other agents:
- Name
- IP addresses of the node
- Health checking IP addresses
- Allocation range of endpoints on the node
Key | Value |
---|---|
cilium/state/nodes/v1/<cluster name>/<node name> |
node.Node |
All node keys are attached to a lease owned by the agent of the respective node.
Leases¶
With a few exceptions, all keys in the key-value store are owned by a particular agent running on a node. All such keys have a lease attached. The lease is renewed automatically. When the lease expires, the key is removed from the key-value store. This guarantees that keys are removed from the key-value store in the event that an agent dies on a particular and never reappears.
The lease lifetime is set to 15 minutes. The exact expiration behavior is dependent on the kvstore implementation but the expiration typically occurs after double the lifetime
Debugging¶
The contents stored in the kvstore can be queued and manipulate using the
cilium kvstore
command. For additional details, see the command reference.
Example:
$ cilium kvstore get --recursive cilium/state/nodes/
cilium/state/nodes/v1/default/runtime1 => {"Name":"runtime1","IPAddresses":[{"AddressType":"InternalIP","IP":"10.0.2.15"}],"IPv4AllocCIDR":{"IP":"10.11.0.0","Mask":"//8AAA=="},"IPv6AllocCIDR":{"IP":"f00d::a0f:0:0:0","Mask":"//////////////////8AAA=="},"IPv4HealthIP":"","IPv6HealthIP":""}
Further Reading¶
Presentations¶
- DockerCon, Austin TX, Apr 2017 - Cilium - Network and Application Security with BPF and XDP: Slides, Video
- CNCF/KubeCon Meetup, Berlin, Mar 2017 - Linux Native, HTTP Aware Network Security: Slides, Video
- Docker Distributed Systems Summit, Berlin, Oct 2016: Slides, Video
- NetDev1.2, Tokyo, Sep 2016 - cls_bpf/eBPF updates since netdev 1.1: Slides, Video
- NetDev1.2, Tokyo, Sep 2016 - Advanced programmability and recent updates with tc’s cls_bpf: Slides, Video
- ContainerCon NA, Toronto, Aug 2016 - Fast IPv6 container networking with BPF & XDP: Slides
Podcasts¶
Glossary¶
Cilium has some terms with special meanings. These should all be covered throughout the documentation but for convenience we have also listed some of them below with short descriptions. If you need more information, please ask us on Slack. Feel free to extend this document with words you expected to see here.
- CNI
- https://github.com/containernetworking/cni
- ConfigMap
- https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/
- CustomResourceDefinition
- https://kubernetes.io/docs/concepts/api-extension/custom-resources/#customresourcedefinitions
- DaemonSet
- https://kubernetes.io/docs/admin/daemons/
- Endpoint
- Endpoint
- Geneve
- https://tools.ietf.org/html/draft-ietf-nvo3-geneve-04
- HeadlessServices
- https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
- iproute2
- https://www.kernel.org/pub/linux/utils/net/iproute2/
- Linux kernel
- https://www.kernel.org/
- llvm
- http://releases.llvm.org/
- NodeSelector
- https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
- Pod
- Pods
- https://kubernetes.io/docs/concepts/workloads/pods/pod/
- Policy
- A Cilium policy consists of a list of rules. The security policy can be specified in The Kubernetes NetworkPolicy format or The Cilium policy language.
- RBAC
- https://kubernetes.io/docs/admin/authorization/rbac/
- Slack channel
- Public community slack channel for everyone to ask questions https://cilium.herokuapp.com
- ThirdPartyResource
- https://kubernetes.io/docs/tasks/access-kubernetes-api/migrate-third-party-resource/
- Volumes
- https://kubernetes.io/docs/tasks/configure-pod-container/configure-volume-storage/
- VXLAN
- https://tools.ietf.org/html/rfc7348