Welcome to Cilium’s documentation!

The documentation is divided into the following sections:

  • Getting Started Guides: Provides a simple tutorial for running a small Cilium setup on your laptop. Intended as an easy way to get your hands dirty applying Cilium security policies between containers.
  • Concepts: Describes the components of Cilium, and the different models for deploying Cilium. Provides the high-level understanding required to run a full Cilium deployment and understand its behavior.
  • Architecture: Describes the components of the Cilium architecture and how these components integrate with existing architectures, such as Kubernetes.
  • Installation : Details instructions for installing, configuring, and troubleshooting Cilium in different deployment modes.
  • Network Policy : Detailed walkthrough of the policy language structure and the supported formats.
  • Monitoring & Metrics : Instructions for configuring metrics collection from Cilium.
  • Troubleshooting : Describes how to troubleshoot Cilium in different deployment modes.
  • BPF and XDP Reference Guide : Provides a technical deep dive of BPF and XDP technology, primarily focused at developers.
  • API Reference : Details the Cilium agent API for interacting with a local Cilium instance.
  • Development Guide : Gives background to those looking to develop and contribute modifications to the Cilium code or documentation.

Introduction to Cilium

What is Cilium?

Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes.

At the foundation of Cilium is a new Linux kernel technology called BPF, which enables the dynamic insertion of powerful security visibility and control logic within Linux itself. Because BPF runs inside the Linux kernel, Cilium security policies can be applied and updated without any changes to the application code or container configuration.

Why Cilium?

The development of modern datacenter applications has shifted to a service-oriented architecture often referred to as microservices, wherein a large application is split into small independent services that communicate with each other via APIs using lightweight protocols like HTTP. Microservices applications tend to be highly dynamic, with individual containers getting started or destroyed as the application scales out / in to adapt to load changes and during rolling updates that are deployed as part of continuous delivery.

This shift toward highly dynamic microservices presents both a challenge and an opportunity in terms of securing connectivity between microservices. Traditional Linux network security approaches (e.g., iptables) filter on IP address and TCP/UDP ports, but IP addresses frequently churn in dynamic microservices environments. The highly volatile life cycle of containers causes these approaches to struggle to scale side by side with the application as load balancing tables and access control lists carrying hundreds of thousands of rules that need to be updated with a continuously growing frequency. Protocol ports (e.g. TCP port 80 for HTTP traffic) can no longer be used to differentiate between application traffic for security purposes as the port is utilized for a wide range of messages across services.

An additional challenge is the ability to provide accurate visibility as traditional systems are using IP addresses as primary identification vehicle which may have a drastically reduced lifetime of just a few seconds in microservices architectures.

By leveraging Linux BPF, Cilium retains the ability to transparently insert security visibility + enforcement, but does so in a way that is based on service / pod / container identity (in contrast to IP address identification in traditional systems) and can filter on application-layer (e.g. HTTP). As a result, Cilium not only makes it simple to apply security policies in a highly dynamic environment by decoupling security from addressing, but can also provide stronger security isolation by operating at the HTTP-layer in addition to providing traditional Layer 3 and Layer 4 segmentation.

The use of BPF enables Cilium to achieve all of this in a way that is highly scalable even for large-scale environments.

Functionality Overview

Protect and secure APIs transparently

Ability to secure modern application protocols such as REST/HTTP, gRPC and Kafka. Traditional firewalls operates at Layer 3 and 4. A protocol running on a particular port is either completely trusted or blocked entirely. Cilium provides the ability to filter on individual application protocol requests such as:

  • Allow all HTTP requests with method GET and path /public/.*. Deny all other requests.
  • Allow service1 to produce on Kafka topic topic1 and service2 to consume on topic1. Reject all other Kafka messages.
  • Require the HTTP header X-Token: [0-9]+ to be present in all REST calls.

See the section Layer 7 Policy in our documentation for the latest list of supported protocols and examples on how to use it.

Secure service to service communication based on identities

Modern distributed applications rely on technologies such as application containers to facilitate agility in deployment and scale out on demand. This results in a large number of application containers to be started in a short period of time. Typical container firewalls secure workloads by filtering on source IP addresses and destination ports. This concept requires the firewalls on all servers to be manipulated whenever a container is started anywhere in the cluster.

In order to avoid this situation which limits scale, Cilium assigns a security identity to groups of application containers which share identical security policies. The identity is then associated with all network packets emitted by the application containers, allowing to validate the identity at the receiving node. Security identity management is performed using a key-value store.

Secure access to and from external services

Label based security is the tool of choice for cluster internal access control. In order to secure access to and from external services, traditional CIDR based security policies for both ingress and egress are supported. This allows to limit access to and from application containers to particular IP ranges.

Simple Networking

A simple flat Layer 3 network with the ability to span multiple clusters connects all application containers. IP allocation is kept simple by using host scope allocators. This means that each host can allocate IPs without any coordination between hosts.

The following multi node networking models are supported:

  • Overlay: Encapsulation-based virtual network spanning all hosts. Currently VXLAN and Geneve are baked in but all encapsulation formats supported by Linux can be enabled.

    When to use this mode: This mode has minimal infrastructure and integration requirements. It works on almost any network infrastructure as the only requirement is IP connectivity between hosts which is typically already given.

  • Native Routing: Use of the regular routing table of the Linux host. The network is required to be capable to route the IP addresses of the application containers.

    When to use this mode: This mode is for advanced users and requires some awareness of the underlying networking infrastructure. This mode works well with:

    • Native IPv6 networks
    • In conjunction with cloud network routers
    • If you are already running routing daemons

Load balancing

Distributed load balancing for traffic between application containers and to external services. The loadbalancing is implemented using BPF using efficient hashtables allowing for almost unlimited scale and supports direct server return (DSR) if the loadbalancing operation is not performed on the source host. Note: load balancing requires connection tracking to be enabled. This is the default.

Monitoring and Troubleshooting

The ability to gain visibility and to troubleshoot issues is fundamental to the operation of any distributed system. While we learned to love tools like tcpdump and ping and while they will always find a special place in our hearts, we strive to provide better tooling for troubleshooting. This includes tooling to provide:

  • Event monitoring with metadata: When a packet is dropped, the tool doesn’t just report the source and destination IP of the packet, the tool provides the full label information of both the sender and receiver among a lot of other information.
  • Policy decision tracing: Why is a packet being dropped or a request rejected. The policy tracing framework allows to trace the policy decision process for both, running workloads and based on arbitrary label definitions.
  • Metrics export via Prometheus: Key metrics are exported via Prometheus for integration with your existing dashboards.
  • Hubble: An observability platform specifically written for Cilium. It provides service dependency maps, operational monitoring and alerting, and application and security visibility based on flow logs.

Integrations

Getting Started Guides

The following is a list of guides that help you get started with Cilium. The guides cover the installation and then dive into more detailed topics such as securing clusters, connecting multiple clusters, monitoring, and troubleshooting. If you are new to Cilium it is recommended to read the Introduction to Cilium section first to learn about the basic concepts and motivation.

Installation

Creating a Sandbox environment

Getting Started Using Minikube

This guide uses minikube to demonstrate deployment and operation of Cilium in a single-node Kubernetes cluster. The minikube VM requires approximately 5GB of RAM and supports hypervisors like VirtualBox that run on Linux, macOS, and Windows.

Install kubectl & minikube
  1. Install kubectl version >= v1.10.0 as described in the Kubernetes Docs.
  2. Install minikube >= v1.3.1 as per minikube documentation: Install Minikube.

Note

It is important to validate that you have minikube v1.3.1 installed. Older versions of minikube are shipping a kernel configuration that is not compatible with the TPROXY requirements of Cilium >= 1.6.0.

minikube version
minikube version: v1.3.1
commit: ca60a424ce69a4d79f502650199ca2b52f29e631
  1. Create a minikube cluster:
minikube start --network-plugin=cni --memory=4096
  1. Mount the BPF filesystem
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf

Note

In case of installing Cilium for a specific Kubernetes version, the --kubernetes-version vx.y.z parameter can be appended to the minikube start command for bootstrapping the local cluster. By default, minikube will install the most recent version of Kubernetes.

Install Cilium

Install Cilium as DaemonSet into your new Kubernetes cluster. The DaemonSet will automatically install itself as Kubernetes CNI plugin.

Note

In case of installing Cilium with CRIO, please see CRIO instructions.

kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Next steps

Now that you have a Kubernetes cluster with Cilium up and running, you can take a couple of next steps to explore various capabilities:

Getting Started Using MicroK8s

This guide uses microk8s to demonstrate deployment and operation of Cilium in a single-node Kubernetes cluster. To run Cilium inside microk8s, a GNU/Linux distribution with kernel 4.9 or later is required (per the System Requirements).

Install microk8s
  1. Install microk8s >= 1.15 as per microk8s documentation: MicroK8s User guide.

  2. Enable the microk8s Cilium service

    microk8s.enable cilium
    
  3. Cilium is now configured! The cilium CLI is provided as microk8s.cilium.

Next steps

Now that you have a Kubernetes cluster with Cilium up and running, you can take a couple of next steps to explore various capabilities:

Getting Started Using K3s

This guide walks you through installation of Cilium on K3s, a highly available, certified Kubernetes distribution designed for production workloads in unattended, resource-constrained, remote locations or inside IoT appliances.

This guide assumes installation on amd64 architecture. Cilium is presently supported on amd64 architecture with ARM support planned for a future release.

Install a Master Node

The first step is to install a K3s master node making sure to disable support for the default CNI plugin:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='--flannel-backend=none --no-flannel' sh -
Install Agent Nodes (Optional)

K3s can run in standalone mode or as a cluster making it a great choice for local testing with multi-node data paths. Agent nodes are joined to the master node using a node-token which can be found on the master node at /var/lib/rancher/k3s/server/node-token.

Install K3s on agent nodes and join them to the master node making sure to replace the variables with values from your environment:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC='--no-flannel' K3S_URL='https://${MASTER_IP}:6443' K3S_TOKEN=${NODE_TOKEN}

Should you encounter any issues during the installation, please refer to the Troubleshooting section and / or seek help on the Slack channel.

Please consult the Kubernetes Requirements for information on how you need to configure your Kubernetes cluster to operate with Cilium.

Mount the BPF Filesystem

On each node, run the following to mount the BPF Filesystem:

sudo mount bpffs -t bpf /sys/fs/bpf
Install Cilium
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Next Steps

Now that you have a Kubernetes cluster with Cilium up and running, you can take a couple of next steps to explore various capabilities:

Getting Started Using Kind

This guide uses kind to demonstrate deployment and operation of Cilium in a multi-node Kubernetes cluster.

Kind requires docker to be installed and running.

Install Dependencies
  1. Install docker stable as described in: Install Docker Engine
  2. Install kubectl version >= v1.14.0 as described in the Kubernetes Docs
  3. Install helm >= v3.0.3 per Helm documentation: Installing Helm
  4. Install kind >= v0.7.0 per kind documentation: Installation and Usage
Kind Configuration

Kind doesn’t use flags for configuration. Instead it uses YAML configuration that is very similar to Kubernetes.

Create a kind-config.yaml file based on the following template. The template will create 3 node + 1 apiserver cluster with the latest version of kubernetes from when the kind release was created.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
networking:
  disableDefaultCNI: true

To change the version of kubernetes being run, image has to be defined for each node. See the Node Configration documentation.

Start Kind

Pass the kind-config.yaml you created with the --config flag of kind.

kind create cluster --config=kind-config.yaml

This will add a kind-kind context to KUBECONFIG or if unset, ${HOME}/.kube/config

kubectl cluster-info --context kind-kind
Install Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

(optional, but recommended) Pre-load Cilium images into the kind cluster so each worker doesn’t have to pull them.

docker pull cilium/cilium:v1.8.90
kind load docker-image cilium/cilium:v1.8.90

Install Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
   --namespace kube-system \
   --set global.nodeinit.enabled=true \
   --set global.kubeProxyReplacement=partial \
   --set global.hostServices.enabled=false \
   --set global.externalIPs.enabled=true \
   --set global.nodePort.enabled=true \
   --set global.hostPort.enabled=true \
   --set global.pullPolicy=IfNotPresent
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Next steps

Now that you have a Kubernetes cluster with Cilium up and running, you can take a couple of next steps to explore various capabilities:

Troubleshooting
Unable to contact k8s api-server

In the Cilum agent logs you will see:

level=info msg="Establishing connection to apiserver" host="https://10.96.0.1:443" subsys=k8s
level=error msg="Unable to contact k8s api-server" error="Get https://10.96.0.1:443/api/v1/namespaces/kube-system: dial tcp 10.96.0.1:443: connect: no route to host" ipAddr="https://10.96.0.1:443" subsys=k8s
level=fatal msg="Unable to initialize Kubernetes subsystem" error="unable to create k8s client: unable to create k8s client: Get https://10.96.0.1:443/api/v1/namespaces/kube-system: dial tcp 10.96.0.1:443: connect: no route to host" subsys=daemon

As Kind is running nodes as containers in Docker, they’re sharing your host machines’ kernel. If Host-Reachable Services wasn’t disabled, the eBPF programs attached by Cilium may be out of date and no longer routing api-server requests to the current kind-control-plane container.

Recreating the kind cluster and using the helm command Install Cilium will detach the inaccurate eBPF programs.

Cluster Mesh

With Kind we can simulate Cluster Mesh in a sandbox too.

Kind Configuration

This time we need to create (2) config.yaml, one for each kubernetes cluster. We will explicitly configure their pod-network-cidr and service-cidr to not overlap.

Example kind-cluster1.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
networking:
  disableDefaultCNI: true
  podSubnet: 10.0.0.0/16
  serviceSubnet: 10.1.0.0/16

Example kind-cluster2.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
networking:
  disableDefaultCNI: true
  podSubnet: 10.2.0.0/16
  serviceSubnet: 10.3.0.0/16
Create Kind Clusters

We can now create the respective clusters:

kind create cluster --name=cluster1 --config=kind-cluster1.yaml
kind create cluster --name=cluster2 --config=kind-cluster2.yaml
Deploy Cilium

This is the same helm command as from Install Cilium. However we’re enabling managed etcd and setting both cluster-name and cluster-id for each cluster.

Make sure context is set to kind-cluster2 cluster.

kubectl config use-context kind-cluster2
helm install cilium cilium/cilium --version 1.8.90 \
   --namespace kube-system \
   --set global.nodeinit.enabled=true \
   --set global.kubeProxyReplacement=partial \
   --set global.hostServices.enabled=false \
   --set global.externalIPs.enabled=true \
   --set global.nodePort.enabled=true \
   --set global.hostPort.enabled=true \
   --set global.etcd.enabled=true \
   --set global.etcd.managed=true \
   --set global.identityAllocationMode=kvstore \
   --set global.cluster.name=cluster2 \
   --set global.cluster.id=2

Change the kubectl context to kind-cluster1 cluster:

kubectl config use-context kind-cluster1
helm install cilium cilium/cilium --version 1.8.90 \
   --namespace kube-system \
   --set global.nodeinit.enabled=true \
   --set global.kubeProxyReplacement=partial \
   --set global.hostServices.enabled=false \
   --set global.externalIPs.enabled=true \
   --set global.nodePort.enabled=true \
   --set global.hostPort.enabled=true \
   --set global.etcd.enabled=true \
   --set global.etcd.managed=true \
   --set global.identityAllocationMode=kvstore \
   --set global.cluster.name=cluster1 \
   --set global.cluster.id=1
Setting up Cluster Mesh

We can complete setup by following the Cluster Mesh guide with Expose the Cilium etcd to other clusters. For Kind, we’ll want to deploy the NodePort service into the kube-system namespace.

Self-Managed Kubernetes

The following guides are available for installation of self-managed Kubernetes clusters. This section provides guides for installing Cilium with and without use of a kvstore (etcd). Please refer to the section Installation with external etcd for details on when etcd is required.

Quick Installation

This guides takes you through the quick installation procedure. The default settings will store all required state using Kubernetes custom resource definitions (CRDs). This is the simplest installation method as it only depends on Kubernetes and does not require additional external dependencies. It is a good option for environments up to about 250 nodes. For bigger environments or for environments which want to leverage the clustermesh functionality, a kvstore set up is required which can be set up using an Installation with external etcd or using the Installation with managed etcd.

Should you encounter any issues during the installation, please refer to the Troubleshooting section and / or seek help on the Slack channel.

Please consult the Kubernetes Requirements for information on how you need to configure your Kubernetes cluster to operate with Cilium.

Install Cilium
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Installation with managed etcd

The standard Quick Installation guide will set up Cilium to use Kubernetes CRDs to store and propagate state between agents. Use of CRDs can impose scale limitations depending on the size of your environment. Use of etcd optimizes the propagation of state between agents. This guide explains the steps required to set up Cilium with a managed etcd where etcd is managed by an operator which maintains an etcd cluster as part of the Kubernetes cluster.

The identity allocation remains to be CRD-based which means that etcd remains an optional component to improve scalability. Failures in providing etcd will not be critical to the availability of Cilium but will reduce the efficacy of state propagation. This allows the managed etcd to recover while depending on Cilium itself to provide connectivity and security.

Should you encounter any issues during the installation, please refer to the Troubleshooting section and / or seek help on the Slack channel.

Requirements

Make sure your Kubernetes environment is meeting the requirements:

  • Kubernetes >= 1.9
  • Linux kernel >= 4.9
  • Kubernetes in CNI mode
  • Mounted BPF filesystem mounted on all worker nodes
  • Recommended: Enable PodCIDR allocation (--allocate-node-cidrs) in the kube-controller-manager (recommended)

Refer to the section Requirements for detailed instruction on how to prepare your Kubernetes environment.

Deploy Cilium + cilium-etcd-operator

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
   --namespace kube-system \
   --set global.etcd.enabled=true \
   --set global.etcd.managed=true
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-etcd-operator-6ffbd46df9-pn6cf   1/1     Running             0          7s
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for the etcd-operator to bring up the necessary number of etcd pods to achieve quorum. Once it reaches quorum, all components should be healthy and ready:

cilium-etcd-8d95ggpjmw                  1/1     Running   0          78s
cilium-etcd-operator-6ffbd46df9-pn6cf   1/1     Running   0          4m12s
cilium-etcd-t695lgxf4x                  1/1     Running   0          118s
cilium-etcd-zw285m6t9g                  1/1     Running   0          2m41s
cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
etcd-operator-5cf67779fd-hd9j7          1/1     Running   0          2m42s
Troubleshooting
  • Make sure that kube-dns or coredns is running and healthy in the kube-system namespace. A functioning Kubernetes DNS is strictly required in order for Cilium to resolve the ClusterIP of the etcd cluster. If either kube-dns or coredns were already running before Cilium was deployed, the pods may be managed by a former CNI plugin. cilium-operator will automatically restart the pods to ensure that they are being managed by the Cilium CNI plugin. You can manually restart the pods as well if required and validate that Cilium is managing kube-dns or coredns by running:

    kubectl -n kube-system get cep
    

    You should see kube-dns-xxx or coredns-xxx pods.

  • In order for the entire system to come up, the following components have to be running at the same time:

    • kube-dns or coredns
    • cilium-xxx
    • cilium-operator-xxx
    • cilium-etcd-operator
    • etcd-operator
    • cilium-etcd-xxx

    All timeouts are configured that this will typically work out smoothly even if some of the pods restart once or twice. In case any of the above pods get into a long CrashLoopBackoff, bootstrapping can be expedited by restarting the pods to reset the CrashLoopBackoff time.

CoreDNS: Enable reverse lookups

In order for the TLS certificates between etcd peers to work correctly, a DNS reverse lookup on a pod IP must map back to pod name. If you are using CoreDNS, check the CoreDNS ConfigMap and validate that in-addr.arpa and ip6.arpa are listed as wildcards for the kubernetes block like this:

kubectl -n kube-system edit cm coredns
[...]
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
    }

The contents can look different than the above. The specific configuration that matters is to make sure that in-addr.arpa and ip6.arpa are listed as wildcards next to cluster.local.

You can validate this by looking up a pod IP with the host utility from any pod:

host 10.60.20.86
86.20.60.10.in-addr.arpa domain name pointer cilium-etcd-972nprv9dp.cilium-etcd.kube-system.svc.cluster.local.
What is the cilium-etcd-operator?

The cilium-etcd-operator uses and extends the etcd-operator to guarantee quorum, auto-create certificates, and manage compaction:

  • Automatic re-creation of the etcd cluster when the cluster loses quorum. The standard etcd-operator will refuse to bring up new etcd nodes and the etcd cluster becomes unusable.
  • Automatic creation of certificates and keys. This simplifies the installation of the operator and makes the certificates and keys required to access the etcd cluster available to Cilium using a well known Kubernetes secret name.
  • Compaction is automatically handled.
Limitations

Use of the cilium-etcd-operator offers a lot of advantages including simplicity of installation, automatic management of the etcd cluster including compaction, restart on quorum loss, and automatic use of TLS. There are several disadvantages which can become of relevance as you scale up your clusters:

  • etcd nodes operated by the etcd-operator will not use persistent storage. Once the etcd cluster looses quorum, the etcd cluster is automatically re-created by the cilium-etcd-operator. Cilium will automatically recover and re-create all state in etcd. This operation can take couple of seconds and may cause minor disruptions as ongoing distributed locks are invalidated and security identities have to be re-allocated.
  • etcd is very sensitive to disk IO latency and requires fast disk access at a certain scale. The cilium-etcd-operator will not take any measures to provide fast disk access and performance will depend whatever is provided to the pods in your Kubernetes cluster. See etcd Hardware recommendations for more details.
Installation with external etcd

This guide walks you through the steps required to set up Cilium on Kubernetes using an external etcd. Use of an external etcd provides better performance and is suitable for larger environments. If you are looking for a simple installation method to get started, refer to the section Installation with managed etcd.

Should you encounter any issues during the installation, please refer to the Troubleshooting section and / or seek help on Slack.

When do I need to use a kvstore?

Unlike the section Quick Installation, this guide explains how to configure Cilium to use an external kvstore such as etcd. If you are unsure whether you need to use a kvstore at all, the following is a list of reasons when to use a kvstore:

  • If you want to use the Cluster Mesh functionality.
  • If you are running in an environment with more than 250 nodes, 5k pods, or if you observe a high overhead in state propagation caused by Kubernetes events.
  • If you do not want Cilium to store state in Kubernetes custom resources (CRDs).
Requirements

Make sure your Kubernetes environment is meeting the requirements:

  • Kubernetes >= 1.9
  • Linux kernel >= 4.9
  • Kubernetes in CNI mode
  • Mounted BPF filesystem mounted on all worker nodes
  • Recommended: Enable PodCIDR allocation (--allocate-node-cidrs) in the kube-controller-manager (recommended)

Refer to the section Requirements for detailed instruction on how to prepare your Kubernetes environment.

You will also need an external etcd version 3.1.0 or higher

Configure Cilium

When using an external kvstore, the address of the external kvstore needs to be configured in the ConfigMap. Download the base YAML and configure it with Helm:

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.etcd.enabled=true \
  --set global.etcd.endpoints[0]=http://etcd-endpoint1:2379 \
  --set global.etcd.endpoints[1]=http://etcd-endpoint2:2379
Optional: Configure the SSL certificates

Create a Kubernetes secret with the root certificate authority, and client-side key and certificate of etcd:

kubectl create secret generic -n kube-system cilium-etcd-secrets \
     --from-file=etcd-client-ca.crt=ca.crt \
     --from-file=etcd-client.key=client.key \
     --from-file=etcd-client.crt=client.crt

Adjust the helm template generation to enable SSL for etcd and use https instead of http for the etcd endpoint URLs:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.etcd.enabled=true \
  --set global.etcd.ssl=true \
  --set global.etcd.endpoints[0]=https://etcd-endpoint1:2379 \
  --set global.etcd.endpoints[1]=https://etcd-endpoint2:2379
Validate the Installation

Verify that Cilium pods were started on each of your worker nodes

kubectl --namespace kube-system get ds cilium
NAME            DESIRED   CURRENT   READY     NODE-SELECTOR   AGE
cilium          4         4         4         <none>          2m

kubectl -n kube-system get deployments cilium-operator
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
cilium-operator   1/1     1            1           5m25s
Installation on Microsoft Azure Cloud
Create an Azure Kubernetes cluster

Setup a Kubernetes cluster on Azure. You can use any method available as long as your Kubernetes cluster has CNI enabled in the kubelet configuration. For simplicity of this guide, we will set up a managed AKS cluster:

export CLUSTER_NAME=aks-test
export LOCATION=westeurope
export RESOURCE_GROUP=aks-test
az aks create -n $CLUSTER_NAME -g $RESOURCE_GROUP -l $LOCATION --network-plugin azure

Note

When setting up AKS, it is important to use the flag --network-plugin azure to ensure that CNI mode is enabled.

Create a service principal for cilium-operator

In order to allow cilium-operator to interact with the Azure API, a service principal is required. You can reuse an existing service principal if you want but it is recommended to create a dedicated service principal for cilium-operator:

az ad sp create-for-rbac -n cilium-operator

The output will look like this: (Store it for later use)

{
  "appId": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
  "displayName": "cilium-operator",
  "name": "http://cilium-operator",
  "password": "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb",
  "tenant": "cccccccc-cccc-cccc-cccc-cccccccccccc"
}

Extract the relevant credentials to access the Azure API:

AZURE_SUBSCRIPTION_ID=$(az account show --query id | tr -d \")
AZURE_CLIENT_ID=aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa
AZURE_CLIENT_SECRET=bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb
AZURE_TENANT_ID=cccccccc-cccc-cccc-cccc-cccccccccccc
AZURE_NODE_RESOURCE_GROUP=$(az aks show -n $CLUSTER_NAME -g $RESOURCE_GROUP | jq -r .nodeResourceGroup)

Note

AZURE_NODE_RESOURCE_GROUP must be set to the resource group of the node pool, not the resource group of the AKS cluster.

Retrieve Credentials to access cluster
az aks get-credentials --resource-group $RESOURCE_GROUP -n $CLUSTER_NAME
Deploy Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.azure.enabled=true \
  --set global.azure.resourceGroup=$AZURE_NODE_RESOURCE_GROUP \
  --set global.azure.subscriptionID=$AZURE_SUBSCRIPTION_ID \
  --set global.azure.tenantID=$AZURE_TENANT_ID \
  --set global.azure.clientID=$AZURE_CLIENT_ID \
  --set global.azure.clientSecret=$AZURE_CLIENT_SECRET \
  --set global.tunnel=disabled \
  --set config.ipam=azure \
  --set global.masquerade=false \
  --set global.nodeinit.enabled=true
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Limitations
  • All VMs and VM scale sets used in a cluster must belong to the same resource group.

Managed Kubernetes

The following guides cover the installation steps for managed Kubernetes environments as offered by cloud providers. If a particular offering is not covered, the guide Installation with managed etcd has a good chance of working out of the box as well.

Installation on AWS EKS
Create an EKS Cluster

The first step is to create an EKS cluster. This guide will use eksctl but you can also follow the Getting Started with Amazon EKS guide.

Prerequisites

Ensure your AWS credentials are located in ~/.aws/credentials or are stored as environment variables .

Next, install eksctl :

curl --silent --location "https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
brew install weaveworks/tap/eksctl

Ensure that aws-iam-authenticator is installed and in the executable path:

which aws-iam-authenticator

If not, install it based on the AWS IAM authenticator documentation .

Create the cluster

Create an EKS cluster with eksctl see the eksctl Documentation for details on how to set credentials, change region, VPC, cluster size, etc.

eksctl create cluster -n test-cluster -N 0

You should see something like this:

[ℹ]  using region us-west-2
[ℹ]  setting availability zones to [us-west-2b us-west-2a us-west-2c]
[...]
[✔]  EKS cluster "test-cluster" in "us-west-2" region is ready
Delete VPC CNI (aws-node DaemonSet)

Cilium will manage ENIs instead of VPC CNI, so the aws-node DaemonSet has to be deleted to prevent conflict behavior.

Note

Once aws-node DaemonSet is deleted, EKS will not try to restore it.

kubectl -n kube-system delete daemonset aws-node
Deploy Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.eni=true \
  --set config.ipam=eni \
  --set global.egressMasqueradeInterfaces=eth0 \
  --set global.tunnel=disabled \
  --set global.nodeinit.enabled=true

Note

This helm command sets global.eni=true and global.tunnel=disabled, meaning that Cilium will allocate a fully-routable AWS ENI IP address for each pod, similar to the behavior of the Amazon VPC CNI plugin.

Cilium can alternatively run in EKS using an overlay mode that gives pods non-VPC-routable IPs. This allows running more pods per Kubernetes worker node than the ENI limit, but means that pod connectivity to resources outside the cluster (e.g., VMs in the VPC or AWS managed services) is masqueraded (i.e., SNAT) by Cilium to use the VPC IP address of the Kubernetes worker node. Excluding the lines for global.eni=true and global.tunnel=disabled from the helm command will configure Cilium to use overlay routing mode (which is the helm default).

Scale up the cluster
eksctl get nodegroup --cluster eni-cluster
CLUSTER                     NODEGROUP       CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID
test-cluster                ng-25560078     2019-07-23T06:05:35Z    0               2               0                       m5.large        ami-0923e4b35a30a5f53
eksctl scale nodegroup --cluster eni-cluster -n ng-25560078 -N 2
[]  scaling nodegroup stack "eksctl-test-cluster-nodegroup-ng-25560078" in cluster eksctl-test-cluster-cluster
[]  scaling nodegroup, desired capacity from 0 to 2
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Installation on Google GKE
GKE Requirements
  1. Install the Google Cloud SDK (gcloud) see Installing Google Cloud SDK.
  2. Create a project or use an existing one
export GKE_PROJECT=gke-clusters
gcloud projects create $GKE_PROJECT
gcloud config set project $GKE_PROJECT
  1. Enable the GKE API for the project if not already done
gcloud services enable container.googleapis.com
Create a GKE Cluster

You can apply any method to create a GKE cluster. The example given here is using the Google Cloud SDK.

export CLUSTER_NAME=cluster1
gcloud container clusters create $CLUSTER_NAME --image-type COS --num-nodes 2 --machine-type n1-standard-4

Retrieve the credentials to access the cluster:

gcloud container clusters get-credentials $CLUSTER_NAME

When done, you should be able to access your cluster like this:

kubectl get nodes
NAME                                      STATUS   ROLES    AGE   VERSION
gke-cluster1-default-pool-a63a765c-flr2   Ready    <none>   6m    v1.11.7-gke.4
gke-cluster1-default-pool-a63a765c-z73c   Ready    <none>   6m    v1.11.7-gke.4
Deploy Cilium

Extract the Cluster CIDR to enable native-routing:

NATIVE_CIDR=$(gcloud container clusters describe $CLUSTER_NAME | grep -i clusterIpv4Cidr | awk '{print $2}')
echo $NATIVE_CIDR

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

If you are ready to restart existing pods when initializing the node, you can also pass the --set nodeinit.restartPods flag to the helm command below. This will ensure all pods are managed by Cilium.

kubectl create namespace cilium
helm install cilium cilium/cilium --version 1.8.90 \
  --namespace cilium \
  --set global.nodeinit.enabled=true \
  --set nodeinit.reconfigureKubelet=true \
  --set nodeinit.removeCbrBridge=true \
  --set global.cni.binPath=/home/kubernetes/bin \
  --set global.gke.enabled=true \
  --set config.ipam=kubernetes \
  --set global.native-routing-cidr=$NATIVE_CIDR

The NodeInit DaemonSet is required to prepare the GKE nodes as nodes are added to the cluster. The NodeInit DaemonSet will perform the following actions:

  • Reconfigure kubelet to run in CNI mode
  • Mount the BPF filesystem
Restart unmanaged Pods

If you did not use the nodeinit.restartPods=true in the Helm options when deploying Cilium, then unmanaged pods need to be restarted manually. Restart all already running pods which are not running in host-networking mode to ensure that Cilium starts managing them. This is required to ensure that all pods which have been running before Cilium was deployed have network connectivity provided by Cilium and NetworkPolicy applies to them:

kubectl get pods --all-namespaces -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,HOSTNETWORK:.spec.hostNetwork --no-headers=true | grep '<none>' | awk '{print "-n "$1" "$2}' | xargs kubectl delete pod
pod "event-exporter-v0.2.3-f9c896d75-cbvcz" deleted
pod "fluentd-gcp-scaler-69d79984cb-nfwwk" deleted
pod "heapster-v1.6.0-beta.1-56d5d5d87f-qw8pv" deleted
pod "kube-dns-5f8689dbc9-2nzft" deleted
pod "kube-dns-5f8689dbc9-j7x5f" deleted
pod "kube-dns-autoscaler-76fcd5f658-22r72" deleted
pod "kube-state-metrics-7d9774bbd5-n6m5k" deleted
pod "l7-default-backend-6f8697844f-d2rq2" deleted
pod "metrics-server-v0.3.1-54699c9cc8-7l5w2" deleted
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n cilium get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Installation on Azure AKS

This guide covers installing Cilium into an Azure AKS environment. This guide will work when setting up AKS in both Basic and Advanced networking mode.

This is achieved using Cilium in CNI chaining mode, with the Azure CNI plugin as the base CNI plugin and Cilium chaining on top to provide L3-L7 observability, network policy enforcement enforcement, Kubernetes services implementation, as well as other advanced features like transparent encryption and clustermesh.

Prerequisites

Ensure that you have the Azure Cloud CLI installed.

To verify, confirm that the following command displays the set of available Kubernetes versions.

az aks get-versions -l westus -o table
Create an AKS Cluster

You can use any method to create and deploy an AKS cluster with the exception of specifying the Network Policy option. Doing so will still work but will result in unwanted iptables rules being installed on all of your nodes.

If you want to us the CLI to create a dedicated set of Azure resources (resource groups, networks, etc.) specifically for this tutorial, the following commands (borrowed from the AKS documentation) run as a script or manually all in the same terminal are sufficient.

It can take 10+ minutes for the final command to be complete indicating that the cluster is ready.

Note

Do NOT specify the ‘–network-policy’ flag when creating the cluster, as this will cause the Azure CNI plugin to push down unwanted iptables rules:

export RESOURCE_GROUP_NAME=group1
export CLUSTER_NAME=aks-test1
export LOCATION=westus

az group create --name $RESOURCE_GROUP_NAME --location $LOCATION
az aks create \
    --resource-group $RESOURCE_GROUP_NAME \
    --name $CLUSTER_NAME \
    --node-count 2 \
    --generate-ssh-keys \
    --network-plugin azure
Configure kubectl to Point to Newly Created Cluster

Run the following commands to configure kubectl to connect to this AKS cluster:

az aks get-credentials --resource-group $RESOURCE_GROUP_NAME --name $CLUSTER_NAME

To verify, you should see AKS in the name of the nodes when you run:

kubectl get nodes
NAME                       STATUS   ROLES   AGE     VERSION
aks-nodepool1-12032939-0   Ready    agent   8m26s   v1.13.10
Create an AKS + Cilium CNI configuration

Create a chaining.yaml file based on the following template to specify the desired CNI chaining configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cni-configuration
  namespace: cilium
data:
  cni-config: |-
    {
      "cniVersion": "0.3.0",
      "name": "azure",
      "plugins": [
        {
          "type": "azure-vnet",
          "mode": "transparent",
          "bridge": "azure0",
          "ipam": {
             "type": "azure-vnet-ipam"
           }
        },
        {
          "type": "portmap",
          "capabilities": {"portMappings": true},
          "snat": true
        },
        {
           "name": "cilium",
           "type": "cilium-cni"
        }
      ]
    }

Create the cilium namespace:

kubectl create namespace cilium

Deploy the ConfigMap:

kubectl apply -f chaining.yaml
Deploy Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace cilium \
  --set global.cni.chainingMode=generic-veth \
  --set global.cni.customConf=true \
  --set global.nodeinit.enabled=true \
  --set nodeinit.azure=true \
  --set global.cni.configMap=cni-configuration \
  --set global.tunnel=disabled \
  --set global.masquerade=false

This will create both the main cilium daemonset, as well as the cilium-node-init daemonset, which handles tasks like mounting the BPF filesystem and updating the existing Azure CNI plugin to run in ‘transparent’ mode.

Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true

Installer Integrations

The following list includes the Kubernetes installer integrations we are aware of. If your preferred installer is not in the list, you can always fall back to the standard Installation with managed etcd guide which works independently any installer.

Installation using Kops

As of kops 1.9 release, Cilium can be plugged into kops-deployed clusters as the CNI plugin. This guide provides steps to create a Kubernetes cluster on AWS using kops and Cilium as the CNI plugin. Note, the kops deployment will automate several deployment features in AWS by default, including AutoScaling, Volumes, VPCs, etc.

Prerequisites
  • aws cli
  • kubectl
  • aws account with permissions: * AmazonEC2FullAccess * AmazonRoute53FullAccess * AmazonS3FullAccess * IAMFullAccess * AmazonVPCFullAccess
Installing kops
curl -LO https://github.com/kubernetes/kops/releases/download/$(curl -s https://api.github.com/repos/kubernetes/kops/releases/latest | grep tag_name | cut -d '"' -f 4)/kops-linux-amd64
chmod +x kops-linux-amd64
sudo mv kops-linux-amd64 /usr/local/bin/kops
brew update && brew install kops
Setting up IAM Group and User

Assuming you have all the prerequisites, run the following commands to create the kops user and group:

# Create IAM group named kops and grant access
aws iam create-group --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonRoute53FullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/IAMFullAccess --group-name kops
aws iam attach-group-policy --policy-arn arn:aws:iam::aws:policy/AmazonVPCFullAccess --group-name kops
aws iam create-user --user-name kops
aws iam add-user-to-group --user-name kops --group-name kops
aws iam create-access-key --user-name kops

kops requires the creation of a dedicated S3 bucket in order to store the state and representation of the cluster. You will need to change the bucket name and provide your unique bucket name (for example a reverse of FQDN added with short description of the cluster). Also make sure to use the region where you will be deploying the cluster.

aws s3api create-bucket --bucket prefix-example-com-state-store --region us-west-2 --create-bucket-configuration LocationConstraint=us-west-2
export KOPS_STATE_STORE=s3://prefix-example-com-state-store

The above steps are sufficient for getting a working cluster installed. Please consult kops aws documentation for more detailed setup instructions.

Cilium Prerequisites
  • Ensure the System Requirements are met, particularly the Linux kernel and key-value store versions.

In this guide, we will use etcd version 3.1.11 and the latest CoreOS stable image which satisfies the minimum kernel version requirement of Cilium. To get the latest CoreOS ami image, you can change the region value to your choice in the command below.

aws ec2 describe-images --region=us-west-2 --owner=595879546273 --filters "Name=virtualization-type,Values=hvm" "Name=name,Values=CoreOS-stable*" --query 'sort_by(Images,&CreationDate)[-1].{id:ImageLocation}'
{
        "id": "595879546273/CoreOS-stable-1745.5.0-hvm"
}
Creating a Cluster
  • Note that you will need to specify the --master-zones and --zones for creating the master and worker nodes. The number of master zones should be * odd (1, 3, …) for HA. For simplicity, you can just use 1 region.
  • The cluster NAME variable should end with k8s.local to use the gossip protocol. If creating multiple clusters using the same kops user, then make the cluster name unique by adding a prefix such as com-company-emailid-.
export NAME=com-company-emailid-cilium.k8s.local
export KOPS_FEATURE_FLAGS=SpecOverrideFlag
kops create cluster --state=${KOPS_STATE_STORE} --node-count 3 --node-size t2.medium --master-size t2.medium --topology private --master-zones us-west-2a,us-west-2b,us-west-2c --zones us-west-2a,us-west-2b,us-west-2c --image 595879546273/CoreOS-stable-1745.5.0-hvm --networking cilium --override "cluster.spec.etcdClusters[*].version=3.1.11" --kubernetes-version 1.10.3  --cloud-labels "Team=Dev,Owner=Admin" ${NAME}

You may be prompted to create a ssh public-private key pair.

ssh-keygen

(Please see Deleting a Cluster)

Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Deleting a Cluster

To undo the dependencies and other deployment features in AWS from the kops cluster creation, use kops to destroy a cluster immediately with the parameter --yes:

kops delete cluster ${NAME} --yes
Appendix: Details of kops flags used in cluster creation

The following section explains all the flags used in create cluster command.

  • KOPS_FEATURE_FLAGS=SpecOverrideFlag : This flag is used to override the etcd version to be used from 2.X[kops default ] to 3.1.x [requirement of cilium]
  • --state=${KOPS_STATE_STORE} : KOPS uses an S3 bucket to store the state of your cluster and representation of your cluster
  • --node-count 3 : No. of worker nodes in the kubernetes cluster.
  • --node-size t2.medium : The size of the AWS EC2 instance for worker nodes
  • --master-size t2.medium : The size of the AWS EC2 instance of master nodes
  • --topology private : Cluster will be created with private topology, what that means is all masters/nodes will be launched in a private subnet in the VPC
  • --master-zones eu-west-1a,eu-west-1b,eu-west-1c : The 3 zones ensure the HA of master nodes, each belonging in a different Availability zones.
  • --zones eu-west-1a,eu-west-1b,eu-west-1c : Zones where the worker nodes will be deployed
  • --image 595879546273/CoreOS-stable-1745.3.1-hvm : Image name to be deployed (Cilium requires kernel version 4.8 and above so ensure to use the right OS for workers.)
  • --networking cilium : Networking CNI plugin to be used - cilium
  • --override "cluster.spec.etcdClusters[*].version=3.1.11" : Overrides the etcd version to be used.
  • --kubernetes-version 1.10.3 : Kubernetes version that is to be installed. Please note [Kops 1.9 officially supports k8s version 1.9]
  • --cloud-labels "Team=Dev,Owner=Admin" : Labels for your cluster
  • ${NAME} : Name of the cluster. Make sure the name ends with k8s.local for a gossip based cluster
Installation using Kubespray

The guide is to use Kubespray for creating an AWS Kubernetes cluster running Cilium as the CNI. The guide uses:

Please consult Kubespray Prerequisites and Cilium System Requirements.

Installing Kubespray
$ git clone --branch v2.6.0 https://github.com/kubernetes-sigs/kubespray

Install dependencies from requirements.txt

$ cd kubespray
$ sudo pip install -r requirements.txt
Infrastructure Provisioning

We will use Terraform for provisioning AWS infrastructure.

Configure AWS credentials

Export the variables for your AWS credentials

export AWS_ACCESS_KEY_ID="www"
export AWS_SECRET_ACCESS_KEY ="xxx"
export AWS_SSH_KEY_NAME="yyy"
export AWS_DEFAULT_REGION="zzz"
Configure Terraform Variables

We will start by specifying the infrastructure needed for the Kubernetes cluster.

$ cd contrib/terraform/aws
$ cp contrib/terraform/aws/terraform.tfvars.example terraform.tfvars`

Open the file and change any defaults particularly, the number of master, etcd, and worker nodes. You can change the master and etcd number to 1 for deployments that don’t need high availability. By default, this tutorial will create:

  • VPC with 2 public and private subnets
  • Bastion Hosts and NAT Gateways in the Public Subnet
  • Three of each (masters, etcd, and worker nodes) in the Private Subnet
  • AWS ELB in the Public Subnet for accessing the Kubernetes API from the internet
  • Terraform scripts using CoreOS as base image.

Example terraform.tfvars file:

#Global Vars
aws_cluster_name = "kubespray"

#VPC Vars
aws_vpc_cidr_block = "XXX.XXX.192.0/18"
aws_cidr_subnets_private = ["XXX.XXX.192.0/20","XXX.XXX.208.0/20"]
aws_cidr_subnets_public = ["XXX.XXX.224.0/20","XXX.XXX.240.0/20"]

#Bastion Host
aws_bastion_size = "t2.medium"


#Kubernetes Cluster

aws_kube_master_num = 3
aws_kube_master_size = "t2.medium"

aws_etcd_num = 3
aws_etcd_size = "t2.medium"

aws_kube_worker_num = 3
aws_kube_worker_size = "t2.medium"

#Settings AWS ELB

aws_elb_api_port = 6443
k8s_secure_api_port = 6443
kube_insecure_apiserver_address = "0.0.0.0"
Apply the configuration

terraform init to initialize the following modules

  • module.aws-vpc
  • module.aws-elb
  • module.aws-iam
$ terraform init

Once initialized , execute:

$ terraform plan -out=aws_kubespray_plan

This will generate a file, aws_kubespray_plan, depicting an execution plan of the infrastructure that will be created on AWS. To apply, execute:

$ terraform init
$ terraform apply "aws_kubespray_plan"

Terraform automatically creates an Ansible Inventory file at inventory/hosts.

Installing Kubernetes cluster with Cilium as CNI

Kubespray uses Ansible as its substrate for provisioning and orchestration. Once the infrastructure is created, you can run the Ansible playbook to install Kubernetes and all the required dependencies. Execute the below command in the kubespray clone repo, providing the correct path of the AWS EC2 ssh private key in ansible_ssh_private_key_file=<path to EC2 SSH private key file>

We recommend using the latest released Cilium version by editing roles/download/defaults/main.yml. Open the file, search for cilium_version, and replace the version with the latest released. As an example, the updated version entry will look like: cilium_version: "v1.2.0".

$ ansible-playbook -i ./inventory/hosts ./cluster.yml -e ansible_user=core -e bootstrap_os=coreos -e kube_network_plugin=cilium -b --become-user=root --flush-cache  -e ansible_ssh_private_key_file=<path to EC2 SSH private key file>
Validate Cluster

To check if cluster is created successfully, ssh into the bastion host with the user core.

# Get information about the basiton host
$ cat ssh-bastion.conf
$ ssh -i ~/path/to/ec2-key-file.pem core@public_ip_of_bastion_host

Execute the commands below from the bastion host. If kubectl isn’t installed on the bastion host, you can login to the master node to test the below commands. You may need to copy the private key to the bastion host to access the master node.

$ kubectl get nodes
$ kubectl get pods -n kube-system

You should see that nodes are in Ready state and Cilium pods are in Running state

Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Delete Cluster
$ cd contrib/terraform/aws
$ terraform destroy
Installation using kubeadm

Instructions about installing Cilium on Kubernetes cluster deployed by kubeadm are available in the official Kubernetes documentation.

CNI Chaining

CNI chaining allows to use Cilium in combination with other CNI plugins.

With Cilium CNI chaining, the base network connectivity and IP address management is managed by the non-Cilium CNI plugin, but Cilium attaches BPF programs to the network devices created by the non-Cilium plugin to provide L3/L4/L7 network visibility & policy enforcement and other advanced features like transparent encryption.

AWS-CNI

This guide explains how to set up Cilium in combination with aws-cni. In this hybrid mode, the aws-cni plugin is responsible for setting up the virtual network devices as well as address allocation (IPAM) via ENI. After the initial networking is setup, the Cilium CNI plugin is called to attach BPF programs to the network devices set up by aws-cni to enforce network policies, perform load-balancing, and encryption.

_images/aws-cni-architecture.png
Setup Cluster on AWS

Follow the instructions in the Installation on AWS EKS guide to set up an EKS cluster or use any other method of your preference to set up a Kubernetes cluster.

Ensure that the aws-vpc-cni-k8s plugin is installed. If you have set up an EKS cluster, this is automatically done.

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.cni.chainingMode=aws-cni \
  --set global.masquerade=false \
  --set global.tunnel=disabled \
  --set global.nodeinit.enabled=true

This will enable chaining with the aws-cni plugin. It will also disable tunneling. Tunneling is not required as ENI IP addresses can be directly routed in your VPC. You can also disable masquerading for the same reason.

Restart existing pods

The new CNI chaining configuration will not apply to any pod that is already running in the cluster. Existing pods will be reachable and Cilium will load-balance to them but policy enforcement will not apply to them and load-balancing is not performed for traffic originating from existing pods. You must restart these pods in order to invoke the chaining configuration on them.

If you are unsure if a pod is managed by Cilium or not, run kubectl get cep in the respective namespace and see if the pod is listed.

Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Azure CNI

This guide explains how to set up Cilium in combination with Azure CNI. In this hybrid mode, the Azure CNI plugin is responsible for setting up the virtual network devices as well as address allocation (IPAM). After the initial networking is setup, the Cilium CNI plugin is called to attach BPF programs to the network devices set up by Azure CNI to enforce network policies, perform load-balancing, and encryption.

Note

If you are looking to install Cilium on Azure AKS, see the guide Installation on Azure AKS for a complete guide also covering cluster setup.

Create an AKS + Cilium CNI configuration

Create a chaining.yaml file based on the following template to specify the desired CNI chaining configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cni-configuration
  namespace: cilium
data:
  cni-config: |-
    {
      "cniVersion": "0.3.0",
      "name": "azure",
      "plugins": [
        {
          "type": "azure-vnet",
          "mode": "transparent",
          "bridge": "azure0",
          "ipam": {
             "type": "azure-vnet-ipam"
           }
        },
        {
          "type": "portmap",
          "capabilities": {"portMappings": true},
          "snat": true
        },
        {
           "name": "cilium",
           "type": "cilium-cni"
        }
      ]
    }

Create the cilium namespace:

kubectl create namespace cilium

Deploy the ConfigMap:

kubectl apply -f chaining.yaml
Deploy Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace cilium \
  --set global.cni.chainingMode=generic-veth \
  --set global.cni.customConf=true \
  --set global.nodeinit.enabled=true \
  --set nodeinit.azure=true \
  --set global.cni.configMap=cni-configuration \
  --set global.tunnel=disabled \
  --set global.masquerade=false

This will create both the main cilium daemonset, as well as the cilium-node-init daemonset, which handles tasks like mounting the BPF filesystem and updating the existing Azure CNI plugin to run in ‘transparent’ mode.

Restart existing pods

The new CNI chaining configuration will not apply to any pod that is already running in the cluster. Existing pods will be reachable and Cilium will load-balance to them but policy enforcement will not apply to them and load-balancing is not performed for traffic originating from existing pods. You must restart these pods in order to invoke the chaining configuration on them.

If you are unsure if a pod is managed by Cilium or not, run kubectl get cep in the respective namespace and see if the pod is listed.

Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n cilium get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Calico

This guide instructs how to install Cilium in chaining configuration on top of Calico.

Create a CNI configuration

Create a chaining.yaml file based on the following template to specify the desired CNI chaining configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cni-configuration
  namespace: kube-system
data:
  cni-config: |-
    {
      "name": "generic-veth",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "log_level": "info",
          "datastore_type": "kubernetes",
          "mtu": 1440,
          "ipam": {
              "type": "calico-ipam"
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
          }
        },
        {
          "type": "portmap",
          "snat": true,
          "capabilities": {"portMappings": true}
        },
        {
          "type": "cilium-cni"
        }
      ]
    }

Deploy the ConfigMap:

kubectl apply -f chaining.yaml
Deploy Cilium with the portmap plugin enabled

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set global.cni.chainingMode=generic-veth \
  --set global.cni.customConf=true \
  --set global.cni.configMap=cni-configuration \
  --set global.tunnel=disabled \
  --set global.masquerade=false

Note

The new CNI chaining configuration will not apply to any pod that is already running the cluster. Existing pods will be reachable and Cilium will load-balance to them but policy enforcement will not apply to them and load-balancing is not performed for traffic originating from existing pods.

You must restart these pods in order to invoke the chaining configuration on them.

Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Generic Veth Chaining

The generic veth chaining plugin enables CNI chaining on top of any CNI plugin that is using a veth device model. The majority of CNI plugins use such a model.

Validate that the current CNI plugin is using veth
  1. Log into one of the worker nodes using SSH

  2. Run ip -d link to list all network devices on the node. You should be able spot network devices representing the pods running on that node.

  3. A network device might look something like this:

    103: lxcb3901b7f9c02@if102: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
        link/ether 3a:39:92:17:75:6f brd ff:ff:ff:ff:ff:ff link-netnsid 18 promiscuity 0
        veth addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    
  4. The veth keyword on line 3 indicates that the network device type is virtual ethernet.

If the CNI plugin you are chaining with is currently not using veth then the generic-veth plugin is not suitable. In that case, a full CNI chaining plugin is required which understands the device model of the underlying plugin. Writing such a plugin is trivial, contact us on Slack for more details.

Create a CNI configuration to define your chaining configuration

Create a chaining.yaml file based on the following template to specify the desired CNI chaining configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cni-configuration
  namespace: kube-system
data:
  cni-config: |-
    {
      "name": "generic-veth",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "XXX",
          [...]
        },
        {
          "type": "cilium-cni"
        }
      ]
    }

Deploy the ConfigMap:

kubectl apply -f chaining.yaml
Deploy Cilium with the portmap plugin enabled

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set global.cni.chainingMode=generic-veth \
  --set global.cni.customConf=true \
  --set global.cni.configMap=cni-configuration \
  --set global.tunnel=disabled \
  --set global.masquerade=false
Portmap (HostPort)

If you want to use the Kubernetes HostPort feature, you must enable CNI chaining with the portmap plugin which implements HostPort. This guide documents how to do so. For more information about the Kubernetes HostPort feature , check out the upstream documentation: Kubernetes hostPort-CNI plugin documentation.

Note

Before using HostPort, read the Kubernetes Configuration Best Practices to understand the implications of this feature.

Deploy Cilium with the portmap plugin enabled

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set global.cni.chainingMode=portmap

Note

You can combine the global.cni.chainingMode=portmap option with any of the other installation guides.

As Cilium is deployed as a DaemonSet, it will write a new CNI configuration 05-cilium.conflist and remove the standard 05-cilium.conf. The new configuration now enables HostPort. Any new pod scheduled is now able to make use of the HostPort functionality.

Restart existing pods

The new CNI chaining configuration will not apply to any pod that is already running the cluster. Existing pods will be reachable and Cilium will load-balance to them but policy enforcement will not apply to them and load-balancing is not performed for traffic originating from existing pods. You must restart these pods in order to invoke the chaining configuration on them.

Weave Net

This guide instructs how to install Cilium in chaining configuration on top of Weave Net.

Create a CNI configuration

Create a chaining.yaml file based on the following template to specify the desired CNI chaining configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cni-configuration
  namespace: kube-system
data:
  cni-config: |-
    {
        "cniVersion": "0.3.1",
        "name": "weave",
        "plugins": [
            {
                "name": "weave",
                "type": "weave-net",
                "hairpinMode": true
            },
            {
                "type": "portmap",
                "capabilities": {"portMappings": true},
                "snat": true
            },
            {
                "type": "cilium-cni"
            }
        ]
    }

Deploy the ConfigMap:

kubectl apply -f chaining.yaml
Deploy Cilium with the portmap plugin enabled

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set global.cni.chainingMode=generic-veth \
  --set global.cni.customConf=true \
  --set global.cni.configMap=cni-configuration \
  --set global.tunnel=disabled \
  --set global.masquerade=false

Note

The new CNI chaining configuration will not apply to any pod that is already running the cluster. Existing pods will be reachable and Cilium will load-balance to them but policy enforcement will not apply to them and load-balancing is not performed for traffic originating from existing pods.

You must restart these pods in order to invoke the chaining configuration on them.

Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true

Network Policy Security Tutorials

Identity-Aware and HTTP-Aware Policy Enforcement

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Deploy the Demo Application

Now that we have Cilium deployed and kube-dns operating correctly we can deploy our demo application.

In our Star Wars-inspired example, there are three microservices applications: deathstar, tiefighter, and xwing. The deathstar runs an HTTP webservice on port 80, which is exposed as a Kubernetes Service to load-balance requests to deathstar across two pod replicas. The deathstar service provides landing services to the empire’s spaceships so that they can request a landing port. The tiefighter pod represents a landing-request client service on a typical empire ship and xwing represents a similar service on an alliance ship. They exist so that we can test different security policies for access control to deathstar landing services.

Application Topology for Cilium and Kubernetes

_images/cilium_http_gsg.png

The file http-sw-app.yaml contains a Kubernetes Deployment for each of the three services. Each deployment is identified using the Kubernetes labels (org=empire, class=deathstar), (org=empire, class=tiefighter), and (org=alliance, class=xwing). It also includes a deathstar-service, which load-balances traffic to all pods with label (org=empire, class=deathstar).

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/minikube/http-sw-app.yaml
service/deathstar created
deployment.extensions/deathstar created
pod/tiefighter created
pod/xwing created

Kubernetes will deploy the pods and service in the background. Running kubectl get pods,svc will inform you about the progress of the operation. Each pod will go through several states until it reaches Running at which point the pod is ready.

$ kubectl get pods,svc
NAME                             READY   STATUS    RESTARTS   AGE
pod/deathstar-6fb5694d48-5hmds   1/1     Running   0          107s
pod/deathstar-6fb5694d48-fhf65   1/1     Running   0          107s
pod/tiefighter                   1/1     Running   0          107s
pod/xwing                        1/1     Running   0          107s

NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
service/deathstar    ClusterIP   10.96.110.8   <none>        80/TCP    107s
service/kubernetes   ClusterIP   10.96.0.1     <none>        443/TCP   3m53s

Each pod will be represented in Cilium as an Endpoint. We can invoke the cilium tool inside the Cilium pod to list them:

$ kubectl -n kube-system get pods -l k8s-app=cilium
NAME           READY   STATUS    RESTARTS   AGE
cilium-5ngzd   1/1     Running   0          3m19s

$ kubectl -n kube-system exec cilium-1c2cz -- cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                       IPv6   IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
108        Disabled           Disabled          104        k8s:io.cilium.k8s.policy.cluster=default                 10.15.233.139   ready
                                                           k8s:io.cilium.k8s.policy.serviceaccount=coredns
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=kube-dns
1011       Disabled           Disabled          104        k8s:io.cilium.k8s.policy.cluster=default                 10.15.96.117    ready
                                                           k8s:io.cilium.k8s.policy.serviceaccount=coredns
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=kube-dns
2407       Disabled           Disabled          22839      k8s:class=deathstar                                      10.15.129.95    ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire
2607       Disabled           Disabled          4          reserved:health                                          10.15.28.196    ready
3339       Disabled           Disabled          22839      k8s:class=deathstar                                      10.15.72.39     ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire
3738       Disabled           Disabled          47764      k8s:class=xwing                                          10.15.116.85    ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=alliance
3837       Disabled           Disabled          9164       k8s:class=tiefighter                                     10.15.22.126    ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire

Both ingress and egress policy enforcement is still disabled on all of these pods because no network policy has been imported yet which select any of the pods.

Check Current Access

From the perspective of the deathstar service, only the ships with label org=empire are allowed to connect and request landing. Since we have no rules enforced, both xwing and tiefighter will be able to request landing. To test this, use the commands below.

$ kubectl exec xwing -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed
$ kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed
Apply an L3/L4 Policy

When using Cilium, endpoint IP addresses are irrelevant when defining security policies. Instead, you can use the labels assigned to the pods to define security policies. The policies will be applied to the right pods based on the labels irrespective of where or when it is running within the cluster.

We’ll start with the basic policy restricting deathstar landing requests to only the ships that have label (org=empire). This will not allow any ships that don’t have the org=empire label to even connect with the deathstar service. This is a simple policy that filters only on IP protocol (network layer 3) and TCP protocol (network layer 4), so it is often referred to as an L3/L4 network security policy.

Note: Cilium performs stateful connection tracking, meaning that if policy allows the frontend to reach backend, it will automatically allow all required reply packets that are part of backend replying to frontend within the context of the same TCP/UDP connection.

L4 Policy with Cilium and Kubernetes

_images/cilium_http_l3_l4_gsg.png

We can achieve that with the following CiliumNetworkPolicy:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L3-L4 policy to restrict deathstar access to empire ships only"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      org: empire
      class: deathstar
  ingress:
  - fromEndpoints:
    - matchLabels:
        org: empire
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP

CiliumNetworkPolicies match on pod labels using an “endpointSelector” to identify the sources and destinations to which the policy applies. The above policy whitelists traffic sent from any pods with label (org=empire) to deathstar pods with label (org=empire, class=deathstar) on TCP port 80.

To apply this L3/L4 policy, run:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/minikube/sw_l3_l4_policy.yaml
ciliumnetworkpolicy.cilium.io/rule1 created

Now if we run the landing requests again, only the tiefighter pods with the label org=empire will succeed. The xwing pods will be blocked!

$ kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed

This works as expected. Now the same request run from an xwing pod will fail:

$ kubectl exec xwing -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing

This request will hang, so press Control-C to kill the curl request, or wait for it to time out.

Inspecting the Policy

If we run cilium endpoint list again we will see that the pods with the label org=empire and class=deathstar now have ingress policy enforcement enabled as per the policy above.

$ kubectl -n kube-system exec cilium-1c2cz -- cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                       IPv6   IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
108        Disabled           Disabled          104        k8s:io.cilium.k8s.policy.cluster=default                 10.15.233.139   ready
                                                           k8s:io.cilium.k8s.policy.serviceaccount=coredns
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=kube-dns
1011       Disabled           Disabled          104        k8s:io.cilium.k8s.policy.cluster=default                 10.15.96.117    ready
                                                           k8s:io.cilium.k8s.policy.serviceaccount=coredns
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=kube-dns
1518       Disabled           Disabled          4          reserved:health                                          10.15.28.196    ready
2407       Enabled            Disabled          22839      k8s:class=deathstar                                      10.15.129.95    ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire
3339       Enabled            Disabled          22839      k8s:class=deathstar                                      10.15.72.39     ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire
3738       Disabled           Disabled          47764      k8s:class=xwing                                          10.15.116.85    ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=alliance
3837       Disabled           Disabled          9164       k8s:class=tiefighter                                     10.15.22.126    ready
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire

You can also inspect the policy details via kubectl

$ kubectl get cnp
NAME    AGE
rule1   2m

$ kubectl describe cnp rule1
Name:         rule1
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  cilium.io/v2
Description:  L3-L4 policy to restrict deathstar access to empire ships only
Kind:         CiliumNetworkPolicy
Metadata:
  Creation Timestamp:  2019-01-23T12:36:32Z
  Generation:          1
  Resource Version:    1115
  Self Link:           /apis/cilium.io/v2/namespaces/default/ciliumnetworkpolicies/rule1
  UID:                 837a2f1b-1f0b-11e9-9609-080027702f09
Spec:
  Endpoint Selector:
    Match Labels:
      Class:  deathstar
      Org:    empire
  Ingress:
    From Endpoints:
      Match Labels:
        Org:  empire
    To Ports:
      Ports:
        Port:      80
        Protocol:  TCP
Status:
  Nodes:
    Minikube:
      Enforcing:              true
      Last Updated:           2019-01-23T12:36:32.277839184Z
      Local Policy Revision:  5
      Ok:                     true
Events:                       <none>
Apply and Test HTTP-aware L7 Policy

In the simple scenario above, it was sufficient to either give tiefighter / xwing full access to deathstar’s API or no access at all. But to provide the strongest security (i.e., enforce least-privilege isolation) between microservices, each service that calls deathstar’s API should be limited to making only the set of HTTP requests it requires for legitimate operation.

For example, consider that the deathstar service exposes some maintenance APIs which should not be called by random empire ships. To see this run:

$ kubectl exec tiefighter -- curl -s -XPUT deathstar.default.svc.cluster.local/v1/exhaust-port
Panic: deathstar exploded

goroutine 1 [running]:
main.HandleGarbage(0x2080c3f50, 0x2, 0x4, 0x425c0, 0x5, 0xa)
        /code/src/github.com/empire/deathstar/
        temp/main.go:9 +0x64
main.main()
        /code/src/github.com/empire/deathstar/
        temp/main.go:5 +0x85

While this is an illustrative example, unauthorized access such as above can have adverse security repercussions.

L7 Policy with Cilium and Kubernetes

_images/cilium_http_l3_l4_l7_gsg.png

Cilium is capable of enforcing HTTP-layer (i.e., L7) policies to limit what URLs the tiefighter is allowed to reach. Here is an example policy file that extends our original policy by limiting tiefighter to making only a POST /v1/request-landing API call, but disallowing all other calls (including PUT /v1/exhaust-port).

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L7 policy to restrict access to specific HTTP call"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      org: empire
      class: deathstar
  ingress:
  - fromEndpoints:
    - matchLabels:
        org: empire
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
      rules:
        http:
        - method: "POST"
          path: "/v1/request-landing"

Update the existing rule to apply L7-aware policy to protect app1 using:

$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/minikube/sw_l3_l4_l7_policy.yaml
ciliumnetworkpolicy.cilium.io/rule1 configured

We can now re-run the same test as above, but we will see a different outcome:

$ kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed

and

$ kubectl exec tiefighter -- curl -s -XPUT deathstar.default.svc.cluster.local/v1/exhaust-port
Access denied

As you can see, with Cilium L7 security policies, we are able to permit tiefighter to access only the required API resources on deathstar, thereby implementing a “least privilege” security approach for communication between microservices.

You can observe the L7 policy via kubectl:

$ kubectl describe ciliumnetworkpolicies
Name:         rule1
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"cilium.io/v2","description":"L7 policy to restrict access to specific HTTP call","kind":"CiliumNetworkPolicy","metadata":{"...
API Version:  cilium.io/v2
Description:  L7 policy to restrict access to specific HTTP call
Kind:         CiliumNetworkPolicy
Metadata:
  Creation Timestamp:  2019-01-23T12:36:32Z
  Generation:          2
  Resource Version:    1484
  Self Link:           /apis/cilium.io/v2/namespaces/default/ciliumnetworkpolicies/rule1
  UID:                 837a2f1b-1f0b-11e9-9609-080027702f09
Spec:
  Endpoint Selector:
    Match Labels:
      Class:  deathstar
      Org:    empire
  Ingress:
    From Endpoints:
      Match Labels:
        Org:  empire
    To Ports:
      Ports:
        Port:      80
        Protocol:  TCP
      Rules:
        Http:
          Method:  POST
          Path:    /v1/request-landing
Status:
  Nodes:
    Minikube:
      Annotations:
        Kubectl . Kubernetes . Io / Last - Applied - Configuration:  {"apiVersion":"cilium.io/v2","description":"L7 policy to restrict access to specific HTTP call","kind":"CiliumNetworkPolicy","metadata":{"annotations":{},"name":"rule1","namespace":"default"},"spec":{"endpointSelector":{"matchLabels":{"class":"deathstar","org":"empire"}},"ingress":[{"fromEndpoints":[{"matchLabels":{"org":"empire"}}],"toPorts":[{"ports":[{"port":"80","protocol":"TCP"}],"rules":{"http":[{"method":"POST","path":"/v1/request-landing"}]}}]}]}}

      Enforcing:              true
      Last Updated:           2019-01-23T12:39:30.823729308Z
      Local Policy Revision:  7
      Ok:                     true
Events:                       <none>

and cilium CLI:

$ kubectl -n kube-system exec cilium-qh5l2 cilium policy get
[
  {
    "endpointSelector": {
      "matchLabels": {
        "any:class": "deathstar",
        "any:org": "empire",
        "k8s:io.kubernetes.pod.namespace": "default"
      }
    },
    "ingress": [
      {
        "fromEndpoints": [
          {
            "matchLabels": {
              "any:org": "empire",
              "k8s:io.kubernetes.pod.namespace": "default"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "80",
                "protocol": "TCP"
              }
            ],
            "rules": {
              "http": [
                {
                  "path": "/v1/request-landing",
                  "method": "POST"
                }
              ]
            }
          }
        ]
      }
    ],
    "labels": [
      {
        "key": "io.cilium.k8s.policy.name",
        "value": "rule1",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.uid",
        "value": "837a2f1b-1f0b-11e9-9609-080027702f09",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.namespace",
        "value": "default",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.derived-from",
        "value": "CiliumNetworkPolicy",
        "source": "k8s"
      }
    ]
  }
]
Revision: 7

We hope you enjoyed the tutorial. Feel free to play more with the setup, read the rest of the documentation, and reach out to us on the Cilium Slack channel with any questions!

Clean-up
$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/minikube/http-sw-app.yaml
$ kubectl delete cnp rule1

Locking down external access with DNS-based policies

This document serves as an introduction for using Cilium to enforce DNS-based security policies for Kubernetes pods.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Deploy the Demo Application

DNS-based policies are very useful for controlling access to services running outside the Kubernetes cluster. DNS acts as a persistent service identifier for both external services provided by AWS, Google, Twilio, Stripe, etc., and internal services such as database clusters running in private subnets outside Kubernetes. CIDR or IP-based policies are cumbersome and hard to maintain as the IPs associated with external services can change frequently. The Cilium DNS-based policies provide an easy mechanism to specify access control while Cilium manages the harder aspects of tracking DNS to IP mapping.

In this guide we will learn about:

  • Controlling egress access to services outside the cluster using DNS-based policies
  • Using patterns (or wildcards) to whitelist a subset of DNS domains
  • Combining DNS, port and L7 rules for restricting access to external service

In line with our Star Wars theme examples, we will use a simple scenario where the empire’s mediabot pods need access to Twitter for managing the empire’s tweets. The pods shouldn’t have access to any other external service.

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-sw-app.yaml
$ kubectl get po
NAME                             READY   STATUS    RESTARTS   AGE
pod/mediabot                     1/1     Running   0          14s
Apply DNS Egress Policy

The following Cilium network policy allows mediabot pods to only access api.twitter.com.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "fqdn"
spec:
  endpointSelector:
    matchLabels:
      org: empire
      class: mediabot
  egress:
  - toFQDNs:
    - matchName: "api.twitter.com"  
  - toEndpoints:
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": kube-system
        "k8s:k8s-app": kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: "*"

Let’s take a closer look at the policy:

  • The first egress section uses toFQDNs: matchName specification to allow egress to api.twitter.com. The destination DNS should match exactly the name specified in the rule. The endpointSelector allows only pods with labels class: mediabot, org:empire to have the egress access.
  • The second egress section allows mediabot pods to access kube-dns service. Note that rules: dns instructs Cilium to inspect and allow DNS lookups matching specified patterns. In this case, inspect and allow all DNS queries.

Note that with this policy the mediabot doesn’t have access to any internal cluster service other than kube-dns. Refer to Network Policy to learn more about policies for controlling access to internal cluster services.

Let’s apply the policy:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-matchname.yaml

Testing the policy, we see that mediabot has access to api.twitter.com but doesn’t have access to any other external service, e.g., help.twitter.com.

$ kubectl exec -it mediabot -- curl -sL https://api.twitter.com
...
...

$ kubectl exec -it mediabot -- curl -sL https://help.twitter.com
^C
DNS Policies Using Patterns

The above policy controlled DNS access based on exact match of the DNS domain name. Often, it is required to allow access to a subset of domains. Let’s say, in the above example, mediabot pods need access to any Twitter sub-domain, e.g., the pattern *.twitter.com. We can achieve this easily by changing the toFQDN rule to use matchPattern instead of matchName.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "fqdn"
spec:
  endpointSelector:
    matchLabels:
      org: empire
      class: mediabot
  egress:
  - toFQDNs:
    - matchPattern: "*.twitter.com" 
  - toEndpoints:
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": kube-system
        "k8s:k8s-app": kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: "*"
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-pattern.yaml

Test that mediabot has access to multiple Twitter services for which the DNS matches the pattern *.twitter.com. It is important to note and test that this doesn’t allow access to twitter.com because the *. in the pattern requires one subdomain to be present in the DNS name. You can simply add more matchName and matchPattern clauses to extend the access. (See DNS based policies to learn more about specifying DNS rules using patterns and names.)

$ kubectl exec -it mediabot -- curl -sL https://help.twitter.com
...

$ kubectl exec -it mediabot -- curl -sL https://about.twitter.com
...

$ kubectl exec -it mediabot -- curl -sL https://twitter.com
^C
Combining DNS, Port and L7 Rules

The DNS-based policies can be combined with port (L4) and API (L7) rules to further restrict the access. In our example, we will restrict mediabot pods to access Twitter services only on ports 443. The toPorts section in the policy below achieves the port-based restrictions along with the DNS-based policies.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "fqdn"
spec:
  endpointSelector:
    matchLabels:
      org: empire
      class: mediabot
  egress:
  - toFQDNs:
    - matchPattern: "*.twitter.com" 
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP 
  - toEndpoints:
    - matchLabels:
        "k8s:io.kubernetes.pod.namespace": kube-system
        "k8s:k8s-app": kube-dns
    toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
        - matchPattern: "*"    
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-port.yaml

Testing, the access to https://help.twitter.com on port 443 will succeed but the access to http://help.twitter.com on port 80 will be denied.

$ kubectl exec -it mediabot -- curl https://help.twitter.com
...

$ kubectl exec -it mediabot -- curl http://help.twitter.com
^C

Refer to Layer 4 Examples and Layer 7 Examples to learn more about Cilium L4 and L7 network policies.

Clean-up
$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-sw-app.yaml
$ kubectl delete cnp fqdn

Inspecting TLS Encrypted Connections with Cilium

This document serves as an introduction for how network security teams can use Cilium to transparently inspect TLS-encrypted connections. This TLS-aware inspection allows Cilium API-aware visibility and policy to function even for connections where client to server communication is protected by TLS, such as when a client accesses the API service via HTTPS. This capability is similar to what is possible to traditional hardware firewalls, but is implemented entirely in software on the Kubernetes worker node, and is policy driven, allowing inspection to target only selected network connectivity.

This type of visibility is extremely valuable to be able to monitor how external API services are being used, for example, understanding which S3 buckets are being accessed by an given application.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml

Edit the ClusterRole for Cilium to give it access to Kubernetes secrets

$ kubectl edit clusterrole cilium -n kube-system

Add the following section at the end of the file:

- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get
Deploy the Demo Application

To demonstrate TLS-interception we will use the same mediabot application that we used for the DNS-aware policy example. This application will access the Star Wars API service using HTTPS, which would normally mean that network-layer mechanisms like Cilium would not be able to see the HTTP-layer details of the communication, since all application data is encrypted using TLS before that data is sent on the network.

In this guide we will learn about:

  • Creating an internal Certificate Authority (CA) and associated certificates signed by that CA to enable TLS interception.
  • Using Cilium network policy to select the traffic to intercept using DNS-based policy rules.
  • Inspecting the details of the HTTP request using cilium monitor (accessing this visibility data via Hubble, and applying Cilium network policies to filter/modify the HTTP request is also possible, but is beyond the scope of this simple Getting Started Guide)

First off, we will create a single pod mediabot application:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-sw-app.yaml
$ kubectl get po
NAME                             READY   STATUS    RESTARTS   AGE
pod/mediabot                     1/1     Running   0          14s
A Brief Overview of the TLS Certificate Model

TLS is a protocol that “wraps” other protocols like HTTP and ensures that communication between client and server has confidentiality (no one can read the data except the intended recipient), integrity (recipient can confirm that the data has not been modified in transit), and authentication (sender can confirm that it is talking with the intended destination, not an impostor). We will provide a highly simplified overview of TLS in this document, but for full details, please see https://en.wikipedia.org/wiki/Transport_Layer_Security .

From an authentication perspective, the TLS model relies on a “Certificate Authority” (CA) which is an entity that is trusted to create proof that a given network service (e.g., www.cilium.io) is who they say they are. The goal is to prevents a malicious party in the network between the client and the server from intercepting the traffic and pretending to be the destination server.

In the case of “friendly interception” for network security monitoring, Cilium uses a model similar to traditional firewalls with TLS inspection capabilities: the network security team creates their own “internal certificate authority” that can be used to create alternative certificates for external destinations. This model requires each client workload to also trust this new certificate, otherwise the client’s TLS library will reject the connection as invalid. In this model, the network firewall uses the certificate signed by the internal CA to act like the destination service and terminate the TLS connection. This allows the firewall to inspect and even modify the application layer data, and then initiate another TLS connect to the actual destination service.

The CA model within TLS is based on cryptographic keys and certificates. Realizing the above model requires four primary steps:

  1. Create an internal certificate authority by generating a CA private key and CA certificate.

2) For any destination where TLS inspection is desired (e.g., artii.herokuapp.com in the example below), generate a private key and certificate signing request with a common name that matches the destination DNS name.

  1. Use the CA private key to create a signed certificate.

4) Ensure that all clients where TLS inspection is have the CA certificate installed so that they will trust all certificates signed by that CA.

5) Given that Cilium will be terminating the initial TLS connection from the client and creating a new TLS connection to the destination, Cilium must be told the set of CAs that it should trust when validating the new TLS connection to the destination service.

Note

In a non-demo environment it is EXTREMELY important that you keep the above private keys safe, as anyone with access to this private key will be able to inspect TLS-encrypted traffic (certificates on the other hand are public information, and are not at all sensitive). In the guide below, the CA private key does not need to be provided to Cilium at all (it is used only to create certificates, which can be done offline) and private keys for individual destination services are stored as Kubernetes secrets. These secrets should be stored in a namespace where they can be accessed by Cilium, but not general purpose workloads.

Generating and Installing TLS Keys and Certificates

Now that we have explained the high-level certificate model used by TLS, we will walk through the concrete steps to generate the appropriate keys and certificates using the openssl utility.

The following image describes the different files containing cryptographic data that are generated or copied, and what components in the system need access to those files:

_images/cilium_tls_visibility_gsg.png

You can use openssl on your local system if it is already installed, but if not a simple shortcut is to use kubectl exec to execute /bin/bash within any of the cilium pods, and then run the resulting openssl commands. Use kubectl cp to copy the resulting files out of the cilium pod when it is time to use them to create Kubernetes secrets of copy them to the mediabot pod.

Create an Internal Certificate Authority (CA)

Generate CA private key named ‘myCA.key’:

$ openssl genrsa -des3 -out myCA.key 2048

Enter any password, just remember it for some of the later steps.

Generate CA certificate from the private key:

$ openssl req -x509 -new -nodes -key myCA.key -sha256 -days 1825 -out myCA.crt

The values you enter for each prompt do not need to be any specific value, and do not need to be accurate.

Create Private Key and Certificate Signing Request for a Given DNS Name

Generate an internal private key and certificate signing with a common name that matches the DNS name of the destination service to be intercepted for inspection (in this example, use artii.herokuapp.com).

First create the private key:

$ openssl genrsa -out internal-artii.key 2048

Next, create a certificate signing request, specifying the DNS name of the destination service for the common name field when prompted. All other prompts can be filled with any value.

$ openssl req -new -key internal-artii.key -out internal-artii.csr

The only field that must be a specific value is ensuring that Common Name is the exact DNS destination artii.herokuapp.com that will be provided to the client.

This example workflow will work for any DNS name as long as the toFQDNs rule in the policy YAML (below) is also updated to match the DNS name in the certificate.

Use CA to Generate a Signed Certificate for the DNS Name

Use the internal CA private key to create a signed certificate for artii.herokuapp.com named internal-artii.crt.

$ openssl x509 -req -days 360 -in internal-artii.csr -CA myCA.crt -CAkey myCA.key -CAcreateserial -out internal-artii.crt -sha256

Next we create a Kubernetes secret that includes both the private key and signed certificates for the destination service:

$ kubectl create secret tls artii-tls-data -n kube-system --cert=internal-artii.crt --key=internal-artii.key
Add the Internal CA as a Trusted CA Inside the Client Pod

Once the CA certificate is inside the client pod, we still must make sure that the CA file is picked up by the TLS library used by your application. Most Linux applications automatically use a set of trusted CA certificates that are bundled along with the Linux distro. In this guide, we are using an Ubuntu container as the client, and so will update it with Ubuntu specific instructions. Other Linux distros will have different mechanisms. Also, individual applications may leverage their own certificate stores rather than use the OS certificate store. Java applications and the aws-cli are two common examples. Please refer to the application or application runtime documentation for more details.

For Ubuntu, we first copy the additional CA certificate to the client pod filesystem

$ kubectl cp myCA.crt default/mediabot:/usr/local/share/ca-certificates/myCA.crt

Then run the Ubuntu-specific utility that adds this certificate to the global set of trusted certificate authorities in /etc/ssl/certs/ca-certificates.crt .

$ kubectl exec mediabot -- update-ca-certificates

This command will issue a WARNING, but this can be ignored.

Provide Cilium with List of Trusted CAs

Next, we will provide Cilium with the set of CAs that it should trust when originating the secondary TLS connections. This list should correspond to the standard set of global CAs that your organization trusts. A logical option for this is the standard CAs that are trusted by your operating system, since this is the set of CAs that were being used prior to introducing TLS inspection.

To keep things simple, in this example we will simply copy this list out of the Ubuntu filesystem of the mediabot pod, though it is important to understand that this list of trusted CAs is not specific to a particular TLS client or server, and so this step need only be performed once regardless of how many TLS clients or servers are involved in TLS inspection.

$ kubectl cp default/mediabot:/etc/ssl/certs/ca-certificates.crt ca-certificates.crt

We then will create a Kubernetes secret using this certificate bundle so that Cilium can read the certificate bundle and use it to validate outgoing TLS connections.

$ kubectl create secret generic tls-orig-data -n kube-system --from-file=./ca-certificates.crt
Apply DNS and TLS-aware Egress Policy

Up to this point, we have created keys and certificates to enable TLS inspection, but we have not told Cilium which traffic we want to intercept and inspect. This is done using the same Cilium Network Policy constructs that are used for other Cilium Network Policies.

The following Cilium network policy indicates that Cilium should perform HTTP-aware inspect of communication between the mediabot pod to artii.herokuapp.com.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L7 policy with TLS"
metadata:
  name: "l7-visibility-tls"
spec:
  endpointSelector:
    matchLabels:
      org: empire
      class: mediabot
  egress:
  - toFQDNs:
    - matchName: "artii.herokuapp.com"
    toPorts:
    - ports:
      - port: "443"
        protocol: "TCP"
      terminatingTLS:
        secret:
          namespace: "kube-system"
          name: "artii-tls-data"
      originatingTLS:
        secret:
          namespace: "kube-system"
          name: "tls-orig-data"
      rules:
        http:
        - {}
  - toPorts:
    - ports:
      - port: "53"
        protocol: ANY
      rules:
        dns:
          - matchPattern: "*"

Let’s take a closer look at the policy: * The endpointSelector means that this policy will only apply to pods with labels class: mediabot, org:empire to have the egress access. * The first egress section uses toFQDNs: matchName specification to allow TCP port 443 egress to artii.herokuapp.com. * The http section below the toFQDNs rule indicates that such connections should be parsed as HTTP, with a policy of {} which will allow all requests. * The terminatingTLS and originatingTLS sections indicate that TLS interception should be used to terminate the initial TLS connection from mediabot and initiate a new out-bound TLS connection to artii.herokuapp.com. * The second egress section allows mediabot pods to access kube-dns service. Note that rules: dns instructs Cilium to inspect and allow DNS lookups matching specified patterns. In this case, inspect and allow all DNS queries.

Note that with this policy the mediabot doesn’t have access to any internal cluster service other than kube-dns and will have no access to any other external destinations either. Refer to Network Policy to learn more about policies for controlling access to internal cluster services.

Let’s apply the policy:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-tls-inspection/l7-visibility-tls.yaml
Demonstrating TLS Inspection

Recall that the policy we pushed will allow all HTTPS requests from mediabot to artii.herokuapp.com, but will parse all data at the HTTP-layer, meaning that cilium monitor will report each HTTP request and response.

To see this, open a new window and run the following command to identity the name of the cilium pod (e.g, cilium-97s78) that is running on the same Kubernetes worker node as the mediabot pod.

Then start running cilium monitor in “L7 mode” to monitor for HTTP requests being reported by Cilium:

kubectl exec -it -n kube-system cilium-d5x8v -- cilium monitor -t l7

Next in the original window, from the mediabot pod we can access artii.herokuapp.com via HTTPS:

$ kubectl exec -it mediabot -- curl -sL 'https://artii.herokuapp.com/fonts_list'
...
...

$ kubectl exec -it mediabot -- curl -sL 'https://artii.herokuapp.com/make?text=cilium&font=univers'
...
...

Looking back at the cilium monitor window, you will see each individual HTTP request and response. For example:

-> Request http from 1975 ([k8s:app=public-service k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=tenant-a]) to 0 ([reserved:world]), identity 56418->2, verdict Forwarded GET https://www.lyft.com/privacy => 0
-> Response http to 1975 ([k8s:app=public-service k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=tenant-a]) from 0 ([reserved:world]), identity 56418->2, verdict Forwarded GET https://www.lyft.com/privacy => 200

Refer to Layer 4 Examples and Layer 7 Examples to learn more about Cilium L4 and L7 network policies.

Clean-up
$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-dns/dns-sw-app.yaml
$ kubectl delete cnp l7-visibility-tls

Securing a Kafka cluster

This document serves as an introduction to using Cilium to enforce Kafka-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Deploy the Demo Application

Now that we have Cilium deployed and kube-dns operating correctly we can deploy our demo Kafka application. Since our first demo of Cilium + HTTP-aware security policies was Star Wars-themed we decided to do the same for Kafka. While the HTTP-aware Cilium Star Wars demo showed how the Galactic Empire used HTTP-aware security policies to protect the Death Star from the Rebel Alliance, this Kafka demo shows how the lack of Kafka-aware security policies allowed the Rebels to steal the Death Star plans in the first place.

Kafka is a powerful platform for passing datastreams between different components of an application. A cluster of “Kafka brokers” connect nodes that “produce” data into a data stream, or “consume” data from a datastream. Kafka refers to each datastream as a “topic”. Because scalable and highly-available Kafka clusters are non-trivial to run, the same cluster of Kafka brokers often handles many different topics at once (read this Introduction to Kafka for more background).

In our simple example, the Empire uses a Kafka cluster to handle two different topics:

  • empire-announce : Used to broadcast announcements to sites spread across the galaxy
  • deathstar-plans : Used by a small group of sites coordinating on building the ultimate battlestation.

To keep the setup small, we will just launch a small number of pods to represent this setup:

  • kafka-broker : A single pod running Kafka and Zookeeper representing the Kafka cluster (label app=kafka).
  • empire-hq : A pod representing the Empire’s Headquarters, which is the only pod that should produce messages to empire-announce or deathstar-plans (label app=empire-hq).
  • empire-backup : A secure backup facility located in Scarif , which is allowed to “consume” from the secret deathstar-plans topic (label app=empire-backup).
  • empire-outpost-8888 : A random outpost in the empire. It needs to “consume” messages from the empire-announce topic (label app=empire-outpost).
  • empire-outpost-9999 : Another random outpost in the empire that “consumes” messages from the empire-announce topic (label app=empire-outpost).

All pods other than kafka-broker are Kafka clients, which need access to the kafka-broker container on TCP port 9092 in order to send Kafka protocol messages.

_images/cilium_kafka_gsg_topology.png

The file kafka-sw-app.yaml contains a Kubernetes Deployment for each of the pods described above, as well as a Kubernetes Service for both Kafka and Zookeeper.

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-kafka/kafka-sw-app.yaml
deployment "kafka-broker" created
deployment "zookeeper" created
service "zook" created
service "kafka-service" created
deployment "empire-hq" created
deployment "empire-outpost-8888" created
deployment "empire-outpost-9999" created
deployment "empire-backup" created

Kubernetes will deploy the pods and service in the background. Running kubectl get svc,pods will inform you about the progress of the operation. Each pod will go through several states until it reaches Running at which point the setup is ready.

$ kubectl get svc,pods
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kafka-service   ClusterIP   None            <none>        9092/TCP   2m
kubernetes      ClusterIP   10.96.0.1       <none>        443/TCP    10m
zook            ClusterIP   10.97.250.131   <none>        2181/TCP   2m

NAME                                   READY     STATUS    RESTARTS   AGE
empire-backup-6f4567d5fd-gcrvg         1/1       Running   0          2m
empire-hq-59475b4b64-mrdww             1/1       Running   0          2m
empire-outpost-8888-78dffd49fb-tnnhf   1/1       Running   0          2m
empire-outpost-9999-7dd9fc5f5b-xp6jw   1/1       Running   0          2m
kafka-broker-b874c78fd-jdwqf           1/1       Running   0          2m
zookeeper-85f64b8cd4-nprck             1/1       Running   0          2m
Setup Client Terminals

First we will open a set of windows to represent the different Kafka clients discussed above. For consistency, we recommend opening them in the pattern shown in the image below, but this is optional.

_images/cilium_kafka_gsg_terminal_layout.png

In each window, use copy-paste to have each terminal provide a shell inside each pod.

empire-hq terminal:

$ HQ_POD=$(kubectl get pods -l app=empire-hq -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $HQ_POD -- sh -c "PS1=\"empire-hq $\" /bin/bash"

empire-backup terminal:

$ BACKUP_POD=$(kubectl get pods -l app=empire-backup -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $BACKUP_POD -- sh -c "PS1=\"empire-backup $\" /bin/bash"

outpost-8888 terminal:

$ OUTPOST_8888_POD=$(kubectl get pods -l outpostid=8888 -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $OUTPOST_8888_POD -- sh -c "PS1=\"outpost-8888 $\" /bin/bash"

outpost-9999 terminal:

$ OUTPOST_9999_POD=$(kubectl get pods -l outpostid=9999 -o jsonpath='{.items[0].metadata.name}') && kubectl exec -it $OUTPOST_9999_POD -- sh -c "PS1=\"outpost-9999 $\" /bin/bash"
Test Basic Kafka Produce & Consume

First, let’s start the consumer clients listening to their respective Kafka topics. All of the consumer commands below will hang intentionally, waiting to print data they consume from the Kafka topic:

In the empire-backup window, start listening on the top-secret deathstar-plans topic:

$ ./kafka-consume.sh --topic deathstar-plans

In the outpost-8888 window, start listening to empire-announcement:

$ ./kafka-consume.sh --topic empire-announce

Do the same in the outpost-9999 window:

$ ./kafka-consume.sh --topic empire-announce

Now from the empire-hq, first produce a message to the empire-announce topic:

$ echo "Happy 40th Birthday to General Tagge" | ./kafka-produce.sh --topic empire-announce

This message will be posted to the empire-announce topic, and shows up in both the outpost-8888 and outpost-9999 windows who consume that topic. It will not show up in empire-backup.

empire-hq can also post a version of the top-secret deathstar plans to the deathstar-plans topic:

$ echo "deathstar reactor design v3" | ./kafka-produce.sh --topic deathstar-plans

This message shows up in the empire-backup window, but not for the outposts.

Congratulations, Kafka is working as expected :)

The Danger of a Compromised Kafka Client

But what if a rebel spy gains access to any of the remote outposts that act as Kafka clients? Since every client has access to the Kafka broker on port 9092, it can do some bad stuff. For starters, the outpost container can actually switch roles from a consumer to a producer, sending “malicious” data to all other consumers on the topic.

To prove this, kill the existing kafka-consume.sh command in the outpost-9999 window by typing control-C and instead run:

$ echo "Vader Booed at Empire Karaoke Party" | ./kafka-produce.sh --topic empire-announce

Uh oh! Outpost-8888 and all of the other outposts in the empire have now received this fake announcement.

But even more nasty from a security perspective is that the outpost container can access any topic on the kafka-broker.

In the outpost-9999 container, run:

$ ./kafka-consume.sh --topic deathstar-plans
"deathstar reactor design v3"

We see that any outpost can actually access the secret deathstar plans. Now we know how the rebels got access to them!

Securing Access to Kafka with Cilium

Obviously, it would be much more secure to limit each pod’s access to the Kafka broker to be least privilege (i.e., only what is needed for the app to operate correctly and nothing more).

We can do that with the following Cilium security policy. As with Cilium HTTP policies, we can write policies that identify pods by labels, and then limit the traffic in/out of this pod. In this case, we’ll create a policy that identifies the exact traffic that should be allowed to reach the Kafka broker, and deny the rest.

As an example, a policy could limit containers with label app=empire-outpost to only be able to consume topic empire-announce, but would block any attempt by a compromised container (e.g., empire-outpost-9999) from producing to empire-announce or consuming from deathstar-plans.

_images/cilium_kafka_gsg_attack.png

Here is the CiliumNetworkPolicy rule that limits access of pods with label app=empire-outpost to only consume on topic empire-announce:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "enable outposts to consume empire-announce"
metadata:
  name: "rule2"
spec:
  endpointSelector:
    matchLabels:
      app: kafka
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: empire-outpost
    toPorts:
    - ports:
      - port: "9092"
        protocol: TCP
      rules:
        kafka:
        - role: "consume"
          topic: "empire-announce"

A CiliumNetworkPolicy contains a list of rules that define allowed requests, meaning that requests that do not match any rules are denied as invalid.

The above rule applies to inbound (i.e., “ingress”) connections to kafka-broker pods (as indicated by “app: kafka” in the “endpointSelector” section). The rule will apply to connections from pods with label “app: empire-outpost” as indicated by the “fromEndpoints” section. The rule explicitly matches Kafka connections destined to TCP 9092, and allows consume/produce actions on various topics of interest. For example we are allowing consume from topic empire-announce in this case.

The full policy adds two additional rules that permit the legitimate “produce” (topic empire-announce and topic deathstar-plans) from empire-hq and the legitimate consume (topic = “deathstar-plans”) from empire-backup. The full policy can be reviewed by opening the URL in the command below in a browser.

Apply this Kafka-aware network security policy using kubectl in the main window:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-kafka/kafka-sw-security-policy.yaml

If we then again try to produce a message from outpost-9999 to empire-annnounce, it is denied. Type control-c and then run:

$ echo "Vader Trips on His Own Cape" | ./kafka-produce.sh --topic empire-announce
>>[2018-04-10 23:50:34,638] ERROR Error when sending message to topic empire-announce with key: null, value: 27 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [empire-announce]

This is because the policy does not allow messages with role = “produce” for topic “empire-announce” from containers with label app = empire-outpost. Its worth noting that we don’t simply drop the message (which could easily be confused with a network error), but rather we respond with the Kafka access denied error (similar to how HTTP would return an error code of 403 unauthorized).

Likewise, if the outpost container ever tries to consume from topic deathstar-plans, it is denied, as role = consume is only allowed for topic empire-announce.

To test, from the outpost-9999 terminal, run:

$./kafka-consume.sh --topic deathstar-plans
[2018-04-10 23:51:12,956] WARN Error while fetching metadata with correlation id 2 : {deathstar-plans=TOPIC_AUTHORIZATION_FAILED} (org.apache.kafka.clients.NetworkClient)

This is blocked as well, thanks to the Cilium network policy. Imagine how different things would have been if the empire had been using Cilium from the beginning!

Clean Up

You have now installed Cilium, deployed a demo app, and tested both L7 Kafka-aware network security policies. To clean up, run:

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-kafka/kafka-sw-app.yaml
$ kubectl delete cnp secure-empire-kafka

After this, you can re-run the tutorial from Step 1.

How to secure gRPC

This document serves as an introduction to using Cilium to enforce gRPC-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml

It is important for this demo that kube-dns is working correctly. To know the status of kube-dns you can run the following command:

$ kubectl get deployment kube-dns -n kube-system
NAME       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-dns   1         1         1            1           13h

Where at least one pod should be available.

Deploy the Demo Application

Now that we have Cilium deployed and kube-dns operating correctly we can deploy our demo gRPC application. Since our first demo of Cilium + HTTP-aware security policies was Star Wars-themed, we decided to do the same for gRPC. While the HTTP-aware Cilium Star Wars demo showed how the Galactic Empire used HTTP-aware security policies to protect the Death Star from the Rebel Alliance, this gRPC demo shows how the lack of gRPC-aware security policies allowed Leia, Chewbacca, Lando, C-3PO, and R2-D2 to escape from Cloud City, which had been overtaken by empire forces.

gRPC is a high-performance RPC framework built on top of the protobuf serialization/deserialization library popularized by Google. There are gRPC bindings for many programming languages, and the efficiency of the protobuf parsing as well as advantages from leveraging HTTP 2 as a transport make it a popular RPC framework for those building new microservices from scratch.

For those unfamiliar with the details of the movie, Leia and the other rebels are fleeing storm troopers and trying to reach the space port platform where the Millennium Falcon is parked, so they can fly out of Cloud City. However, the door to the platform is closed, and the access code has been changed. However, R2-D2 is able to access the Cloud City computer system via a public terminal, and disable this security, opening the door and letting the Rebels reach the Millennium Falcon just in time to escape.

_images/cilium_grpc_gsg_r2d2_terminal.png

In our example, Cloud City’s internal computer system is built as a set of gRPC-based microservices (who knew that gRPC was actually invented a long time ago, in a galaxy far, far away?).

With gRPC, each service is defined using a language independent protocol buffer definition. Here is the definition for the system used to manage doors within Cloud City:

package cloudcity;

// The door manager service definition.
service DoorManager {

  // Get human readable name of door.
  rpc GetName(DoorRequest) returns (DoorNameReply) {}

  // Find the location of this door.
  rpc GetLocation (DoorRequest) returns (DoorLocationReply) {}

  // Find out whether door is open or closed
  rpc GetStatus(DoorRequest) returns (DoorStatusReply) {}

  // Request maintenance on the door
  rpc RequestMaintenance(DoorMaintRequest) returns (DoorActionReply) {}

  // Set Access Code to Open / Lock the door
  rpc SetAccessCode(DoorAccessCodeRequest) returns (DoorActionReply) {}

}

To keep the setup small, we will just launch two pods to represent this setup:

  • cc-door-mgr: A single pod running the gRPC door manager service with label app=cc-door-mgr.
  • terminal-87: One of the public network access terminals scattered across Cloud City. R2-D2 plugs into terminal-87 as the rebels are desperately trying to escape. This terminal uses the gRPC client code to communicate with the door management services with label app=public-terminal.
_images/cilium_grpc_gsg_topology.png

The file cc-door-app.yaml contains a Kubernetes Deployment for the door manager service, a Kubernetes Pod representing terminal-87, and a Kubernetes Service for the door manager services. To deploy this example app, run:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-grpc/cc-door-app.yaml
deployment "cc-door-mgr" created
service "cc-door-server" created
pod "terminal-87" created

Kubernetes will deploy the pods and service in the background. Running kubectl get svc,pods will inform you about the progress of the operation. Each pod will go through several states until it reaches Running at which point the setup is ready.

$ kubectl get pods,svc
NAME                                 READY     STATUS    RESTARTS   AGE
po/cc-door-mgr-3590146619-cv4jn      1/1       Running   0          1m
po/terminal-87                       1/1       Running   0          1m

NAME                 CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
svc/cc-door-server   10.0.0.72    <none>        50051/TCP   1m
svc/kubernetes       10.0.0.1     <none>        443/TCP     6m
Test Access Between gRPC Client and Server

First, let’s confirm that the public terminal can properly act as a client to the door service. We can test this by running a Python gRPC client for the door service that exists in the terminal-87 container.

We’ll invoke the ‘cc_door_client’ with the name of the gRPC method to call, and any parameters (in this case, the door-id):

$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py GetName 1
Door name is: Spaceport Door #1

$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py GetLocation 1
Door location is lat = 10.222200393676758 long = 68.87879943847656

Exposing this information to public terminals seems quite useful, as it helps travelers new to Cloud City identify and locate different doors. But recall that the door service also exposes several other methods, including SetAccessCode. If access to the door manager service is protected only using traditional IP and port-based firewalling, the TCP port of the service (50051 in this example) will be wide open to allow legitimate calls like GetName and GetLocation, which also leave more sensitive calls like SetAccessCode exposed as well. It is this mismatch between the course granularity of traditional firewalls and the fine-grained nature of gRPC calls that R2-D2 exploited to override the security and help the rebels escape.

To see this, run:

$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py SetAccessCode 1 999
Successfully set AccessCode to 999
Securing Access to a gRPC Service with Cilium

Once the legitimate owners of Cloud City recover the city from the empire, how can they use Cilium to plug this key security hole and block requests to SetAccessCode and GetStatus while still allowing GetName, GetLocation, and RequestMaintenance?

_images/cilium_grpc_gsg_policy.png

Since gRPC build on top of HTTP, this can be achieved easily by understanding how a gRPC call is mapped to an HTTP URL, and then applying a Cilium HTTP-aware filter to allow public terminals to only invoke a subset of all the total gRPC methods available on the door service.

Each gRPC method is mapped to an HTTP POST call to a URL of the form /cloudcity.DoorManager/<method-name>.

As a result, the following CiliumNetworkPolicy rule limits access of pods with label app=public-terminal to only invoke GetName, GetLocation, and RequestMaintenance on the door service, identified by label app=cc-door-sgr:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "L7 policy to allow public terminals to call GetName, GetLocation, and RequestMaintenance, but not GetState, or SetAccessCode on the Door Manager Service"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      app: cc-door-mgr 
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: public-terminal
    toPorts:
    - ports:
      - port: "50051"
        protocol: TCP
      rules:
        http:
        - method: "POST" 
          path: "/cloudcity.DoorManager/GetName"
        - method: "POST" 
          path: "/cloudcity.DoorManager/GetLocation"
        - method: "POST" 
          path: "/cloudcity.DoorManager/RequestMaintenance"

A CiliumNetworkPolicy contains a list of rules that define allowed requests, meaning that requests that do not match any rules (e.g., SetAccessCode) are denied as invalid.

The above rule applies to inbound (i.e., “ingress”) connections to cc-door-mgr pods (as indicated by app: cc-door-mgr in the “endpointSelector” section). The rule will apply to connections from pods with label app: public-terminal as indicated by the “fromEndpoints” section. The rule explicitly matches gRPC connections destined to TCP 50051, and white-lists specifically the permitted URLs.

Apply this gRPC-aware network security policy using kubectl in the main window:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-grpc/cc-door-ingress-security.yaml

After this security policy is in place, access to the innocuous calls like GetLocation still works as intended:

$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py GetLocation 1
Door location is lat = 10.222200393676758 long = 68.87879943847656

However, if we then again try to invoke SetAccessCode, it is denied:

$ kubectl exec terminal-87 -- python3 /cloudcity/cc_door_client.py SetAccessCode 1 999

Traceback (most recent call last):
  File "/cloudcity/cc_door_client.py", line 71, in <module>
    run()
  File "/cloudcity/cc_door_client.py", line 53, in run
    door_id=int(arg2), access_code=int(arg3)))
  File "/usr/local/lib/python3.4/dist-packages/grpc/_channel.py", line 492, in __call__
    return _end_unary_response_blocking(state, call, False, deadline)
  File "/usr/local/lib/python3.4/dist-packages/grpc/_channel.py", line 440, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.CANCELLED, Received http2 header with status: 403)>

This is now blocked, thanks to the Cilium network policy. And notice that unlike a traditional firewall which would just drop packets in a way indistinguishable from a network failure, because Cilium operates at the API-layer, it can explicitly reply with an custom HTTP 403 Unauthorized error, indicating that the request was intentionally denied for security reasons.

Thank goodness that the empire IT staff hadn’t had time to deploy Cilium on Cloud City’s internal network prior to the escape attempt, or things might have turned out quite differently for Leia and the other Rebels!

Clean-Up

You have now installed Cilium, deployed a demo app, and tested L7 gRPC-aware network security policies. To clean-up, run:

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-grpc/cc-door-app.yaml
$ kubectl delete cnp rule1

After this, you can re-run the tutorial from Step 1.

Getting Started Securing Elasticsearch

This document serves as an introduction for using Cilium to enforce Elasticsearch-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Deploy the Demo Application

Following the Cilium tradition, we will use a Star Wars-inspired example. The Empire has a large scale Elasticsearch cluster which is used for storing a variety of data including:

  • index: troop_logs: Stormtroopers performance logs collected from every outpost which are used to identify and eliminate weak performers!
  • index: spaceship_diagnostics: Spaceships diagnostics data collected from every spaceship which is used for R&D and improvement of the spaceships.

Every outpost has an Elasticsearch client service to upload the Stormtroopers logs. And every spaceship has a service to upload diagnostics. Similarly, the Empire headquarters has a service to search and analyze the troop logs and spaceship diagnostics data. Before we look into the security concerns, let’s first create this application scenario in minikube.

Deploy the app using command below, which will create

  • An elasticsearch service with the selector label component:elasticsearch and a pod running Elasticsearch.
  • Three Elasticsearch clients one each for empire-hq, outpost and spaceship.
$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-es/es-sw-app.yaml
serviceaccount "elasticsearch" created
service "elasticsearch" created
replicationcontroller "es" created
role "elasticsearch" created
rolebinding "elasticsearch" created
pod "outpost" created
pod "empire-hq" created
pod "spaceship" created
$ kubectl get svc,pods
NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                           AGE
svc/elasticsearch   NodePort    10.111.238.254   <none>        9200:30130/TCP,9300:31721/TCP     2d
svc/etcd-cilium     NodePort    10.98.67.60      <none>        32379:31079/TCP,32380:31080/TCP   9d
svc/kubernetes      ClusterIP   10.96.0.1        <none>        443/TCP                           9d

NAME               READY     STATUS    RESTARTS   AGE
po/empire-hq       1/1       Running   0          2d
po/es-g9qk2        1/1       Running   0          2d
po/etcd-cilium-0   1/1       Running   0          9d
po/outpost         1/1       Running   0          2d
po/spaceship       1/1       Running   0          2d
Security Risks for Elasticsearch Access

For Elasticsearch clusters the least privilege security challenge is to give clients access only to particular indices, and to limit the operations each client is allowed to perform on each index. In this example, the outpost Elasticsearch clients only need access to upload troop logs; and the empire-hq client only needs search access to both the indices. From the security perspective, the outposts are weak spots and susceptible to be captured by the rebels. Once compromised, the clients can be used to search and manipulate the critical data in Elasticsearch. We can simulate this attack, but first let’s run the commands for legitimate behavior for all the client services.

outpost client uploading troop logs

$ kubectl exec outpost -- python upload_logs.py
Uploading Stormtroopers Performance Logs
created :  {'_index': 'troop_logs', '_type': 'log', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': True}

spaceship uploading diagnostics

$ kubectl exec spaceship -- python upload_diagnostics.py
Uploading Spaceship Diagnostics
created :  {'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, 'created': True}

empire-hq running search queries for logs and diagnostics

$ kubectl exec empire-hq -- python search.py
Searching for Spaceship Diagnostics
Got 1 Hits:
{'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_score': 1.0, \
 '_source': {'spaceshipid': '3459B78XNZTF', 'type': 'tiefighter', 'title': 'Engine Diagnostics', \
             'stats': '[CRITICAL] [ENGINE BURN @SPEED 5000 km/s] [CHANCE 80%]'}}
Searching for Stormtroopers Performance Logs
Got 1 Hits:
{'_index': 'troop_logs', '_type': 'log', '_id': '1', '_score': 1.0, \
 '_source': {'outpost': 'Endor', 'datetime': '33 ABY 4AM DST', 'title': 'Endor Corps 1: Morning Drill', \
             'notes': '5100 PRESENT; 15 ABSENT; 130 CODE-RED BELOW PAR PERFORMANCE'}}

Now imagine an outpost captured by the rebels. In the commands below, the rebels first search all the indices and then manipulate the diagnostics data from a compromised outpost.

$ kubectl exec outpost -- python search.py
Searching for Spaceship Diagnostics
Got 1 Hits:
{'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_score': 1.0, \
 '_source': {'spaceshipid': '3459B78XNZTF', 'type': 'tiefighter', 'title': 'Engine Diagnostics', \
             'stats': '[CRITICAL] [ENGINE BURN @SPEED 5000 km/s] [CHANCE 80%]'}}
Searching for Stormtroopers Performance Logs
Got 1 Hits:
{'_index': 'troop_logs', '_type': 'log', '_id': '1', '_score': 1.0, \
 '_source': {'outpost': 'Endor', 'datetime': '33 ABY 4AM DST', 'title': 'Endor Corps 1: Morning Drill', \
             'notes': '5100 PRESENT; 15 ABSENT; 130 CODE-RED BELOW PAR PERFORMANCE'}}

Rebels manipulate spaceship diagnostics data so that the spaceship defects are not known to the empire-hq! (Hint: Rebels have changed the stats for the tiefighter spaceship, a change hard to detect but with adverse impact!)

$ kubectl exec outpost -- python update.py
Uploading Spaceship Diagnostics
{'_index': 'spaceship_diagnostics', '_type': 'stats', '_id': '1', '_score': 1.0, \
 '_source': {'spaceshipid': '3459B78XNZTF', 'type': 'tiefighter', 'title': 'Engine Diagnostics', \
             'stats': '[OK] [ENGINE OK @SPEED 5000 km/s]'}}
Securing Elasticsearch Using Cilium
_images/cilium_es_gsg_topology.png

Following the least privilege security principle, we want to the allow the following legitimate actions and nothing more:

  • outpost service only has upload access to index: troop_logs
  • spaceship service only has upload access to index: spaceship_diagnostics
  • empire-hq service only has search access for both the indices

Fortunately, the Empire DevOps team is using Cilium for their Kubernetes cluster. Cilium provides L7 visibility and security policies to control Elasticsearch API access. Cilium follows the white-list, least privilege model for security. That is to say, a CiliumNetworkPolicy contains a list of rules that define allowed requests and any request that does not match the rules is denied.

In this example, the policy rules are defined for inbound traffic (i.e., “ingress”) connections to the elasticsearch service. Note that endpoints selected as backend pods for the service are defined by the selector labels. Selector labels use the same concept as Kubernetes to define a service. In this example, label component: elasticsearch defines the pods that are part of the elasticsearch service in Kubernetes.

In the policy file below, you will see the following rules for controlling the indices access and actions performed:

  • fromEndpoints with labels app:spaceship only HTTP PUT is allowed on paths matching regex ^/spaceship_diagnostics/stats/.*$
  • fromEndpoints with labels app:outpost only HTTP PUT is allowed on paths matching regex ^/troop_logs/log/.*$
  • fromEndpoints with labels app:empire only HTTP GET is allowed on paths matching regex ^/spaceship_diagnostics/_search/??.*$ and ^/troop_logs/search/??.*$
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: secure-empire-elasticsearch
  namespace: default
specs:
- endpointSelector:
    matchLabels:
      component: elasticsearch
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: spaceship
    toPorts:
    - ports:
      - port: "9200"
        protocol: TCP
      rules:
        http:
        - method: ^PUT$
          path: ^/spaceship_diagnostics/stats/.*$
  - fromEndpoints:
    - matchLabels:
        app: empire-hq
    toPorts:
    - ports:
      - port: "9200"
        protocol: TCP
      rules:
        http:
        - method: ^GET$
          path: ^/spaceship_diagnostics/_search/??.*$
        - method: ^GET$
          path: ^/troop_logs/_search/??.*$
  - fromEndpoints:
    - matchLabels:
        app: outpost
    toPorts:
    - ports:
      - port: "9200"
        protocol: TCP
      rules:
        http:
        - method: ^PUT$
          path: ^/troop_logs/log/.*$
- egress:
  - toEndpoints:
    - matchExpressions:
      - key: k8s:io.kubernetes.pod.namespace
        operator: Exists
  - toEntities:
    - cluster
    - host
  endpointSelector: {}
  ingress:
  - {}

Apply this Elasticsearch-aware network security policy using kubectl:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-es/es-sw-policy.yaml
ciliumnetworkpolicy "secure-empire-elasticsearch" created

Let’s test the security policies. Firstly, the search access is blocked for both outpost and spaceship. So from a compromised outpost, rebels will not be able to search and obtain knowledge about troops and spaceship diagnostics. Secondly, the outpost clients don’t have access to create or update the index: spaceship_diagnostics.

$ kubectl exec outpost -- python search.py
GET http://elasticsearch:9200/spaceship_diagnostics/_search [status:403 request:0.008s]
...
...
elasticsearch.exceptions.AuthorizationException: TransportError(403, 'Access denied\r\n')
command terminated with exit code 1
$ kubectl exec outpost -- python update.py
PUT http://elasticsearch:9200/spaceship_diagnostics/stats/1 [status:403 request:0.006s]
...
...
elasticsearch.exceptions.AuthorizationException: TransportError(403, 'Access denied\r\n')
command terminated with exit code 1

We can re-run any of the below commands to show that the security policy still allows all legitimate requests (i.e., no 403 errors are returned).

$ kubectl exec outpost -- python upload_logs.py
...
$ kubectl exec spaceship -- python upload_diagnostics.py
...
$ kubectl exec empire-hq -- python search.py
...
Clean Up

You have now installed Cilium, deployed a demo app, and finally deployed & tested Elasticsearch-aware network security policies. To clean up, run:

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-es/es-sw-app.yaml
$ kubectl delete cnp secure-empire-elasticsearch

How to Secure a Cassandra Database

This document serves as an introduction to using Cilium to enforce Cassandra-aware security policies. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.

NOTE: Cassandra-aware policy support is still in beta phase. It is not yet ready for production use. Additionally, the Cassandra-specific policy language is highly likely to change in a future Cilium version.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Deploy the Demo Application

Now that we have Cilium deployed and kube-dns operating correctly we can deploy our demo Cassandra application. Since our first HTTP-aware Cilium Star Wars demo showed how the Galactic Empire used HTTP-aware security policies to protect the Death Star from the Rebel Alliance, this Cassandra demo is Star Wars-themed as well.

Apache Cassanadra is a popular NOSQL database focused on delivering high-performance transactions (especially on writes) without sacrificing on availability or scale. Cassandra operates as a cluster of servers, and Cassandra clients query these services via a the native Cassandra protocol . Cilium understands the Cassandra protocol, and thus is able to provide deep visibility and control over which clients are able to access particular tables inside a Cassandra cluster, and which actions (e.g., “select”, “insert”, “update”, “delete”) can be performed on tables.

With Cassandra, each table belongs to a “keyspace”, allowing multiple groups to use a single cluster without conflicting. Cassandra queries specify the full table name qualified by the keyspace using the syntax “<keyspace>.<table>”.

In our simple example, the Empire uses a Cassandra cluster to store two different types of information:

  • Employee Attendance Records : Use to store daily attendance data (attendance.daily_records).
  • Deathstar Scrum Reports : Daily scrum reports from the teams working on the Deathstar (deathstar.scrum_reports).

To keep the setup small, we will just launch a small number of pods to represent this setup:

  • cass-server : A single pod running the Cassandra service, representing a Cassandra cluster (label app=cass-server).
  • empire-hq : A pod representing the Empire’s Headquarters, which is the only pod that should be able to read all attendance data, or read/write the Deathstar scrum notes (label app=empire-hq).
  • empire-outpost : A random outpost in the empire. It should be able to insert employee attendance records, but not read records for other empire facilities. It also should not have any access to the deathstar keyspace (label app=empire-outpost).

All pods other than cass-server are Cassandra clients, which need access to the cass-server container on TCP port 9042 in order to send Cassandra protocol messages.

_images/cilium_cass_gsg_topology.png

The file cass-sw-app.yaml contains a Kubernetes Deployment for each of the pods described above, as well as a Kubernetes Service cassandra-svc for the Cassandra cluster.

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-cassandra/cass-sw-app.yaml
deployment.extensions/cass-server created
service/cassandra-svc created
deployment.extensions/empire-hq created
deployment.extensions/empire-outpost created

Kubernetes will deploy the pods and service in the background. Running kubectl get svc,pods will inform you about the progress of the operation. Each pod will go through several states until it reaches Running at which point the setup is ready.

$ kubectl get svc,pods
NAME                    TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/cassandra-svc   ClusterIP   None         <none>        9042/TCP   1m
service/kubernetes      ClusterIP   10.96.0.1    <none>        443/TCP    15h

NAME                                  READY     STATUS    RESTARTS   AGE
pod/cass-server-5674d5b946-x8v4j      1/1       Running   0          1m
pod/empire-hq-c494c664d-xmvdl         1/1       Running   0          1m
pod/empire-outpost-68bf76858d-flczn   1/1       Running   0          1m
Step 3: Test Basic Cassandra Access

First, we’ll create the keyspaces and tables mentioned above, and populate them with some initial data:

$  curl -s https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-cassandra/cass-populate-tables.sh | bash

Next, create two environment variables that refer to the empire-hq and empire-outpost pods:

$ HQ_POD=$(kubectl get pods -l app=empire-hq -o jsonpath='{.items[0].metadata.name}')
$ OUTPOST_POD=$(kubectl get pods -l app=empire-outpost -o jsonpath='{.items[0].metadata.name}')

Now we will run the ‘cqlsh’ Cassandra client in the empire-outpost pod, telling it to access the Cassandra cluster identified by the ‘cassandra-svc’ DNS name:

$ kubectl exec -it $OUTPOST_POD cqlsh -- cassandra-svc
Connected to Test Cluster at cassandra-svc:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Next, using the cqlsh prompt, we’ll show that the outpost can add records to the “daily_records” table in the “attendance” keyspace:

cqlsh> INSERT INTO attendance.daily_records (creation, loc_id, present, empire_member_id) values (now(), 074AD3B9-A47D-4EBC-83D3-CAD75B1911CE, true, 6AD3139F-EBFC-4E0C-9F79-8F997BA01D90);

We have confirmed that outposts are able to report daily attendance records as intended. We’re off to a good start!

The Danger of a Compromised Cassandra Client

But what if a rebel spy gains access to any of the remote outposts that act as a Cassandra client? Since every client has access to the Cassandra API on port 9042, it can do some bad stuff. For starters, the outpost container can not only add entries to the attendance.daily_reports table, but it could read all entries as well.

To see this, we can run the following command:

$ cqlsh> SELECT * FROM attendance.daily_records;

  loc_id                               | creation                             | empire_member_id                     | present
 --------------------------------------+--------------------------------------+--------------------------------------+---------
 a855e745-69d8-4159-b8b6-e2bafed8387a | c692ce90-bf57-11e8-98e6-f1a9f45fc4d8 | cee6d956-dbeb-4b09-ad21-1dd93290fa6c |    True
 5b9a7990-657e-442d-a3f7-94484f06696e | c8493120-bf57-11e8-98e6-f1a9f45fc4d8 | e74a0300-94f3-4b3d-aee4-fea85eca5af7 |    True
 53ed94d0-ddac-4b14-8c2f-ba6f83a8218c | c641a150-bf57-11e8-98e6-f1a9f45fc4d8 | 104ddbb6-f2f7-4cd0-8683-cc18cccc1326 |    True
 074ad3b9-a47d-4ebc-83d3-cad75b1911ce | 9674ed40-bf59-11e8-98e6-f1a9f45fc4d8 | 6ad3139f-ebfc-4e0c-9f79-8f997ba01d90 |    True
 fe72cc39-dffb-45dc-8e5f-86c674a58951 | c5e79a70-bf57-11e8-98e6-f1a9f45fc4d8 | 6782689c-0488-4ecb-b582-a2ccd282405e |    True
 461f4176-eb4c-4bcc-a08a-46787ca01af3 | c6fefde0-bf57-11e8-98e6-f1a9f45fc4d8 | 01009199-3d6b-4041-9c43-b1ca9aef021c |    True
 64dbf608-6947-4a23-98e9-63339c413136 | c8096900-bf57-11e8-98e6-f1a9f45fc4d8 | 6ffe024e-beff-4370-a1b5-dcf6330ec82b |    True
 13cefcac-5652-4c69-a3c2-1484671f2467 | c53f4c80-bf57-11e8-98e6-f1a9f45fc4d8 | 55218adc-2f3d-4f84-a693-87a2c238bb26 |    True
 eabf5185-376b-4d4a-a5b5-99f912d98279 | c593fc30-bf57-11e8-98e6-f1a9f45fc4d8 | 5e22159b-f3a9-4f8a-9944-97375df570e9 |    True
 3c0ae2d1-c836-4aa4-8fe2-5db6cc1f92fc | c7af1400-bf57-11e8-98e6-f1a9f45fc4d8 | 0ccb3df7-78d0-4434-8a7f-4bfa8d714275 |    True
 31a292e0-2e28-4a7d-8c84-8d4cf0c57483 | c4e0d8d0-bf57-11e8-98e6-f1a9f45fc4d8 | 8fe7625c-f482-4eb6-b33e-271440777403 |    True

(11 rows)

Uh oh! The rebels now has strategic information about empire troop strengths at each location in the galaxy.

But even more nasty from a security perspective is that the outpost container can also access information in any keyspace, including the deathstar keyspace. For example, run:

$ cqlsh> SELECT * FROM deathstar.scrum_notes;

 empire_member_id                     | content                                                                                                        | creation
--------------------------------------+----------------------------------------------------------------------------------------------------------------+--------------------------------------
34e564c2-781b-477e-acd0-b357d67f94f2 | Designed protective shield for deathstar.  Could be based on nearby moon.  Feature punted to v2.  Not blocked. | c3c8b210-bf57-11e8-98e6-f1a9f45fc4d8
dfa974ea-88cd-4e9b-85e3-542b9d00e2df |   I think the exhaust port could be vulnerable to a direct hit.  Hope no one finds out about it.  Not blocked. | c37f4d00-bf57-11e8-98e6-f1a9f45fc4d8
ee12306a-7b44-46a4-ad68-42e86f0f111e |        Trying to figure out if we should paint it medium grey, light grey, or medium-light grey.  Not blocked. | c32daa90-bf57-11e8-98e6-f1a9f45fc4d8

(3 rows)

We see that any outpost can actually access the deathstar scrum notes, which mentions a pretty serious issue with the exhaust port.

Securing Access to Cassandra with Cilium

Obviously, it would be much more secure to limit each pod’s access to the Cassandra server to be least privilege (i.e., only what is needed for the app to operate correctly and nothing more).

We can do that with the following Cilium security policy. As with Cilium HTTP policies, we can write policies that identify pods by labels, and then limit the traffic in/out of this pod. In this case, we’ll create a policy that identifies the tables that each client should be able to access, the actions that are allowed on those tables, and deny the rest.

As an example, a policy could limit containers with label app=empire-outpost to only be able to insert entries into the table “attendance.daily_reports”, but would block any attempt by a compromised outpost to read all attendance information or access other keyspaces.

_images/cilium_cass_gsg_attack.png

Here is the CiliumNetworkPolicy rule that limits access of pods with label app=empire-outpost to only insert records into “attendance.daily_reports”:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Allow only permitted requests to empire Cassandra server"
metadata:
  name: "secure-empire-cassandra"
specs:
  - endpointSelector:
      matchLabels:
        app: cass-server
    ingress:
    - fromEndpoints:
      - matchLabels:
          app: empire-outpost
      toPorts:
      - ports:
        - port: "9042"
          protocol: TCP
        rules:
          l7proto: cassandra
          l7: 
          - query_action: "select"
            query_table: "system\\..*" 
          - query_action: "select"
            query_table: "system_schema\\..*" 
          - query_action: "insert"
            query_table: "attendance.daily_records"
    - fromEndpoints:
      - matchLabels:
          app: empire-hq
      toPorts:
      - ports:
        - port: "9042"
          protocol: TCP
        rules:
          l7proto: cassandra
          l7: 
          - {} 

A CiliumNetworkPolicy contains a list of rules that define allowed requests, meaning that requests that do not match any rules are denied as invalid.

The rule explicitly matches Cassandra connections destined to TCP 9042 on cass-server pods, and allows query actions like select/insert/update/delete only on a specified set of tables. The above rule applies to inbound (i.e., “ingress”) connections to cass-server pods (as indicated by “app:cass-server” in the “endpointSelector” section). The rule applies different rules based on whether the client pod has labels “app: empire-outpost” or “app: empire-hq” as indicated by the “fromEndpoints” section.

The policy limits the empire-outpost pod to performing “select” queries on the “system” and “system_schema” keyspaces (required by cqlsh on startup) and “insert” queries to the “attendance.daily_records” table.

The full policy adds another rule that allows all queries from the empire-hq pod.

Apply this Cassandra-aware network security policy using kubectl in a new window:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-cassandra/cass-sw-security-policy.yaml

If we then again try to perform the attacks from the empire-outpost pod, we’ll see that they are denied:

$ cqlsh> SELECT * FROM attendance.daily_records;
Unauthorized: Error from server: code=2100 [Unauthorized] message="Request Unauthorized"

This is because the policy only permits pods with labels app: empire-outpost to insert into attendance.daily_records, it does not permit select on that table, or any action on other tables (with the exception of the system.* and system_schema.* keyspaces). Its worth noting that we don’t simply drop the message (which could easily be confused with a network error), but rather we respond with the Cassandra Unauthorized error message. (similar to how HTTP would return an error code of 403 unauthorized).

Likewise, if the outpost pod ever tries to access a table in another keyspace, like deathstar, this request will also be denied:

$ cqlsh> SELECT * FROM deathstar.scrum_notes;
Unauthorized: Error from server: code=2100 [Unauthorized] message="Request Unauthorized"

This is blocked as well, thanks to the Cilium network policy.

Use another window to confirm that the empire-hq pod still has full access to the cassandra cluster:

$ kubectl exec -it $HQ_POD cqlsh -- cassandra-svc
Connected to Test Cluster at cassandra-svc:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

The power of Cilium’s identity-based security allows empire-hq to still have full access to both tables:

$ cqlsh> SELECT * FROM attendance.daily_records;
 loc_id                               | creation                             | empire_member_id                     | present
--------------------------------------+--------------------------------------+--------------------------------------+---------
a855e745-69d8-4159-b8b6-e2bafed8387a | c692ce90-bf57-11e8-98e6-f1a9f45fc4d8 | cee6d956-dbeb-4b09-ad21-1dd93290fa6c |    True

<snip>

(12 rows)

Similarly, the deathstar can still access the scrum notes:

$ cqlsh> SELECT * FROM deathstar.scrum_notes;

  <snip>

(3 rows)
Cassandra-Aware Visibility (Bonus)

As a bonus, you can re-run the above queries with policy enforced and view how Cilium provides Cassandra-aware visibility, including whether requests are forwarded or denied. First, use “kubectl exec” to access the cilium pod.

$ CILIUM_POD=$(kubectl get pods -n kube-system -l k8s-app=cilium -o jsonpath='{.items[0].metadata.name}')
$ kubectl exec -it -n kube-system $CILIUM_POD /bin/bash
root@minikube:~#

Next, start Cilium monitor, and limit the output to only “l7” type messages using the “-t” flag:

root@minikube:~# cilium monitor -t l7
Listening for events on 2 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit

In the other windows, re-run the above queries, and you will see that Cilium provides full visibility at the level of each Cassandra request, indicating:

  • The Kubernetes label-based identity of both the sending and receiving pod.
  • The details of the Cassandra request, including the ‘query_action’ (e.g., ‘select’, ‘insert’) and ‘query_table’ (e.g., ‘system.local’, ‘attendance.daily_records’)
  • The ‘verdict’ indicating whether the request was allowed by policy (‘Forwarded’ or ‘Denied’).

Example output is below. All requests are from empire-outpost to cass-server. The first two requests are allowed, a ‘select’ into ‘system.local’ and an ‘insert’ into ‘attendance.daily_records’. The second two requests are denied, a ‘select’ into ‘attendance.daily_records’ and a select into ‘deathstar.scrum_notes’ :

<- Request cassandra from 0 ([k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:app=empire-outpost]) to 64503 ([k8s:app=cass-server k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default]), identity 12443->16168, verdict Forwarded query_table:system.local query_action:selec
<- Request cassandra from 0 ([k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:app=empire-outpost]) to 64503 ([k8s:app=cass-server k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default]), identity 12443->16168, verdict Forwarded query_action:insert query_table:attendance.daily_records
<- Request cassandra from 0 ([k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:app=empire-outpost]) to 64503 ([k8s:app=cass-server k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default]), identity 12443->16168, verdict Denied query_action:select query_table:attendance.daily_records
<- Request cassandra from 0 ([k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:app=empire-outpost]) to 64503 ([k8s:app=cass-server k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default]), identity 12443->16168, verdict Denied query_table:deathstar.scrum_notes query_action:select
Clean Up

You have now installed Cilium, deployed a demo app, and tested L7 Cassandra-aware network security policies. To clean up, run:

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-cassandra/cass-sw-app.yaml
$ kubectl delete cnp secure-empire-cassandra

After this, you can re-run the tutorial from Step 1.

Getting Started Securing Memcached

This document serves as an introduction to using Cilium to enforce memcached-aware security policies. It walks through a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes.

NOTE: memcached-aware policy support is still in beta. It is not yet ready for production use. Additionally, the memcached-specific policy language is highly likely to change in a future Cilium version.

Memcached is a high performance, distributed memory object caching system. It’s simple yet powerful, and used by dynamic web applications to alleviate database load. Memcached is designed to work efficiently for a very large number of open connections. Thus, clients are encouraged to cache their connections rather than the overhead of reopening TCP connections every time they need to store or retrieve data. Multiple clients can benefit from this distributed cache’s performance benefits.

There are two kinds of data sent in the memcache protocol: text lines and unstructured (binary) data. We will demonstrate clients using both types of protocols to communicate with a memcached server.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml
Step 2: Deploy the Demo Application

Now that we have Cilium deployed and kube-dns operating correctly we can deploy our demo memcached application. Since our first HTTP-aware Cilium demo was based on Star Wars, we continue with the theme for the memcached demo as well.

Ever wonder how the Alliance Fleet manages the changing positions of their ships? The Alliance Fleet uses memcached to store the coordinates of their ships. The Alliance Fleet leverages the memcached-svc service implemented as a memcached server. Each ship in the fleet constantly updates its coordinates and has the ability to get the coordinates of other ships in the Alliance Fleet.

In this simple example, the Alliance Fleet uses a memcached server for their starfighters to store their own supergalatic coordinates and get those of other starfighters.

In order to avoid collisions and protect against compromised starfighters, memcached commands are limited to gets for any starfighter coordinates and sets only to a key specific to the starfighter. Thus the following operations are allowed:

  • A-wing: can set coordinates to key “awing-coord” and get the key coordinates.
  • X-wing: can set coordinates to key “xwing-coord” and get the key coordinates.
  • Alliance-Tracker: can get any coordinates but not set any.

To keep the setup small, we will launch a small number of pods to represent a larger environment:

  • memcached-server : A Kubernetes service represented by a single pod running a memcached server (label app=memcd-server).
  • a-wing memcached binary client : A pod representing an A-wing starfighter, which can update its coordinates and read it via the binary memcached protocol (label app=a-wing).
  • x-wing memcached text-based client : A pod representing an X-wing starfighter, which can update its coordinates and read it via the text-based memcached protocol (label app=x-wing).
  • alliance-tracker memcached binary client : A pod representing the Alliance Fleet Tracker, able to read the coordinates of all starfighters (label name=fleet-tracker).

Memcached clients access the memcached-server on TCP port 11211 and send memcached protocol messages to it.

_images/cilium_memcd_gsg_topology.png

The file memcd-sw-app.yaml contains a Kubernetes Deployment for each of the pods described above, as well as a Kubernetes Service memcached-server for the Memcached server.

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-memcached/memcd-sw-app.yaml
deployment.extensions/memcached-server created
service/memcached-server created
deployment.extensions/a-wing created
deployment.extensions/x-wing created
deployment.extensions/alliance-tracker created

Kubernetes will deploy the pods and service in the background. Running kubectl get svc,pods will inform you about the progress of the operation. Each pod will go through several states until it reaches Running at which point the setup is ready.

$ kubectl get svc,pods
NAME                       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE
service/kubernetes         ClusterIP   10.96.0.1    <none>        443/TCP     31m
service/memcached-server   ClusterIP   None         <none>        11211/TCP   14m

NAME                                    READY   STATUS    RESTARTS   AGE
pod/a-wing-67db8d5fcc-dpwl4             1/1     Running   0          14m
pod/alliance-tracker-6b6447bd69-sz5hz   1/1     Running   0          14m
pod/memcached-server-bdbfb87cd-8tdh7    1/1     Running   0          14m
pod/x-wing-fd5dfb9d9-wrtwn              1/1     Running   0          14m

We suggest having a main terminal window to execute kubectl commands and two additional terminal windows dedicated to accessing the A-Wing and Alliance-Tracker, which use a python library to communicate to the memcached server using the binary protocol.

In all three terminal windows, set some handy environment variables for the demo with the following script:

In the terminal window dedicated for the A-wing pod, exec in, use python to import the binary memcached library and set the client connection to the memcached server:

$ kubectl exec -ti $AWING_POD sh
# python
Python 3.7.0 (default, Sep  5 2018, 03:25:31)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bmemcached
>>> client = bmemcached.Client(("memcached-server:11211", ))

In the terminal window dedicated for the Alliance-Tracker, exec in, use python to import the binary memcached library and set the client connection to the memcached server:

$ kubectl exec -ti $TRACKER_POD sh
# python
Python 3.7.0 (default, Sep  5 2018, 03:25:31)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bmemcached
>>> client = bmemcached.Client(("memcached-server:11211", ))
Step 3: Test Basic Memcached Access

Let’s show that each client is able to access the memcached server. Execute the following to have the A-wing and X-wing starfighters update the Alliance Fleet memcached-server with their respective supergalatic coordinates:

A-wing will access the memcached-server using the binary protocol. In your terminal window for A-Wing, set A-wing’s coordinates:

>>> client.set("awing-coord","4309.432,918.980",time=2400)
True
>>> client.get("awing-coord")
'4309.432,918.980'

In your main terminal window, have X-wing starfighter set their coordinates using the text-based protocol to the memcached server.

$ kubectl exec $XWING_POD sh -- -c "echo -en \"$SETXC\" | nc memcached-server 11211"
STORED
$ kubectl exec $XWING_POD sh -- -c "echo -en \"$GETXC\" | nc memcached-server 11211"
VALUE xwing-coord 0 16
8893.34,234.3290
END

Check that the Alliance Fleet Tracker is able to get all starfighters’ coordinates in your terminal window for the Alliance-Tracker:

>>> client.get("awing-coord")
'4309.432,918.980'
>>> client.get("xwing-coord")
'8893.34,234.3290'
Step 4: The Danger of a Compromised Memcached Client

Imagine if a starfighter ship is captured. Should the starfighter be able to set the coordinates of other ships, or get the coordinates of all other ships? Or if the Alliance-Tracker is compromised, can it modify the coordinates of any starfighter ship? If every client has access to the Memcached API on port 11211, all have over-privileged access until further locked down.

With L4 port access to the memcached server, all starfighters could write to any key/ship and read all ship coordinates. In your main terminal, execute:

$ kubectl exec $XWING_POD sh -- -c "echo -en \"$GETAC\" | nc memcached-server 11211"
VALUE awing-coord 0 16
4309.432,918.980
END

In your A-Wing terminal window, confirm the over-privileged access:

>>> client.get("xwing-coord")
'8893.34,234.3290'
>>> client.set("xwing-coord","0.0,0.0",time=2400)
True
>>> client.get("xwing-coord")
'0.0,0.0'

From A-Wing, set the X-Wing coordinates back to their proper position:

>>> client.set("xwing-coord","8893.34,234.3290",time=2400)
True

Thus, the Alliance Fleet Tracking System could be made weak if a single starfighter ship is compromised.

Step 5: Securing Access to Memcached with Cilium

Cilium helps lock down Memcached servers to ensure clients have secure access to it. Beyond just providing access to port 11211, Cilium can enforce specific key value access by understanding both the text-based and the unstructured (binary) memcached protocol.

We’ll create a policy that limits the scope of what a starfighter can access and write. Thus, only the intended memcached protocol calls to the memcached-server can be made.

In this example, we’ll only allow A-Wing to get and set the key “awing-coord”, only allow X-Wing to get and set key “xwing-coord”, and allow Alliance-Tracker to only get coordinates.

_images/cilium_memcd_gsg_attack.png

Here is the CiliumNetworkPolicy rule that limits the access of starfighters to their own key and allows Alliance Tracker to get any coordinate:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "secure-fleet"
spec:
  endpointSelector:
    matchLabels:
      app: memcd-server
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: a-wing
    toPorts:
    - ports:
      - port: '11211'
        protocol: TCP
      rules:
        l7proto: memcache
        l7:
        - command: get
          keyExact: awing-coord
        - command: set
          keyExact: awing-coord
  - fromEndpoints:
    - matchLabels:
        app: x-wing
    toPorts:
    - ports:
      - port: '11211'
        protocol: TCP
      rules:
        l7proto: memcache
        l7:
        - command: get
          keyExact: xwing-coord
        - command: set
          keyExact: xwing-coord
  - fromEndpoints:
    - matchLabels:
        name: fleet-tracker
    toPorts:
    - ports:
      - port: '11211'
        protocol: TCP
      rules:
        l7proto: memcache
        l7:
        - command: get
          keyExact: awing-coord
        - command: get
          keyExact: xwing-coord

A CiliumNetworkPolicy contains a list of rules that define allowed memcached commands, and requests that do not match any rules are denied. The rules explicitly match connections destined to the Memcached Service on TCP 11211.

The rules apply to inbound (i.e., “ingress”) connections bound for memcached-server pods (as indicated by app:memcached-server in the “endpointSelector” section). The rules apply differently depending on the client pod: app:a-wing, app:x-wing, or name:fleet-tracker as indicated by the “fromEndpoints” section.

With the policy in place, A-wings can only get and set the key “awing-coord”; similarly the X-Wing can only get and set “xwing-coord”. The Alliance Tracker can only get coordinates - not set.

Apply this Memcached-aware network security policy using kubectl in your main terminal window:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-memcached/memcd-sw-security-policy.yaml

If we then try to perform the attacks from the X-wing pod from the main terminal window, we’ll see that they are denied:

$ kubectl exec $XWING_POD sh -- -c "echo -en \"$GETAC\" | nc memcached-server 11211"
CLIENT_ERROR access denied

From the A-Wing terminal window, we can confirm that if A-wing goes outside of the bounds of its allowed calls. You may need to run the client.get command twice for the python call:

>>> client.get("awing-coord")
'4309.432,918.980'
>>> client.get("xwing-coord")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/bmemcached/client/replicating.py", line 42, in get
    value, cas = server.get(key)
  File "/usr/local/lib/python3.7/site-packages/bmemcached/protocol.py", line 440, in get
    raise MemcachedException('Code: %d Message: %s' % (status, extra_content), status)
bmemcached.exceptions.MemcachedException: ("Code: 8 Message: b'access denied'", 8)

Similarly, the Alliance-Tracker cannot set any coordinates, which you can attempt from the Alliance-Tracker terminal window:

>>> client.get("xwing-coord")
'8893.34,234.3290'
>>> client.set("awing-coord","0.0,0.0",time=1200)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/bmemcached/client/replicating.py", line 112, in set
    returns.append(server.set(key, value, time, compress_level=compress_level))
  File "/usr/local/lib/python3.7/site-packages/bmemcached/protocol.py", line 604, in set
    return self._set_add_replace('set', key, value, time, compress_level=compress_level)
  File "/usr/local/lib/python3.7/site-packages/bmemcached/protocol.py", line 583, in _set_add_replace
    raise MemcachedException('Code: %d Message: %s' % (status, extra_content), status)
bmemcached.exceptions.MemcachedException: ("Code: 8 Message: b'access denied'", 8)

The policy is working as expected.

With the CiliumNetworkPolicy in place, the allowed Memcached calls are still allowed from the respective pods.

In the main terminal window, execute:

$ kubectl exec $XWING_POD sh -- -c "echo -en \"$GETXC\" | nc memcached-server 11211"
VALUE xwing-coord 0 16
8893.34,234.3290
END
$ SETXC="set xwing-coord 0 1200 16\r\n9854.34,926.9187\r\nquit\r\n"
$ kubectl exec $XWING_POD sh -- -c "echo -en \"$SETXC\" | nc memcached-server 11211"
STORED
$ kubectl exec $XWING_POD sh -- -c "echo -en \"$GETXC\" | nc memcached-server 11211"
VALUE xwing-coord 0 16
9854.34,926.9187
END

In the A-Wing terminal window, execute:

>>> client.set("awing-coord","9852.542,892.1318",time=1200)
True
>>> client.get("awing-coord")
'9852.542,892.1318'
>>> exit()
# exit

In the Alliance-Tracker terminal window, execute:

>>> client.get("awing-coord")
>>> client.get("xwing-coord")
>>> exit()
# exit
Step 7: Clean Up

You have now installed Cilium, deployed a demo app, and tested L7 memcached-aware network security policies. To clean up, in your main terminal window, run:

$ kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-memcached/memcd-sw-app.yaml
$ kubectl delete cnp secure-fleet

For some handy memcached references, see below:

Locking down external access using AWS metadata

This document serves as an introduction to using Cilium to enforce policies based on AWS instances metadata. It is a detailed walk-through of getting a single-node Cilium environment running on your machine. It is designed to take 15-30 minutes with some experience running Kubernetes.

Setup Cilium

This guide will work with any approach to installing Cilium, including minikube, as long as the cilium-operator pod in the deployment can reach the AWS API server However, since the most common use of this mechanism is for Kubernetes clusters running in AWS, we recommend trying it out along with the guide: Installation on AWS EKS .

Create AWS secrets

Before installing Cilium, a new Kubernetes Secret with the AWS Tokens needs to be added to your Kubernetes cluster. This Secret will allow Cilium to gather information from the AWS API which is needed to implement ToGroups policies.

AWS Access keys and IAM role

To create a new access token the following guide can be used. These keys need to have certain permissions set:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "ec2:Describe*",
            "Resource": "*"
        }
    ]
}

As soon as you have the access tokens, the following secret needs to be added, with each empty string replaced by the associated value as a base64-encoded string:

apiVersion: v1
kind: Secret
metadata:
  name: cilium-aws
  namespace: kube-system
type: Opaque
data:
  AWS_ACCESS_KEY_ID: ""
  AWS_SECRET_ACCESS_KEY: ""
  AWS_DEFAULT_REGION: ""

The base64 command line utility can be used to generate each value, for example:

$ echo -n "eu-west-1"  | base64
ZXUtd2VzdC0x

This secret stores the AWS credentials, which will be used to connect the AWS API.

$ kubectl create -f cilium-secret.yaml

To validate that the credentials are correct, the following pod can be created for debugging purposes:

apiVersion: v1
kind: Pod
metadata:
  name: testing-aws-pod
  namespace: kube-system
spec:
  containers:
  - name: aws-cli
    image: mesosphere/aws-cli
    command: ['sh', '-c', 'sleep 3600']
    env:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: cilium-aws
            key: AWS_ACCESS_KEY_ID
            optional: true
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: cilium-aws
            key: AWS_SECRET_ACCESS_KEY
            optional: true
      - name: AWS_DEFAULT_REGION
        valueFrom:
          secretKeyRef:
            name: cilium-aws
            key: AWS_DEFAULT_REGION
            optional: true

To list all of the available AWS instances, the following command can be used:

$ kubectl  -n kube-system exec -ti testing-aws-pod -- aws ec2 describe-instances

Once the secret has been created and validated, the cilium-operator pod must be restarted in order to pick up the credentials in the secret. To do this, identify and delete the existing cilium-operator pod, which will be recreated automatically:

$ kubectl get pods -l name=cilium-operator -n kube-system
NAME                              READY   STATUS    RESTARTS   AGE
cilium-operator-7c9d69f7c-97vqx   1/1     Running   0          36h

$ kubectl delete pod cilium-operator-7c9d69f7c-97vqx

It is important for this demo that coredns is working correctly. To know the status of coredns you can run the following command:

$ kubectl get deployment kube-dns -n kube-system
NAME       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
coredns    2         2         2            2           13h

Where at least one pod should be available.

Configure AWS Security Groups

Cilium’s AWS Metadata filtering capability enables explicit whitelisting of communication between a subset of pods (identified by Kubernetes labels) with a set of destination EC2 VMs (identified by membership in an AWS security group).

In this example, the destination EC2 VMs are a member of a single AWS security group (‘sg-0f2146100a88d03c3’) and pods with label class=xwing should only be able to make connections outside the cluster to the destination VMs in that security group.

To enable this, the VMs acting as Kubernetes worker nodes must be able to send traffic to the destination VMs that are being accessed by pods. One approach for achieving this is to put all Kubernetes worker VMs in a single ‘k8s-worker’ security group, and then ensure that any security group that is referenced in a Cilium toGroups policy has an allow all ingress rule (all ports) for connections from the ‘k8s-worker’ security group. Cilium filtering will then ensure that the only pods allowed by policy can reach the destination VMs.

Create a sample policy
Deploy a demo application:

In this case we’re going to use a demo application that is used in other guides. These manifests will create three microservices applications: deathstar, tiefighter, and xwing. In this case, we are only going to use our xwing microservice to secure communications to existing AWS instances.

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/minikube/http-sw-app.yaml
service "deathstar" created
deployment "deathstar" created
deployment "tiefighter" created
deployment "xwing" created

Kubernetes will deploy the pods and service in the background. Running kubectl get pods,svc will inform you about the progress of the operation. Each pod will go through several states until it reaches Running at which point the pod is ready.

$ kubectl get pods,svc
NAME                             READY     STATUS    RESTARTS   AGE
po/deathstar-76995f4687-2mxb2    1/1       Running   0          1m
po/deathstar-76995f4687-xbgnl    1/1       Running   0          1m
po/tiefighter                    1/1       Running   0          1m
po/xwing                         1/1       Running   0          1m

NAME             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
svc/deathstar    ClusterIP   10.109.254.198   <none>        80/TCP    3h
svc/kubernetes   ClusterIP   10.96.0.1        <none>        443/TCP   3h
Policy Language:

ToGroups rules can be used to define policy in relation to cloud providers, like AWS.

---
kind: CiliumNetworkPolicy
apiVersion: cilium.io/v2
metadata:
  name: to-groups-sample
  namespace: default
spec:
  endpointSelector:
    matchLabels:
      org: alliance
      class: xwing
  egress:
  - toPorts:
    - ports:
      - port: '80'
        protocol: TCP
    toGroups:
    - aws:
        securityGroupsIds:
        - 'sg-0f2146100a88d03c3'

This policy allows traffic from pod xwing to any AWS instance that is in the security group with ID sg-0f2146100a88d03c3.

Validate that derived policy is in place

Every time that a new policy with ToGroups rules is added, an equivalent policy (also called “derivative policy”), will be created. This policy will contain the set of CIDRs that correspond to the specification in ToGroups, e.g., the IPs of all instances that are part of a specified security group. The list of IPs will be updated periodically.

$ kubectl get cnp
NAME                                                             AGE
to-groups-sample                                                 11s
to-groups-sample-togroups-044ba7d1-f491-11e8-ad2e-080027d2d952   10s

Eventually, the derivative policy will contain IPs in the ToCIDR section:

$ kubectl get cnp to-groups-sample-togroups-044ba7d1-f491-11e8-ad2e-080027d2d952
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  creationTimestamp: 2018-11-30T11:13:52Z
  generation: 1
  labels:
    io.cilium.network.policy.kind: derivative
    io.cilium.network.policy.parent.uuid: 044ba7d1-f491-11e8-ad2e-080027d2d952
  name: to-groups-sample-togroups-044ba7d1-f491-11e8-ad2e-080027d2d952
  namespace: default
  ownerReferences:
  - apiVersion: cilium.io/v2
    blockOwnerDeletion: true
    kind: CiliumNetworkPolicy
    name: to-groups-sample
    uid: 044ba7d1-f491-11e8-ad2e-080027d2d952
  resourceVersion: "34853"
  selfLink: /apis/cilium.io/v2/namespaces/default/ciliumnetworkpolicies/to-groups-sample-togroups-044ba7d1-f491-11e8-ad2e-080027d2d952
  uid: 04b289ba-f491-11e8-ad2e-080027d2d952
specs:
- egress:
  - toCIDRSet:
    - cidr: 34.254.113.42/32
    - cidr: 172.31.44.160/32
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
  endpointSelector:
    matchLabels:
      any:class: xwing
      any:org: alliance
      k8s:io.kubernetes.pod.namespace: default
  labels:
  - key: io.cilium.k8s.policy.name
    source: k8s
    value: to-groups-sample
  - key: io.cilium.k8s.policy.uid
    source: k8s
    value: 044ba7d1-f491-11e8-ad2e-080027d2d952
  - key: io.cilium.k8s.policy.namespace
    source: k8s
    value: default
  - key: io.cilium.k8s.policy.derived-from
    source: k8s
    value: CiliumNetworkPolicy
status:
  nodes:
    k8s1:
      enforcing: true
      lastUpdated: 2018-11-30T11:28:03.907678888Z
      localPolicyRevision: 28
      ok: true

The derivative rule should contain the following information:

  • metadata.OwnerReferences: that contains the information about the ToGroups policy.
  • specs.Egress.ToCIDRSet: the list of private and public IPs of the instances that correspond to the spec of the parent policy.
  • status: whether or not the policy is enforced yet, and when the policy was last updated.

The Cilium Endpoint status for the xwing should have policy enforcement enabled only for egress connectivity:

$ kubectl get cep xwing
NAME    ENDPOINT ID   IDENTITY ID   POLICY ENFORCEMENT   ENDPOINT STATE   IPV4         IPV6
xwing   23453         63929         egress               ready            10.10.0.95   f00d::a0a:0:0:22cf

In this example, xwing pod can only connect to 34.254.113.42/32 and 172.31.44.160/32 and connectivity to other IP will be denied.

Advanced Networking

Setting up Cilium in AWS ENI mode

Note

The AWS ENI integration is still subject to some limitations. See Limitations for details.

Create an AWS cluster

Setup a Kubernetes on AWS. You can use any method you prefer, but for the simplicity of this tutorial, we are going to use eksctl. For more details on how to set up an EKS cluster using eksctl, see the section Installation on AWS EKS.

eksctl create cluster -n eni-cluster -N 0
Disable VPC CNI (aws-node DaemonSet) (EKS only)

If you are running an EKS cluster, you should delete the aws-node DaemonSet.

Cilium will manage ENIs instead of VPC CNI, so the aws-node DaemonSet has to be deleted to prevent conflict behavior.

Note

Once aws-node DaemonSet is deleted, EKS will not try to restore it.

kubectl -n kube-system delete daemonset aws-node
Deploy Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.eni=true \
  --set global.egressMasqueradeInterfaces=eth0 \
  --set global.tunnel=disabled \
  --set global.nodeinit.enabled=true

Note

The above options are assuming that masquerading is desired and that the VM is connected to the VPC using eth0. It will route all traffic that does not stay in the VPC via eth0 and masquerade it.

If you want to avoid masquerading, set global.masquerade=false. You must ensure that the security groups associated with the ENIs (eth1, eth2, …) allow for egress traffic to outside of the VPC. By default, the security groups for pod ENIs are derived from the primary ENI (eth0).

Scale up the cluster
eksctl get nodegroup --cluster eni-cluster
CLUSTER                     NODEGROUP       CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID
test-cluster                ng-25560078     2019-07-23T06:05:35Z    0               2               0                       m5.large        ami-0923e4b35a30a5f53
eksctl scale nodegroup --cluster eni-cluster -n ng-25560078 -N 2
[]  scaling nodegroup stack "eksctl-test-cluster-nodegroup-ng-25560078" in cluster eksctl-test-cluster-cluster
[]  scaling nodegroup, desired capacity from 0 to 2
Validate the Installation

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-operator-cb4578bc5-q52qk         0/1     Pending             0          8s
cilium-s8w5m                            0/1     PodInitializing     0          7s
coredns-86c58d9df4-4g7dd                0/1     ContainerCreating   0          8m57s
coredns-86c58d9df4-4l6b2                0/1     ContainerCreating   0          8m57s

It may take a couple of minutes for all components to come up:

cilium-operator-cb4578bc5-q52qk         1/1     Running   0          4m13s
cilium-s8w5m                            1/1     Running   0          4m12s
coredns-86c58d9df4-4g7dd                1/1     Running   0          13m
coredns-86c58d9df4-4l6b2                1/1     Running   0          13m
Deploy the connectivity test

You can deploy the “connectivity-check” to test connectivity between pods.

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml

It will deploy a series of deployments which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations. The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-5995597649-f5d5g                                 1/1     Running   0          4m51s
echo-b-54c9bb5f5c-p6lxf                                 1/1     Running   0          4m50s
echo-b-host-67446447f7-chvsp                            1/1     Running   0          4m50s
host-to-b-multi-node-clusterip-78f9869d75-l8cf8         1/1     Running   0          4m50s
host-to-b-multi-node-headless-798949bd5f-vvfff          1/1     Running   0          4m50s
pod-to-a-59b5fcb7f6-gq4hd                               1/1     Running   0          4m50s
pod-to-a-allowed-cnp-55f885bf8b-5lxzz                   1/1     Running   0          4m50s
pod-to-a-external-1111-7ff666fd8-v5kqb                  1/1     Running   0          4m48s
pod-to-a-l3-denied-cnp-64c6c75c5d-xmqhw                 1/1     Running   0          4m50s
pod-to-b-intra-node-845f955cdc-5nfrt                    1/1     Running   0          4m49s
pod-to-b-multi-node-clusterip-666594b445-bsn4j          1/1     Running   0          4m49s
pod-to-b-multi-node-headless-746f84dff5-prk4w           1/1     Running   0          4m49s
pod-to-b-multi-node-nodeport-7cb9c6cb8b-ksm4h           1/1     Running   0          4m49s
pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-tg9dh   1/1     Running   0          4m48s
Install Hubble

Hubble is a fully distributed networking and security observability platform for cloud native workloads. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. Visit Hubble Github page.

Deploy Hubble using Helm:

git clone https://github.com/cilium/hubble.git --branch v0.5
cd hubble/install/kubernetes

helm install hubble ./hubble \
    --namespace kube-system \
    --set metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}" \
    --set ui.enabled=true
Limitations
  • The AWS ENI integration of Cilium is currently only enabled for IPv4.
  • When applying L7 policies at egress, the source identity context is lost as it is currently not carried in the packet. This means that traffic will look like it is coming from outside of the cluster to the receiving pod.
Troubleshooting
Make sure to disable DHCP on ENIs

Cilium will use both the primary and secondary IP addresses assigned to ENIs. Use of the primary IP address optimizes the number of IPs available to pods but can conflict with a DHCP agent running on the node and assigning the primary IP of the ENI to the interface of the node. A common scenario where this happens is if NetworkManager is running on the node and automatically performing DHCP on all network interfaces of the VM. Be sure to disable DHCP on any ENIs that get attached to the node or disable NetworkManager entirely.

Using kube-router to run BGP

This guide explains how to configure Cilium and kube-router to co-operate to use kube-router for BGP peering and route propagation and Cilium for policy enforcement and load-balancing.

Note

This is a beta feature. Please provide feedback and file a GitHub issue if you experience any problems.

Deploy kube-router

Download the kube-router DaemonSet template:

curl -LO https://raw.githubusercontent.com/cloudnativelabs/kube-router/v0.4.0/daemonset/generic-kuberouter-only-advertise-routes.yaml

Open the file generic-kuberouter-only-advertise-routes.yaml and edit the args: section. The following arguments are requried to be set to exactly these values:

- "--run-router=true"
- "--run-firewall=false"
- "--run-service-proxy=false"
- "--enable-cni=false"
- "--enable-pod-egress=false"

The following arguments are optional and may be set according to your needs. For the purpose of keeping this guide simple, the following values are being used which require the least preparations in your cluster. Please see the kube-router user guide for more information.

- "--enable-ibgp=true"
- "--enable-overlay=true"
- "--advertise-cluster-ip=true"
- "--advertise-external-ip=true"
- "--advertise-loadbalancer-ip=true"

The following arguments are optional and should be set if you want BGP peering with an external router. This is useful if you want externally routable Kubernetes Pod and Service IPs. Note the values used here should be changed to whatever IPs and ASNs are configured on your external router.

- "--cluster-asn=65001"
- "--peer-router-ips=10.0.0.1,10.0.2"
- "--peer-router-asns=65000,65000"

Apply the DaemonSet file to deploy kube-router and verify it has come up correctly:

$ kubectl apply -f generic-kuberouter-only-advertise-routes.yaml
$ kubectl -n kube-system get pods -l k8s-app=kube-router
NAME                READY     STATUS    RESTARTS   AGE
kube-router-n6fv8   1/1       Running   0          10m
kube-router-nj4vs   1/1       Running   0          10m
kube-router-xqqwc   1/1       Running   0          10m
kube-router-xsmd4   1/1       Running   0          10m
Deploy Cilium

In order for routing to be delegated to kube-router, tunneling/encapsulation must be disabled. This is done by setting the tunnel=disabled in the ConfigMap cilium-config or by adjusting the DaemonSet to run the cilium-agent with the argument --tunnel=disabled:

# Encapsulation mode for communication between nodes
# Possible values:
#   - disabled
#   - vxlan (default)
#   - geneve
tunnel: "disabled"

You can then install Cilium according to the instructions in section Requirements.

Ensure that Cilium is up and running:

$ kubectl -n kube-system get pods -l k8s-app=cilium
NAME           READY     STATUS    RESTARTS   AGE
cilium-fhpk2   1/1       Running   0          45m
cilium-jh6kc   1/1       Running   0          44m
cilium-rlx6n   1/1       Running   0          44m
cilium-x5x9z   1/1       Running   0          45m
Verify Installation

Verify that kube-router has installed routes:

$ kubectl -n kube-system exec -ti cilium-fhpk2 -- ip route list scope global
default via 172.0.32.1 dev eth0 proto dhcp src 172.0.50.227 metric 1024
10.2.0.0/24 via 10.2.0.172 dev cilium_host src 10.2.0.172
10.2.1.0/24 via 172.0.51.175 dev eth0 proto 17
10.2.2.0/24 dev tun-172011760 proto 17 src 172.0.50.227
10.2.3.0/24 dev tun-1720186231 proto 17 src 172.0.50.227

In the above example, we see three categories of routes that have been installed:

  • Local PodCIDR: This route points to all pods running on the host and makes these pods available to * 10.2.0.0/24 via 10.2.0.172 dev cilium_host src 10.2.0.172
  • BGP route: This type of route is installed if kube-router determines that the remote PodCIDR can be reached via a router known to the local host. It will instruct pod to pod traffic to be forwarded directly to that router without requiring any encapsulation. * 10.2.1.0/24 via 172.0.51.175 dev eth0 proto 17
  • IPIP tunnel route: If no direct routing path exists, kube-router will fall back to using an overlay and establish an IPIP tunnel between the nodes. * 10.2.2.0/24 dev tun-172011760 proto 17 src 172.0.50.227 * 10.2.3.0/24 dev tun-1720186231 proto 17 src 172.0.50.227

You can test connectivity by deploying the following connectivity checker pods:

$ kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/connectivity-check/connectivity-check.yaml
$ kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
echo-a-dd67f6b4b-s62jl                                  1/1     Running   0          2m15s
echo-b-55d8dbd74f-t8jwk                                 1/1     Running   0          2m15s
host-to-b-multi-node-clusterip-686f99995d-tn6kq         1/1     Running   0          2m15s
host-to-b-multi-node-headless-bdbc856d-9zv4x            1/1     Running   0          2m15s
pod-to-a-766584ffff-wh2s8                               1/1     Running   0          2m15s
pod-to-a-allowed-cnp-5899c44899-f9tdv                   1/1     Running   0          2m15s
pod-to-a-external-1111-55c488465-7sd55                  1/1     Running   0          2m14s
pod-to-a-l3-denied-cnp-856998c977-j9dhs                 1/1     Running   0          2m15s
pod-to-b-intra-node-7b6cbc6c56-hqz7r                    1/1     Running   0          2m15s
pod-to-b-multi-node-clusterip-77c8446b6d-qc8ch          1/1     Running   0          2m15s
pod-to-b-multi-node-headless-854b65674d-9zlp8           1/1     Running   0          2m15s
pod-to-external-fqdn-allow-google-cnp-bb9597947-bc85q   1/1     Running   0          2m14s

Using BIRD to run BGP

BIRD is an open-source implementation for routing Internet Protocol packets on Unix-like operating systems. If you are not familiar with it, you had best have a glance at the User’s Guide first.

BIRD maintains two release families at present: 1.x and 2.x, and the configuration format varies a lot between them. Unless you have already deployed the 1.x, we suggest using 2.x directly, as the 2.x will live longer. In the following, we will denote bird as the bird2 software.

This guide shows how to install and configure bird on CentOS 7.x to make it collaborate with Cilium. Installation and configuration on other platforms should be very similar.

Install bird
$ yum install -y bird2

$ systemctl enable bird
$ systemctl restart bird

Test the installation:

$ birdc show route
BIRD 2.0.6 ready.

$ birdc              # interactive shell
BIRD 2.0.6 ready.
bird> show bfd sessions
There is no BFD protocol running
bird>
bird> show protocols all
Name       Proto      Table      State  Since         Info
device1    Device     ---        up     10:53:40.147

direct1    Direct     ---        down   10:53:40.147
  Channel ipv4
    State:          DOWN
    Input filter:   ACCEPT
    Output filter:  REJECT
...
Basic configuration

It’s hard to discuss bird configurations without considering specific BGP schemes. However, BGP scheme design is beyond the scope of this guide. If you are interested in this topic, refer to BGP in the Data Center (O’Reilly, 2017) for a quick start.

In the following, we will restrict our BGP scenario as follows:

_images/bird_sample_topo.png
  • physical network: simple 3-tier hierarchical architecture
  • nodes connect to physical network via layer 2 switches
  • announcing each node’s PodCIDR to physical network via bird
  • for each node, do not import route announcements from physical network

In this design, the BGP connections look like this:

_images/bird_sample_bgp.png

This scheme is simple in that:

  • core routers learn PodCIDRs from bird, which makes the Pod IP addresses routable within the entire network.
  • bird doesn’t learn routes from core routers and other nodes, which keeps the kernel routing table of each node clean and small, and suffering no performance issues.

In this scheme, each node just sends pod egress traffic to node’s default gateway (the core routers), and lets the latter do the routing.

Below is the a reference configuration for fulfilling the above purposes:

$ cat /etc/bird.conf
log syslog all;

router id {{ NODE_IP }};

protocol device {
        scan time 10;           # Scan interfaces every 10 seconds
}

# Disable automatically generating direct routes to all network interfaces.
protocol direct {
        disabled;               # Disable by default
}

# Forbid synchronizing BIRD routing tables with the OS kernel.
protocol kernel {
        ipv4 {                    # Connect protocol to IPv4 table by channel
                import none;      # Import to table, default is import all
                export none;      # Export to protocol. default is export none
        };
}

# Static IPv4 routes.
protocol static {
      ipv4;
      route {{ POD_CIDR }} via "cilium_host";
}

# BGP peers
protocol bgp uplink0 {
      description "BGP uplink 0";
      local {{ NODE_IP }} as {{ NODE_ASN }};
      neighbor {{ NEIGHBOR_0_IP }} as {{ NEIGHBOR_0_ASN }};
      password {{ NEIGHBOR_PWD }};

      ipv4 {
              import filter {reject;};
              export filter {accept;};
      };
}

protocol bgp uplink1 {
      description "BGP uplink 1";
      local {{ NODE_IP }} as {{ NODE_ASN }};
      neighbor {{ NEIGHBOR_1_IP }} as {{ NEIGHBOR_1_ASN }};
      password {{ NEIGHBOR_PWD }};

      ipv4 {
              import filter {reject;};
              export filter {accept;};
      };
}

Save the above file as /etc/bird.conf, and replace the placeholders with your own:

sed -i 's/{{ NODE_IP }}/<your node ip>/g'                /etc/bird.conf
sed -i 's/{{ POD_CIDR }}/<your pod cidr>/g'              /etc/bird.conf
sed -i 's/{{ NODE_ASN }}/<your node asn>/g'              /etc/bird.conf
sed -i 's/{{ NEIGHBOR_0_IP }}/<your neighbor 0 ip>/g'    /etc/bird.conf
sed -i 's/{{ NEIGHBOR_1_IP }}/<your neighbor 1 ip>/g'    /etc/bird.conf
sed -i 's/{{ NEIGHBOR_0_ASN }}/<your neighbor 0 asn>/g'  /etc/bird.conf
sed -i 's/{{ NEIGHBOR_1_ASN }}/<your neighbor 1 asn>/g'  /etc/bird.conf
sed -i 's/{{ NEIGHBOR_PWD }}/<your neighbor password>/g' /etc/bird.conf

Restart bird and check the logs:

$ systemctl restart bird

# check logs
$ journalctl -u bird
-- Logs begin at Sat 2020-02-22 16:11:44 CST, end at Mon 2020-02-24 18:58:35 CST. --
Feb 24 18:58:24 node systemd[1]: Started BIRD Internet Routing Daemon.
Feb 24 18:58:24 node systemd[1]: Starting BIRD Internet Routing Daemon...
Feb 24 18:58:24 node bird[137410]: Started

Verify the changes, you should get something like this:

$ birdc show route
BIRD 2.0.6 ready.
Table master4:
10.5.48.0/24         unicast [static1 20:14:51.478] * (200)
        dev cilium_host

This indicates that the PodCIDR 10.5.48.0/24 on this node has been successfully announced to the BGP peers.

Monitoring

bird_exporter could collect bird daemon states, and export Prometheus-style metrics.

It also provides a simple Grafana dashboard, but you could also create your own, e.g. Trip.com’s looks like this:

_images/bird_dashboard.png
Advanced Configurations

You may need some advanced configurations to make your BGP scheme production-ready. This section lists some of these parameters, but we will not dive into details, that’s BIRD User’s Guide’s responsibility.

BFD

Bidirectional Forwarding Detection (BFD) is a detection protocol designed to accelerate path failure detection.

This feature also relies on peer side’s configuration.

protocol bfd {
      interface "{{ grains['node_mgnt_device'] }}" {
              min rx interval 100 ms;
              min tx interval 100 ms;
              idle tx interval 300 ms;
              multiplier 10;
              password {{ NEIGHBOR_PWD }};
      };

      neighbor {{ NEIGHBOR_0_IP] }};
      neighbor {{ NEIGHBOR_1_IP] }};
}

protocol bgp uplink0 {
            ...

        bfd on;
}

Verify, you should see something like this:

$ birdc show bfd sessions
BIRD 2.0.6 ready.
bfd1:
IP address                Interface  State      Since         Interval  Timeout
10.5.40.2                 bond0      Up         20:14:51.479    0.300    0.000
10.5.40.3                 bond0      Up         20:14:51.479    0.300    0.000
ECMP

For some special purposes (e.g. L4LB), you may configure a same CIDR on multiple nodes. In this case, you need to configure Equal-Cost Multi-Path (ECMP) routing.

This feature also relies on peer side’s configuration.

protocol kernel {
        ipv4 {                    # Connect protocol to IPv4 table by channel
                import none;      # Import to table, default is import all
                export none;      # Export to protocol. default is export none
        };

        # Configure ECMP
        merge paths yes limit {{ N }} ;
}

See the user manual for more detailed information.

You need to check the ECMP correctness on physical network (Core router in the above scenario):

CORE01# show ip route 10.5.2.0
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

10.5.2.0/24, ubest/mbest: 2/0
    *via 10.4.1.7, [200/0], 13w6d, bgp-65418, internal, tag 65418
    *via 10.4.1.8, [200/0], 12w4d, bgp-65418, internal, tag 65418
Graceful restart

This feature also relies on peer side’s configuration.

Add graceful restart to each bgp section:

protocol bgp uplink0 {
            ...

        graceful restart;
}

Setting up Cluster Mesh

This is a step-by-step guide on how to build a mesh of Kubernetes clusters by connecting them together, enabling pod-to-pod connectivity across all clusters, define global services to load-balance between clusters and enforce security policies to restrict access.

Prerequisites
  • PodCIDR ranges in all clusters must be non-conflicting.
  • This guide and the referenced scripts assume that Cilium was installed using the Installation with managed etcd instructions which leads to etcd being managed by Cilium using etcd-operator. You can use any way to manage etcd but you will have to adjust some of the scripts to account for different secret names and adjust the LoadBalancer to expose the etcd pods.
  • Nodes in all clusters must have IP connectivity between each other. This requirement is typically met by establishing peering or VPN tunnels between the networks of the nodes of each cluster.
  • All nodes must have a unique IP address assigned them. Node IPs of clusters being connected together may not conflict with each other.
  • Cilium must be configured to use etcd as the kvstore. Consul is not supported by cluster mesh at this point.
  • It is highly recommended to use a TLS protected etcd cluster with Cilium. The server certificate of etcd must whitelist the host name *.mesh.cilium.io. If you are using the cilium-etcd-operator as set up in the Installation with managed etcd instructions then this is automatically taken care of.
  • The network between clusters must allow the inter-cluster communication. The exact ports are documented in the Firewall Rules section.
Prepare the clusters
Specify the cluster name and ID

Each cluster must be assigned a unique human-readable name. The name will be used to group nodes of a cluster together. The cluster name is specified with the --cluster-name=NAME argument or cluster-name ConfigMap option.

To ensure scalability of identity allocation and policy enforcement, each cluster continues to manage its own security identity allocation. In order to guarantee compatibility with identities across clusters, each cluster is configured with a unique cluster ID configured with the --cluster-id=ID argument or cluster-id ConfigMap option. The value must be between 1 and 255.

kubectl -n kube-system edit cm cilium-config
[ ... add/edit ... ]
cluster-name: cluster1
cluster-id: "1"

Repeat this step for each cluster.

Note

This can also be done by passing --set global.cluster.id=<id> and --set global.cluster.name=<name> to helm install when installing or updating Cilium.

Expose the Cilium etcd to other clusters

The Cilium etcd must be exposed to other clusters. There are many ways to achieve this. The method documented in this guide will work with cloud providers that implement the Kubernetes LoadBalancer service type, as well as with services of type NodePort (assuming that nodes can reach each other using their internal IPs):

apiVersion: v1
kind: Service
metadata:
  name: cilium-etcd-external
  annotations:
    cloud.google.com/load-balancer-type: "Internal"
spec:
  type: LoadBalancer
  ports:
  - port: 2379
  selector:
    app: etcd
    etcd_cluster: cilium-etcd
    io.cilium/app: etcd-operator
apiVersion: v1
kind: Service
metadata:
  name: cilium-etcd-external
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
spec:
  type: LoadBalancer
  ports:
  - port: 2379
  selector:
    app: etcd
    etcd_cluster: cilium-etcd
    io.cilium/app: etcd-operator
apiVersion: v1
kind: Service
metadata:
  name: cilium-etcd-external
spec:
  type: NodePort
  ports:
  - port: 2379
  selector:
    app: etcd
    etcd_cluster: cilium-etcd
    io.cilium/app: etcd-operator

The example used here exposes the etcd cluster as managed by cilium-etcd-operator installed by the standard installation instructions as an internal service which means that it is only exposed inside of a VPC and not publicly accessible outside of the VPC. It is recommended to use a static IP for the ServiceIP to avoid requiring to update the IP mapping as done in one of the later steps.

If you are running the cilium-etcd-operator you can simply apply the following service to expose etcd:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/clustermesh/cilium-etcd-external-service/cilium-etcd-external-gke.yaml
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/clustermesh/cilium-etcd-external-service/cilium-etcd-external-eks.yaml
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/clustermesh/cilium-etcd-external-service/cilium-etcd-external-nodeport.yaml

Note

Make sure that you create the service in namespace in which cilium and/or etcd is running. Depending on which installation method you chose, this could be kube-system or cilium.

Extract the TLS keys and generate the etcd configuration

The cluster mesh control plane performs TLS based authentication and encryption. For this purpose, the TLS keys and certificates of each etcd need to be made available to all clusters that wish to connect.

  1. Clone the cilium/clustermesh-tools repository. It contains scripts to extracts the secrets and generate a Kubernetes secret in form of a YAML file:

    git clone https://github.com/cilium/clustermesh-tools.git
    cd clustermesh-tools
    
  2. Ensure that the kubectl context is pointing to the cluster you want to extract the secret from.

  3. Extract the TLS certificate, key and root CA authority.

    ./extract-etcd-secrets.sh
    

    This will extract the keys that Cilium is using to connect to the etcd in the local cluster. The key files are written to config/<cluster-name>.*.{key|crt|-ca.crt}

  4. Repeat this step for all clusters you want to connect with each other.

  5. Generate a single Kubernetes secret from all the keys and certificates extracted. The secret will contain the etcd configuration with the service IP or host name of the etcd including the keys and certificates to access it.

    ./generate-secret-yaml.sh > clustermesh.yaml
    

Note

The key files in config/ and the secret represented as YAML are sensitive. Anyone gaining access to these files is able to connect to the etcd instances in the local cluster. Delete the files after the you are done setting up the cluster mesh.

Ensure that the etcd service names can be resolved

For TLS authentication to work properly, agents will connect to etcd in remote clusters using a pre-defined naming schema {clustername}.mesh.cilium.io. In order for DNS resolution to work on these virtual host name, the names are statically mapped to the service IP via the /etc/hosts file.

  1. The following script will generate the required segment which has to be inserted into the cilium DaemonSet:

    ./generate-name-mapping.sh > ds.patch
    

    The ds.patch will look something like this:

    spec:
      template:
        spec:
          hostAliases:
          - ip: "10.138.0.18"
            hostnames:
            - cluster1.mesh.cilium.io
          - ip: "10.138.0.19"
            hostnames:
            - cluster2.mesh.cilium.io
    
  2. Apply the patch to all DaemonSets in all clusters:

    kubectl -n kube-system patch ds cilium -p "$(cat ds.patch)"
    
Establish connections between clusters

1. Import the cilium-clustermesh secret that you generated in the last chapter into all of your clusters:

kubectl -n kube-system apply -f clustermesh.yaml
  1. Restart the cilium-agent in all clusters so it picks up the new cluster name, cluster id and mounts the cilium-clustermesh secret. Cilium will automatically establish connectivity between the clusters.
kubectl -n kube-system delete pod -l k8s-app=cilium
  1. For global services to work (see below), also restart the cilium-operator:
kubectl -n kube-system delete pod -l name=cilium-operator
Test pod connectivity between clusters

Run cilium node list to see the full list of nodes discovered. You can run this command inside any Cilium pod in any cluster:

$ kubectl -n kube-system exec -ti cilium-g6btl cilium node list
Name                                                   IPv4 Address    Endpoint CIDR   IPv6 Address   Endpoint CIDR
cluster5/ip-172-0-117-60.us-west-2.compute.internal    172.0.117.60    10.2.2.0/24     <nil>          f00d::a02:200:0:0/112
cluster5/ip-172-0-186-231.us-west-2.compute.internal   172.0.186.231   10.2.3.0/24     <nil>          f00d::a02:300:0:0/112
cluster5/ip-172-0-50-227.us-west-2.compute.internal    172.0.50.227    10.2.0.0/24     <nil>          f00d::a02:0:0:0/112
cluster5/ip-172-0-51-175.us-west-2.compute.internal    172.0.51.175    10.2.1.0/24     <nil>          f00d::a02:100:0:0/112
cluster7/ip-172-0-121-242.us-west-2.compute.internal   172.0.121.242   10.4.2.0/24     <nil>          f00d::a04:200:0:0/112
cluster7/ip-172-0-58-194.us-west-2.compute.internal    172.0.58.194    10.4.1.0/24     <nil>          f00d::a04:100:0:0/112
cluster7/ip-172-0-60-118.us-west-2.compute.internal    172.0.60.118    10.4.0.0/24     <nil>          f00d::a04:0:0:0/112
$ kubectl exec -ti pod-cluster5-xxx curl <pod-ip-cluster7>
[...]
Load-balancing with Global Services

Establishing load-balancing between clusters is achieved by defining a Kubernetes service with identical name and namespace in each cluster and adding the annotation io.cilium/global-service: "true"` to declare it global. Cilium will automatically perform load-balancing to pods in both clusters.

apiVersion: v1
kind: Service
metadata:
  name: rebel-base
  annotations:
    io.cilium/global-service: "true"
spec:
  type: ClusterIP
  ports:
  - port: 80
  selector:
    name: rebel-base
Deploying a simple example service
  1. In cluster 1, deploy:

    kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/clustermesh/global-service-example/cluster1.yaml
    
  2. In cluster 2, deploy:

    kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/clustermesh/global-service-example/cluster2.yaml
    
  3. From either cluster, access the global service:

    kubectl exec -ti xwing-xxx -- curl rebel-base
    

    You will see replies from pods in both clusters.

Security Policies

As addressing and network security is decoupled, network security enforcement automatically spans across clusters. Note that Kubernetes security policies are not automatically distributed across clusters, it is your responsibility to apply CiliumNetworkPolicy or NetworkPolicy in all clusters.

Allowing specific communication between clusters

The following policy illustrates how to allow particular pods to allow communicate between two clusters. The cluster name refers to the name given via the --cluster-name agent option or cluster-name ConfigMap option.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-cross-cluster"
  description: "Allow x-wing in cluster1 to contact rebel-base in cluster2"
spec:
  endpointSelector:
    matchLabels:
      name: x-wing
      io.cilium.k8s.policy.cluster: cluster1
  egress:
  - toEndpoints:
    - matchLabels:
        name: rebel-base
        io.cilium.k8s.policy.cluster: cluster2
Troubleshooting

Use the following list of steps to troubleshoot issues with ClusterMesh:

Generic
  1. Validate that the cilium-xxx as well as the cilium-operator-xxx` pods are healthy and ready. It is important that the ``cilium-operator is healthy as well as it is responsible for synchronizing state from the local cluster into the kvstore. If this fails, check the logs of these pods to track the reason for failure.

  2. Validate that the ClusterMesh subsystem is initialized by looking for a cilium-agent log message like this:

    level=info msg="Initializing ClusterMesh routing" path=/var/lib/cilium/clustermesh/ subsys=daemon
    
Control Plane Connectivity
  1. Validate that the configuration for remote clusters is picked up correctly. For each remote cluster, an info log message New remote cluster configuration along with the remote cluster name must be logged in the cilium-agent logs.

    If the configuration is now found, check the following:

    • The Kubernetes secret clustermesh-secrets is imported correctly.
    • The secret contains a file for each remote cluster with the filename matching the name of the remote cluster.
    • The contents of the file in the secret is a valid etcd configuration consisting of the IP to reach the remote etcd as well as the required certificates to connect to that etcd.
    • Run a kubectl exec -ti [...] bash in one of the Cilium pods and check the contents of the directory /var/lib/cilium/clustermesh/. It must contain a configuration file for each remote cluster along with all the required SSL certificates and keys. The filenames must match the cluster names as provided by the --cluster-name argument or cluster-name ConfigMap option. If the directory is empty or incomplete, regenerate the secret again and ensure that the secret is correctly mounted into the DaemonSet.
  2. Validate that the connection to the remote cluster could be established. You will see a log message like this in the cilium-agent logs for each remote cluster:

    level=info msg="Connection to remote cluster established"
    

    If the connection failed, you will see a warning like this:

    level=warning msg="Unable to establish etcd connection to remote cluster"
    

    If the connection fails, the cause can be one of the following:

    • Validate that the hostAliases section in the Cilium DaemonSet maps each remote cluster to the IP of the LoadBalancer that makes the remote control plane available.

    • Validate that a local node in the source cluster can reach the IP specified in the hostAliases section. The clustermesh-secrets secret contains a configuration file for each remote cluster, it will point to a logical name representing the remote cluster:

      endpoints:
      - https://cluster1.mesh.cilium.io:2379
      

      The name will NOT be resolvable via DNS outside of the cilium pod. The name is mapped to an IP using hostAliases. Run kubectl -n kube-system get ds cilium -o yaml and grep for the FQDN to retrieve the IP that is configured. Then use curl to validate that the port is reachable.

    • A firewall between the local cluster and the remote cluster may drop the control plane connection. Ensure that port 2379/TCP is allowed.

State Propagation
  1. Run cilium node list in one of the Cilium pods and validate that it lists both local nodes and nodes from remote clusters. If this discovery does not work, validate the following:

    • In each cluster, check that the kvstore contains information about local nodes by running:

      cilium kvstore get --recursive cilium/state/nodes/v1/
      

      Note

      The kvstore will only contain nodes of the local cluster. It will not contain nodes of remote clusters. The state in the kvstore is used for other clusters to discover all nodes so it is important that local nodes are listed.

  2. Validate the connectivity health matrix across clusters by running cilium-health status inside any Cilium pod. It will list the status of the connectivity health check to each remote node.

    If this fails:

    • Make sure that the network allows the health checking traffic as specified in the section Firewall Rules.
  3. Validate that identities are synchronized correctly by running cilium identity list in one of the Cilium pods. It must list identities from all clusters. You can determine what cluster an identity belongs to by looking at the label io.cilium.k8s.policy.cluster.

    If this fails:

    • Is the identity information available in the kvstore of each cluster? You can confirm this by running cilium kvstore get --recursive cilium/state/identities/v1/.

      Note

      The kvstore will only contain identities of the local cluster. It will not contain identities of remote clusters. The state in the kvstore is used for other clusters to discover all identities so it is important that local identities are listed.

  4. Validate that the IP cache is synchronized correctly by running cilium bpf ipcache list or cilium map get cilium_ipcache. The output must contain pod IPs from local and remote clusters.

    If this fails:

    • Is the IP cache information available in the kvstore of each cluster? You can confirm this by running cilium kvstore get --recursive cilium/state/ip/v1/.

      Note

      The kvstore will only contain IPs of the local cluster. It will not contain IPs of remote clusters. The state in the kvstore is used for other clusters to discover all pod IPs so it is important that local identities are listed.

  5. When using global services, ensure that global services are configured with endpoints from all clusters. Run cilium service list in any Cilium pod and validate that the backend IPs consist of pod IPs from all clusters running relevant backends. You can further validate the correct datapath plumbing by running cilium bpf lb list to inspect the state of the BPF maps.

    If this fails:

    • Are services available in the kvstore of each cluster? You can confirm this by running cilium kvstore get --recursive cilium/state/services/v1/.

    • Run cilium debuginfo and look for the section “k8s-service-cache”. In that section, you will find the contents of the service correlation cache. it will list the Kubernetes services and endpoints of the local cluster. It will also have a section externalEndpoints which must list all endpoints of remote clusters.

      #### k8s-service-cache
      
      (*k8s.ServiceCache)(0xc00000c500)({
      [...]
       services: (map[k8s.ServiceID]*k8s.Service) (len=2) {
         (k8s.ServiceID) default/kubernetes: (*k8s.Service)(0xc000cd11d0)(frontend:172.20.0.1/ports=[https]/selector=map[]),
         (k8s.ServiceID) kube-system/kube-dns: (*k8s.Service)(0xc000cd1220)(frontend:172.20.0.10/ports=[metrics dns dns-tcp]/selector=map[k8s-app:kube-dns])
       },
       endpoints: (map[k8s.ServiceID]*k8s.Endpoints) (len=2) {
         (k8s.ServiceID) kube-system/kube-dns: (*k8s.Endpoints)(0xc0000103c0)(10.16.127.105:53/TCP,10.16.127.105:53/UDP,10.16.127.105:9153/TCP),
         (k8s.ServiceID) default/kubernetes: (*k8s.Endpoints)(0xc0000103f8)(192.168.33.11:6443/TCP)
       },
       externalEndpoints: (map[k8s.ServiceID]k8s.externalEndpoints) {
       }
      })
      

      The sections services and endpoints represent the services of the local cluster, the section externalEndpoints lists all remote services and will be correlated with services matching the same ServiceID.

Limitations
  • L7 security policies currently only work across multiple clusters if worker nodes have routes installed allowing to route pod IPs of all clusters. This is given when running in direct routing mode by running a routing daemon or --auto-direct-node-routes but won’t work automatically when using tunnel/encapsulation mode.
  • The number of clusters that can be connected together is currently limited to 255. This limitation will be lifted in the future when running in direct routing mode or when running in encapsulation mode with encryption enabled.
Roadmap Ahead
  • Future versions will put an API server before etcd to provide better scalability and simplify the installation to support any etcd support
  • Introduction of IPsec and use of ESP or utilization of the traffic class field in the IPv6 header will allow to use more than 8 bits for the cluster-id and thus support more than 256 clusters.

Cilium integration with Flannel (beta)

This guide contains the necessary steps to run Cilium on top of your Flannel cluster.

If you have a cluster already set up with Flannel you will not need to install Flannel again.

This Cilium integration with Flannel was performed with Flannel 0.10.0 and Kubernetes >= 1.9. If you find any issues with previous Flannel versions please feel free to reach out to us to help you.

Note

This is a beta feature. Please provide feedback and file a GitHub issue if you experience any problems.

The feature lacks support of the following, which will be resolved in upcoming Cilium releases:

  • L7 policy enforcement
Flannel installation

NOTE: If kubeadm is used, then pass --pod-network-cidr=10.244.0.0/16 to kubeadm init to ensure that the podCIDR is set.

kubectl apply -f  https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/addons/flannel/flannel.yaml

Wait until all pods to be in ready state before preceding to the next step.

Cilium installation

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.flannel.enabled=true

Set global.flannel.uninstallOnExit=true if you want Cilium to uninstall itself when the Cilium pod is stopped.

If the Flannel bridge has a different name than cni0, you must specify the name by setting global.flannel.masterDevice=....

Cilium might not come up immediately on all nodes, since Flannel only sets up the bridge network interface that connects containers with the outside world when the first container is created on that node. In this case, Cilium will wait until that bridge is created before marking itself as Ready.

IPVLAN based Networking (beta)

This guide explains how to configure Cilium to set up an ipvlan-based datapath instead of the default veth-based one.

Note

This is a beta feature. Please provide feedback and file a GitHub issue if you experience any problems.

The feature lacks support of the following, which will be resolved in upcoming Cilium releases:

  • IPVLAN L2 mode
  • L7 policy enforcement
  • NAT64
  • IPVLAN with tunneling
  • BPF-based masquerading

Note

The ipvlan-based datapath in L3 mode requires v4.12 or more recent Linux kernel, while L3S mode, in addition, requires a stable kernel with the fix mentioned in this document (see below).

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.datapathMode=ipvlan \
  --set global.ipvlan.masterDevice=eth0 \
  --set global.tunnel=disabled

It is required to specify the master ipvlan device which typically points to a networking device that is facing the external network. This is done through setting global.ipvlan.masterDevice to the name of the networking device such as "eth0" or "bond0", for example. Be aware this option will be used by all nodes, so it is required this device name is consistent on all nodes where you are going to deploy Cilium.

The ipvlan datapath only supports direct routing mode right now, therefore tunneling must be disabled through setting tunnel to "disabled".

To make ipvlan work between hosts, routes on each host have to be installed either manually or automatically by Cilium. The latter can be enabled through setting global.autoDirectNodeRoutes to "true".

The global.installIptablesRules parameter is optional and if set to "false" then Cilium will not install any iptables rules which are mainly for interaction with kube-proxy, and additionally it will trigger ipvlan setup in L3 mode. For the default case where the latter is "true", ipvlan is operated in L3S mode such that netfilter in host namespace is not bypassed. Optionally, the agent can also be set up for masquerading all traffic leaving the ipvlan master device if global.masquerade is set to "true". Note that in order for L3S mode to work correctly, a kernel with the following fix is required: d5256083f62e . This fix is included in stable kernels v4.9.155, 4.14.98, 4.19.20, 4.20.6 or higher. Without this kernel fix, ipvlan in L3S mode cannot connect to kube-apiserver.

Masquerading with iptables in L3-only mode is not possible since netfilter hooks are bypassed in the kernel in this mode, hence L3S (symmetric) had to be introduced in the kernel at the cost of performance.

Example ConfigMap extract for ipvlan in pure L3 mode:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.datapathMode=ipvlan \
  --set global.ipvlan.masterDevice=bond0 \
  --set global.tunnel=disabled \
  --set global.installIptablesRules=false \
  --set global.l7Proxy.enabled=false \
  --set global.autoDirectNodeRoutes=true

Example ConfigMap extract for ipvlan in L3S mode with iptables masquerading all traffic leaving the node:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.datapathMode=ipvlan \
  --set global.ipvlan.masterDevice=bond0 \
  --set global.tunnel=disabled \
  --set global.masquerade=true \
  --set global.autoDirectNodeRoutes=true

Verify that it has come up correctly:

kubectl -n kube-system get pods -l k8s-app=cilium
NAME                READY     STATUS    RESTARTS   AGE
cilium-crf7f        1/1       Running   0          10m

For further information on Cilium’s ipvlan datapath mode, see Architecture.

Transparent Encryption (stable/beta)

This guide explains how to configure Cilium to use IPsec based transparent encryption using Kubernetes secrets to distribute the IPsec keys. After this configuration is complete all traffic between Cilium managed endpoints, as well as Cilium managed host traffic, will be encrypted using IPsec. This guide uses Kubernetes secrets to distribute keys. Alternatively, keys may be manually distributed but that is not shown here.

Note

The encryption feature is stable in combination with the direct-routing and ENI datapath mode. In combination with encapsulation/tunneling, the feature is still in beta phase.

Generate & import the PSK

First create a Kubernetes secret for the IPsec keys to be stored. This will generate the necessary IPsec keys which will be distributed as a Kubernetes secret called cilium-ipsec-keys. In this example we use GMC-128-AES, but any of the supported Linux algorithms may be used. To generate use the following

$ kubectl create -n kube-system secret generic cilium-ipsec-keys \
    --from-literal=keys="3 rfc4106(gcm(aes)) $(echo $(dd if=/dev/urandom count=20 bs=1 2> /dev/null| xxd -p -c 64)) 128"

The secret can be displayed with kubectl -n kube-system get secret and will be listed as ‘cilium-ipsec-keys’.

$ kubectl -n kube-system get secrets cilium-ipsec-keys
NAME                TYPE     DATA   AGE
cilium-ipsec-keys   Opaque   1      176m
Enable Encryption in Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.encryption.enabled=true \
  --set global.encryption.nodeEncryption=false

At this point the Cilium managed nodes will be using IPsec for all traffic. For further information on Cilium’s transparent encryption, see Architecture.

Encryption interface

If direct routing is being used an additional argument can be used to identify the network facing interface. If no interface is specified the default route link is chosen by inspecting the routing tables. This will work in many cases but depending on routing rules users may need to specify the encryption interface as follows:

--set global.encryption.interface=ethX
Node to node encryption

In order to enable node-to-node encryption, add:

[...]
--set global.encryption.enabled=true \
--set global.encryption.nodeEncryption=true
Validate the Setup

Run a bash shell in one of the Cilium pods with kubectl -n kube-system exec -ti cilium-7cpsm -- bash and execute the following commands:

  1. Install tcpdump
apt-get update
apt-get -y install tcpdump
  1. Check that traffic is encrypted:
tcpdump -n -i cilium_vxlan
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on cilium_vxlan, link-type EN10MB (Ethernet), capture size 262144 bytes
15:16:21.626416 IP 10.60.1.1 > 10.60.0.1: ESP(spi=0x00000001,seq=0x57e2), length 180
15:16:21.626473 IP 10.60.1.1 > 10.60.0.1: ESP(spi=0x00000001,seq=0x57e3), length 180
15:16:21.627167 IP 10.60.0.1 > 10.60.1.1: ESP(spi=0x00000001,seq=0x579d), length 100
15:16:21.627296 IP 10.60.0.1 > 10.60.1.1: ESP(spi=0x00000001,seq=0x579e), length 100
15:16:21.627523 IP 10.60.0.1 > 10.60.1.1: ESP(spi=0x00000001,seq=0x579f), length 180
15:16:21.627699 IP 10.60.1.1 > 10.60.0.1: ESP(spi=0x00000001,seq=0x57e4), length 100
15:16:21.628408 IP 10.60.1.1 > 10.60.0.1: ESP(spi=0x00000001,seq=0x57e5), length 100
Key Rotation

To replace cilium-ipsec-keys secret with a new keys,

KEYID=$(kubectl get secret -n kube-system cilium-ipsec-keys -o yaml|grep keys: | awk '{print $2}' | base64 -d | awk '{print $1}')
if [[ $KEYID -gt 15 ]]; then KEYID=0; fi
data=$(echo "{\"stringData\":{\"keys\":\"$((($KEYID+1))) "rfc4106\(gcm\(aes\)\)" $(echo $(dd if=/dev/urandom count=20 bs=1 2> /dev/null| xxd -p -c 64)) 128\"}}")
kubectl patch secret -n kube-system cilium-ipsec-keys -p="${data}" -v=1

Then restart cilium agents to transition to the new key. During transition the new and old keys will be in use. The cilium agent keeps per endpoint data on which key is used by each endpoint and will use the correct key if either side has not yet been updated. In this way encryption will work as new keys are rolled out.

The KEYID environment variable in the above example stores the current key ID used by Cilium. The key variable is a uint8 with value between 0-16 and should be monotonically increasing every re-key with a rollover from 16 to 0. The cilium agent will default to KEYID of zero if its not specified in the secret.

Troubleshooting
  • Make sure that the Cilium pods have kvstore connectivity:

    cilium status
    KVStore:                Ok   etcd: 1/1 connected: http://127.0.0.1:31079 - 3.3.2 (Leader)
    [...]
    
  • Check for level=warning and level=error messages in the Cilium log files

  • Run a bash in a Cilium and validate the following:

    • Routing rules matching on fwmark:

      ip rule list
      1:      from all fwmark 0xd00/0xf00 lookup 200
      1:      from all fwmark 0xe00/0xf00 lookup 200
      [...]
      
    • Content of routing table 200

      ip route list table 200
      local 10.60.0.0/24 dev cilium_vxlan proto 50 scope host
      10.60.1.0/24 via 10.60.0.1 dev cilium_host
      
    • XFRM policy:

      ip xfrm p
      src 10.60.1.1/24 dst 10.60.0.1/24
              dir fwd priority 0
              mark 0xd00/0xf00
              tmpl src 10.60.1.1 dst 10.60.0.1
                      proto esp spi 0x00000001 reqid 1 mode tunnel
      src 10.60.1.1/24 dst 10.60.0.1/24
              dir in priority 0
              mark 0xd00/0xf00
              tmpl src 10.60.1.1 dst 10.60.0.1
                      proto esp spi 0x00000001 reqid 1 mode tunnel
      src 10.60.0.1/24 dst 10.60.1.1/24
              dir out priority 0
              mark 0xe00/0xf00
              tmpl src 10.60.0.1 dst 10.60.1.1
                      proto esp spi 0x00000001 reqid 1 mode tunnel
      
    • XFRM state:

      ip xfrm s
      src 10.60.0.1 dst 10.60.1.1
              proto esp spi 0x00000001 reqid 1 mode tunnel
              replay-window 0
              auth-trunc hmac(sha256) 0x6162636465666768696a6b6c6d6e6f70717273747576777a797a414243444546 96
              enc cbc(aes) 0x6162636465666768696a6b6c6d6e6f70717273747576777a797a414243444546
              anti-replay context: seq 0x0, oseq 0xe0c0, bitmap 0x00000000
              sel src 0.0.0.0/0 dst 0.0.0.0/0
      src 10.60.1.1 dst 10.60.0.1
              proto esp spi 0x00000001 reqid 1 mode tunnel
              replay-window 0
              auth-trunc hmac(sha256) 0x6162636465666768696a6b6c6d6e6f70717273747576777a797a414243444546 96
              enc cbc(aes) 0x6162636465666768696a6b6c6d6e6f70717273747576777a797a414243444546
              anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
              sel src 0.0.0.0/0 dst 0.0.0.0/0
      
Disabling Encryption

To disable the encryption, regenerate the YAML with the option global.encryption.enabled=false

Host-Reachable Services

This guide explains how to configure Cilium to enable services to be reached from the host namespace in addition to pod namespaces.

Note

Host-reachable services for TCP and UDP requires a v4.19.57, v5.1.16, v5.2.0 or more recent Linux kernel. Note that v5.0.y kernels do not have the fix required to run host-reachable services with UDP since at this point in time the v5.0.y stable kernel is end-of-life (EOL) and not maintained anymore. For only enabling TCP-based host-reachable services a v4.17.0 or newer kernel is required. The most optimal kernel with the full feature set is v5.8.

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.hostServices.enabled=true

If you can’t run 4.19.57 but have 4.17.0 available you can restrict protocol support to TCP only:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.hostServices.enabled=true \
  --set global.hostServices.protocols=tcp

Host-reachable services act transparent to Cilium’s lower layer datapath in that upon connect system call (TCP, connected UDP) or sendmsg as well as recvmsg (UDP) the destination IP is checked for an existing service IP and one of the service backends is selected as a target, meaning, while the application is assuming its connection to the service address, the corresponding kernel’s socket is actually connected to the backend address and therefore no additional lower layer NAT is required.

Verify that it has come up correctly:

kubectl -n kube-system get pods -l k8s-app=cilium
NAME                READY     STATUS    RESTARTS   AGE
cilium-crf7f        1/1       Running   0          10m
Limitations
  • The kernel BPF cgroup hooks operate at connect(2), sendmsg(2) and recvmsg(2) system call layers for connecting the application to one of the service backends. In the v5.8 Linux kernel, a getpeername(2) hook for BPF has been added in order to also reverse translate the connected sock addresses for application’s getpeername(2) calls in Cilium. For kernels older than v5.8 such reverse translation is not taking place for this system call. For the vast majority of applications not having this translation at getpeername(2) does not cause any issues. There is one known case for libceph where its monitor might return an error since expected peer address mismatches.

Kubernetes without kube-proxy

This guide explains how to provision a Kubernetes cluster without kube-proxy, and to use Cilium to fully replace it. For simplicity, we will use kubeadm to bootstrap the cluster.

For installing kubeadm and for more provisioning options please refer to the official kubeadm documentation.

Note

Cilium’s kube-proxy replacement depends on the Host-Reachable Services feature, therefore a v4.19.57, v5.1.16, v5.2.0 or more recent Linux kernel is required. We recommend a v5.3 or even more recent Linux kernel such as v5.7 as Cilium can perform additional optimizations in its kube-proxy replacement implementation.

Note that v5.0.y kernels do not have the fix required to run the kube-proxy replacement since at this point in time the v5.0.y stable kernel is end-of-life (EOL) and not maintained anymore on kernel.org. For individual distribution maintained kernels, the situation could differ. Therefore, please check with your distribution.

Quick-Start

Initialize the control-plane node via kubeadm init, set a pod network CIDR and skip the kube-proxy add-on:

kubeadm init --pod-network-cidr=10.217.0.0/16 --skip-phases=addon/kube-proxy

In K8s 1.15 and older it is not yet possible to disable kube-proxy via --skip-phases=addon/kube-proxy in kubeadm, therefore the below workaround for manually removing the kube-proxy DaemonSet and cleaning the corresponding iptables rules after kubeadm initialization is still necessary (kubeadm#1733).

Initialize control-plane as first step with a given pod network CIDR:

kubeadm init --pod-network-cidr=10.217.0.0/16

Then delete the kube-proxy DaemonSet and remove its iptables rules as following:

kubectl -n kube-system delete ds kube-proxy
iptables-restore <(iptables-save | grep -v KUBE)

Afterwards, join worker nodes by specifying the control-plane node IP address and the token returned by kubeadm init:

kubeadm join <..>

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Next, generate the required YAML files and deploy them. Important: Replace API_SERVER_IP and API_SERVER_PORT below with the concrete control-plane node IP address and the kube-apiserver port number reported by kubeadm init (usually, it is port 6443).

Specifying this is necessary as kubeadm init is run explicitly without setting up kube-proxy and as a consequence while it exports KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT with a ClusterIP of the kube-apiserver service to the environment, there is no kube-proxy in our setup provisioning that service. The Cilium agent therefore needs to be made aware of this information through below configuration.

helm install cilium cilium/cilium --version 1.8.90 \
    --namespace kube-system \
    --set global.kubeProxyReplacement=strict \
    --set global.k8sServiceHost=API_SERVER_IP \
    --set global.k8sServicePort=API_SERVER_PORT

This will install Cilium as a CNI plugin with the BPF kube-proxy replacement to implement handling of Kubernetes services of type ClusterIP, NodePort, ExternalIPs and LoadBalancer. On top of that the BPF kube-proxy replacement also supports hostPort for containers such that using portmap is not necessary anymore.

Finally, as a last step, verify that Cilium has come up correctly on all nodes and is ready to operate:

kubectl -n kube-system get pods -l k8s-app=cilium
NAME                READY     STATUS    RESTARTS   AGE
cilium-fmh8d        1/1       Running   0          10m
cilium-mkcmb        1/1       Running   0          10m

Note, in above helm configuration the kubeProxyReplacement has been set to strict mode. This means that the Cilium agent will bail out in case the underlying Linux kernel support is missing.

Without explicitly specifying a kubeProxyReplacement option, helm uses kubeProxyReplacement with probe by default which would automatically disable a subset of the features to implement the kube-proxy replacement instead of bailing out if the kernel support is missing. This makes the assumption that Cilium’s BPF kube-proxy replacement would co-exist with kube-proxy on the system to optimize Kubernetes services. Given we’ve used kubeadm to explicitly deploy a kube-proxy-free setup, the strict mode has been used instead to ensure that we do not rely on a (non-existing) fallback.

Cilium’s BPF kube-proxy replacement is supported in direct routing as well as in tunneling mode.

Validate the Setup

After deploying Cilium with above Quick-Start guide, we can first validate that the Cilium agent is running in the desired mode:

kubectl exec -it -n kube-system cilium-fmh8d -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement:   Strict      (eth0)  [NodePort (SNAT, 30000-32767, XDP: NONE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]

As a next, optional step, we deploy nginx pods, create a new NodePort service and validate that Cilium installed the service correctly.

The following yaml is used for the backend pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80

Verify that the nginx pods are up and running:

kubectl get pods -l run=my-nginx -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE   NOMINATED NODE   READINESS GATES
my-nginx-756fb87568-gmp8c   1/1     Running   0          62m   10.217.0.149   apoc   <none>           <none>
my-nginx-756fb87568-n5scv   1/1     Running   0          62m   10.217.0.107   apoc   <none>           <none>

In the next step, we create a NodePort service for the two instances:

kubectl expose deployment my-nginx --type=NodePort --port=80
service/my-nginx exposed

Verify that the NodePort service has been created:

kubectl get svc my-nginx
NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
my-nginx   NodePort   10.104.239.135   <none>        80:31940/TCP   24m

With the help of the cilium service list command, we can validate that Cilium’s BPF kube-proxy replacement created the new NodePort service under port 31940:

kubectl exec -it -n kube-system cilium-fmh8d -- cilium service list
ID   Frontend               Service Type   Backend
[...]
4    10.104.239.135:80      ClusterIP      1 => 10.217.0.107:80
                                           2 => 10.217.0.149:80
5    0.0.0.0:31940          NodePort       1 => 10.217.0.107:80
                                           2 => 10.217.0.149:80
6    192.168.178.29:31940   NodePort       1 => 10.217.0.107:80
                                           2 => 10.217.0.149:80

At the same time we can inspect through iptables in the host namespace that no iptables rule for the service is present:

iptables-save | grep KUBE-SVC
[ empty line ]

Last but not least, a simple curl test shows connectivity for the exposed NodePort port 31940 as well as for the ClusterIP:

curl 127.0.0.1:31940
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
[....]
curl 10.104.239.135:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
[....]

As can be seen, the Cilium’s BPF kube-proxy replacement is set up correctly.

Advanced Configuration

This section covers a few advanced configuration modes for the kube-proxy replacement that go beyond the above Quick-Start guide and are entirely optional.

Direct Server Return (DSR)

By default, Cilium’s BPF NodePort implementation operates in SNAT mode. That is, when node-external traffic arrives and the node determines that the backend for the NodePort or ExternalIPs service is at a remote node, then the node is redirecting the request to the remote backend on its behalf by performing SNAT. This does not require any additional MTU changes at the cost that replies from the backend need to make the extra hop back that node in order to perform the reverse SNAT translation there before returning the packet directly to the external client.

This setting can be changed through the global.nodePort.mode helm option to dsr in order to let Cilium’s BPF NodePort implementation operate in DSR mode. In this mode, the backends reply directly to the external client without taking the extra hop, meaning, backends reply by using the service IP/port as a source. DSR currently requires Cilium to be deployed in Direct / Native Routing Mode, i.e. it will not work in either tunneling mode.

Another advantage in DSR mode is that the client’s source IP is preserved, so policy can match on it at the backend node. In the SNAT mode this is not possible. Given a specific backend can be used by multiple services, the backends need to be made aware of the service IP/port which they need to reply with. Therefore, Cilium encodes this information in a Cilium-specific IPv4 option or IPv6 Destination Option extension header at the cost of advertising a lower MTU. For TCP services, Cilium only encodes the service IP/port for the SYN packet, but not subsequent ones. The latter also allows to operate Cilium in a hybrid mode as detailed in the next subsection where DSR is used for TCP and SNAT for UDP in order to avoid an otherwise needed MTU reduction.

Note that usage of DSR mode might not work in some public cloud provider environments due to the Cilium-specific IP options that could be dropped by an underlying fabric. Therefore, in case of connectivity issues to services where backends are located on a remote node from the node that is processing the given NodePort request, it is advised to first check whether the NodePort request actually arrived on the node containing the backend. If this was not the case, then switching back to the default SNAT mode would be advised as a workaround.

Above helm example configuration in a kube-proxy-free environment with DSR-only mode enabled would look as follows:

helm install cilium cilium/cilium --version 1.8.90 \
    --namespace kube-system \
    --set global.tunnel=disabled \
    --set global.autoDirectNodeRoutes=true \
    --set global.kubeProxyReplacement=strict \
    --set global.nodePort.mode=dsr \
    --set global.k8sServiceHost=API_SERVER_IP \
    --set global.k8sServicePort=API_SERVER_PORT
Hybrid DSR and SNAT Mode

Cilium also supports a hybrid DSR and SNAT mode, that is, DSR is performed for TCP and SNAT for UDP connections. This has the advantage that it removes the need for manual MTU changes in the network while still benefiting from the latency improvements through the removed extra hop for replies, in particular, when TCP is the main transport for workloads.

The mode setting global.nodePort.mode allows to control the behavior through the options dsr, snat and hybrid. By default the snat mode is used in the agent.

A helm example configuration in a kube-proxy-free environment with DSR enabled in hybrid mode would look as follows:

helm install cilium cilium/cilium --version 1.8.90 \
    --namespace kube-system \
    --set global.tunnel=disabled \
    --set global.autoDirectNodeRoutes=true \
    --set global.kubeProxyReplacement=strict \
    --set global.nodePort.mode=hybrid \
    --set global.k8sServiceHost=API_SERVER_IP \
    --set global.k8sServicePort=API_SERVER_PORT
NodePort XDP Acceleration

Cilium has built-in support for accelerating NodePort, ExternalIPs and LoadBalancer services for the case where the arriving request needs to be pushed back out of the node when the backend is located on a remote node. This ability to act as a hairpin load balancer can be handled by Cilium at the XDP (eXpress Data Path) layer where BPF is operating directly in the networking driver instead of a higher layer.

The mode setting global.nodePort.acceleration allows to enable this acceleration through the option native. The option none is the default and disables the acceleration. The majority of drivers supporting 10G or higher rates also support native XDP on a recent kernel. For cloud based deployments most of these drivers have SR-IOV variants that support native XDP as well.

The global.nodePort.acceleration setting is supported for DSR, SNAT and hybrid modes and can be enabled as follows for nodePort.mode=hybrid in this example:

helm install cilium cilium/cilium --version 1.8.90 \
    --namespace kube-system \
    --set global.tunnel=disabled \
    --set global.autoDirectNodeRoutes=true \
    --set global.kubeProxyReplacement=strict \
    --set global.nodePort.acceleration=native \
    --set global.nodePort.mode=hybrid \
    --set global.k8sServiceHost=API_SERVER_IP \
    --set global.k8sServicePort=API_SERVER_PORT

The current Cilium kube-proxy XDP acceleration mode can also be introspected through the cilium status CLI command:

kubectl exec -it -n kube-system cilium-xxxxx -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement:   Strict   [NodePort (SNAT, 30000-32767, XDP: NATIVE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]
NodePort Device, Port and Bind settings

When running Cilium’s BPF kube-proxy replacement, by default, a NodePort or ExternalIPs service will be accessible through the IP address of a native device which has the default route on the host. To change the device, set its name in the global.nodePort.device helm option.

In addition, thanks to the Host-Reachable Services feature, the NodePort service can be accessed by default from a host or a pod within a cluster via its public, any local (except for docker* prefixed names) or loopback address, e.g. 127.0.0.1:NODE_PORT.

If kube-apiserver was configured to use a non-default NodePort port range, then the same range must be passed to Cilium via the global.nodePort.range option, for example, as --set global.nodePort.range="10000\,32767" for a range of 10000-32767. The default Kubernetes NodePort range is 30000-32767.

If the NodePort port range overlaps with the ephemeral port range (net.ipv4.ip_local_port_range), Cilium will append the NodePort range to the reserved ports (net.ipv4.ip_local_reserved_ports). This is needed to prevent a NodePort service from hijacking traffic of a host local application which source port matches the service port. To disable the modification of the reserved ports, set global.nodePort.autoProtectPortRanges to false.

By default, the NodePort implementation prevents application bind(2) requests to NodePort service ports. In such case, the application will typically see a bind: Operation not permitted error. This happens either globally for older kernels or starting from v5.7 kernels only for the host namespace by default and therefore not affecting any application pod bind(2) requests anymore. In order to opt-out from this behavior in general, this setting can be changed for expert users by switching global.nodePort.bindProtection to false.

Container hostPort support

Although not part of kube-proxy, Cilium’s BPF kube-proxy replacement also natively supports hostPort service mapping without having to use the Helm CNI chaining option of global.cni.chainingMode=portmap.

By specifying global.kubeProxyReplacement=strict or global.kubeProxyReplacement=probe the native hostPort support is automatically enabled and therefore no further action is required. Otherwise global.hostPort.enabled=true can be used to enable the setting.

An example deployment in a kube-proxy-free environment therefore is the same as in the earlier getting started deployment:

helm install cilium cilium/cilium --version 1.8.90 \
    --namespace kube-system \
    --set global.kubeProxyReplacement=strict \
    --set global.k8sServiceHost=API_SERVER_IP \
    --set global.k8sServicePort=API_SERVER_PORT

Also, ensure that each node IP is known via INTERNAL-IP or EXTERNAL-IP, for example:

kubectl get nodes -o wide
NAME   STATUS   ROLES    AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   [...]
apoc   Ready    master   6h15m   v1.17.3   192.168.178.29   <none>        [...]
tank   Ready    <none>   6h13m   v1.17.3   192.168.178.28   <none>        [...]

If this is not the case, then kubelet needs to be made aware of it through specifying --node-ip through KUBELET_EXTRA_ARGS. Assuming eth0 is the public facing interface, this can be achieved by:

echo KUBELET_EXTRA_ARGS="--node-ip=$(ip -4 -o a show eth0 | awk '{print $4}' | cut -d/ -f1)" | tee -a /etc/default/kubelet

After updating /etc/default/kubelet, kubelet needs to be restarted.

The following modified example yaml from the setup validation with an additional hostPort: 8080 parameter can be used to verify the mapping:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 1
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80
          hostPort: 8080

After deployment, we can validate that Cilium’s BPF kube-proxy replacement exposed the container as HostPort under the specified port 8080:

kubectl exec -it -n kube-system cilium-fmh8d -- cilium service list
ID   Frontend               Service Type   Backend
[...]
5    192.168.178.29:8080    HostPort       1 => 10.29.207.199:80

Similarly, we can inspect through iptables in the host namespace that no iptables rule for the HostPort service is present:

iptables-save | grep HOSTPORT
[ empty line ]

Last but not least, a simple curl test shows connectivity for the exposed HostPort container under the node’s IP:

curl 192.168.178.29:8080
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
[....]

Removing the deployment also removes the corresponding HostPort from the cilium service list dump:

kubectl delete deployment my-nginx
kube-proxy Hybrid Modes

Cilium’s BPF kube-proxy replacement can be configured in several modes, i.e. it can replace kube-proxy entirely or it can co-exist with kube-proxy on the system if the underlying Linux kernel requirements do not support a full kube-proxy replacement.

This section therefore elaborates on the various global.kubeProxyReplacement options:

  • global.kubeProxyReplacement=strict: This option expects a kube-proxy-free Kubernetes setup where Cilium is expected to fully replace all kube-proxy functionality. Once the Cilium agent is up and running, it takes care of handling Kubernetes services of type ClusterIP, NodePort, ExternalIPs and LoadBalancer as well as HostPort. If the underlying kernel version requirements are not met (see Kubernetes without kube-proxy note), then the Cilium agent will bail out on start-up with an error message.

  • global.kubeProxyReplacement=probe: This option is intended for a hybrid setup, that is, kube-proxy is running in the Kubernetes cluster where Cilium partially replaces and optimizes kube-proxy functionality. Once the Cilium agent is up and running, it probes the underlying kernel for the availability of needed BPF kernel features and, if not present, disables a subset of the functionality in BPF by relying on kube-proxy to complement the remaining Kubernetes service handling. The Cilium agent will emit an info message into its log in such case. For example, if the kernel does not support Host-Reachable Services, then the ClusterIP translation for the node’s host-namespace is done through kube-proxy’s iptables rules.

  • global.kubeProxyReplacement=partial: Similarly to probe, this option is intended for a hybrid setup, that is, kube-proxy is running in the Kubernetes cluster where Cilium partially replaces and optimizes kube-proxy functionality. As opposed to probe which checks the underlying kernel for available BPF features and automatically disables components responsible for the BPF kube-proxy replacement when kernel support is missing, the partial option requires the user to manually specify which components for the BPF kube-proxy replacement should be used. Similarly to strict mode, the Cilium agent will bail out on start-up with an error message if the underlying kernel requirements are not met. For fine-grained configuration, global.hostServices.enabled, global.nodePort.enabled, global.externalIPs.enabled and global.hostPort.enabled can be set to true. By default all four options are set to false. A few example configurations for the partial option are provided below.

    The following helm setup below would be equivalent to global.kubeProxyReplacement=strict in a kube-proxy-free environment:

    helm install cilium cilium/cilium --version 1.8.90 \
        --namespace kube-system \
        --set global.kubeProxyReplacement=partial \
        --set global.hostServices.enabled=true \
        --set global.nodePort.enabled=true \
        --set global.externalIPs.enabled=true \
        --set global.hostPort.enabled=true \
        --set global.k8sServiceHost=API_SERVER_IP \
        --set global.k8sServicePort=API_SERVER_PORT
    

    The following helm setup below would be equivalent to the default Cilium service handling in v1.6 or earlier in a kube-proxy environment, that is, serving ClusterIP for pods:

    helm install cilium cilium/cilium --version 1.8.90 \
        --namespace kube-system \
        --set global.kubeProxyReplacement=partial
    

    The following helm setup below would optimize Cilium’s ClusterIP handling for TCP in a kube-proxy environment (global.hostServices.protocols default is tcp,udp):

    helm install cilium cilium/cilium --version 1.8.90 \
        --namespace kube-system \
        --set global.kubeProxyReplacement=partial \
        --set global.hostServices.enabled=true \
        --set global.hostServices.protocols=tcp
    

    The following helm setup below would optimize Cilium’s NodePort and ExternalIPs handling for external traffic ingressing into the Cilium managed node in a kube-proxy environment:

    helm install cilium cilium/cilium --version 1.8.90 \
        --namespace kube-system \
        --set global.kubeProxyReplacement=partial \
        --set global.nodePort.enabled=true \
        --set global.externalIPs.enabled=true
    
  • global.kubeProxyReplacement=disabled: This option disables any Kubernetes service handling by fully relying on kube-proxy instead, except for ClusterIP services accessed from pods if cilium-agent’s flag --disable-k8s-services is set to false (pre-v1.6 behavior).

In Cilium’s helm chart, the default mode is global.kubeProxyReplacement=probe for new deployments.

For existing Cilium deployments in version v1.6 or prior, please consult the 1.7 Upgrade Notes.

The current Cilium kube-proxy replacement mode can also be introspected through the cilium status CLI command:

kubectl exec -it -n kube-system cilium-xxxxx -- cilium status | grep KubeProxyReplacement
KubeProxyReplacement:   Strict      (eth0)  [NodePort (SNAT, 30000-32767, XDP: NONE), HostPort, ExternalIPs, HostReachableServices (TCP, UDP)]
Limitations
  • Cilium’s BPF kube-proxy replacement currently cannot be used with Transparent Encryption (stable/beta).
  • Cilium’s BPF kube-proxy replacement relies upon the Host-Reachable Services feature which uses BPF cgroup hooks to implement the service translation. The getpeername(2) hook address translation in BPF is only available for v5.8 kernels. It is known to currently not work with libceph deployments.
  • Cilium’s DSR NodePort mode currently does not operate well in environments with TCP Fast Open (TFO) enabled. It is recommended to switch to snat mode in this situation.
Further Readings

The following presentations describe inner-workings of the kube-proxy replacement in BPF in great details:

  • “Liberating Kubernetes from kube-proxy and iptables” (KubeCon North America 2019, slides, video)
  • “BPF as a revolutionary technology for the container landscape” (Fosdem 2020, slides, video)
  • “Kernel improvements for Cilium socket LB” (LSF/MM/BPF 2020, slides)

Kata with Cilium on Google GCE

Kata Containers is an open source project that provides a secure container runtime with lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense. Similar to the OCI runtime runc provided by Docker, Cilium can be used with Kata Containers, providing a higher degree of security at the network layer and at the compute layer with Kata. This guide provides a walkthrough of installing Kata with Cilium on GCE. Kata Containers on Google Compute Engine (GCE) makes use of nested virtualization. At the time of this writing, nested virtualization support was not yet available on GKE.

GCE Requirements
  1. Install the Google Cloud SDK (gcloud) see Installing Google Cloud SDK Verify your gcloud installation and configuration:
gcloud info || { echo "ERROR: no Google Cloud SDK"; exit 1; }
  1. Create a project or use an existing one
export GCE_PROJECT=kata-with-cilium
gcloud projects create $GCE_PROJECT
Create an image on GCE with Nested Virtualization support

As mentioned before, Kata Containers on Google Compute Engine (GCE) makes use of nested virtualization. As a prerequisite you need to create an image with nested virtualization enabled in your currently active GCE project.

  1. Choose a base image

Officially supported images are automatically discoverable with:

gcloud compute images list
NAME                                                  PROJECT            FAMILY                            DEPRECATED  STATUS
centos-6-v20190423                                    centos-cloud       centos-6                                      READY
centos-7-v20190423                                    centos-cloud       centos-7                                      READY
coreos-alpha-2121-0-0-v20190423                       coreos-cloud       coreos-alpha                                  READY
cos-69-10895-211-0                                    cos-cloud          cos-69-lts                                    READY
ubuntu-1604-xenial-v20180522                          ubuntu-os-cloud    ubuntu-1604-lts                               READY
ubuntu-1804-bionic-v20180522                          ubuntu-os-cloud    ubuntu-1804-lts                               READY

Select an image based on project and family rather than by name. This ensures any scripts or other automation always works with a non-deprecated image, including security updates, updates to GCE-specific scripts, etc.

  1. Create the image with nested virtualization support
SOURCE_IMAGE_PROJECT=ubuntu-os-cloud
SOURCE_IMAGE_FAMILY=ubuntu-1804-lts
IMAGE_NAME=${SOURCE_IMAGE_FAMILY}-nested

gcloud compute images create \
    --source-image-project $SOURCE_IMAGE_PROJECT \
    --source-image-family $SOURCE_IMAGE_FAMILY \
    --licenses=https://www.googleapis.com/compute/v1/projects/vm-options/global/licenses/enable-vmx \
    $IMAGE_NAME

If successful, gcloud reports that the image was created.

  1. Verify VMX is enabled

Verify that a virtual machine created with the previous image has VMX enabled.

gcloud compute instances create \
  --image $IMAGE_NAME \
  --machine-type n1-standard-2 \
  --min-cpu-platform "Intel Broadwell" \
  kata-testing

gcloud compute ssh kata-testing
# While ssh'd into the VM:
$ [ -z "$(lscpu|grep GenuineIntel)" ] && { echo "ERROR: Need an Intel CPU"; exit 1; }
Setup Kubernetes with CRI

Kata Containers runtime is an OCI compatible runtime and cannot directly interact with the CRI API level. For this reason we rely on a CRI implementation to translate CRI into OCI. There are two supported ways called CRI-O and CRI-containerd. It is up to you to choose the one that you want, but you have to pick one.

If you select CRI-O, follow the “CRI-O Tutorial” instructions here to properly install it. If you select containerd with cri plugin, follow the “Getting Started for Developers” instructions here to properly install it.

Setup your Kubernetes environment and make sure the following requirements are met:

  • Kubernetes >= 1.12
  • Linux kernel >= 4.9
  • Kubernetes in CNI mode
  • Running kube-dns/coredns (When using the etcd-operator installation method)
  • Mounted BPF filesystem mounted on all worker nodes
  • Enable PodCIDR allocation (--allocate-node-cidrs) in the kube-controller-manager (recommended)

Refer to the section Requirements for detailed instruction on how to prepare your Kubernetes environment.

Note

Minimum version of kubernetes 1.12 is required to use the RuntimeClass Feature for Kata Container runtime described below. It is possible to use kubernetes<=1.10 with Kata, but that requires for a slightly different setup that has been deprecated.

Kubernetes talks with CRI implementations through a container-runtime-endpoint, also called CRI socket. This socket path is different depending on which CRI implementation you chose, and the kubelet service has to be updated accordingly.

Configure Kubernetes for CRI-O

Add /etc/systemd/system/kubelet.service.d/0-crio.conf

[Service]
Environment="KUBELET_EXTRA_ARGS=--container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=unix:///var/run/crio/crio.sock"
Configure for Kubernetes for containerd

Add /etc/systemd/system/kubelet.service.d/0-cri-containerd.conf

[Service]
Environment="KUBELET_EXTRA_ARGS=--container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock"

After you update your kubelet service based on the CRI implementation you are using, reload and restart kubelet.

Deploy Cilium

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.containerRuntime.integration=crio

Note

If you are using containerd, set global.containerRuntime.integration=containerd.

Validate cilium

You can monitor as Cilium and all required components are being installed:

kubectl -n kube-system get pods --watch
NAME                                    READY   STATUS              RESTARTS   AGE
cilium-cvp8q                            0/1     Init:0/1            0          53s
cilium-operator-788c55554-gkpbf         0/1     ContainerCreating   0          54s
cilium-tdzcx                            0/1     Init:0/1            0          53s
coredns-77b578f78d-km6r4                1/1     Running             0          11m
coredns-77b578f78d-qr6gq                1/1     Running             0          11m
kube-proxy-l47rx                        1/1     Running             0          6m28s
kube-proxy-zj6v5                        1/1     Running             0          6m28s

It may take a couple of minutes for the etcd-operator to bring up the necessary number of etcd pods to achieve quorum. Once it reaches quorum, all components should be healthy and ready:

kubectl -n=kube-system get pods
NAME                                    READY   STATUS    RESTARTS   AGE
cilium-cvp8q                            1/1     Running   0          42s
cilium-operator-788c55554-gkpbf         1/1     Running   2          43s
cilium-tdzcx                            1/1     Running   0          42s
coredns-77b578f78d-2khwp                1/1     Running   0          13s
coredns-77b578f78d-bs6rp                1/1     Running   0          13s
kube-proxy-l47rx                        1/1     Running   0          6m
kube-proxy-zj6v5                        1/1     Running   0          6m

For troubleshooting any issues, please refer to Installation with managed etcd

Install Kata on a running Kubernetes Cluster

Kubernetes configured with CRI runtimes by default uses runc runtime for running a workload. You will need to configure Kubernetes to be able to use an alternate runtime.

RuntimeClass is a Kubernetes feature first introduced in Kubernetes 1.12 as alpha. It is the feature for selecting the container runtime configuration to use to run a pod’s containers. To use Kata-Containers, ensure the RuntimeClass feature gate is enabled for k8s < 1.13. It is enabled by default on k8s 1.14. See Feature Gates for an explanation of enabling feature gates.

To install Kata Containers and configure CRI to use Kata as a one step process, you will use kata-deploy tool as shown below.

  1. Install Kata on a running k8s cluster
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/kata-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/kata-deploy.yaml

This will install all the required Kata binaries under /opt/kata and configure CRI implementation with the RuntimeClass handlers for the Kata runtime binaries. Kata Containers can leverage Qemu and Firecracker hypervisor for running the lightweight VM. kata-fc binary runs a Firecracker isolated Kata Container while kata-qemu runs a Qemu isolated Kata Container.

  1. Create the RuntimeClass resource for Kata-containers

To add a RuntimeClass for Qemu isolated Kata-Containers:

kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/k8s-1.14/kata-qemu-runtimeClass.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/k8s-1.13/kata-qemu-runtimeClass.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/k8s-1.13/kata-qemu-runtimeClass.yaml

To add a RuntimeClass for Firecracker isolated Kata-Containers:

kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/k8s-1.14/kata-fc-runtimeClass.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/k8s-1.13/kata-fc-runtimeClass.yaml
kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/k8s-1.13/kata-fc-runtimeClass.yaml
Run Kata Containers with Cilium CNI

Now that Kata is installed on the k8s cluster, you can run an untrusted workload with Kata Containers with Cilium as the CNI.

The following YAML snippet shows how to specify a workload should use Kata with QEMU:

spec:
  template:
    spec:
      runtimeClassName: kata-qemu

The following YAML snippet shows how to specify a workload should use Kata with Firecracker:

spec:
  template:
    spec:
      runtimeClassName: kata-fc

To run an example pod with kata-qemu:

kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/examples/test-deploy-kata-qemu.yaml

To run an example with kata-fc:

kubectl apply -f https://raw.githubusercontent.com/kata-containers/packaging/4bb97ef14a4ba8170b9d501b3e567037eb0f9a41/kata-deploy/examples/test-deploy-kata-fc.yaml

Configuring IPAM modes

CRD-backed IPAM

This is a quick tutorial walking through how to enable CRD-backed IPAM. The purpose of this tutorial is to show how components are configured and resources interact with each other to enable users to automate or extend on their own.

For more details, see the section CRD-Backed

Enable CRD IPAM mode
  1. Setup Cilium for Kubernetes using any of the available guides.

  2. Run Cilium with the --ipam=crd option or set ipam: crd in the cilium-config ConfigMap.

  3. Restart Cilium. Cilium will automatically register the CRD if not available already

    msg="Waiting for initial IP to become available in 'k8s1' custom resource" subsys=ipam
    
  4. Validate that the CRD has been registered:

    kubectl get crds
    NAME                              CREATED AT
    [...]
    ciliumnodes.cilium.io             2019-06-08T12:26:41Z
    
Create a CiliumNode CR
  1. Import the following custom resource to make t

    apiVersion: "cilium.io/v2"
    kind: CiliumNode
    metadata:
      name: "k8s1"
    spec:
      ipam:
        192.168.1.1: {}
        192.168.1.2: {}
        192.168.1.3: {}
        192.168.1.4: {}
    
  2. Validate that Cilium has started up correctly

    cilium status --all-addresses
    KVStore:                Ok   etcd: 1/1 connected, has-quorum=true: https://192.168.33.11:2379 - 3.3.12 (Leader)
    [...]
    IPAM:                   IPv4: 2/4 allocated,
    Allocated addresses:
      192.168.1.1 (router)
      192.168.1.3 (health)
    
  3. Validate the status.IPAM.used section:

    kubectl get cn k8s1 -o yaml
    apiVersion: cilium.io/v2
    kind: CiliumNode
    metadata:
      name: k8s1
      [...]
    spec:
      ipam:
        192.168.1.1: {}
        192.168.1.2: {}
        192.168.1.3: {}
        192.168.1.4: {}
    status:
      ipam:
        used:
          192.168.1.1:
            owner: router
          192.168.1.3:
            owner: health
    
CRD-backed by Cilium cluster-pool IPAM

This is a quick tutorial walking through how to enable CRD-backed by Cilium cluster-pool IPAM. The purpose of this tutorial is to show how components are configured and resources interact with each other to enable users to automate or extend on their own.

For more details, see the section Cilium Cluster-pool IPAM

Enable Cluster-pool IPAM mode
  1. Setup Cilium for Kubernetes using helm with the options: --set config.ipam=cluster-pool.
  2. Depending if you are using IPv4 and / or IPv6, you might want to adjust the podCIDR allocated for your cluster’s pods with the options:
    • --set global.ipam.operator.clusterPoolIPv4PodCIDR=<IPv4CIDR>
    • --set global.ipam.operator.clusterPoolIPv6PodCIDR=<IPv6CIDR>
  3. To adjust the CIDR size that should be allocated for each node you can use the following options:
    • --set global.ipam.operator.clusterPoolIPv4MaskSize=<IPv4MaskSize>
    • --set global.ipam.operator.clusterPoolIPv6MaskSize=<IPv6MaskSize>
  4. Deploy Cilium and Cilium-Operator. Cilium will automatically wait until the podCIDR is allocated for its node by Cilium Operator.
Validate installation
  1. Validate that Cilium has started up correctly

    cilium status --all-addresses
    KVStore:                Ok   etcd: 1/1 connected, has-quorum=true: https://192.168.33.11:2379 - 3.3.12 (Leader)
    [...]
    IPAM:                   IPv4: 2/256 allocated,
    Allocated addresses:
      10.0.0.1 (router)
      10.0.0.3 (health)
    
  2. Validate the spec.ipam.podCIDRs section:

    kubectl get cn k8s1 -o yaml
    apiVersion: cilium.io/v2
    kind: CiliumNode
    metadata:
      name: k8s1
      [...]
    spec:
      ipam:
        podCIDRs:
          - 10.0.0.0/24
    

Operations

Running Prometheus & Grafana

Installation

This is an example deployment that includes Prometheus and Grafana in a single deployment.

The default installation contains:

  • Grafana: A visualization dashboard with Cilium Dashboard pre-loaded.
  • Prometheus: a time series database and monitoring system.
$ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/addons/prometheus/monitoring-example.yaml
configmap/cilium-metrics-config created
namespace/cilium-monitoring created
configmap/prometheus created
deployment.extensions/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/prometheus created
clusterrole.rbac.authorization.k8s.io/prometheus created
serviceaccount/prometheus-k8s created
service/prometheus created
deployment.extensions/grafana created
service/grafana created
configmap/grafana-config created
Deploy Cilium with metrics enabled

Both cilium-agent and cilium-operator do not expose metrics by default. Enabling metrics for these services will open ports 9090 and 6942 on all nodes of your cluster where these components are running.

To deploy Cilium with metrics enabled, set the global.prometheus.enabled=true Helm value:

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Deploy Cilium release via Helm:

helm install cilium cilium/cilium --version 1.8.90 \
   --namespace kube-system \
   --set global.prometheus.enabled=true

Note

You can combine the global.prometheus.enabled=true option with any of the other installation guides.

How to access Grafana

Expose the port on your local machine

kubectl -n cilium-monitoring port-forward service/grafana 3000:3000

Access it via your browser: https://localhost:3000

How to access Prometheus

Expose the port on your local machine

kubectl -n cilium-monitoring port-forward service/prometheus 9090:9090

Access it via your browser: https://localhost:9090

Examples
Generic
_images/grafana_generic.png
Network
_images/grafana_network.png
Policy
_images/grafana_policy.png _images/grafana_policy2.png
Endpoints
_images/grafana_endpoints.png
Controllers
_images/grafana_controllers.png
Kubernetes
_images/grafana_k8s.png

Istio

Getting Started Using Istio

This document serves as an introduction to using Cilium to enforce security policies in Kubernetes micro-services managed with Istio. It is a detailed walk-through of getting a single-node Cilium + Istio environment running on your machine.

If you haven’t read the Introduction to Cilium yet, we’d encourage you to do that first.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Setup Cilium

If you have not set up Cilium yet, pick any installation method as described in section Installation to set up Cilium for your Kubernetes environment. If in doubt, pick Getting Started Using Minikube as the simplest way to set up a Kubernetes cluster with Cilium:

minikube start --network-plugin=cni --memory=4096
minikube ssh -- sudo mount bpffs -t bpf /sys/fs/bpf
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/install/kubernetes/quick-install.yaml

Note

If running on minikube, you may need to up the memory and CPUs available to the minikube VM from the defaults and/or the instructions provided here for the other GSGs. 5 GB and 4 CPUs should be enough for this GSG (--memory=5120 --cpus=4).

Step 2: Install cilium-istioctl

Note

Make sure that Cilium is running in your cluster before proceeding.

Download the cilium enhanced istioctl version 1.5.4:

curl -L https://github.com/cilium/istio/releases/download/1.5.4/cilium-istioctl-1.5.4-linux.tar.gz | tar xz
curl -L https://github.com/cilium/istio/releases/download/1.5.4/cilium-istioctl-1.5.4-osx.tar.gz | tar xz

Note

Cilium integration, as presented in this Getting Started Guide, has been tested with Kubernetes releases 1.14, 1.15, 1.16, 1.17, and 1.18. Note that this does not work with K8s 1.13.

Deploy the default Istio configuration profile onto Kubernetes:

./cilium-istioctl manifest apply -y

Add a namespace label to instruct Istio to automatically inject Envoy sidecar proxies when you deploy your application later:

kubectl label namespace default istio-injection=enabled
Step 3: Deploy the Bookinfo Application V1

Now that we have Cilium and Istio deployed, we can deploy version v1 of the services of the Istio Bookinfo sample application.

While the upstream Istio Bookinfo Application example for Kubernetes deploys multiple versions of the Bookinfo application at the same time, here we first deploy only the version 1.

The BookInfo application is broken into four separate microservices:

  • productpage. The productpage microservice calls the details and reviews microservices to populate the page.
  • details. The details microservice contains book information.
  • reviews. The reviews microservice contains book reviews. It also calls the ratings microservice.
  • ratings. The ratings microservice contains book ranking information that accompanies a book review.

In this demo, each specific version of each microservice is deployed into Kubernetes using separate YAML files which define:

  • A Kubernetes Service.
  • A Kubernetes Deployment specifying the microservice’s pods, specific to each service version.
  • A Cilium Network Policy limiting the traffic to the microservice, specific to each service version.
_images/istio-bookinfo-v1.png

To deploy the application with manual sidecar injection, run:

for service in productpage-service productpage-v1 details-v1 reviews-v1; do \
      kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/bookinfo-${service}.yaml ; done

Check the progress of the deployment (every service should have an AVAILABLE count of 1):

watch "kubectl get deployments"
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
details-v1       1/1     1            1           12s
productpage-v1   1/1     1            1           13s
reviews-v1       1/1     1            1           12s

Create an Istio ingress gateway for the productpage service:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/bookinfo-gateway.yaml

To obtain the URL to the frontend productpage service, run:

export GATEWAY_URL=http://$(minikube ip):$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
export PRODUCTPAGE_URL=${GATEWAY_URL}/productpage
open ${PRODUCTPAGE_URL}

Open that URL in your web browser and check that the application has been successfully deployed. It may take several seconds before all services become accessible in the Istio service mesh, so you may have have to reload the page.

Step 4: Canary and Deploy the Reviews Service V2

We will now deploy version v2 of the reviews service. In addition to providing reviews from readers, reviews v2 queries a new ratings service for book ratings, and displays each rating as 1 to 5 black stars.

As a precaution, we will use Istio’s service routing feature to canary the v2 deployment to prevent breaking the end-to-end application completely if it is faulty.

Before deploying v2, to prevent any traffic from being routed to it for now, we will create this Istio route rules to route 100% of the reviews traffic to v1:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 100
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
_images/istio-bookinfo-reviews-v2-route-to-v1.png

Apply this route rule:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/route-rule-reviews-v1.yaml

Deploy the ratings v1 and reviews v2 services:

for service in ratings-v1 reviews-v2; do \
      kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/bookinfo-${service}.yaml ; done

Check the progress of the deployment (every service should have an AVAILABLE count of 1):

watch "kubectl get deployments"
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
details-v1       1/1     1            1           17m
productpage-v1   1/1     1            1           17m
ratings-v1       1/1     1            1           69s
reviews-v1       1/1     1            1           17m
reviews-v2       1/1     1            1           68s

Check in your web browser that no stars are appearing in the Book Reviews, even after refreshing the page several times. This indicates that all reviews are retrieved from reviews v1 and none from reviews v2.

_images/istio-bookinfo-reviews-v1.png

The ratings-v1 CiliumNetworkPolicy explicitly whitelists access to the ratings API only from productpage and reviews v2:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: ratings-v1
  namespace: default
specs:
  - endpointSelector:
      matchLabels:
        "k8s:app": ratings
        "k8s:version": v1
    ingress:
    - fromEndpoints:
      - matchLabels:
          "k8s:app": productpage
          "k8s:version": v1
      toPorts:
      - ports:
        - port: "9080"
          protocol: TCP
        rules:
          http:
          - method: GET
            path: "/ratings/[0-9]*"
    - fromEndpoints:
        - matchLabels:
            "k8s:app": reviews
            "k8s:version": v2
      toPorts:
      - ports:
        - port: "9080"
          protocol: TCP
        rules:
          http:
          - method: GET
            path: "/ratings/[0-9]*"

Check that reviews v1 may not be able to access the ratings service, even if it were compromised or suffered from a bug, by running curl from within the pod:

Note

All traffic from reviews v1 to ratings is blocked, so the connection attempt fails after the connection timeout.

export POD_REVIEWS_V1=`kubectl get pods -l app=reviews,version=v1 -o jsonpath='{.items[0].metadata.name}'`
kubectl exec ${POD_REVIEWS_V1} -c istio-proxy -ti -- curl --connect-timeout 5 --fail http://ratings:9080/ratings/0
curl: (28) Connection timed out after 5001 milliseconds
command terminated with exit code 28

Update the Istio route rule to send 50% of reviews traffic to v1 and 50% to v2:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 50
    - destination:
        host: reviews
        subset: v2
      weight: 50
_images/istio-bookinfo-reviews-v2-route-to-v1-and-v2.png

Apply this route rule:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/route-rule-reviews-v1-v2.yaml

Check in your web browser that stars are appearing in the Book Reviews roughly 50% of the time. This may require refreshing the page for a few seconds to observe. Queries to reviews v2 result in reviews containing ratings displayed as black stars:

_images/istio-bookinfo-reviews-v2.png

Finally, update the route rule to send 100% of reviews traffic to v2:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v2
      weight: 100
_images/istio-bookinfo-reviews-v2-route-to-v2.png

Apply this route rule:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/route-rule-reviews-v2.yaml

Refresh the product page in your web browser several times to verify that stars are now appearing in the Book Reviews on every page refresh. All the reviews are now retrieved from reviews v2 and none from reviews v1.

Step 5: Deploy the Product Page Service V2

We will now deploy version v2 of the productpage service, which brings two changes:

  • It is deployed with a more restrictive CiliumNetworkPolicy, which restricts access to a subset of the HTTP URLs, at Layer-7.
  • It implements a new authentication audit log into Kafka.
_images/istio-bookinfo-productpage-v2-kafka.png

The policy for v1 currently allows read access to the full HTTP REST API, under the /api/v1 HTTP URI path:

  • /api/v1/products: Returns the list of books and their details.
  • /api/v1/products/<id>: Returns details about a specific book.
  • /api/v1/products/<id>/reviews: Returns reviews for a specific book.
  • /api/v1/products/<id>/ratings: Returns ratings for a specific book.

Check that the full REST API is currently accessible in v1 and returns valid JSON data:

for APIPATH in /api/v1/products /api/v1/products/0 /api/v1/products/0/reviews /api/v1/products/0/ratings; do echo ; curl -s -S "${GATEWAY_URL}${APIPATH}" ; echo ; done

The output will be similar to this:

[{"descriptionHtml": "<a href=\"https://en.wikipedia.org/wiki/The_Comedy_of_Errors\">Wikipedia Summary</a>: The Comedy of Errors is one of <b>William Shakespeare's</b> early plays. It is his shortest and one of his most farcical comedies, with a major part of the humour coming from slapstick and mistaken identity, in addition to puns and word play.", "id": 0, "title": "The Comedy of Errors"}]

{"publisher": "PublisherA", "language": "English", "author": "William Shakespeare", "id": 0, "ISBN-10": "1234567890", "ISBN-13": "123-1234567890", "year": 1595, "type": "paperback", "pages": 200}

{"reviews": [{"reviewer": "Reviewer1", "rating": {"color": "black", "stars": 5}, "text": "An extremely entertaining play by Shakespeare. The slapstick humour is refreshing!"}, {"reviewer": "Reviewer2", "rating": {"color": "black", "stars": 4}, "text": "Absolutely fun and entertaining. The play lacks thematic depth when compared to other plays by Shakespeare."}], "id": "0"}

{"ratings": {"Reviewer2": 4, "Reviewer1": 5}, "id": 0}

We realized that the REST API to get the book reviews and ratings was meant only for consumption by other internal services, and will be blocked from external clients using the updated Layer-7 CiliumNetworkPolicy in productpage v2, i.e. only the /api/v1/products and /api/v1/products/<id> HTTP URLs will be whitelisted:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: productpage-v2
  namespace: default
specs:
  - endpointSelector:
      matchLabels:
        "k8s:app": productpage
        "k8s:version": v2
    ingress:
    - toPorts:
      - ports:
        - port: "9080"
          protocol: TCP
        rules:
          http:
          - method: GET
            path: "/"
          - method: GET
            path: "/index.html"
          - method: POST
            path: "/login"
          - method: GET
            path: "/logout"
          - method: GET
            path: "/productpage"
          - method: GET
            path: "/api/v1/products"
          - method: GET
            path: "/api/v1/products/[0-9]*"
#          - method: GET
#            path: "/api/v1/products/[0-9]*/reviews"
#          - method: GET
#            path: "/api/v1/products/[0-9]*/ratings"

Because productpage v2 sends messages into Kafka, we must first deploy a Kafka broker:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/kafka-v1-destrule.yaml
kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/kafka-v1.yaml

Wait until the kafka-v1-0 pod is ready, i.e. until it has a READY count of 1/1:

watch "kubectl get pods -l app=kafka"
NAME         READY     STATUS    RESTARTS   AGE
kafka-v1-0   1/1       Running   0          21m

Create the authaudit Kafka topic, which will be used by productpage v2:

kubectl exec kafka-v1-0 -c kafka -- bash -c '/opt/kafka_2.11-0.10.1.0/bin/kafka-topics.sh --zookeeper localhost:2181/kafka --create --topic authaudit --partitions 1 --replication-factor 1'

We are now ready to deploy productpage v2.

Create the productpage v2 service and its updated CiliumNetworkPolicy and delete productpage v1:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/bookinfo-productpage-v2.yaml
kubectl delete -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/bookinfo-productpage-v1.yaml

productpage v2 implements an authorization audit logging. On every user login or logout, it produces into Kafka topic authaudit a JSON-formatted message which contains the following information:

  • event: login or logout
  • username
  • client IP address
  • timestamp

To observe the Kafka messages sent by productpage, we will run an additional authaudit-logger service. This service fetches and prints out all messages from the authaudit Kafka topic. Start this service:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes-istio/authaudit-logger-v1.yaml

Check the progress of the deployment (every service should have an AVAILABLE count of 1):

watch "kubectl get deployments"
NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
authaudit-logger-v1   1/1     1            1           41s
details-v1            1/1     1            1           37m
productpage-v2        1/1     1            1           4m47s
ratings-v1            1/1     1            1           20m
reviews-v1            1/1     1            1           37m
reviews-v2            1/1     1            1           20m

Check that the product REST API is still accessible, and that Cilium now denies at Layer-7 any access to the reviews and ratings REST API:

for APIPATH in /api/v1/products /api/v1/products/0 /api/v1/products/0/reviews /api/v1/products/0/ratings; do echo ; curl -s -S "${GATEWAY_URL}${APIPATH}" ; echo ; done

The output will be similar to this:

[{"descriptionHtml": "<a href=\"https://en.wikipedia.org/wiki/The_Comedy_of_Errors\">Wikipedia Summary</a>: The Comedy of Errors is one of <b>William Shakespeare's</b> early plays. It is his shortest and one of his most farcical comedies, with a major part of the humour coming from slapstick and mistaken identity, in addition to puns and word play.", "id": 0, "title": "The Comedy of Errors"}]

{"publisher": "PublisherA", "language": "English", "author": "William Shakespeare", "id": 0, "ISBN-10": "1234567890", "ISBN-13": "123-1234567890", "year": 1595, "type": "paperback", "pages": 200}

Access denied


Access denied

This demonstrated that requests to the /api/v1/products/<id>/reviews and /api/v1/products/<id>/ratings URIs now result in Cilium returning HTTP 403 Forbidden HTTP responses.

Every login and logout on the product page will result in a line in this service’s log. Note that you need to log in/out using the sign in/sign out element on the bookinfo web page. When you do, you can observe these kind of audit logs:

export POD_LOGGER_V1=`kubectl get pods -l app=authaudit-logger,version=v1 -o jsonpath='{.items[0].metadata.name}'`
kubectl logs ${POD_LOGGER_V1} -c authaudit-logger
...
{"timestamp": "2017-12-04T09:34:24.341668", "remote_addr": "10.15.28.238", "event": "login", "user": "richard"}
{"timestamp": "2017-12-04T09:34:40.943772", "remote_addr": "10.15.28.238", "event": "logout", "user": "richard"}
{"timestamp": "2017-12-04T09:35:03.096497", "remote_addr": "10.15.28.238", "event": "login", "user": "gilfoyle"}
{"timestamp": "2017-12-04T09:35:08.777389", "remote_addr": "10.15.28.238", "event": "logout", "user": "gilfoyle"}

As you can see, the user-identifiable information sent by productpage in every Kafka message is sensitive, so access to this Kafka topic must be protected using Cilium. The CiliumNetworkPolicy configured on the Kafka broker enforces that:

  • only productpage v2 is allowed to produce messages into the authaudit topic;
  • only authaudit-logger can fetch messages from this topic;
  • no service can access any other topic.
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: kafka-authaudit
specs:
  - endpointSelector:
      matchLabels:
        "k8s:app": kafka
    ingress:
    - fromEndpoints:
      - matchLabels:
          "k8s:app": productpage
          "k8s:version": v2
      toPorts:
      - ports:
        - port: "9092"
          protocol: TCP
        rules:
          kafka:
          - apiKey: "produce"
            topic: "authaudit"
          - apiKey: "apiversions"
          - apiKey: "metadata"
          - apiKey: "heartbeat"
    - fromEndpoints:
      - matchLabels:
          app: kafka
    - fromEndpoints:
      - matchLabels:
          "k8s:app": authaudit-logger
      toPorts:
      - ports:
        - port: "9092"
          protocol: TCP
        rules:
          kafka:
          - apiKey: "fetch"
            topic: "authaudit"
          - apiKey: "apiversions"
          - apiKey: "metadata"
          - apiKey: "findcoordinator"
          - apiKey: "joingroup"
          - apiKey: "leavegroup"
          - apiKey: "syncgroup"
          - apiKey: "offsets"
          - apiKey: "offsetcommit"
          - apiKey: "offsetfetch"
          - apiKey: "heartbeat"

Check that Cilium prevents the authaudit-logger service from writing into the authaudit topic (enter a message followed by ENTER, e.g. test message)

Note

Note that the error message may take a short time to appear.

Note

You can terminate the command with a single <CTRL>-d.

kubectl exec ${POD_LOGGER_V1} -c authaudit-logger -ti -- /opt/kafka_2.11-0.10.1.0/bin/kafka-console-producer.sh --broker-list=kafka:9092 --topic=authaudit
test message
[2017-12-07 02:13:47,020] ERROR Error when sending message to topic authaudit with key: null, value: 12 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [authaudit]

This demonstrated that Cilium sent a response with an authorization error for any Produce request from this service.

Create another topic named credit-card-payments, meant to transmit highly-sensitive credit card payment requests:

kubectl exec kafka-v1-0 -c kafka -- bash -c '/opt/kafka_2.11-0.10.1.0/bin/kafka-topics.sh --zookeeper localhost:2181/kafka --create --topic credit-card-payments --partitions 1 --replication-factor 1'

Check that Cilium prevents the authaudit-logger service from fetching messages from this topic:

kubectl exec ${POD_LOGGER_V1} -c authaudit-logger -ti -- /opt/kafka_2.11-0.10.1.0/bin/kafka-console-consumer.sh --bootstrap-server=kafka:9092 --topic=credit-card-payments
[2017-12-07 03:08:54,513] WARN Not authorized to read from topic credit-card-payments. (org.apache.kafka.clients.consumer.internals.Fetcher)
[2017-12-07 03:08:54,517] ERROR Error processing message, terminating consumer process:  (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [credit-card-payments]
Processed a total of 0 messages

This demonstrated that Cilium sent a response with an authorization error for any Fetch request from this service for any topic other than authaudit.

Note

At present, the above command may also result in an error message.

Step 6: Clean Up

You have now installed Cilium and Istio, deployed a demo app, and tested both Cilium’s L3-L7 network security policies and Istio’s service route rules. To clean up, run:

minikube delete

After this, you can re-run the tutorial from Step 0.

Other Orchestrators

Cilium with Docker & libnetwork

This tutorial leverages Vagrant and VirtualBox, thus should run on any operating system supported by Vagrant, including Linux, macOS, and Windows.

Step 0: Install Vagrant

If you don’t already have Vagrant installed, refer to the Development Guide for links to installation instructions for Vagrant.

Step 1: Download the Cilium Source Code

Download the latest Cilium source code and unzip the files.

Alternatively, if you are a developer, feel free to clone the repository:

$ git clone https://github.com/cilium/cilium
Step 2: Starting the Docker + Cilium VM

Open a terminal and navigate into the top of the cilium source directory.

Then navigate into examples/getting-started and run vagrant up:

$ cd examples/getting-started
$ vagrant up

The script usually takes a few minutes depending on the speed of your internet connection. Vagrant will set up a VM, install the Docker container runtime and run Cilium with the help of Docker Compose. When the script completes successfully, it will print:

==> default: Creating cilium-kvstore
==> default: Creating cilium
==> default: Creating cilium-docker-plugin
$

If the script exits with an error message, do not attempt to proceed with the tutorial, as later steps will not work properly. Instead, contact us on the Cilium Slack channel.

Step 3: Accessing the VM

After the script has successfully completed, you can log into the VM using vagrant ssh:

$ vagrant ssh

All commands for the rest of the tutorial below should be run from inside this Vagrant VM. If you end up disconnecting from this VM, you can always reconnect in a new terminal window just by running vagrant ssh again from the Cilium directory.

Step 4: Confirm that Cilium is Running

The Cilium agent is now running as a system service and you can interact with it using the cilium CLI client. Check the status of the agent by running cilium status:

$ cilium status
KVStore:                Ok         Consul: 172.18.0.2:8300
Kubernetes:             Disabled
Cilium:                 Ok         OK
NodeMonitor:            Disabled
Cilium health daemon:   Ok
IPAM:                   IPv4: 2/65535 allocated from 10.15.0.0/16,
Controller Status:      14/14 healthy
Proxy Status:           OK, ip 10.15.225.211, 0 redirects active on ports 10000-20000
Hubble:                 Disabled
Cluster health:         1/1 reachable   (2020-04-17T10:55:03Z)

The status indicates that all components are operational with the Kubernetes integration currently being disabled.

Step 5: Create a Docker Network of Type Cilium

Cilium integrates with local container runtimes, which in the case of this demo means Docker. With Docker, native networking is handled via a component called libnetwork. In order to steer Docker to request networking of a container from Cilium, a container must be started with a network of driver type “cilium”.

With Cilium, all containers are connected to a single logical network, with isolation added not based on IP addresses but based on container labels (as we will do in the steps below). So with Docker, we simply create a single network named ‘cilium-net’ for all containers:

$ docker network create --driver cilium --ipam-driver cilium cilium-net
Step 6: Start an Example Service with Docker

In this tutorial, we’ll use a container running a simple HTTP server to represent a microservice application which we will refer to as app1. As a result, we will start this container with the label “id=app1”, so we can create Cilium security policies for that service.

Use the following command to start the app1 container connected to the Docker network managed by Cilium:

$ docker run -d --name app1 --net cilium-net -l "id=app1" cilium/demo-httpd
e5723edaa2a1307e7aa7e71b4087882de0250973331bc74a37f6f80667bc5856

This has launched a container running an HTTP server which Cilium is now managing as an Endpoint. A Cilium endpoint is one or more application containers which can be addressed by an individual IP address.

Step 7: Apply an L3/L4 Policy With Cilium

When using Cilium, endpoint IP addresses are irrelevant when defining security policies. Instead, you can use the labels assigned to the VM to define security policies, which are automatically applied to any container with that label, no matter where or when it is run within a container cluster.

We’ll start with an overly simple example where we create two additional apps, app2 and app3, and we want app2 containers to be able to reach app1 containers, but app3 containers should not be allowed to reach app1 containers. Additionally, we only want to allow app1 to be reachable on port 80, but no other ports. This is a simple policy that filters only on IP address (network layer 3) and TCP port (network layer 4), so it is often referred to as an L3/L4 network security policy.

Cilium performs stateful ‘’connection tracking’‘, meaning that if a policy allows app2 to contact app1, it will automatically allow return packets that are part of app1 replying to app2 within the context of the same TCP/UDP connection.

L4 Policy with Cilium and Docker

_images/cilium_dkr_demo_l3-l4-policy-170817.png

We can achieve that with the following Cilium policy:

[{
    "labels": [{"key": "name", "value": "l3-rule"}],
    "endpointSelector": {"matchLabels":{"id":"app1"}},
    "ingress": [{
        "fromEndpoints": [
            {"matchLabels":{"id":"app2"}}
        ],
        "toPorts": [{
                "ports": [{"port": "80", "protocol": "TCP"}]
        }]
    }]
}]

Save this JSON to a file named l3_l4_policy.json in your VM, and apply the policy by running:

$ cilium policy import l3_l4_policy.json
Revision: 1
Step 8: Test L3/L4 Policy

You can now launch additional containers that represent other services attempting to access app1. Any new container with label “id=app2” will be allowed to access app1 on port 80, otherwise the network request will be dropped.

To test this out, we’ll make an HTTP request to app1 from a container with the label “id=app2” :

$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl -m 20 http://app1
<html><body><h1>It works!</h1></body></html>

We can see that this request was successful, as we get a valid HTTP response.

Now let’s run the same HTTP request to app1 from a container that has label “id=app3”:

$ docker run --rm -ti --net cilium-net -l "id=app3" cilium/demo-client curl -m 10 http://app1

You will see no reply as all packets are dropped by the Cilium security policy. The request will time-out after 10 seconds.

So with this we see Cilium’s ability to segment containers based purely on a container-level identity label. This means that the end user can apply security policies without knowing anything about the IP address of the container or requiring some complex mechanism to ensure that containers of a particular service are assigned an IP address in a particular range.

Step 9: Apply and Test an L7 Policy with Cilium

In the simple scenario above, it was sufficient to either give app2 / app3 full access to app1’s API or no access at all. But to provide the strongest security (i.e., enforce least-privilege isolation) between microservices, each service that calls app1’s API should be limited to making only the set of HTTP requests it requires for legitimate operation.

For example, consider a scenario where app1 has two API calls:
  • GET /public
  • GET /private

Continuing with the example from above, if app2 requires access only to the GET /public API call, the L3/L4 policy alone has no visibility into the HTTP requests, and therefore would allow any HTTP request from app2 (since all HTTP is over port 80).

To see this, run:

$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl 'http://app1/public'
{ 'val': 'this is public' }

and

$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl 'http://app1/private'
{ 'val': 'this is private' }

Cilium is capable of enforcing HTTP-layer (i.e., L7) policies to limit what URLs app2 is allowed to reach. Here is an example policy file that extends our original policy by limiting app2 to making only a GET /public API call, but disallowing all other calls (including GET /private).

L7 Policy with Cilium and Docker

_images/cilium_dkr_demo_l7-policy-230817.png

The following Cilium policy file achieves this goal:

[{
    "labels": [{"key": "name", "value": "l7-rule"}],
    "endpointSelector": {"matchLabels":{"id":"app1"}},
    "ingress": [{
        "fromEndpoints": [
            {"matchLabels":{"id":"app2"}}
        ],
        "toPorts": [{
            "ports": [{"port": "80", "protocol": "TCP"}],
            "rules": {
                "http": [{
                    "method": "GET",
                    "path": "/public"
                }]
            }
        }]
    }]
}]

Create a file with this contents and name it l7_aware_policy.json. Then import this policy to Cilium by running:

$ cilium policy delete --all
Revision: 2
$ cilium policy import l7_aware_policy.json
Revision: 3
$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl -si 'http://app1/public'
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 28
Date: Tue, 31 Oct 2017 14:30:56 GMT
Etag: "1c-54bb868cec400"
Last-Modified: Mon, 27 Mar 2017 15:58:08 GMT
Server: Apache/2.4.25 (Unix)
Content-Type: text/plain; charset=utf-8

{ 'val': 'this is public' }

and

$ docker run --rm -ti --net cilium-net -l "id=app2" cilium/demo-client curl -si 'http://app1/private'
HTTP/1.1 403 Forbidden
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 31 Oct 2017 14:31:09 GMT
Content-Length: 14

Access denied

As you can see, with Cilium L7 security policies, we are able to permit app2 to access only the required API resources on app1, thereby implementing a “least privilege” security approach for communication between microservices.

We hope you enjoyed the tutorial. Feel free to play more with the setup, read the rest of the documentation, and reach out to us on the Cilium Slack channel with any questions!

Step 10: Clean-Up

Exit the vagrant VM by typing exit.

When you are done with the setup and want to tear-down the Cilium + Docker VM, and destroy all local state (e.g., the VM disk image), open a terminal in the cilium/examples/getting-started directory and type:

$ vagrant destroy

You can always re-create the VM using the steps described above.

If instead you just want to shut down the VM but may use it later, vagrant halt will work, and you can start it again later.

Cilium with Mesos/Marathon

This tutorial leverages Vagrant and VirtualBox to deploy Apache Mesos, Marathon and Cilium. You will run Cilium to apply a simple policy between a simulated web-service and clients. This tutorial can be run on any operating system supported by Vagrant including Linux, macOS, and Windows.

For more information on Apache Mesos and Marathon orchestration, check out the Mesos and Marathon GitHub pages, respectively.

Step 0: Install Vagrant

You need to run at least Vagrant version 1.8.3 or you will run into issues booting the Ubuntu 17.04 base image. You can verify by running vagrant --version.

If you don’t already have Vagrant installed, follow the Vagrant Install Instructions or see Download Vagrant for newer versions.

Step 1: Download the Cilium Source Code

Download the latest Cilium source code and unzip the files.

Alternatively, if you are a developer, feel free to clone the repository:

$ git clone https://github.com/cilium/cilium
Step 2: Starting a VM with Cilium

Open a terminal and navigate into the top of the cilium source directory.

Then navigate into examples/mesos and run vagrant up:

$ cd examples/mesos
$ vagrant up

The script usually takes a few minutes depending on the speed of your internet connection. Vagrant will set up a VM, install Mesos & Marathon, run Cilium with the help of Docker compose, and start up the Mesos master and slave services. When the script completes successfully, it will print:

==> default: Creating cilium-kvstore
Creating cilium-kvstore ... done
==> default: Creating cilium ...
==> default: Creating cilium
Creating cilium ... done
==> default: Installing loopback driver...
==> default: Installing cilium-cni to /host/opt/cni/bin/ ...
==> default: Installing new /host/etc/cni/net.d/00-cilium.conf ...
==> default: Deploying Vagrant VM + Cilium + Mesos...done
$

If the script exits with an error message, do not attempt to proceed with the tutorial, as later steps will not work properly. Instead, contact us on the Cilium Slack channel.

Step 3: Accessing the VM

After the script has successfully completed, you can log into the VM using vagrant ssh:

$ vagrant ssh

All commands for the rest of the tutorial below should be run from inside this Vagrant VM. If you end up disconnecting from this VM, you can always reconnect by going to the examples/mesos directory and then running the command vagrant ssh.

Step 4: Confirm that Cilium is Running

The Cilium agent is now running and you can interact with it using the cilium CLI client. Check the status of the agent by running cilium status:

$ cilium status
KVStore:                Ok         Consul: 172.18.0.2:8300
ContainerRuntime:       Ok         docker daemon: OK
Kubernetes:             Disabled
Cilium:                 Ok         OK
NodeMonitor:            Disabled
Cilium health daemon:   Ok
IPv4 address pool:      3/65535 allocated
IPv6 address pool:      2/65535 allocated
Controller Status:      10/10 healthy
Proxy Status:           OK, ip 10.15.0.1, port-range 10000-20000
Cluster health:   1/1 reachable   (2018-06-19T15:10:28Z)

The status indicates that all necessary components are operational.

Step 5: Run Script to Start Marathon

Start Marathon inside the Vagrant VM:

$ ./start_marathon.sh
Starting marathon...
...
...
...
...
Done
Step 6: Simulate a Web-Server and Clients

Use curl to submit a task to Marathon for scheduling, with data to run the simulated web-server provided by the web-server.json. The web-server simply responds to requests on a particular port.

$ curl -i -H 'Content-Type: application/json' -d @web-server.json 127.0.0.1:8080/v2/apps

You should see output similar to the following:

HTTP/1.1 201 Created
...
Marathon-Deployment-Id: [UUID]
...

Confirm that Cilium sees the new workload. The output should return the endpoint with label mesos:id=web-server and the assigned IP:

$ cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
20928      Disabled           Disabled          59281      mesos:id=web-server           f00d::a0f:0:0:51c0   10.15.137.206   ready
23520      Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:5be0   10.15.162.64    ready

Test the web-server provides OK output:

$ export WEB_IP=`cilium endpoint list | grep web-server | awk '{print $7}'`
$ curl $WEB_IP:8181/api
OK

Run a script to create two client tasks (“good client” and “bad client”) that will attempt to access the web-server. The output of these tasks will be used to validate the Cilium network policy enforcement later in the exercise. The script will generate goodclient.json and badclient.json files for the client tasks, respectively:

$ ./generate_client_file.sh goodclient
$ ./generate_client_file.sh badclient

Then submit the client tasks to Marathon, which will generate GET /public and GET /private requests:

$ curl -i -H 'Content-Type: application/json' -d @goodclient.json 127.0.0.1:8080/v2/apps
$ curl -i -H 'Content-Type: application/json' -d @badclient.json 127.0.0.1:8080/v2/apps

You can observe the newly created endpoints in Cilium, similar to the following output:

$ cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
20928      Disabled           Disabled          59281      mesos:id=web-server           f00d::a0f:0:0:51c0   10.15.137.206   ready
23520      Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:5be0   10.15.162.64    ready
37835      Disabled           Disabled          15197      mesos:id=goodclient           f00d::a0f:0:0:93cb   10.15.152.208   ready
51053      Disabled           Disabled          5113       mesos:id=badclient            f00d::a0f:0:0:c76d   10.15.34.97     ready

Marathon runs the tasks as batch jobs with stdout logged to task-specific files located in /var/lib/mesos. To simplify the retrieval of the stdout log, use the tail_client.sh script to output each of the client logs. In a new terminal, go to examples/mesos, start a new ssh session to the Vagrant VM with vagrant ssh and tail the goodclient logs:

$ ./tail_client.sh goodclient

and in a separate terminal, do the same thing with vagrant ssh and observe the badclient logs:

$ ./tail_client.sh badclient

Make sure both tail logs continuously prints the result of the clients accessing the /public and /private API of the web-server:

...
---------- Test #X  ----------
   Request:   GET /public
   Reply:     OK

   Request:   GET /private
   Reply:     OK
-------------------------------
...

Note that both clients are able to access the web-server and retrieve both URLs because no Cilium policy has been applied yet.

Step 7: Apply L3/L4 Policy with Cilium

Apply an L3/L4 policy only allowing the goodclient to access the web-server. The L3/L4 json policy looks like:

[{
    "labels": [{"key": "name", "value": "l3-l4-rule"}],
    "endpointSelector": {"matchLabels":{"id":"web-server"}},
    "ingress": [{
        "fromEndpoints": [
            {"matchLabels":{"id":"goodclient"}}
        ],
        "toPorts": [{
                "ports": [{"port": "8181", "protocol": "TCP"}]
        }]
    }]
}]

In your original terminal session, use cilium CLI to apply the L3/L4 policy above, saved in the l3-l4-policy.json file on the VM:

$ cilium policy import l3-l4-policy.json
Revision: 1

L3/L4 Policy with Cilium and Mesos

_images/cilium_mesos_demo_l3-l4-policy-170817.png

You can observe that the policy is applied via cilium CLI as the POLICY ENFORCEMENT column changed from Disabled to Enabled:

$ cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
20928      Enabled            Disabled          59281      mesos:id=web-server           f00d::a0f:0:0:51c0   10.15.137.206   ready
23520      Disabled           Disabled          4          reserved:health               f00d::a0f:0:0:5be0   10.15.162.64    ready
37835      Disabled           Disabled          15197      mesos:id=goodclient           f00d::a0f:0:0:93cb   10.15.152.208   ready
51053      Disabled           Disabled          5113       mesos:id=badclient            f00d::a0f:0:0:c76d   10.15.34.97     ready

You should also observe that the goodclient logs continue to output the web-server responses, whereas the badclient request does not reach the web-server because of policy enforcement, and logging output similar to below.

...
---------- Test #X  ----------
   Request:   GET /public
   Reply:     Timeout!

   Request:   GET /private
   Reply:     Timeout!
-------------------------------
...

Remove the L3/L4 policy in order to give badclient access to the web-server again.

$ cilium policy delete --all
Revision: 2

The badclient logs should resume outputting the web-server’s response and Cilium is configured to no longer enforce policy:

$ cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
29898      Disabled           Disabled          37948      reserved:health               f00d::a0f:0:0:74ca   10.15.242.54   ready
33115      Disabled           Disabled          38072      mesos:id=web-server           f00d::a0f:0:0:815b   10.15.220.6    ready
38061      Disabled           Disabled          46430      mesos:id=badclient            f00d::a0f:0:0:94ad   10.15.0.173    ready
64189      Disabled           Disabled          31645      mesos:id=goodclient           f00d::a0f:0:0:fabd   10.15.152.27   ready
Step 8: Apply L7 Policy with Cilium

Now, apply an L7 Policy that only allows access for the goodclient to the /public API, included in the l7-policy.json file:

[{
    "labels": [{"key": "name", "value": "l7-rule"}],
    "endpointSelector": {"matchLabels":{"id":"web-server"}},
    "ingress": [{
        "fromEndpoints": [
            {"matchLabels":{"id":"goodclient"}}
        ],
        "toPorts": [{
            "ports": [{"port": "8181", "protocol": "TCP"}],
            "rules": {
                "http": [{
                    "method": "GET",
                    "path": "/public"
                }]
            }
        }]
    }]
}]

Apply using cilium CLI:

$ cilium policy import l7-policy.json
Revision: 3

L7 Policy with Cilium and Mesos

_images/cilium_mesos_demo_l7-policy-230817.png

In the terminal sessions tailing the goodclient and badclient logs, check the goodclient’s log to see that /private is no longer accessible, and the badclient’s requests are the same results as the enforced policy in the previous step.

...
---------- Test #X  ----------
   Request:   GET /public
   Reply:     OK

   Request:   GET /private
   Reply:     Access Denied
-------------------------------
...

(optional) Remove the policy and notice that the access to /private is unrestricted again:

$ cilium policy delete --all
Revision: 4
Step 9: Clean-Up

Exit the vagrant VM by typing exit in original terminal session. When you want to tear-down the Cilium + Mesos VM and destroy all local state (e.g., the VM disk image), ensure you are in the cilium/examples/mesos directory and type:

$ vagrant destroy

You can always re-create the VM using the steps described above.

If instead you just want to shut down the VM but may use it later, vagrant halt default will work, and you can start it again later.

Troubleshooting

For assistance on any of the Getting Started Guides, please reach out and ask a question on the Cilium Slack channel.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

Concepts

The goal of this document is to describe the components of the Cilium architecture, and the different models for deploying Cilium within your datacenter or cloud environment. It focuses on the higher-level understanding required to run a full Cilium deployment.

Component Overview

_images/cilium-arch.png

A deployment of Cilium consists of the following components running on each Linux container node in the container cluster:

  • Cilium Agent (Daemon): Userspace daemon that interacts with the orchestration systems such as Kubernetes via Plugins to setup networking and security for containers running on the local server. Provides an API for configuring network security policies, extracting network visibility data, etc.
  • Cilium CLI Client: Simple CLI client for communicating with the local Cilium Agent, for example, to configure network security or visibility policies.
  • Linux Kernel BPF: Integrated capability of the Linux kernel to accept compiled bytecode that is run at various hook / trace points within the kernel. Cilium compiles BPF programs and has the kernel run them at key points in the network stack to have visibility and control over all network traffic in / out of all containers.
  • Container Platform Network Plugin: Each container platform (e.g., Docker, Kubernetes) has its own plugin model for how external networking platforms integrate. In the case of Docker, each Linux node runs a process (cilium-docker) that handles each Docker libnetwork call and passes data / requests on to the main Cilium Agent.

In addition to these components, Cilium also depends on the following components running in the cluster:

  • Key-Value Store: Cilium shares data between Cilium Agents on different nodes via a kvstore. The currently supported key-value stores are etcd or consul.
  • Cilium Operator: Daemon for handling cluster management duties which can be handled once per cluster, rather than once per node.

Cilium Agent

The Cilium agent (cilium-agent) runs on each Linux container host. At a high-level, the agent accepts configuration that describes service-level network security and visibility policies. It then listens to events in the orchestration systems to learn when containers are started or stopped, and it creates custom BPF programs which the Linux kernel uses to control all network access in / out of those containers. In more detail, the agent:

  • Exposes APIs to allow operations / security teams to configure security policies (see below) that control all communication between containers in the cluster. These APIs also expose monitoring capabilities to gain additional visibility into network forwarding and filtering behavior.
  • Gathers metadata about each new container that is created. In particular, it queries identity metadata like container / pod labels, which are used to identify Endpoint in Cilium security policies.
  • Interacts with the container platforms network plugin to perform IP address management (IPAM), which controls what IPv4 and IPv6 addresses are assigned to each container. The IPAM is managed by the agent in a shared pool between all plugins which means that the Docker and CNI network plugin can run side by side allocating a single address pool.
  • Combines its knowledge about container identity and addresses with the already configured security and visibility policies to generate highly efficient BPF programs that are tailored to the network forwarding and security behavior appropriate for each container.
  • Compiles the BPF programs to bytecode using clang/LLVM and passes them to the Linux kernel to run for all packets in / out of the container’s virtual ethernet device(s).

Cilium CLI Client

The Cilium CLI Client (cilium) is a command-line tool that is installed along with the Cilium Agent. It gives a command-line interface to interact with all aspects of the Cilium Agent API. This includes inspecting Cilium’s state about each network endpoint (i.e., container), configuring and viewing security policies, and configuring network monitoring behavior.

Linux Kernel BPF

Berkeley Packet Filter (BPF) is a Linux kernel bytecode interpreter originally introduced to filter network packets, e.g. tcpdump and socket filters. It has since been extended with additional data structures such as hashtable and arrays as well as additional actions to support packet mangling, forwarding, encapsulation, etc. An in-kernel verifier ensures that BPF programs are safe to run and a JIT compiler converts the bytecode to CPU architecture specific instructions for native execution efficiency. BPF programs can be run at various hooking points in the kernel such as for incoming packets, outgoing packets, system calls, kprobes, etc.

BPF continues to evolve and gain additional capabilities with each new Linux release. Cilium leverages BPF to perform core datapath filtering, mangling, monitoring and redirection, and requires BPF capabilities that are in any Linux kernel version 4.8.0 or newer. On the basis that 4.8.x is already declared end of life and 4.9.x has been nominated as a stable release we recommend to run at least kernel 4.9.17 (the latest current stable Linux kernel as of this writing is 4.10.x).

Cilium is capable of probing the Linux kernel for available features and will automatically make use of more recent features as they are detected.

Linux distros that focus on being a container runtime (e.g., CoreOS, Fedora Atomic) typically already ship kernels that are newer than 4.8, but even recent versions of general purpose operating systems such as Ubuntu 16.10 ship fairly recent kernels. Some Linux distributions still ship older kernels but many of them allow installing recent kernels from separate kernel package repositories.

For more detail on kernel versions, see: Linux Kernel.

Key-Value Store

The Key-Value (KV) Store is used for the following state:

  • Policy Identities: list of labels <=> policy identity identifier
  • Global Services: global service id to VIP association (optional)
  • Encapsulation VTEP mapping (optional)

To simplify things in a larger deployment, the key-value store can be the same one used by the container orchestrator (e.g., Kubernetes using etcd).

Cilium Operator

The Cilium Operator is responsible for managing duties in the cluster which should logically be handled once for the entire cluster, rather than once for each node in the cluster. Its design helps with scale limitations in large kubernetes clusters (>1000 nodes). The responsibilities of Cilium operator include:

  • Synchronizing kubernetes services with etcd for Cluster Mesh
  • Synchronizing node resources with etcd
  • Ensuring that DNS pods are managed by Cilium
  • Garbage-collection of Cilium Endpoints resources, unused security identities from the key-value store, and status of deleted nodes from CiliumNetworkPolicy
  • Translation of toGroups policy
  • Interaction with the AWS API for managing AWS ENI
  • Sending CiliumNetworkPolicyNodeStatus updates from the whole cluster for each CiliumNetworkPolicy to kube-apiserver

Terminology

Labels

Labels are a generic, flexible and highly scalable way of addressing a large set of resources as they allow for arbitrary grouping and creation of sets. Whenever something needs to be described, addressed or selected, it is done based on labels:

  • Endpoint are assigned labels as derived from the container runtime, orchestration system, or other sources.
  • Network Policy select pairs of Endpoint which are allowed to communicate based on labels. The policies themselves are identified by labels as well.
What is a Label?

A label is a pair of strings consisting of a key and value. A label can be formatted as a single string with the format key=value. The key portion is mandatory and must be unique. This is typically achieved by using the reverse domain name notion, e.g. io.cilium.mykey=myvalue. The value portion is optional and can be omitted, e.g. io.cilium.mykey.

Key names should typically consist of the character set [a-z0-9-.].

When using labels to select resources, both the key and the value must match, e.g. when a policy should be applied to all endpoints with the label my.corp.foo then the label my.corp.foo=bar will not match the selector.

Label Source

A label can be derived from various sources. For example, an endpoint will derive the labels associated to the container by the local container runtime as well as the labels associated with the pod as provided by Kubernetes. As these two label namespaces are not aware of each other, this may result in conflicting label keys.

To resolve this potential conflict, Cilium prefixes all label keys with source: to indicate the source of the label when importing labels, e.g. k8s:role=frontend, container:user=joe, k8s:role=backend. This means that when you run a Docker container using docker run [...] -l foo=bar, the label container:foo=bar will appear on the Cilium endpoint representing the container. Similarly, a Kubernetes pod started with the label foo: bar will be represented with a Cilium endpoint associated with the label k8s:foo=bar. A unique name is allocated for each potential source. The following label sources are currently supported:

  • container: for labels derived from the local container runtime
  • k8s: for labels derived from Kubernetes
  • mesos: for labels derived from Mesos
  • reserved: for special reserved labels, see Special Identities.
  • unspec: for labels with unspecified source

When using labels to identify other resources, the source can be included to limit matching of labels to a particular type. If no source is provided, the label source defaults to any: which will match all labels regardless of their source. If a source is provided, the source of the selecting and matching labels need to match.

Endpoint

Cilium makes application containers available on the network by assigning them IP addresses. Multiple application containers can share the same IP address; a typical example for this model is a Kubernetes Pod. All application containers which share a common address are grouped together in what Cilium refers to as an endpoint.

Allocating individual IP addresses enables the use of the entire Layer 4 port range by each endpoint. This essentially allows multiple application containers running on the same cluster node to all bind to well known ports such as 80 without causing any conflicts.

The default behavior of Cilium is to assign both an IPv6 and IPv4 address to every endpoint. However, this behavior can be configured to only allocate an IPv6 address with the --enable-ipv4=false option. If both an IPv6 and IPv4 address are assigned, either address can be used to reach the endpoint. The same behavior will apply with regard to policy rules, load-balancing, etc. See Address Management for more details.

Identification

For identification purposes, Cilium assigns an internal endpoint id to all endpoints on a cluster node. The endpoint id is unique within the context of an individual cluster node.

Endpoint Metadata

An endpoint automatically derives metadata from the application containers associated with the endpoint. The metadata can then be used to identify the endpoint for security/policy, load-balancing and routing purposes.

The source of the metadata will depend on the orchestration system and container runtime in use. The following metadata retrieval mechanisms are currently supported:

System Description
Kubernetes Pod labels (via k8s API)
Mesos Labels (via CNI)
containerd (Docker) Container labels (via Docker API)

Metadata is attached to endpoints in the form of Labels.

The following example launches a container with the label app=benchmark which is then associated with the endpoint. The label is prefixed with container: to indicate that the label was derived from the container runtime.

$ docker run --net cilium -d -l app=benchmark tgraf/netperf
aaff7190f47d071325e7af06577f672beff64ccc91d2b53c42262635c063cf1c
$  cilium endpoint list
ENDPOINT   POLICY        IDENTITY   LABELS (source:key[=value])   IPv6                   IPv4            STATUS
           ENFORCEMENT
62006      Disabled      257        container:app=benchmark       f00d::a00:20f:0:f236   10.15.116.202   ready

An endpoint can have metadata associated from multiple sources. A typical example is a Kubernetes cluster which uses containerd as the container runtime. Endpoints will derive Kubernetes pod labels (prefixed with the k8s: source prefix) and containerd labels (prefixed with container: source prefix).

Identity

All Endpoint are assigned an identity. The identity is what is used to enforce basic connectivity between endpoints. In traditional networking terminology, this would be equivalent to Layer 3 enforcement.

An identity is identified by Labels and is given a cluster wide unique identifier. The endpoint is assigned the identity which matches the endpoint’s Security Relevant Labels, i.e. all endpoints which share the same set of Security Relevant Labels will share the same identity. This concept allows to scale policy enforcement to a massive number of endpoints as many individual endpoints will typically share the same set of security Labels as applications are scaled.

What is an Identity?

The identity of an endpoint is derived based on the Labels associated with the pod or container which are derived to the endpoint. When a pod or container is started, Cilium will create an endpoint based on the event received by the container runtime to represent the pod or container on the network. As a next step, Cilium will resolve the identity of the endpoint created. Whenever the Labels of the pod or container change, the identity is reconfirmed and automatically modified as required.

Security Relevant Labels

Not all Labels associated with a container or pod are meaningful when deriving the Identity. Labels may be used to store metadata such as the timestamp when a container was launched. Cilium requires to know which labels are meaningful and are subject to being considered when deriving the identity. For this purpose, the user is required to specify a list of string prefixes of meaningful labels. The standard behavior is to include all labels which start with the prefix id., e.g. id.service1, id.service2, id.groupA.service44. The list of meaningful label prefixes can be specified when starting the agent.

Special Identities

All endpoints which are managed by Cilium will be assigned an identity. In order to allow communication to network endpoints which are not managed by Cilium, special identities exist to represent those. Special reserved identities are prefixed with the string reserved:.

Identity Description
reserved:unknown The identity could not be derived.
reserved:host The local host. Any traffic that originates from or is designated to one of the local host IPs.
reserved:remote-node The collection of all remote cluster hosts. Any traffic that originates from or is designated to one of the IPs of any host in any connected cluster other than the local node.
reserved:world Any network endpoint outside of the cluster
reserved:health This is health checking traffic generated by Cilium agents.
reserved:init

An endpoint for which the identity has not yet been resolved is assigned the init identity. This represents the phase of an endpoint in which some of the metadata required to derive the security identity is still missing. This is typically the case in the bootstrapping phase.

The init identity is only allocated if the labels of the endpoint are not known at creation time. This can be the case for the Docker plugin.

reserved:unmanaged An endpoint that is not managed by Cilium, e.g. a Kubernetes pod that was launched before Cilium was installed.

Note

Cilium used to include both the local and all remote hosts in the reserved:host identity. This is still the default option unless a recent default ConfigMap is used. The remote-node identity can be enabled via the option enable-remote-node-identity.

Well-known Identities

The following is a list of well-known identities which Cilium is aware of automatically and will hand out a security identity without requiring to contact any external dependencies such as the kvstore. The purpose of this is to allow bootstrapping Cilium and enable network connectivity with policy enforcement in the cluster for essential services without depending on any dependencies.

Deployment Namespace ServiceAccount Cluster Name Numeric ID Labels
cilium-etcd-operator <cilium-namespace> cilium-etcd-operator <cilium-cluster> 107 name=cilium-etcd-operator, io.cilium/app=etcd-operator
etcd-operator <cilium-namespace> cilium-etcd-sa <cilium-cluster> 100 io.cilium/app=etcd-operator
cilium-etcd <cilium-namespace> default <cilium-cluster> 101 app=etcd, etcd_cluster=cilium-etcd, io.cilium/app=etcd-operator
kube-dns kube-system kube-dns <cilium-cluster> 102 k8s-app=kube-dns
kube-dns (EKS) kube-system kube-dns <cilium-cluster> 103 k8s-app=kube-dns, eks.amazonaws.com/component=kube-dns
core-dns kube-system coredns <cilium-cluster> 104 k8s-app=kube-dns
core-dns (EKS) kube-system coredns <cilium-cluster> 106 k8s-app=kube-dns, eks.amazonaws.com/component=coredns
cilium-operator <cilium-namespace> cilium-operator <cilium-cluster> 105 name=cilium-operator, io.cilium/app=operator

Note: if cilium-cluster is not defined with the cluster-name option, the default value will be set to “default”.

Identity Management in the Cluster

Identities are valid in the entire cluster which means that if several pods or containers are started on several cluster nodes, all of them will resolve and share a single identity if they share the identity relevant labels. This requires coordination between cluster nodes.

_images/identity_store.png

The operation to resolve an endpoint identity is performed with the help of the distributed key-value store which allows to perform atomic operations in the form generate a new unique identifier if the following value has not been seen before. This allows each cluster node to create the identity relevant subset of labels and then query the key-value store to derive the identity. Depending on whether the set of labels has been queried before, either a new identity will be created, or the identity of the initial query will be returned.

Node

Cilium refers to a node as an individual member of a cluster. Each node must be running the cilium-agent and will operate in a mostly autonomous manner. Synchronization of state between Cilium agent’s running on different nodes is kept to a minimum for simplicity and scale. It occurs exclusively via the Key-Value store or with packet metadata.

Node Address

Cilium will automatically detect the node’s IPv4 and IPv6 address. The detected node address is printed out when the cilium-agent starts:

Local node-name: worker0
Node-IPv6: f00d::ac10:14:0:1
External-Node IPv4: 172.16.0.20
Internal-Node IPv4: 10.200.28.238

Address Management

Before we look into the details of Address Management in Cilium, let us look at overview of the cilium container networking control flow.

Cilium Container Networking Control Flow

The control flow picture below gives an overview about how the containers obtain its IP Address from the IPAM for different modes of Address Management that Cilium Supports.

_images/cilium_container_networking_control_flow.png

Cilium supports multiple different address management modes:

Address Management Modes

Cilium Cluster-pool IPAM

Cilium Cluster-pool IPAM is based on Kubernetes host-scope IPAM, for more info see Kubernetes Host Scope. The functionality is the same but the PodCIDRs are allocated and managed entirely by Cilium Operator.

In this mode, the Cilium agent will wait on startup until the PodCIDRs range are made available via the Cilium Node v2.CiliumNode object for all enabled address families via the resource field set in the v2.CiliumNode:

Field Description
Spec.IPAM.PodCIDRs IPv4 and/or IPv6 PodCIDR range

If Cilium Operator can not allocate PodCIDRs for that node it will keep a status message in Status.Operator.Error.

Kubernetes Host Scope

The Kubernetes host-scope IPAM mode is enabled with ipam: kubernetes and delegates the address allocation to each individual node in the cluster. IPs are allocated out of the PodCIDR range associated to each node by Kubernetes.

_images/k8s_hostscope.png

In this mode, the Cilium agent will wait on startup until the PodCIDR range is made available via the Kubernetes v1.Node object for all enabled address families via one of the following methods:

via v1.Node resource field

Field Description
spec.podCIDRs IPv4 and/or IPv6 PodCIDR range
spec.podCIDR IPv4 or IPv6 PodCIDR range

Note

It is important to run the kube-controller-manager with the flag --allocate-node-cidrs flag to indicate to Kubernetes that PodCIDR ranges should be allocated.

via v1.Node annotation

Annotation Description
io.cilium.network.ipv4-pod-cidr IPv4 PodCIDR range
io.cilium.network.ipv6-pod-cidr IPv6 PodCIDR range

Note

The annotation-based mechanism is primarily useful in combination with older Kubernetes versions which do not support spec.podCIDRs yet but support for both IPv4 and IPv6 is enabled.

Configuration

The following ConfigMap options exist to configure Kubernetes hostscope:

  • ipam: kubernetes: Enables Kubernetes IPAM mode. Enabling this option will automatically enable k8s-require-ipv4-pod-cidr if enable-ipv4 is true and k8s-require-ipv6-pod-cidr if enable-ipv6 is true.
  • k8s-require-ipv4-pod-cidr: true: instructs the Cilium agent to wait until an IPv4 PodCIDR is made available via the Kubernetes node resource.
  • k8s-require-ipv6-pod-cidr: true: instructs the Cilium agent to wait until an IPv6 PodCIDR is made available via the Kubernetes node resource.
Host Scope (default)

The host-scope IPAM mode delegates the address allocation to each individual node in the cluster. Each cluster node is assigned an allocation CIDR out of which the node can allocate IPs without further coordination with any other nodes. For details on running the hostscope IPAM mode in the context of Kubernetes, please see Kubernetes Host Scope.

This means that no state needs to be synchronized between cluster nodes to allocate IP addresses and to determine whether an IP address belongs to an endpoint of the cluster and whether that endpoint resides on the local cluster node.

Default Values

The following values are used by default if the cluster prefix is left unspecified. These are meant for testing and need to be adjusted according to the needs of your environment.

Note

Relying default values via automatically generated per node PodCIDRs is discouraged in any production environment. It can result in IPAM conflicts and undesired renumbering if the IPAM state on a node is lost for some reason.

Type Cluster Node Prefix
IPv4 10.0.0.0/8 10.X.0.0/16 where X is derived using the last 8 bits of the first IPv4 address in the list of global scope addresses on the cluster node.
IPv6 f00d::/48

f00d:0:0:0:<ipv4-address>::/96 where the IPv4 address is the first address in the list of global scope addresses on the cluster node.

Note: Only 16 bits out of the /96 node prefix are currently used when allocating container addresses. This allows to use the remaining 16 bits to store arbitrary connection state when sending packets between nodes. A typical use case for the state is direct server return.

The size of the IPv4 cluster prefix can be changed with the --ipv4-cluster-cidr-mask-size option. The size of the IPv6 cluster prefix is currently fixed sized at /48. The node allocation prefixes can be specified manually with the option --ipv4-range respectively --ipv6-range.

CRD-Backed

The CRD-backed IPAM mode provides an extendable interface to control the IP address management via a Kubernetes Custom Resource Definition (CRD). This allows to delegate IPAM to external operators or make it user configurable per node.

Architecture
_images/crd_arch.png

When this mode is enabled, each Cilium agent will start watching for a Kubernetes custom resource ciliumnodes.cilium.io with a name matching the Kubernetes node on which the agent is running.

Whenever the custom resource is updated, the per node allocation pool is updated with all addresses listed in the spec.ipam.available field. When an IP is removed that is currently allocated, the IP will continue to be used but will not be available for re-allocation after release.

Upon allocation of an IP in the allocation pool, the IP is added to the status.ipam.inuse field.

Note

The node status update is limited to run at most once every 15 seconds. Therefore, if several pods are scheduled at the same time, the update of the status section can lag behind.

Configuration

The CRD-backed IPAM mode is enabled by setting ipam: crd in the cilium-config ConfigMap or by specifying the option --ipam=crd. When enabled, the agent will wait for a CiliumNode custom resource matching the Kubernetes node name to become available with at least one IP address listed as available. When connectivity health-checking is enabled, at least two IP addresses must be available.

While waiting, the agent will print the following log message:

Waiting for initial IP to become available in '<node-name>' custom resource

For a practical tutorial on how to enable CRD IPAM mode with Cilium, see the section CRD-backed IPAM.

Privileges

In order for the custom resource to be functional, the following additional privileges are required. These privileges are automatically granted when using the standard Cilium deployment artifacts:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium
rules:
- apiGroups:
  - cilium.io
  resources:
  - ciliumnodes
  - ciliumnodes/status
  verbs:
  - '*'
CRD Definition

The CilumNode custom resource is modeled after a standard Kubernetes resource and is split into a spec and status section:

type CiliumNode struct {
        [...]

        // Spec is the specification of the node
        Spec NodeSpec `json:"spec"`

        // Status it the status of the node
        Status NodeStatus `json:"status"`
}
IPAM Specification

The spec section embeds an IPAM specific field which allows to define the list of all IPs which are available to the node for allocation:

// AllocationMap is a map of allocated IPs indexed by IP
type AllocationMap map[string]AllocationIP

// NodeSpec is the configuration specific to a node
type NodeSpec struct {
        // [...]

        // IPAM is the address management specification. This section can be
        // populated by a user or it can be automatically populated by an IPAM
        // operator
        //
        // +optional
        IPAM IPAMSpec `json:"ipam,omitempty"`
}

// IPAMSpec is the IPAM specification of the node
type IPAMSpec struct {
        // Pool is the list of IPs available to the node for allocation. When
        // an IP is used, the IP will remain on this list but will be added to
        // Status.IPAM.InUse
        //
        // +optional
        Pool AllocationMap `json:"pool,omitempty"`
}

// AllocationIP is an IP available for allocation or already allocated
type AllocationIP struct {
        // Owner is the owner of the IP, this field is set if the IP has been
        // allocated. It will be set to the pod name or another identifier
        // representing the usage of the IP
        //
        // The owner field is left blank for an entry in Spec.IPAM.Pool
        // and filled out as the IP is used and also added to
        // Status.IPAM.InUse.
        //
        // +optional
        Owner string `json:"owner,omitempty"`

        // Resource is set for both available and allocated IPs, it represents
        // what resource the IP is associated with, e.g. in combination with
        // AWS ENI, this will refer to the ID of the ENI
        //
        // +optional
        Resource string `json:"resource,omitempty"`
}
IPAM Status

The status section contains an IPAM specific field. The IPAM status reports all used addresses on that node:

// NodeStatus is the status of a node
type NodeStatus struct {
        // [...]

        // IPAM is the IPAM status of the node
        //
        // +optional
        IPAM IPAMStatus `json:"ipam,omitempty"`
}

// IPAMStatus is the IPAM status of a node
type IPAMStatus struct {
        // InUse lists all IPs out of Spec.IPAM.Pool which have been
        // allocated and are in use.
        //
        // +optional
        InUse AllocationMap `json:"used,omitempty"`
}
Azure IPAM

The Azure IPAM allocator is specific to Cilium deployments running in the Azure cloud and performs IP allocation based on Azure Private IP addresses.

The architecture ensures that only a single operator communicates with the Azure API to avoid rate-limiting issues in large clusters. A pre-allocation watermark allows to maintain a number of IP addresses to be available for use on nodes at all time without requiring to contact the Azure API when a new pod is scheduled in the cluster.

Architecture
_images/azure_arch.png

The Azure IPAM allocator builds on top of the CRD-backed allocator. Each node creates a ciliumnodes.cilium.io custom resource matching the node name when Cilium starts up for the first time on that node. The Cilium agent running on each node will retrieve the Kubernetes v1.Node resource and extract the .Spec.ProviderID field in order to derive the Azure instance ID. Azure allocation parameters are provided as agent configuration option and are passed into the custom resource as well.

The Cilium operator listens for new ciliumnodes.cilium.io custom resources and starts managing the IPAM aspect automatically. It scans the Azure instances for existing interfaces with associated IPs and makes them available via the spec.ipam.available field. It will then constantly monitor the used IP addresses in the status.ipam.used field and allocate more IPs as needed to meet the IP pre-allocation watermark. This ensures that there are always IPs available

Configuration
  • The Cilium agent and operator must be run with the option --ipam=azure or the option ipam: azure must be set in the ConfigMap. This will enable Azure IPAM allocation in both the node agent and operator.
  • In most scenarios, it makes sense to automatically create the ciliumnodes.cilium.io custom resource when the agent starts up on a node for the first time. To enable this, specify the option --auto-create-cilium-node-resource or set auto-create-cilium-node-resource: "true" in the ConfigMap.
  • If IPs are limited, run the Operator with option --aws-release-excess-ips=true. When enabled, operator checks the number of IPs regularly and attempts to release excess free IPs from ENI.
  • It is generally a good idea to enable metrics in the Operator as well with the option --enable-metrics. See the section Running Prometheus & Grafana for additional information how to install and run Prometheus including the Grafana dashboard.
Azure Allocation Parameters

The following parameters are available to control the IP allocation:

spec.ipam.min-allocate

The minimum number of IPs that must be allocated when the node is first bootstrapped. It defines the minimum base socket of addresses that must be available. After reaching this watermark, the PreAllocate and MaxAboveWatermark logic takes over to continue allocating IPs.

If unspecified, no minimum number of IPs is required.

spec.ipam.pre-allocate

The number of IP addresses that must be available for allocation at all times. It defines the buffer of addresses available immediately without requiring for the operator to get involved.

If unspecified, this value defaults to 8.

spec.ipam.max-above-watermark

The maximum number of addresses to allocate beyond the addresses needed to reach the PreAllocate watermark. Going above the watermark can help reduce the number of API calls to allocate IPs.

If let unspecified, the value defaults to 0.

Operational Details
Cache of Interfaces, Subnets, and VirtualNetworks

The operator maintains a list of all Azure ScaleSets, Instances, Interfaces, VirtualNetworks, and Subnets associated with the Azure subscription in a cache.

The cache is updated once per minute or after an IP allocation has been performed. When triggered based on an allocation, the operation is performed at most once per second.

Publication of available IPs

Following the update of the cache, all CiliumNode custom resources representing nodes are updated to publish eventual new IPs that have become available.

In this process, all interfaces are scanned for all available IPs. All IPs found are added to spec.ipam.available. Each interface is also added to status.azure.interfaces.

If this update caused the custom resource to change, the custom resource is updated using the Kubernetes API methods Update() and/or UpdateStatus() if available.

Determination of IP deficits or excess

The operator constantly monitors all nodes and detects deficits in available IP addresses. The check to recognize a deficit is performed on two occasions:

  • When a CiliumNode custom resource is updated
  • All nodes are scanned in a regular interval (once per minute)

If --aws-release-excess-ips is enabled, the check to recognize IP excess is performed at the interval based scan.

When determining whether a node has a deficit in IP addresses, the following calculation is performed:

spec.ipam.pre-allocate - (len(spec.ipam.available) - len(status.ipam.used))

For excess IP calculation:

(len(spec.ipam.available) - len(status.ipam.used)) - (spec.ipam.pre-allocate + spec.ipam.max-above-watermark)

Upon detection of a deficit, the node is added to the list of nodes which require IP address allocation. When a deficit is detected using the interval based scan, the allocation order of nodes is determined based on the severity of the deficit, i.e. the node with the biggest deficit will be at the front of the allocation queue. Nodes that need to release IPs are behind nodes that need allocation.

The allocation queue is handled on demand but at most once per second.

IP Allocation

When performing IP allocation for a node with an address deficit, the operator first looks at the interfaces already attached to the instance represented by the CiliumNode resource.

The operator will then pick the first interface which meets the following criteria:

  • The interface has addresses associated which are not yet used or the number of addresses associated with the interface is lesser than maximum number of addresses that can be associated to an interface.
  • The subnet associated with the interface has IPs available for allocation

The following formula is used to determine how many IPs are allocated on the interface:

min(AvailableOnSubnet, min(AvailableOnInterface, NeededAddresses + spec.ipam.max-above-watermark))

This means that the number of IPs allocated in a single allocation cycle can be less than what is required to fulfill spec.ipam.pre-allocate.

IP Release

When performing IP release for a node with IP excess, the operator scans the interface attached to the node. The following formula is used to determine how many IPs are available for release on the interface:

min(FreeOnInterface, (TotalFreeIPs - spec.ipam.pre-allocate - spec.ipam.max-above-watermark))
Node Termination

When a node or instance terminates, the Kubernetes apiserver will send a node deletion event. This event will be picked up by the operator and the operator will delete the corresponding ciliumnodes.cilium.io custom resource.

Required Privileges

The following Azure API calls are being performed by the Cilium operator. The service principal provided must have privileges to perform these:

Metrics

The metrics are documented in the section IPAM.

AWS ENI

The AWS ENI allocator is specific to Cilium deployments running in the AWS cloud and performs IP allocation based on IPs of AWS Elastic Network Interfaces (ENI) by communicating with the AWS EC2 API.

The architecture ensures that only a single operator communicates with the EC2 service API to avoid rate-limiting issues in large clusters. A pre-allocation watermark is used to maintain a number of IP addresses to be available for use on nodes at all time without needing to contact the EC2 API when a new pod is scheduled in the cluster.

Architecture
_images/eni_arch.png

The AWS ENI allocator builds on top of the CRD-backed allocator. Each node creates a ciliumnodes.cilium.io custom resource matching the node name when Cilium starts up for the first time on that node. It contacts the EC2 metadata API to retrieve the instance ID, instance type, and VPC information, then it populates the custom resource with this information. ENI allocation parameters are provided as agent configuration option and are passed into the custom resource as well.

The Cilium operator listens for new ciliumnodes.cilium.io custom resources and starts managing the IPAM aspect automatically. It scans the EC2 instances for existing ENIs with associated IPs and makes them available via the spec.ipam.available field. It will then constantly monitor the used IP addresses in the status.ipam.used field and automatically create ENIs and allocate more IPs as needed to meet the IP pre-allocation watermark. This ensures that there are always IPs available.

The selection of subnets to use for allocation as well as attachment of security groups to new ENIs can be controlled separately for each node. This makes it possible to hand out pod IPs with differing security groups on individual nodes.

The corresponding datapath is described in section AWS ENI.

Configuration
  • The Cilium agent and operator must be run with the option --ipam=eni or the option ipam: eni must be set in the ConfigMap. This will enable ENI allocation in both the node agent and operator.
  • In most scenarios, it makes sense to automatically create the ciliumnodes.cilium.io custom resource when the agent starts up on a node for the first time. To enable this, specify the option --auto-create-cilium-node-resource or set auto-create-cilium-node-resource: "true" in the ConfigMap.
  • If IPs are limited, run the Operator with option --aws-release-excess-ips=true. When enabled, operator checks the number of IPs regularly and attempts to release excess free IPs from ENI.
  • It is generally a good idea to enable metrics in the Operator as well with the option --enable-metrics. See the section Running Prometheus & Grafana for additional information how to install and run Prometheus including the Grafana dashboard.
ENI Allocation Parameters

The following parameters are available to control the ENI creation and IP allocation:

InstanceType

The AWS EC2 instance type

This field is automatically populated when using ``–auto-create-cilium-node-resource``

spec.eni.vpc-id

The VPC identifier used to create ENIs and select AWS subnets for IP allocation.

This field is automatically populated when using ``–auto-create-cilium-node-resource``

spec.eni.availability-zone

The availability zone used to create ENIs and select AWS subnets for IP allocation.

This field is automatically populated when using ``–auto-create-cilium-node-resource``

spec.ipam.min-allocate

The minimum number of IPs that must be allocated when the node is first bootstrapped. It defines the minimum base socket of addresses that must be available. After reaching this watermark, the PreAllocate and MaxAboveWatermark logic takes over to continue allocating IPs.

If unspecified, no minimum number of IPs is required.

spec.ipam.max-allocate

The maximum number of IPs that can be allocated to the node. When the current amount of allocated IPs will approach this value, the considered value for PreAllocate will decrease down to 0 in order to not attempt to allocate more addresses than defined.

If unspecified, no maximum number of IPs will be enforced.

spec.ipam.pre-allocate

The number of IP addresses that must be available for allocation at all times. It defines the buffer of addresses available immediately without requiring for the operator to get involved.

If unspecified, this value defaults to 8.

spec.ipam.max-above-watermark

The maximum number of addresses to allocate beyond the addresses needed to reach the PreAllocate watermark. Going above the watermark can help reduce the number of API calls to allocate IPs, e.g. when a new ENI is allocated, as many secondary IPs as possible are allocated. Limiting the amount can help reduce waste of IPs.

If let unspecified, the value defaults to 0.

spec.eni.first-interface-index

The index of the first ENI to use for IP allocation, e.g. if the node has eth0, eth1, eth2 and FirstInterfaceIndex is set to 1, then only eth1 and eth2 will be used for IP allocation, eth0 will be ignored for PodIP allocation.

If unspecified, this value defaults to 1 which means that eth0 will not be used for pod IPs.

spec.eni.security-group-tags

The list tags which will be used to filter the security groups to attach to any ENI that is created and attached to the instance.

If unspecified, the security group ids passed in spec.eni.security-groups field will be used.

spec.eni.security-groups

The list of security group ids to attach to any ENI that is created and attached to the instance.

If unspecified, the security group ids of eth0 will be used.

spec.eni.subnet-tags

The tags used to select the AWS subnets for IP allocation. This is an additional requirement on top of requiring to match the availability zone and VPC of the instance.

If unspecified, no tags are required.

spec.eni.delete-on-termination

Remove the ENI when the instance is terminated

If unspecified, this option is enabled.

Operational Details
Cache of ENIs, Subnets, and VPCs

The operator maintains a list of all EC2 ENIs, VPCs and subnets associated with the AWS account in a cache. For this purpose, the operator performs the following three EC2 API operations:

  • DescribeNetworkInterfaces
  • DescribeSubnets
  • DescribeVpcs

The cache is updated once per minute or after an IP allocation or ENI creation has been performed. When triggered based on an allocation or creation, the operation is performed at most once per second.

Publication of available ENI IPs

Following the update of the cache, all CiliumNode custom resources representing nodes are updated to publish eventual new IPs that have become available.

In this process, all ENIs with an interface index greater than spec.eni.first-interface-index are scanned for all available IPs. All IPs found are added to spec.ipam.available. Each ENI meeting this criteria is also added to status.eni.enis.

If this updated caused the custom resource to change, the custom resource is updated using the Kubernetes API methods Update() and/or UpdateStatus() if available.

Determination of ENI IP deficits or excess

The operator constantly monitors all nodes and detects deficits in available ENI IP addresses. The check to recognize a deficit is performed on two occasions:

  • When a CiliumNode custom resource is updated
  • All nodes are scanned in a regular interval (once per minute)

If --aws-release-excess-ips is enabled, the check to recognize IP excess is performed at the interval based scan.

When determining whether a node has a deficit in IP addresses, the following calculation is performed:

spec.ipam.pre-allocate - (len(spec.ipam.available) - len(status.ipam.used))

For excess IP calculation:

(len(spec.ipam.available) - len(status.ipam.used)) - (spec.ipam.pre-allocate + spec.ipam.max-above-watermark)

Upon detection of a deficit, the node is added to the list of nodes which require IP address allocation. When a deficit is detected using the interval based scan, the allocation order of nodes is determined based on the severity of the deficit, i.e. the node with the biggest deficit will be at the front of the allocation queue. Nodes that need to release IPs are behind nodes that need allocation.

The allocation queue is handled on demand but at most once per second.

IP Allocation

When performing IP allocation for a node with an address deficit, the operator first looks at the ENIs which are already attached to the instance represented by the CiliumNode resource. All ENIs with an interface index greater than spec.eni.first-interface-index are considered for use.

Note

In order to not use eth0 for IP allocation, set spec.eni.first-interface-index to 1 to skip the first interface in line.

The operator will then pick the first already allocated ENI which meets the following criteria:

  • The ENI has addresses associated which are not yet used or the number of addresses associated with the ENI is lesser than the instance type specific limit.
  • The subnet associated with the ENI has IPs available for allocation

The following formula is used to determine how many IPs are allocated on the ENI:

min(AvailableOnSubnet, min(AvailableOnENI, NeededAddresses + spec.ipam.max-above-watermark))

This means that the number of IPs allocated in a single allocation cycle can be less than what is required to fulfill spec.ipam.pre-allocate.

In order to allocate the IPs, the method AssignPrivateIpAddresses of the EC2 service API is called. When no more ENIs are available meeting the above criteria, a new ENI is created.

IP Release

When performing IP release for a node with IP excess, the operator scans ENIs attached to the node with an interface index greater than spec.eni.first-interface-index and selects an ENI with the most free IPs available for release. The following formula is used to determine how many IPs are available for release on the ENI:

min(FreeOnENI, (FreeIPs - spec.ipam.pre-allocate - spec.ipam.max-above-watermark))

Operator releases IPs from the selected ENI, if there is still excess free IP not released, operator will attempt to release in next release cycle.

In order to release the IPs, the method UnassignPrivateIpAddresses of the EC2 service API is called. There is no limit on ENIs per subnet so ENIs are remained on the node.

ENI Creation

As long as an instance type is capable allocating additional ENIs, ENIs are allocated automatically based on demand.

When allocating an ENI, the first operation performed is to identify the best subnet. This is done by searching through all subnets and finding a subnet that matches the following criteria:

  • The VPC ID of the subnet matches spec.eni.vpc-id
  • The Availability Zone of the subnet matches spec.eni.availability-zone
  • The subnet contains all tags as specified by spec.eni.subnet-tags

If multiple subnets match, the subnet with the most available addresses is selected.

After selecting the ENI, the interface index is determined. For this purpose, all existing ENIs are scanned and the first unused index greater than spec.eni.first-interface-index is selected.

After determining the subnet and interface index, the ENI is created and attached to the EC2 instance using the methods CreateNetworkInterface and AttachNetworkInterface of the EC2 API.

The security group ids attached to the ENI are computed in the following order:

  1. The field spec.eni.security-groups is consulted first. If this is set then these will be the security group ids attached to the newly created ENI.
  2. The filed spec.eni.security-group-tags is consulted. If this is set then the operator will list all security groups in the account and will attach to the ENI the ones that match the list of tags passed.
  3. Finally if none of the above fields are set then the newly created ENI will inherit the security group ids of eth0 of the machine.

The description will be in the following format:

"Cilium-CNI (<EC2 instance ID>)"

If the ENI tagging feature is enabled then the ENI will be tagged with the provided information.

ENI Deletion Policy

ENIs can be marked for deletion when the EC2 instance to which the ENI is attached to is terminated. In order to enable this, the option spec.eni.delete-on-termination can be enabled. If enabled, the ENI is modifying after creation using ModifyNetworkInterface to specify this deletion policy.

Node Termination

When a node or instance terminates, the Kubernetes apiserver will send a node deletion event. This event will be picked up by the operator and the operator will delete the corresponding ciliumnodes.cilium.io custom resource.

Required Privileges

The following EC2 privileges are required by the Cilium operator in order to perform ENI creation and IP allocation:

  • DescribeNetworkInterfaces
  • DescribeSubnets
  • DescribeVpcs
  • DescribeSecurityGroups
  • CreateNetworkInterface
  • AttachNetworkInterface
  • ModifyNetworkInterface
  • AssignPrivateIpAddresses

Additionally if the ENI tagging feature is enabled it will require the following EC2 API operation as well:

  • CreateTags

If release excess IP enabled:

  • UnassignPrivateIpAddresses
EC2 instance types ENI limits

Currently the EC2 Instance ENI limits (adapters per instance + IPv4/IPv6 IPs per adapter) are hardcoded in the Cilium codebase for easy out-of-the box deployment and usage.

The limits can be modified via the --aws-instance-limit-mapping CLI flag on the cilium-operator. This allows the user to supply a custom limit.

Additionally the limits can be updated via the EC2 API by passing the --update-ec2-apdater-limit-via-api CLI flag. This will require an additional EC2 IAM permission:

  • DescribeInstanceTypes
Metrics

The IPAM metrics are documented in the section IPAM.

Google Kubernetes Engine

When running Cilium on Google GKE following the Installation on Google GKE guide, the native networking layer of Google Cloud will be utilized for address management and IP forwarding.

Architecture
_images/gke_ipam_arch.png

Cilium running in a GKE configuration mode utilizes the Kubernetes hostscope IPAM mode. It will configure the Cilium agent to wait until the Kubernetes node resource is populated with a spec.podCIDR or spec.podCIDRs as required by the enabled address families (IPv4/IPv6). See Kubernetes Host Scope for additional details of this IPAM mode.

The corresponding datapath is described in section Google Kubernetes Engine.

See the getting started guide Installation on Google GKE to install Cilium Google Kubernetes Engine (GKE).

Configuration

The GKE IPAM mode can be enabled by setting the Helm option config.ipam=kubernetes or by setting the ConfigMap option ipam: kubernetes.

Troubleshooting
Validate the exposed PodCIDR field

Check if the Kubernetes nodes contain a value in the podCIDR field:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
gke-cluster4-default-pool-b195a3f3-k431     10.4.0.0/24
gke-cluster4-default-pool-b195a3f3-zv3p     10.4.1.0/24
Check the Cilium status

Run cilium status on the node in question and validate that the CIDR used for IPAM matches the PodCIDR announced in the Kubernetes node:

kubectl -n cilium get pods -o wide | grep gke-cluster4-default-pool-b195a3f3-k431
cilium-lv4xd                       1/1     Running   0          3h8m   10.164.0.112   gke-cluster4-default-pool-b195a3f3-k431   <none>           <none>

kubectl -n cilium exec -ti cilium-lv4xd -- cilium status
KVStore:                Ok   Disabled
Kubernetes:             Ok   1.14+ (v1.14.10-gke.27) [linux/amd64]
Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement:   Probe   []
Cilium:                 Ok      OK
NodeMonitor:            Disabled
Cilium health daemon:   Ok
IPAM:                   IPv4: 7/255 allocated from 10.4.0.0/24,
Controller Status:      36/36 healthy
Proxy Status:           OK, ip 10.4.0.190, 0 redirects active on ports 10000-20000
Hubble:                 Disabled
Cluster health:         2/2 reachable   (2020-04-23T13:46:36Z)

Multi Host Networking

Cilium is in full control over both ends of the connection for connections inside the cluster. It can thus transmit state and security context information between two container hosts by embedding the information in encapsulation headers or even unused bits of the IPv6 packet header. This allows Cilium to transmit the security context of where the packet originates, which allows tracing back which container labels are assigned to the origin container.

Note

As the packet headers contain security sensitive information, it is highly recommended to either encrypt all traffic or run Cilium in a trusted network environment.

Cilium keeps the networking concept as simple as possible. There are two networking models to choose from.

Regardless of the option chosen, the container itself has no awareness of the underlying network it runs on; it only contains a default route which points to the IP address of the cluster node. Given the removal of the routing cache in the Linux kernel, this reduces the amount of state to keep in the per connection flow cache (TCP metrics), which allows to terminate millions of connections in each container.

Overlay Network Mode

When no configuration is provided, Cilium automatically runs in this mode.

In this mode, all cluster nodes form a mesh of tunnels using the UDP based encapsulation protocols VXLAN or Geneve. All container-to-container network traffic is routed through these tunnels.

The overlay used by the mesh of tunnels is managed by the protocol of your choice (VXLAN by default). It’s up to the overlay implementation and the kernel’s routing table to route traffic to the other nodes’ overlays. It’s likely that the encapsulation protocol will pick up the default layer 3 infrastructure for the host.

If you would like all traffic in cilium tunnels to go through a specific network interface (e.g.: a private network for security purposes), you should configure the host’s default endpoint accordingly (e.g.: the IP of the node in a Kubernetes setup). You may check the endpoints used by cilium to set up the tunneling with the command [cilium bpf tunnel list](https://docs.cilium.io/en/stable/cmdref/cilium_bpf_tunnel_list/).

This mode has several major advantages:

  • Simplicity: The network which connects the cluster nodes does not need to be made aware of the cluster prefix. Cluster nodes can spawn multiple routing or link-layer domains. The topology of the underlying network is irrelevant as long as cluster nodes can reach each other using IP/UDP.
  • Auto-configuration: When running together with an orchestration system such as Kubernetes, the list of all nodes in the cluster including their associated allocation prefix node is made available to each agent automatically. This means that if Kubernetes is being run with the --allocate-node-cidrs option, Cilium can form an overlay network automatically without any configuration by the user. New nodes joining the cluster will automatically be incorporated into the mesh.
  • Identity transfer: Encapsulation protocols allow for the carrying of arbitrary metadata along with the network packet. Cilium makes use of this ability to transfer metadata such as the source security identity and load balancing state to perform direct-server-return.

Direct / Native Routing Mode

Note

This is an advanced networking mode which requires the underlying network to be made aware of container IPs. You can enable this mode by running Cilium with the option --tunnel disabled.

In direct routing mode, Cilium will hand all packets which are not addressed for another local endpoint to the routing subsystem of the Linux kernel. This means that the packet will be routed as if a local process would have emitted the packet. As a result, the network connecting the cluster nodes must be aware that each of the node IP prefixes are reachable by using the node’s primary IP address as an L3 next hop address.

Cilium automatically enables IP forwarding in Linux when direct mode is configured, but it is up to the container cluster administrator to ensure that each routing element in the underlying network has a route that describes each node IP as the IP next hop for the corresponding node prefix.

This is typically achieved using two methods:

  • Operation of a routing protocol such as OSPF or BGP via routing daemon such as Zebra, bird, bgpd. The routing protocols will announce the node allocation prefix via the node’s IP to all other nodes.
  • Use of the cloud provider’s routing functionality. Refer to the documentation of your cloud provider for additional details (e.g,. AWS VPC Route Tables or GCE Routes). These APIs can be used to associate each node prefix with the appropriate next hop IP each time a container node is added to the cluster. If you are running Kubernetes with the --cloud-provider in combination with the --allocate-node-cidrs option then this is configured automatically for IPv4 prefixes.

Note

Use of direct routing mode with advanced policy use cases such as L7 policies is currently beta. Please provide feedback and file a GitHub issue if you experience any problems.

There are two possible approaches to performing network forwarding for container-to-container traffic:

Cluster Mesh

Cluster mesh extends the networking datapath across multiple clusters. It allows endpoints in all connected clusters to communicate while providing full policy enforcement. Load-balancing is available via Kubernetes annotations.

See Setting up Cluster Mesh for instructions on how to set up cluster mesh.

Container Communication with External Hosts

Container communication with the outside world has two primary modes:

  • Containers exposing API services for consumption by hosts outside of the container cluster.
  • Containers making outgoing connections. Examples include connecting to 3rd-party API services like Twilio or Stripe as well as accessing private APIs that are hosted elsewhere in your enterprise datacenter or cloud deployment.

In the Direct / Native Routing Mode mode described before, if container IP addresses are routable outside of the container cluster, communication with external hosts requires little more than enabling L3 forwarding on each of the Linux nodes.

External Network Connectivity

If the destination of a packet lies outside of the cluster, Cilium will delegate routing to the routing subsystem of the cluster node to use the default route which is installed on the node of the cluster.

As the IP addresses used for the cluster prefix are typically allocated from RFC1918 private address blocks and are not publicly routable. Cilium will automatically masquerade the source IP address of all traffic that is leaving the cluster. This behavior can be disabled by running cilium-agent with the option --masquerade=false.

Public Endpoint Exposure

In direct routing mode, endpoint IPs can be publicly routable IPs and no additional action needs to be taken.

In overlay mode, endpoints that are accepting inbound connections from cluster external clients likely want to be exposed via some kind of load-balancing layer. Such a load-balancer will have a public external address that is not part of the Cilium network. This can be achieved by having a load-balancer container that both has a public IP on an externally reachable network and a private IP on a Cilium network. However, many container orchestration frameworks, like Kubernetes, have built in abstractions to handle this “ingress” load-balancing capability, which achieve the same effect that Cilium handles forwarding and security only for ‘’internal’’ traffic between different services.

Security

Cilium provides security on multiple levels. Each can be used individually or combined together.

  • Identity based Connectivity Access Control: Connectivity policies between endpoints (Layer 3), e.g. any endpoint with label role=frontend can connect to any endpoint with label role=backend.
  • Restriction of accessible ports (Layer 4) for both incoming and outgoing connections, e.g. endpoint with label role=frontend can only make outgoing connections on port 443 (https) and endpoint role=backend can only accept connections on port 443 (https).
  • Fine grained access control on application protocol level to secure HTTP and remote procedure call (RPC) protocols, e.g the endpoint with label role=frontend can only perform the REST API call GET /userdata/[0-9]+, all other API interactions with role=backend are restricted.

Currently on the roadmap, to be added soon:

  • Authentication: Any endpoint which wants to initiate a connection to an endpoint with the label role=backend must have a particular security certificate to authenticate itself before being able to initiate any connections. See GH issue 502 for additional details.
  • Encryption: Communication between any endpoint with the label role=frontend to any endpoint with the label role=backend is automatically encrypted with a key that is automatically rotated. See GH issue 504 to track progress on this feature.

Identity based Connectivity Access Control

Container management systems such as Kubernetes deploy a networking model which assigns an individual IP address to each pod (group of containers). This ensures simplicity in architecture, avoids unnecessary network address translation (NAT) and provides each individual container with a full range of port numbers to use. The logical consequence of this model is that depending on the size of the cluster and total number of pods, the networking layer has to manage a large number of IP addresses.

Traditionally security enforcement architectures have been based on IP address filters. Let’s walk through a simple example: If all pods with the label role=frontend should be allowed to initiate connections to all pods with the label role=backend then each cluster node which runs at least one pod with the label role=backend must have a corresponding filter installed which allows all IP addresses of all role=frontend pods to initiate a connection to the IP addresses of all local role=backend pods. All other connection requests should be denied. This could look like this: If the destination address is 10.1.1.2 then allow the connection only if the source address is one of the following [10.1.2.2,10.1.2.3,20.4.9.1].

Every time a new pod with the label role=frontend or role=backend is either started or stopped, the rules on every cluster node which run any such pods must be updated by either adding or removing the corresponding IP address from the list of allowed IP addresses. In large distributed applications, this could imply updating thousands of cluster nodes multiple times per second depending on the churn rate of deployed pods. Worse, the starting of new role=frontend pods must be delayed until all servers running role=backend pods have been updated with the new security rules as otherwise connection attempts from the new pod could be mistakenly dropped. This makes it difficult to scale efficiently.

In order to avoid these complications which can limit scalability and flexibility, Cilium entirely separates security from network addressing. Instead, security is based on the identity of a pod, which is derived through labels. This identity can be shared between pods. This means that when the first role=frontend pod is started, Cilium assigns an identity to that pod which is then allowed to initiate connections to the identity of the role=backend pod. The subsequent start of additional role=frontend pods only requires to resolve this identity via a key-value store, no action has to be performed on any of the cluster nodes hosting role=backend pods. The starting of a new pod must only be delayed until the identity of the pod has been resolved which is a much simpler operation than updating the security rules on all other cluster nodes.

_images/identity.png

Policy Enforcement

All security policies are described assuming stateful policy enforcement for session based protocols. This means that the intent of the policy is to describe allowed direction of connection establishment. If the policy allows A => B then reply packets from B to A are automatically allowed as well. However, B is not automatically allowed to initiate connections to A. If that outcome is desired, then both directions must be explicitly allowed.

Security policies may be enforced at ingress or egress. For ingress, this means that each cluster node verifies all incoming packets and determines whether the packet is allowed to be transmitted to the intended endpoint. Correspondingly, for egress each cluster node verifies outgoing packets and determines whether the packet is allowed to be transmitted to its intended destination.

In order to enforce identity based security in a multi host cluster, the identity of the transmitting endpoint is embedded into every network packet that is transmitted in between cluster nodes. The receiving cluster node can then extract the identity and verify whether a particular identity is allowed to communicate with any of the local endpoints.

Default Security Policy

If no policy is loaded, the default behavior is to allow all communication unless policy enforcement has been explicitly enabled. As soon as the first policy rule is loaded, policy enforcement is enabled automatically and any communication must then be white listed or the relevant packets will be dropped.

Similarly, if an endpoint is not subject to an L4 policy, communication from and to all ports is permitted. Associating at least one L4 policy to an endpoint will block all connectivity to ports unless explicitly allowed.

Orchestration System Specifics

Kubernetes

Cilium regards each deployed Pod as an endpoint with regards to networking and security policy enforcement. Labels associated with pods can be used to define the identity of the endpoint.

When two pods communicate via a service construct, then the labels of the origin pod apply to determine the identity.

Datapath

Native-Routing

The native routing datapath is enabled with tunnel: disabled and enables the native packet forwarding mode. The native packet forwarding mode leverages the routing capabilities of the network Cilium runs on instead of performing encapsulation.

_images/native_routing.png
Requirements on the network
  • In order to run the native routing mode, the network connecting the hosts on which Cilium is running on must be capable of forwarding IP traffic using addresses given to pods or other workloads.
  • The Linux kernel on the node must be aware on how to forward packets of pods or other workloads of all nodes running Cilium. This can be achieved in two ways:
    1. The node itself does not know how to route all pod IPs but a router exists on the network that knows how to reach all other pods. In this scenario, the Linux node is configured to contain a default route to point to such a router. This model is used for cloud provider network integration. See Google Kubernetes Engine, AWS ENI, and Azure IPAM for more details.
    2. Each individual node is made aware of all pod IPs of all other nodes and routes are inserted into the Linux kernel routing table to represent this. If all nodes share a single L2 network, then this can be taken care of by enabling the option auto-direct-node-routes: true. Otherwise, an additional system component such as a BGP daemon must be run to distribute the routes. See the guide Using kube-router to run BGP on how to achieve this using the kube-router project.
Masquerading

Native routing is typically enabled in the context of a virtual network with private IP addresses. For any destination outside of the virtual network, traffic must typically be masqueraded. This is done by setting masquerade: true (default). In order to exclude the entire CIDR of the virtual network, the datapath must be told the CIDR within which native routing is supported. This is done with the option native-routing-cidr: x.x.x.x/y.

Configuration

The following configuration options must be set to run the datapath in native routing mode:

  • tunnel: disabled: Enable native routing mode
  • enable-endpoint-routes: true: Enable per-endpoint routing on the node
  • native-routing-cidr: x.x.x.x/y: Set the CIDR in which native routing can be performed.

AWS ENI

The AWS ENI datapath is enabled when Cilium is run with the option --ipam=eni. It is a special purpose datapath that is useful when running Cilium in an AWS environment.

Advantages of the model
  • Pods are assigned ENI IPs which are directly routable in the AWS VPC. This simplifies communication of pod traffic within VPCs and avoids the need for SNAT.
  • Pod IPs are assigned a security group. The security groups for pods are configured per node which allows to create node pools and give different security group assignments to different pods. See section AWS ENI for more details.
Disadvantages of this model
  • The number of ENI IPs is limited per instance. The limit depends on the EC2 instance type. This can become a problem when attempting to run a larger number of pods on very small instance types.
  • Allocation of ENIs and ENI IPs requires interaction with the EC2 API which is subject to rate limiting. This is primarily mitigated via the operator design, see section AWS ENI for more details.
Architecture
Ingress
  1. Traffic is received on one of the ENIs attached to the instance which is represented on the node as interface ethN.

  2. An IP routing rule ensures that traffic to all local pod IPs is done using the main routing table:

    20:      from all to 192.168.105.44 lookup main
    
  3. The main routing table contains an exact match route to steer traffic into a veth pair which is hooked into the pod:

    192.168.105.44 dev lxc5a4def8d96c5
    
  4. All traffic passing lxc5a4def8d96c5 on the way into the pod is subject to Cilium’s BPF program to enforce network policies, provide service reverse load-balancing, and visibility.

Egress
  1. The pod’s network namespace contains a default route which points to the node’s router IP via the veth pair which is named eth0 inside of the pod and lxcXXXXXX in the host namespace. The router IP is allocated from the ENI space, allowing for sending of ICMP errors from the router IP for Path MTU purposes.

  2. After passing through the veth pair and before reaching the Linux routing layer, all traffic is subject to Cilium’s BPF program to enforce network policies, implement load-balancing and provide networking features.

  3. An IP routing rule ensures that traffic from individual endpoints are using a routing table specific to the ENI from which the endpoint IP was allocated:

    30:      from 192.168.105.44 to 192.168.0.0/16 lookup 92
    
  4. The ENI specific routing table contains a default route which redirects to the router of the VPC via the ENI interface:

    default via 192.168.0.1 dev eth2
    192.168.0.1 dev eth2
    
Configuration

The AWS ENI datapath is enabled by setting the following option:

  • ipam: eni Enables the ENI specific IPAM backend and indicates to the datapath that ENI IPs will be used.
  • blacklist-conflicting-routes: "false" disables blacklisting of local routes. This is required as routes will exist covering ENI IPs pointing to interfaces that are not owned by Cilium. If blacklisting is not disabled, all ENI IPs would be considered used by another networking component.
  • enable-endpoint-routes: "true" enables direct routing to the ENI veth pairs without requiring to route via the cilium_host interface.
  • auto-create-cilium-node-resource: "true" enables the automatic creation of the CiliumNode custom resource with all required ENI parameters. It is possible to disable this and provide the custom resource manually.
  • egress-masquerade-interfaces: eth+ is the interface selector of all interfaces which are subject to masquerading. Masquerading can be disabled entirely with masquerade: "false".

See the section AWS ENI for details on how to configure ENI IPAM specific parameters.

Google Kubernetes Engine

Running Cilium on Google Kubernetes Engine will utilize the Google Cloud’s networking layer with Cilium running in a Native-Routing configuration. This provides native networking performance while benefiting from many additional Cilium features such as policy enforcement, load-balancing with DSR, efficient NodePort/ExternalIP/HostPort implementation, extensive visibility features, and so on.

_images/gke_datapath.png
Addressing
Cilium will assign IPs to pods out of the PodCIDR assigned to the specific Kubernetes node. By using Alias IP ranges, these IPs are natively routable on Google Cloud’s network without additional encapsulation or route distribution.
Masquerading
All traffic not staying with the native-routing-cidr (defaults to the Cluster CIDR) will be masqueraded to the node’s IP address to become publicly routable.
Load-balancing
ClusterIP load-balancing will be performed using BPF for all version of GKE. Starting with >= GKE v1.15 or when running a Linux kernel >= 4.19, all NodePort/ExternalIP/HostPort will be performed using a BPF implementation as well.
Policy enforcement & visibility
All NetworkPolicy enforcement and visibility is provided using BPF.
Configuration

The following configuration options must be set to run the datapath on GKE:

  • gke.enabled: true: Enables the Google Kubernetes Engine (GKE) datapath. Setting this to true will enabled the following options:
    • ipam: kubernetes: Enable Kubernetes Host Scope IPAM
    • tunnel: disabled: Enable native routing mode
    • enable-endpoint-routes: true: Enable per-endpoint routing on the node
    • blacklist-conflicting-routes: false: Disable blacklisting of IPs which collide with a local route
    • enable-local-node-route: false: Disable installation of the local node route
  • native-routing-cidr: x.x.x.x/y: Set the CIDR in which native routing is supported.

See the getting started guide Installation on Google GKE to install Cilium on Google Kubernetes Engine (GKE).

Failure Behavior

If Cilium loses connectivity with the KV-Store, it guarantees that:

  • Normal networking operations will continue;
  • If policy enforcement is enabled, the existing Endpoint will still have their policy enforced but you will lose the ability to add additional containers that belong to security identities which are unknown on the node;
  • If services are enabled, you will lose the ability to add additional services / loadbalancers;
  • When the connectivity is restored to the KV-Store, Cilium can take up to 5 minutes to re-sync the out-of-sync state with the KV-Store.

Cilium will keep running even if it is out-of-sync with the KV-Store.

If Cilium crashes / or the DaemonSet is accidentally deleted, the following are guaranteed:

  • When running Cilium as a DaemonSet / container, with the specification files provided in the documentation Installation with external etcd, the endpoints / containers which are already running will not lose any connectivity, and they will keep running with the policy loaded before Cilium stopped unexpectedly.
  • When running Cilium in a different way, just make sure the bpf fs is mounted Mounted BPF filesystem.

Architecture

This document describes the Cilium architecture. It focuses on documenting the BPF datapath hooks to implement the Cilium datapath, how the Cilium datapath integrates with the container orchestration layer, and the objects shared between the layers e.g. the BPF datapath and Cilium agent.

Datapath

The Linux kernel supports a set of BPF hooks in the networking stack that can be used to run BPF programs. The Cilium datapath uses these hooks to load BPF programs that when used together create higher level networking constructs.

The following is a list of the hooks used by Cilium and a brief description. For a more thorough documentation on specifics of each hook see BPF and XDP Reference Guide.

  • XDP: The XDP BPF hook is at the earliest point possible in the networking driver and triggers a run of the BPF program upon packet reception. This achieves the best possible packet processing performance since the program runs directly on the packet data before any other processing can happen. This hook is ideal for running filtering programs that drop malicious or unexpected traffic, and other common DDOS protection mechanisms.

  • Traffic Control Ingress/Egress: BPF programs attached to the traffic control (tc) ingress hook are attached to a networking interface, same as XDP, but will run after the networking stack has done initial processing of the packet. The hook is run before the L3 layer of the stack but has access to most of the metadata associated with a packet. This is ideal for doing local node processing, such as applying L3/L4 endpoint policy and redirecting traffic to endpoints. For networking facing devices the tc ingress hook can be coupled with above XDP hook. When this is done it is reasonable to assume that the majority of the traffic at this point is legitimate and destined for the host.

    Containers typically use a virtual device called a veth pair which acts as a virtual wire connecting the container to the host. By attaching to the TC ingress hook of the host side of this veth pair Cilium can monitor and enforce policy on all traffic exiting a container. By attaching a BPF program to the veth pair associated with each container and routing all network traffic to the host side virtual devices with another BPF program attached to the tc ingress hook as well Cilium can monitor and enforce policy on all traffic entering or exiting the node.

    Depending on the use case, containers may also be connected through ipvlan devices instead of a veth pair. In this mode, the physical device in the host is the ipvlan master where virtual ipvlan devices in slave mode are set up inside the container. One of the benefits of ipvlan over a veth pair is that the stack requires less resources to push the packet into the ipvlan slave device of the other network namespace and therefore may achieve better latency results. This option can be used for unprivileged containers. The BPF programs for tc are then attached to the tc egress hook on the ipvlan slave device inside the container’s network namespace in order to have Cilium apply L3/L4 endpoint policy, for example, combined with another BPF program running on the tc ingress hook of the ipvlan master such that also incoming traffic on the node can be enforced.

  • Socket operations: The socket operations hook is attached to a specific cgroup and runs on TCP events. Cilium attaches a BPF socket operations program to the root cgroup and uses this to monitor for TCP state transitions, specifically for ESTABLISHED state transitions. When a socket transitions into ESTABLISHED state if the TCP socket has a node local peer (possibly a local proxy) a socket send/recv program is attached.

  • Socket send/recv: The socket send/recv hook runs on every send operation performed by a TCP socket. At this point the hook can inspect the message and either drop the message, send the message to the TCP layer, or redirect the message to another socket. Cilium uses this to accelerate the datapath redirects as described below.

Combining the above hooks with a virtual interfaces (cilium_host, cilium_net), an optional overlay interface (cilium_vxlan), Linux kernel crypto support and a userspace proxy (Envoy) Cilium creates the following networking objects.

  • Prefilter: The prefilter object runs an XDP program and provides a set of prefilter rules used to filter traffic from the network for best performance. Specifically, a set of CIDR maps supplied by the Cilium agent are used to do a lookup and the packet is either dropped, for example when the destination is not a valid endpoint, or allowed to be processed by the stack. This can be easily extended as needed to build in new prefilter criteria/capabilities.

  • Endpoint Policy: The endpoint policy object implements the Cilium endpoint enforcement. Using a map to lookup a packets associated identity and policy this layer scales well to lots of endpoints. Depending on the policy this layer may drop the packet, forward to a local endpoint, forward to the service object or forward to the L7 Policy object for further L7 rules. This is the primary object in the Cilium datapath responsible for mapping packets to identities and enforcing L3 and L4 policies.

  • Service: The Service object performs a map lookup on the destination IP and optionally destination port for every packet received by the object. If a matching entry is found, the packet will be forwarded to one of the configured L3/L4 endpoints. The Service block can be used to implement a standalone load balancer on any interface using the TC ingress hook or may be integrated in the endpoint policy object.

  • L3 Encryption: On ingress the L3 Encryption object marks packets for decryption, passes the packets to the Linux xfrm (transform) layer for decryption, and after the packet is decrypted the object receives the packet then passes it up the stack for further processing by other objects. Depending on the mode, direct routing or overlay, this may be a BPF tail call or the Linux routing stack that passes the packet to the next object. The key required for decryption is encoded in the IPsec header so on ingress we do not need to do a map lookup to find the decryption key.

    On egress a map lookup is first performed using the destination IP to determine if a packet should be encrypted and if so what keys are available on the destination node. The most recent key available on both nodes is chosen and the packet is marked for encryption. The packet is then passed to the Linux xfrm layer where it is encrypted. Upon receiving the now encrypted packet it is passed to the next layer either by sending it to the Linux stack for routing or doing a direct tail call if an overlay is in use.

  • Socket Layer Enforcement: Socket layer enforcement use two hooks the socket operations hook and the socket send/recv hook to monitor and attach to all TCP sockets associated with Cilium managed endpoints, including any L7 proxies. The socket operations hook will identify candidate sockets for accelerating. These include all local node connections (endpoint to endpoint) and any connection to a Cilium proxy. These identified connections will then have all messages handled by the socket send/recv hook and will be accelerated using sockmap fast redirects. The fast redirect ensures all policies implemented in Cilium are valid for the associated socket/endpoint mapping and assuming they are sends the message directly to the peer socket. This is allowed because the sockmap send/recv hooks ensures the message will not need to be processed by any of the objects above.

  • L7 Policy: The L7 Policy object redirect proxy traffic to a Cilium userspace proxy instance. Cilium uses an Envoy instance as its userspace proxy. Envoy will then either forward the traffic or generate appropriate reject messages based on the configured L7 policy.

These components are connected to create the flexible and efficient datapath used by Cilium. Below we show the following possible flows connecting endpoints on a single node, ingress to an endpoint, and endpoint to egress networking device. In each case there is an additional diagram showing the TCP accelerated path available when socket layer enforcement is enabled.

Endpoint to Endpoint

First we show the local endpoint to endpoint flow with optional L7 Policy on egress and ingress. Followed by the same endpoint to endpoint flow with socket layer enforcement enabled. With socket layer enforcement enabled for TCP traffic the handshake initiating the connection will traverse the endpoint policy object until TCP state is ESTABLISHED. Then after the connection is ESTABLISHED only the L7 Policy object is still required.

_images/cilium_bpf_endpoint.svg

Egress from Endpoint

Next we show local endpoint to egress with optional overlay network. In the optional overlay network traffic is forwarded out the Linux network interface corresponding to the overlay. In the default case the overlay interface is named cilium_vxlan. Similar to above, when socket layer enforcement is enabled and a L7 proxy is in use we can avoid running the endpoint policy block between the endpoint and the L7 Policy for TCP traffic. An optional L3 encryption block will encrypt the packet if enabled.

_images/cilium_bpf_egress.svg

Ingress to Endpoint

Finally we show ingress to local endpoint also with optional overlay network. Similar to above socket layer enforcement can be used to avoid a set of policy traversals between the proxy and the endpoint socket. If the packet is encrypted upon receive it is first decrypted and then handled through the normal flow.

_images/cilium_bpf_ingress.svg

veth-based versus ipvlan-based datapath

Note

The ipvlan-based datapath is currently only in technology preview and to be used for experimentation purposes. This restriction will be lifted in future Cilium releases.

By default Cilium CNI operates in veth-based datapath mode which allows for more flexibility in that all BPF programs are managed by Cilium out of the host network namespace such that containers can be granted privileges for their namespaces like CAP_NET_ADMIN without affecting security since BPF enforcement points in the host are unreachable for the container. Given BPF programs are attached from the host’s network namespace, BPF also has the ability to take over and efficiently manage most of the forwarding logic between local containers and host since there always is a networking device reachable. However, this also comes at a latency cost as in veth-based mode the network stack internally needs to be re-traversed when handing the packet from one veth device to its peer device in the other network namespace. This egress-to-ingress switch needs to be done twice when communicating between local Cilium endpoints, and once for packet that are arriving or sent out of the host.

For a more latency optimized datapath, Cilium CNI also supports ipvlan L3/L3S mode with a number of restrictions. In order to support older kernel’s without ipvlan’s hairpin mode, Cilium attaches BPF programs at the ipvlan slave device inside the container’s network namespace on the tc egress layer, which means that this datapath mode can only be used for containers which are not running with CAP_NET_ADMIN and CAP_NET_RAW privileges! ipvlan uses an internal forwarding logic for direct slave-to-slave or slave-to-master redirection and therefore forwarding to devices is not performed from the BPF program itself. The network namespace switching is more efficient in ipvlan mode since the stack does not need to be re-traversed as in veth-based datapath case for external packets. The host-to-container network namespace switch happens directly at L3 layer without having to queue and reschedule the packet for later ingress processing. In case of communication among local endpoints, the egress-to-ingress switch is performed once instead of having to perform it twice.

For Cilium in ipvlan mode there are a number of additional restrictions in the current implementation which are to be addressed in upcoming work: NAT64 cannot be enabled at this point as well as L7 policy enforcement via proxy. Service load-balancing to local endpoints is currently not enabled as well as container to host-local communication. If one of these features are needed, then the default veth-based datapath mode is recommended instead.

The ipvlan mode in Cilium’s CNI can be enabled by running the Cilium daemon with e.g. --datapath-mode=ipvlan --ipvlan-master-device=bond0 where the latter typically specifies the physical networking device which then also acts as the ipvlan master device. Note that in case ipvlan datapath mode is deployed in L3S mode with Kubernetes, make sure to have a stable kernel running with the following ipvlan fix included: d5256083f62e.

This completes the datapath overview. More BPF specifics can be found in the BPF and XDP Reference Guide. Additional details on how to extend the L7 Policy exist in the Envoy section.

Scale

BPF Map Limitations

All BPF maps are created with upper capacity limits. Insertion beyond the limit will fail and thus limits the scalability of the datapath. The following table shows the default values of the maps. Each limit can be bumped in the source code. Configuration options will be added on request if demand arises.

Map Name Scope Default Limit Scale Implications
Connection Tracking node or endpoint 1M TCP/256k UDP Max 1M concurrent TCP connections, max 256k expected UDP answers
NAT node 512k Max 512k NAT entries
Endpoints node 64k Max 64k local endpoints + host IPs per node
IP cache node 512k Max 256k endpoints (IPv4+IPv6), max 512k endpoints (IPv4 or IPv6) across all clusters
Load Balancer node 64k Max 64k cumulative backends across all services across all clusters
Policy endpoint 16k Max 16k allowed identity + port + protocol pairs for specific endpoint
Proxy Map node 512k Max 512k concurrent redirected TCP connections to proxy
Tunnel node 64k Max 32k nodes (IPv4+IPv6) or 64k nodes (IPv4 or IPv6) across all clusters
IPv4 Fragmentation node 8k Max 8k fragmented datagrams in flight simultaneously on the node

For some BPF maps, the upper capacity limit can be overridden using command line options for cilium-agent. A given capacity can be set using --bpf-ct-global-tcp-max, --bpf-ct-global-any-max, --bpf-nat-global-max, --bpf-policy-map-max, and --bpf-fragments-map-max.

Using --bpf-map-dynamic-size-ratio the upper capacity limits of the connection tracking, NAT, and policy maps are determined at agent startup based on the given ratio of the total system memory. For example a given ratio of 0.03 leads to 3% of the total system memory to be used for these maps.

Kubernetes Integration

The following diagram shows the integration of iptables rules as installed by kube-proxy and the iptables rules as installed by Cilium.

_images/kubernetes_iptables.svg

Getting Help

Cilium is a project with a growing community. There are numerous ways to get help with Cilium if needed:

FAQ

Cilium Frequently Asked Questions (FAQ): Cilium uses GitHub tags to maintain a list of questions asked by users. We suggest checking to see if your question is already answered.

Slack

Chat: The best way to get immediate help if you get stuck is to ask in one of the Cilium Slack channels.

GitHub

Bug Tracker: All the issues are addressed in the GitHub issue tracker. If you want to report a bug or a new feature please file the issue according to the GitHub template.

Contributing: If you want to contribute, reading the Development Guide should help you.

Security Bugs

Security: We strongly encourage you to report security vulnerabilities to our private security mailing list: security@cilium.io - first, before disclosing them in any public forums.

This is a private mailing list where only members of the Cilium internal security team are subscribed to, and is treated as top priority.

Kubernetes

Cilium provides seamless integration into Kubernetes. The following guidance may help you to navigate this documentation section:

The following sections describe the Kubernetes integration in detail:

Introduction

What does Cilium provide in your Kubernetes Cluster?

The following functionality is provided as your run Cilium in your Kubernetes cluster:

  • CNI plugin support to provide pod_connectivity with Multi Host Networking.
  • Identity based implementation of the NetworkPolicy resource to isolate pod to pod connectivity on Layer 3 and 4.
  • An extension to NetworkPolicy in the form of a CustomResourceDefinition which extends policy control to add:
    • Layer 7 policy enforcement on ingress and egress for the following application protocols:
      • HTTP
      • Kafka
    • Egress support for CIDRs to secure access to external services
    • Enforcement to external headless services to automatically restrict to the set of Kubernetes endpoints configured for a service.
  • ClusterIP implementation to provide distributed load-balancing for pod to pod traffic.
  • Fully compatible with existing kube-proxy model

Pod-to-Pod Connectivity

In Kubernetes, containers are deployed within units referred to as Pod, which include one or more containers reachable via a single IP address. With Cilium, each Pod gets an IP address from the node prefix of the Linux node running the Pod. See Address Management for additional details. In the absence of any network security policies, all Pods can reach each other.

Pod IP addresses are typically local to the Kubernetes cluster. If pods need to reach services outside the cluster as a client, the network traffic is automatically masqueraded as it leaves the node. You can find additional information in the section External Network Connectivity.

Service Load-balancing

Kubernetes has developed the Services abstraction which provides the user the ability to load balance network traffic to different pods. This abstraction allows the pods reaching out to other pods by a single IP address, a virtual IP address, without knowing all the pods that are running that particular service.

Without Cilium, kube-proxy is installed on every node, watches for endpoints and services addition and removal on the kube-master which allows it to to apply the necessary enforcement on iptables. Thus, the received and sent traffic from and to the pods are properly routed to the node and port serving for that service. For more information you can check out the kubernetes user guide for Services.

When implementing ClusterIP, Cilium acts on the same principles as kube-proxy, it watches for services addition or removal, but instead of doing the enforcement on the iptables, it updates BPF map entries on each node. For more information, see the Pull Request.

Further Reading

The Kubernetes documentation contains more background on the Kubernetes Networking Model and Kubernetes Network Plugins .

Concepts

Deployment

The configuration of a standard Cilium Kubernetes deployment consists of several Kubernetes resources:

  • A DaemonSet resource: describes the Cilium pod that is deployed to each Kubernetes node. This pod runs the cilium-agent and associated daemons. The configuration of this DaemonSet includes the image tag indicating the exact version of the Cilium docker container (e.g., v1.0.0) and command-line options passed to the cilium-agent.
  • A ConfigMap resource: describes common configuration values that are passed to the cilium-agent, such as the kvstore endpoint and credentials, enabling/disabling debug mode, etc.
  • ServiceAccount, ClusterRole, and ClusterRoleBindings resources: the identity and permissions used by cilium-agent to access the Kubernetes API server when Kubernetes RBAC is enabled.
  • A Secret resource: describes the credentials use access the etcd kvstore, if required.

Networking For Existing Pods

In case pods were already running before the Cilium DaemonSet was deployed, these pods will still be connected using the previous networking plugin according to the CNI configuration. A typical example for this is the kube-dns service which runs in the kube-system namespace by default.

A simple way to change networking for such existing pods is to rely on the fact that Kubernetes automatically restarts pods in a Deployment if they are deleted, so we can simply delete the original kube-dns pod and the replacement pod started immediately after will have networking managed by Cilium. In a production deployment, this step could be performed as a rolling update of kube-dns pods to avoid downtime of the DNS service.

$ kubectl --namespace kube-system delete pods -l k8s-app=kube-dns
pod "kube-dns-268032401-t57r2" deleted

Running kubectl get pods will show you that Kubernetes started a new set of kube-dns pods while at the same time terminating the old pods:

$ kubectl --namespace kube-system get pods
NAME                          READY     STATUS        RESTARTS   AGE
cilium-5074s                  1/1       Running       0          58m
kube-addon-manager-minikube   1/1       Running       0          59m
kube-dns-268032401-j0vml      3/3       Running       0          9s
kube-dns-268032401-t57r2      3/3       Terminating   0          57m

Default Ingress Allow from Local Host

Kubernetes has functionality to indicate to users the current health of their applications via Liveness Probes and Readiness Probes. In order for kubelet to run these health checks for each pod, by default, Cilium will always allow all ingress traffic from the local host to each pod.

Requirements

Kubernetes Version

The following Kubernetes versions have been tested in the continuous integration system for this version of Cilium:

  • 1.11
  • 1.12
  • 1.13
  • 1.14
  • 1.15
  • 1.16
  • 1.17
  • 1.18

System Requirements

Cilium requires a Linux kernel >= 4.9. See System Requirements for the full details on all systems requirements.

Enable CNI in Kubernetes

CNI - Container Network Interface is the plugin layer used by Kubernetes to delegate networking configuration. CNI must be enabled in your Kubernetes cluster in order to install Cilium. This is done by passing --network-plugin=cni to kubelet on all nodes. For more information, see the Kubernets CNI network-plugins documentation.

Mounted BPF filesystem

Note

Some distributions mount the bpf filesystem automatically. Check if the bpf filesystem is mounted by running the command.

mount | grep /sys/fs/bpf
# if present should output, e.g. "none on /sys/fs/bpf type bpf"...

This step is required for production environments but optional for testing and development. It allows the cilium-agent to pin BPF resources to a persistent filesystem and make them persistent across restarts of the agent. If the BPF filesystem is not mounted in the host filesystem, Cilium will automatically mount the filesystem but it will be unmounted and re-mounted when the Cilium pod is restarted. This in turn will cause BPF resources to be re-created which will cause network connectivity to be disrupted. Mounting the BPF filesystem in the host mount namespace will ensure that the agent can be restarted without affecting connectivity of any pods.

In order to mount the BPF filesystem, the following command must be run in the host mount namespace. The command must only be run once during the boot process of the machine.

mount bpffs /sys/fs/bpf -t bpf

A portable way to achieve this with persistence is to add the following line to /etc/fstab and then run mount /sys/fs/bpf. This will cause the filesystem to be automatically mounted when the node boots.

bpffs                      /sys/fs/bpf             bpf     defaults 0 0

If you are using systemd to manage the kubelet, see the section Mounting BPFFS with systemd.

kube-dns

The Installation with managed etcd relies on the etcd-operator to manage an etcd cluster. In order for the etcd cluster to be available, the Cilium pod is being run with dnsPolicy: ClusterFirstWithHostNet in order for Cilium to be able to look up Kubernetes service names via DNS. This creates a dependency on kube-dns. If you would like to avoid running kube-dns, choose a different installation method and remove the dnsPolicy field from the DaemonSet.

Configuration

ConfigMap Options

In the ConfigMap there are several options that can be configured according to your preferences:

  • debug - Sets to run Cilium in full debug mode, which enables verbose logging and configures BPF programs to emit more visibility events into the output of cilium monitor.
  • enable-ipv4 - Enable IPv4 addressing support
  • enable-ipv6 - Enable IPv6 addressing support
  • clean-cilium-bpf-state - Removes all BPF state from the filesystem on startup. Endpoints will be restored with the same IP addresses, but ongoing connections may be briefly disrupted and loadbalancing decisions will be lost, so active connections via the loadbalancer will break. All BPF state will be reconstructed from their original sources (for example, from kubernetes or the kvstore). This may be used to mitigate serious issues regarding BPF maps. This option should be turned off again after restarting the daemon.
  • clean-cilium-state - Removes all Cilium state, including unrecoverable information such as all endpoint state, as well as recoverable state such as BPF state pinned to the filesystem, CNI configuration files, library code, links, routes, and other information. This operation is irreversible. Existing endpoints currently managed by Cilium may continue to operate as before, but Cilium will no longer manage them and they may stop working without warning. After using this operation, endpoints must be deleted and reconnected to allow the new instance of Cilium to manage them.
  • monitor-aggregation - This option enables coalescing of tracing events in cilium monitor to only include periodic updates from active flows, or any packets that involve an L4 connection state change. Valid options are none, low, medium, maximum.
  • preallocate-bpf-maps - Pre-allocation of map entries allows per-packet latency to be reduced, at the expense of up-front memory allocation for the entries in the maps. Set to true to optimize for latency. If this value is modified, then during the next Cilium startup connectivity may be temporarily disrupted for endpoints with active connections.

Any changes that you perform in the Cilium ConfigMap and in cilium-etcd-secrets Secret will require you to restart any existing Cilium pods in order for them to pick the latest configuration.

The following ConfigMap is an example where the etcd cluster is running in 2 nodes, node-1 and node-2 with TLS, and client to server authentication enabled.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  # The kvstore configuration is used to enable use of a kvstore for state
  # storage. This can either be provided with an external kvstore or with the
  # help of cilium-etcd-operator which operates an etcd cluster automatically.
  kvstore: etcd
  kvstore-opt: '{"etcd.config": "/var/lib/etcd-config/etcd.config"}'

  # This etcd-config contains the etcd endpoints of your cluster. If you use
  # TLS please make sure you follow the tutorial in https://cilium.link/etcd-config
  etcd-config: |-
    ---
    endpoints:
      - https://node-1:31079
      - https://node-2:31079
    #
    # In case you want to use TLS in etcd, uncomment the 'trusted-ca-file' line
    # and create a kubernetes secret by following the tutorial in
    # https://cilium.link/etcd-config
    trusted-ca-file: '/var/lib/etcd-secrets/etcd-client-ca.crt'
    #
    # In case you want client to server authentication, uncomment the following
    # lines and create a kubernetes secret by following the tutorial in
    # https://cilium.link/etcd-config
    key-file: '/var/lib/etcd-secrets/etcd-client.key'
    cert-file: '/var/lib/etcd-secrets/etcd-client.crt'

  # If you want to run cilium in debug mode change this value to true
  debug: "false"
  enable-ipv4: "true"
  # If you want to clean cilium state; change this value to true
  clean-cilium-state: "false"
CNI

CNI - Container Network Interface is the plugin layer used by Kubernetes to delegate networking configuration. You can find additional information on the CNI project website.

Note

Kubernetes `` >= 1.3.5`` requires the loopback CNI plugin to be installed on all worker nodes. The binary is typically provided by most Kubernetes distributions. See section Manually installing CNI for instructions on how to install CNI in case the loopback binary is not already installed on your worker nodes.

CNI configuration is automatically being taken care of when deploying Cilium via the provided DaemonSet. The script cni-install.sh is automatically run via the postStart mechanism when the cilium pod is started.

Note

In order for the the cni-install.sh script to work properly, the kubelet task must either be running on the host filesystem of the worker node, or the /etc/cni/net.d and /opt/cni/bin directories must be mounted into the container where kubelet is running. This can be achieved with Volumes mounts.

The CNI auto installation is performed as follows:

  1. The /etc/cni/net.d and /opt/cni/bin directories are mounted from the host filesystem into the pod where Cilium is running.
  2. The file /etc/cni/net.d/05-cilium.conf is written in case it does not exist yet.
  3. The binary cilium-cni is installed to /opt/cni/bin. Any existing binary with the name cilium-cni is overwritten.

Manually installing CNI

This step is typically already included in all Kubernetes distributions or Kubernetes installers but can be performed manually:

sudo mkdir -p /opt/cni
wget https://storage.googleapis.com/kubernetes-release/network-plugins/cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz
sudo tar -xvf cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz -C /opt/cni
rm cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz

Adjusting CNI configuration

The CNI configuration file is automatically written and maintained by the scripts cni-install.sh and cni-uninstall.sh which are running as postStart and preStop hooks of the Cilium pod.

If you want to provide your own custom CNI configuration file, set the CILIUM_CUSTOM_CNI_CONF environment variable to avoid overwriting the configuration file by adding the following to the env: section of the cilium DaemonSet:

- name: CILIUM_CUSTOM_CNI_CONF
  value: "true"

The CNI installation can be configured with environment variables. These environment variables can be specified in the DaemonSet file like this:

env:
  - name: "CNI_CONF_NAME"
    value: "05-cilium.conf"

The following variables are supported:

Option Description Default
HOST_PREFIX Path prefix of all host mounts /host
CNI_DIR Path to mounted CNI directory ${HOST_PREFIX}/opt/cni
CNI_CONF_NAME Name of configuration file 05-cilium.conf

If you want to further adjust the CNI configuration you may do so by creating the CNI configuration /etc/cni/net.d/05-cilium.conf manually:

sudo mkdir -p /etc/cni/net.d
sudo sh -c 'echo "{
    "name": "cilium",
    "type": "cilium-cni"
}
" > /etc/cni/net.d/05-cilium.conf'

Cilium will use any existing /etc/cni/net.d/05-cilium.conf file if it already exists on a worker node and only creates it if it does not exist yet.

CRD Validation

Custom Resource Validation was introduced in Kubernetes since version 1.8.0. This is still considered an alpha feature in Kubernetes 1.8.0 and beta in Kubernetes 1.9.0.

Since Cilium v1.0.0-rc3, Cilium will create, or update in case it exists, the Cilium Network Policy (CNP) Resource Definition with the embedded validation schema. This allows the validation of CiliumNetworkPolicy to be done on the kube-apiserver when the policy is imported with an ability to provide direct feedback when importing the resource.

To enable this feature, the flag --feature-gates=CustomResourceValidation=true must be set when starting kube-apiserver. Cilium itself will automatically make use of this feature and no additional flag is required.

Note

In case there is an invalid CNP before updating to Cilium v1.0.0-rc3, which contains the validator, the kube-apiserver validator will prevent Cilium from updating that invalid CNP with Cilium node status. By checking Cilium logs for unable to update CNP, retrying..., it is possible to determine which Cilium Network Policies are considered invalid after updating to Cilium v1.0.0-rc3.

To verify that the CNP resource definition contains the validation schema, run the following command:

kubectl get crd ciliumnetworkpolicies.cilium.io -o json

kubectl get crd ciliumnetworkpolicies.cilium.io -o json | grep -A 12 openAPIV3Schema
    "openAPIV3Schema": {
        "oneOf": [
            {
                "required": [
                    "spec"
                ]
            },
            {
                "required": [
                    "specs"
                ]
            }
        ],

In case the user writes a policy that does not conform to the schema, Kubernetes will return an error, e.g.:

cat <<EOF > ./bad-cnp.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Policy to test multiple rules in a single file"
metadata:
  name: my-new-cilium-object
spec:
  endpointSelector:
    matchLabels:
      app: details
      track: stable
      version: v1
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: reviews
        track: stable
        version: v1
    toPorts:
    - ports:
      - port: '65536'
        protocol: TCP
      rules:
        http:
        - method: GET
          path: "/health"
EOF

kubectl create -f ./bad-cnp.yaml
...
spec.ingress.toPorts.ports.port in body should match '^(6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|[0-9]{1,4})$'

In this case, the policy has a port out of the 0-65535 range.

Mounting BPFFS with systemd

Due to how systemd mounts filesystems, the mount point path must be reflected in the unit filename.

cat <<EOF | sudo tee /etc/systemd/system/sys-fs-bpf.mount
[Unit]
Description=Cilium BPF mounts
Documentation=https://docs.cilium.io/
DefaultDependencies=no
Before=local-fs.target umount.target
After=swap.target

[Mount]
What=bpffs
Where=/sys/fs/bpf
Type=bpf
Options=rw,nosuid,nodev,noexec,relatime,mode=700

[Install]
WantedBy=multi-user.target
EOF
Container Runtimes

CRIO

If you want to use CRIO, generate the YAML using:

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Note

The helm --set global.containerRuntime.integration=crio might not be required for your setup. For more info see Common CRIO issues.

helm install cilium cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set global.containerRuntime.integration=crio

Since CRI-O does not automatically detect that a new CNI plugin has been installed, you will need to restart the CRI-O daemon for it to pick up the Cilium CNI configuration.

First make sure cilium is running:

kubectl get pods -n kube-system -o wide
NAME               READY     STATUS    RESTARTS   AGE       IP          NODE
cilium-mqtdz       1/1       Running   0          3m       10.0.2.15   minikube

After that you can restart CRI-O:

minikube ssh -- sudo systemctl restart crio

Common CRIO issues

Some CRI-O environments automatically mount the bpf filesystem in the pods, which is something that Cilium avoids doing when --set global.containerRuntime.integration=crio is set. However, some CRI-O environments do not mount the bpf filesystem automatically which causes Cilium to print the follow message:

level=warning msg="BPF system config check: NOT OK." error="CONFIG_BPF kernel parameter is required" subsys=linux-datapath
level=warning msg="================================= WARNING ==========================================" subsys=bpf
level=warning msg="BPF filesystem is not mounted. This will lead to network disruption when Cilium pods" subsys=bpf
level=warning msg="are restarted. Ensure that the BPF filesystem is mounted in the host." subsys=bpf
level=warning msg="https://docs.cilium.io/en/stable/kubernetes/requirements/#mounted-bpf-filesystem" subsys=bpf
level=warning msg="====================================================================================" subsys=bpf
level=info msg="Mounting BPF filesystem at /sys/fs/bpf" subsys=bpf

If you see this warning in the Cilium pod logs with your CRI-O environment, please remove the flag --set global.containerRuntime.integration=crio from your helm setup and redeploy Cilium.

Network Policy

If you are running Cilium on Kubernetes, you can benefit from Kubernetes distributing policies for you. In this mode, Kubernetes is responsible for distributing the policies across all nodes and Cilium will automatically apply the policies. Three formats are available to configure network policies natively with Kubernetes:

Cilium supports running multiple of these policy types at the same time. However caution should be applied when using multiple policy types at the same time, as it can be confusing to understand the complete set of allowed traffic across multiple policy types. If close attention is not applied this may lead to unintended policy allow behavior.

NetworkPolicy

For more information, see the official NetworkPolicy documentation.

Known missing features for Kubernetes Network Policy:

Feature Tracking Issue
Ingress CIDR-based L4 policy https://github.com/cilium/cilium/issues/4129

CiliumNetworkPolicy

The CiliumNetworkPolicy is very similar to the standard NetworkPolicy. The purpose is provide the functionality which is not yet supported in NetworkPolicy. Ideally all of the functionality will be merged into the standard resource format and this CRD will no longer be required.

The raw specification of the resource in Go looks like this:

type CiliumNetworkPolicy struct {
        metav1.TypeMeta `json:",inline"`
        // +optional
        Metadata metav1.ObjectMeta `json:"metadata"`

        // Spec is the desired Cilium specific rule specification.
        Spec *api.Rule `json:"spec,omitempty"`

        // Specs is a list of desired Cilium specific rule specification.
        Specs api.Rules `json:"specs,omitempty"`

        // Status is the status of the Cilium policy rule
        // +optional
        Status CiliumNetworkPolicyStatus `json:"status"`
}
Metadata

Describes the policy. This includes:

  • Name of the policy, unique within a namespace
  • Namespace of where the policy has been injected into
  • Set of labels to identify resource in Kubernetes
Spec
Field which contains a Rule Basics
Specs
Field which contains a list of Rule Basics. This field is useful if multiple rules must be removed or added automatically.
Status
Provides visibility into whether the policy has been successfully applied

Examples

See Layer 3 Examples for a detailed list of example policies.

CiliumClusterwideNetworkPolicy

CiliumClusterwideNetworkPolicy is same as that of CiliumNetworkPolicy with the only difference in the scope of the policy. Policies defined by CiliumClusterwideNetworkPolicy are non-namespaced and are cluster-scoped. Internally the policy is composed of CiliumNetworkPolicy itself and thus the effects of this policy specification are also same.

The raw specification of the resource in go looks like this:

type CiliumClusterwideNetworkPolicy struct {
        *CiliumNetworkPolicy

        // Status is the status of the Cilium policy rule
        // +optional
        // The reason this field exists in this structure is due a bug in the k8s code-generator
        // that doesn't create a `UpdateStatus` method because the field does not exist in
        // the structure.
        Status CiliumNetworkPolicyStatus `json:"status"`
}

Endpoint CRD

When managing pods in Kubernetes, Cilium will create a Custom Resource Definition (CRD) of Kind CiliumEndpoint. One CiliumEndpoint is created for each pod managed by Cilium, with the same name and in the same namespace. The CiliumEndpoint objects contain the same information as the json output of cilium endpoint get under the .status field, but can be fetched for all pods in the cluster. Adding the -o json will export more information about each endpoint. This includes the endpoint’s labels, security identity and the policy in effect on it.

For example:

kubectl get ciliumendpoints --all-namespaces
NAMESPACE     NAME                     AGE
default       app1-55d7944bdd-l7c8j    1h
default       app1-55d7944bdd-sn9xj    1h
default       app2                     1h
default       app3                     1h
kube-system   cilium-health-minikube   1h
kube-system   microscope               1h

Note

Each cilium-agent pod will create a CiliumEndpoint to represent its own inter-agent health-check endpoint. These are not pods in Kubernetes and are in the kube-system namespace. They are named as cilium-health-<node-name>

orphan:

Kubernetes Compatibility

Cilium is compatible with multiple Kubernetes API Groups. Some are deprecated or beta, and may only be available in specific versions of Kubernetes.

All Kubernetes versions listed are compatible with Cilium:

k8s Version k8s NetworkPolicy API CiliumNetworkPolicy
1.12, 1.13, 1.14, 1.15, 1.16, 1.17, 1.18 cilium.io/v2 has a CustomResourceDefinition

Troubleshooting

Verifying the installation

Check the status of the DaemonSet and verify that all desired instances are in “ready” state:

$ kubectl --namespace kube-system get ds
NAME      DESIRED   CURRENT   READY     NODE-SELECTOR   AGE
cilium    1         1         0         <none>          3s

In this example, we see a desired state of 1 with 0 being ready. This indicates a problem. The next step is to list all cilium pods by matching on the label k8s-app=cilium and also sort the list by the restart count of each pod to easily identify the failing pods:

$ kubectl --namespace kube-system get pods --selector k8s-app=cilium \
          --sort-by='.status.containerStatuses[0].restartCount'
NAME           READY     STATUS             RESTARTS   AGE
cilium-813gf   0/1       CrashLoopBackOff   2          44s

Pod cilium-813gf is failing and has already been restarted 2 times. Let’s print the logfile of that pod to investigate the cause:

$ kubectl --namespace kube-system logs cilium-813gf
INFO      _ _ _
INFO  ___|_| |_|_ _ _____
INFO |  _| | | | | |     |
INFO |___|_|_|_|___|_|_|_|
INFO Cilium 0.8.90 f022e2f Thu, 27 Apr 2017 23:17:56 -0700 go version go1.7.5 linux/amd64
CRIT kernel version: NOT OK: minimal supported kernel version is >= 4.8

In this example, the cause for the failure is a Linux kernel running on the worker node which is not meeting System Requirements.

If the cause for the problem is not apparent based on these simple steps, please come and seek help on our Slack channel.

Apiserver outside of cluster

If you are running Kubernetes Apiserver outside of your cluster for some reason (like keeping master nodes behind a firewall), make sure that you run Cilium on master nodes too. Otherwise Kubernetes pod proxies created by Apiserver will not be able to route to pod IPs and you may encounter errors when trying to proxy traffic to pods.

You may run Cilium as a static pod or set tolerations for Cilium DaemonSet to ensure that Cilium pods will be scheduled on your master nodes. The exact way to do it depends on your setup.

Istio

Cilium can be deployed along Istio to provide L3-L7 network filtering in complement to Istio’s microservice mesh features. The following quick guide guides you through the process step by step:

For more information on Istio, check out the Istio website.

Docker

Cilium can be integrated with Docker in two ways:

  • via the CNI interface. This method is used by Kubernetes and Mesos.
  • via Docker’s libnetwork plugin interface, if networking is to be managed by the Docker runtime. This method is used, for example, by Docker Compose.

To run Cilium with Docker’s libnetwork, it needs a single logical Docker network of type cilium with an IPAM-driver of type cilium. The IPAM-driver delegates control over IPv4 and IPv6 address management and network connectivity to Cilium for all containers attached to this network. Each Docker container is allocated an IP address from the node prefix of the node running that container.

When deployed with Docker, each Linux node must also run a cilium-docker agent that receives libnetwork calls from Docker and then communicates with the Cilium Agent to control container networking.

Security policies controlling connectivity between the Docker containers can be written in terms of the Docker container labels passed to Docker when creating the container. These policies can be created and updated via the Cilium agent API or by using the Cilium CLI client.

Follow this guide for a step by step introduction on how to use Cilium with Docker Compose:

Mesos

Cilium can be integrated with Apache Mesos and Marathon using the CNI plugin. The following quick guide guides you through the process step by step:

For more information on Apache Mesos and Marathon orchestration, check out the Mesos and Marathon GitHub pages, respectively.

Envoy

Envoy Go Extensions

Note

This feature is currently in beta phase.

This is a guide for developers who are interested in writing a Go extension to the Envoy proxy as part of Cilium.

_images/proxylib_logical_flow.png

As depicted above, this framework allows a developer to write a small amount of Go code (green box) focused on parsing a new API protocol, and this Go code is able to take full advantage of Cilium features including high-performance redirection to/from Envoy, rich L7-aware policy language and access logging, and visibility into encrypted traffic traffic via kTLS (coming soon!). In sum, you as the developer need only worry about the logic of parsing the protocol, and Cilium + Envoy + BPF do the heavy-lifting.

This guide uses simple examples based on a hypothetical “r2d2” protocol (see proxylib/r2d2/r2d2parser.go) that might be used to talk to a simple protocol droid a long time ago in a galaxy far, far away. But it also points to other real protocols like Memcached and Cassandra that already exist in the cilium/proxylib directory.

Step 1: Decide on a Basic Policy Model

To get started, take some time to think about what it means to provide protocol-aware security in the context of your chosen protocol. Most protocols follow a common pattern of a client who performs an ‘’operation’’ on a ‘’resource’‘. For example:

  • A standard RESTful HTTP request has a GET/POST/PUT/DELETE methods (operation) and URLs (resource).
  • A database protocol like MySQL has SELECT/INSERT/UPDATE/DELETE actions (operation) on a combined database + table name (resource).
  • A queueing protocol like Kafka has produce/consume (operation) on a particular queue (resource).

A common policy model is to allow the user to whitelist certain operations on one or more resources. In some cases, the resources need to support regexes to avoid explicit matching on variable content like ids (e.g., /users/<uuid> would match /users/.*)

In our examples, the ‘’r2d2’’ example, we’ll use a basic set of operations (READ/WRITE/HALT/RESET). The READ and WRITE commands also support a ‘filename’ resource, while HALT and RESET have no resource.

Step 2: Understand Protocol, Encoding, Framing and Types

Next, get your head wrapped around how a protocol looks terms of the raw data, as this is what you’ll be parsing.

Try looking for official definitions of the protocol or API. Official docs will not only help you quickly learn how the protocol works, but will also help you by documenting tricky corner cases that wouldn’t be obvious just from regular use of the protocol. For example, here are example specs for Redis Protocol , Cassandra Protocol, and AWS SQS .

These specs help you understand protocol aspects like:

  • encoding / framing : how to recognize the beginning/end of individual requests/replies within a TCP stream. This typically involves reading a header that encodes the overall request length, though some simple protocols use a delimiter like ‘’rn’’ to separate messages.
  • request/reply fields : for most protocols, you will need to parse out fields at various offsets into the request data in order to extract security-relevant values for visibility + filtering. In some cases, access control requires filtering requests from clients to servers, but in some cases, parsing replies will also be required if reply data is required to understand future requests (e.g., prepared-statements in database protocols).
  • message flow : specs often describe various dependencies between different requests. Basic protocols tend to follow a simple serial request/reply model, but more advanced protocols will support pipelining (i.e., sending multiple requests before any replies have been received).
  • protocol errors : when a Cilium proxy denies a request based on policy, it should return a protocol-specific error to the client (e.g., in HTTP, a proxy should return a ‘‘403 Access Denied’’ error). Looking at the protocol spec will typically indicate how you should return an equivalent ‘’Access Denied’’ error.

Sometimes, the protocol spec does not give you a full sense of the set of commands that can be sent over the protocol. In that case, looking at higher-level user documentation can fill in some of these knowledge gaps. Here are examples for Redis Commands and Cassandra CQL Commands .

Another great trick is to use Wireshark to capture raw packet data between a client and server. For many protocols, the Wireshark Sample Captures has already saved captures for us. Otherwise, you can easily use tcpdump to capture a file. For example, for MySQL traffic on port 3306, you could run the following in a container running the MySQL client or server: “tcpdump -s 0 port 3306 -w mysql.pcap”. More Info

In our example r2d2 protocol, we’ll keep the spec as simple as possible. It is a text-only based protocol, with each request being a line terminated by ‘’rn’‘. A request starts with a case-insensitive string command (“READ”,”WRITE”,”HALT”,”RESET”). If the command is “READ” or “WRITE”, the command must be followed by a space, and a non-empty filename that contains only non whitespace ASCII characters.

Step 3: Search for Existing Parser Code / Libraries

Look for open source Go library/code that can help. Is there existing open source Go code that parse your protocol that you can leverage, either directly as library or a motivating example? For example, the tidwall/recon library parses Redis in Go, and Vitess parses MySQL in Go. Wireshark dissectors also has a wealth of protocol parsers written in C that can serve as useful guidance. Note: finding client-only protocol parsing code is typically less helpful than finding a proxy implementation, or a full parser library. This is because the set of requests a client parsers is typically the inverse set of the requests a Cilium proxy needs to parse, since the proxy mimics the server rather than the client. Still, viewing a Go client can give you a general idea of how to parse the general serialization format of the protocol.

Step 4: Follow the Cilium Developer Guide

It is easiest to start Cilium development by following the Development Guide

After cloning Cilium:

$ cd cilium
$ contrib/vagrant/start.sh
$ cd proxylib

While this dev VM is running, you can open additional terminals to the Cilium dev VM by running ‘’vagrant ssh’’ from within the cilium source directory.

Step 5: Create New Proxy Skeleton

From inside the proxylib directory, copy the rd2d directory and rename the files. Replace ‘’newproto’’ with your protocol:

$ mkdir newproto
$ cd newproto
$ cp ../r2d2/r2d2parser.go newproto.go
$ cp ../r2d2/r2d2parser_test.go newproto_test.go

Within both newproto.go and newproto_test.go update references to r2d2 with your protocol name. Search for both ‘’r2d2’’ and ‘’R2D2’‘.

Also, edit proxylib.go and add the following import line:

_ "github.com/cilium/cilium/proxylib/newproto"

Step 6: Update OnData Method

Implementing a parser requires you as the developer to implement three primary functions, shown as blue in the diagram below. We will cover OnData() in this section, and the other functions in section Step 9: Add Policy Loading and Matching.

_images/proxylib_key_functions.png

The beating heart of your parsing is implementing the onData function. You can think of any proxy as have two data streams, one in the request direction (i.e., client to server) and one in the reply direction (i.e., server to client). OnData is called when there is data to process, and the value of the boolean ‘reply’ parameter indicates the direction of the stream for a given call to OnData. The data passed to OnData is a slice of byte slices (i.e., an array of byte arrays).

The return values of the OnData function tell the Go framework tell how data in the stream should be processed, with four primary outcomes:

  • PASS x : The next x bytes in the data stream passed to OnData represent a request/reply that should be passed on to the server/client. The common case here is that this is a request that should be allowed by policy, or that no policy is applied. Note: x bytes may be less than the total amount of data passed to OnData, in which case the remaining bytes will still be in the data stream when onData is invoked next. x bytes may also be more than the data that has been passed to OnData. For example, in the case of a protocol where the parser filters only on values in a protocol header, it is often possible to make a filtering decision, and then pass (or drop) the size of the full request/reply without having the entire request passed to Go.
  • MORE x : The buffers passed to OnData to do not represent all of the data required to frame and filter the request/reply. Instead, the parser needs to see at least x additional bytes beyond the current data to make a decision. In some cases, the full request must be read to understand framing and filtering, but in others a decision can be made simply by reading a protocol header. When parsing data, be defensive, and recognize that it is technically possible that data arrives one byte byte at a time. Two common scenarios exist here:
    • Text-based Protocols : For text-based protocols that use a delimiter like “rn”, it is common to simply check if the delimiter exists, and return MORE 1 if it does not, as technically one more character could result in the delimiter being present. See the sample r2d2 parser as a basic example of this.
    • Binary-based protocols : Many binary protocols have a fixed header length, which containers a field that then indicates the remaining length of the request. In the binary case, first check to make sure a full header is received. Typically the header will indicate both the full request length (i.e., framing), as well as the request type, which indicates how much of the full request must be read in order to perform filtering (in many cases, this is less than the full request). A binary parser will typically return MORE if the data passed to OnData is less than the header length. After reading a full header, the simple approach is for the parser to return MORE to wait for the full request to be received and parsed (see the existing CassandraParser as an example). However, as an optimization, the parser can attempt to only request the minimum number of bytes required beyond the header to make a policy decision, and then PASS or DROP the remaining bytes without requiring them to be passed to the Go parser.
  • DROP x : Remove the first x bytes from the data stream passed to OnData, as they represent a request/reply that should not be forwarded to the client or server based on policy. Don’t worry about making onData return a drop right away, as we’ll return to DROP in a later step below.
  • ERROR y : The connection contains data that does not match the protocol spec, and prevents you from further parsing the data stream. The framework will terminate the connection. An example would be a request length that falls outside the min/max specified by the protocol spec, or values for a field that fall outside the values indicated by the spec (e.g., wrong versions, unknown commands). If you are still able to properly frame the requests, you can also choose to simply drop the request and return a protocol error (e.g., similar to an ‘’HTTP 400 Bad Request’’ error. But in all cases, you should write your parser defensively, such that you never forward a request that you do not understand, as such a request could become an avenue for subverting the intended security visibility and filtering policies. See proxylib/types.h for the set of valid error codes.

See proxylib/proxylib/parserfactory.go for the official OnData interface definition.

Keep it simple, and work iteratively. Start out just getting the framing right. Can you write a parser that just prints out the length and contents of a request, and then PASS each request with no policy enforcement?

One simple trick is to comment out the r2d2 parsing logic in OnData, but leave it in the file as a reference, as your protocol will likely require similar code as we add more functionality below.

Step 7: Use Unit Testing To Drive Development

Use unit tests to drive your development. Its tempting to want to first test your parser by firing up a client and server and developing on the fly. But in our experience you’ll iterate faster by using the great unit test framework created along with the Go proxy framework. This framework lets you pass in an example set of requests as byte arrays to a CheckOnDataOK method, which are passed to the parser’s OnData method. CheckOnDataOK takes a set of expected return values, and compares them to the actual return values from OnData processing the byte arrays.

Take some time to look at the unit tests for the r2d2 parser, and then for more complex parsers like Cassandra and Memcached. For simple text-based protocols, you can simply write ASCII strings to represent protocol messages, and convert them to []byte arrays and pass them to CheckOnDataOK. For binary protocols, one can either create byte arrays directly, or use a mechanism to convert a hex string to byte[] array using a helper function like hexData in cassandra/cassandraparser_test.go

A great way to get the exact data to pass in is to copy the data from the Wireshark captures mentioned above in Step #2. You can see the full application layer data streams in Wireshark by right-clicking on a packet and selecting “Follow As… TCP Stream”. If the protocol is text-based, you can copy the data as ASCII (see r2d2/r2d2parser_test.go as an example of this). For binary data, it can be easier to instead select “raw” in the drop-down, and use a basic utility to convert from ascii strings to binary raw data (see cassandra/cassandraparser_test.go for an example of this).

To run the unit tests, go to proxylib/newproto and run:

$ go test

This will build the latest version of your parser and unit test files and run the unit tests.

Step 8: Add More Advanced Parsing

Thinking back to step #1, what are the critical fields to parse out of the request in order to understand the “operation” and “resource” of each request. Can you print those out for each request?

Use the unit test framework to pass in increasingly complex requests, and confirm that the parser prints out the right values, and that the unit tests are properly slicing the datastream into requests and parsing out the required fields.

A couple scenarios to make sure your parser handles properly via unit tests:

  • data chunks that are less than a full request (return MORE)
  • requests that are spread across multiple data chunks. (return MORE ,then PASS)
  • multiple requests that are bundled into a single data chunk (return PASS, then another PASS)
  • rejection of malformed requests (return ERROR).

For certain advanced cases, it is required for a parser to store state across requests. In this case, data can be stored using data structures that are included as part of the main parser struct. See CassandraParser in cassandra/cassandraparser.go as an example of how the parser uses a string to store the current ‘keyspace’ in use, and uses Go maps to keep state required for handling prepared queries.

Step 9: Add Policy Loading and Matching

Once you have the parsing of most protocol messages ironed out, its time to start enforcing policy.

First, create a Go object that will represent a single rule in the policy language. For example, this is the rule for the r2d2 protocol, which performs exact match on the command string, and a regex on the filename:

type R2d2Rule struct {
   cmdExact   string
   fileRegexCompiled *regexp.Regexp
}

There are two key methods to update:

  • Matches : This function implements the basic logic of comparing data from a single request against a single policy rule, and return true if that rule matches (i.e., allows) that request.
  • <NewProto>RuleParser : Reads key value pairs from policy, validates those entries, and stores them as a <NewProto>Rule object.

See r2d2/r2d2parser.go for examples of both functions for the r2d2 protocol.

You’ll also need to update OnData to call p.connection.Matches(), and if this function return false, return DROP for a request. Note: despite the similar names between the Matches() function you create in your newprotoparser.go and p.connection.Matches(), do not confuse the two. Your OnData function should always call p.connection.Matches() rather than invoking your own Matches() directly, as p.connection.Matches() calls the parser’s Matches() function only on the subset of L7 rules that apply for the given Cilium source identity for this particular connection.

Once you add the logic to call Matches() and return DROP in OnData, you will need to update unit tests to have policies that allow the traffic you expect to be passed. The following is an example of how r2d2/r2d2parser_test.go adds an allow-all policy for a given test:

s.ins.CheckInsertPolicyText(c, "1", []string{`
    name: "cp1"
    policy: 2
    ingress_per_port_policies: <
      port: 80
      rules: <
        l7_proto: "r2d2"
      >
    >
    `})

The following is an example of a policy that would allow READ commands with a file regex of “.*”:

s.ins.CheckInsertPolicyText(c, "1", []string{`
    name: "cp2"
    policy: 2
    ingress_per_port_policies: <
      port: 80
      rules: <
        l7_proto: "r2d2"
        l7_rules: <
        rule: <
          key: "cmd"
          value: "READ"
        >
        rule: <
          key: "file"
          value: ".*"
        >
          >
        >
      >
    >
    `})

Step 10: Inject Error Response

Simply dropping the request from the request data stream prevents the request from reaching the server, but it would leave the client hanging, waiting for a response that would never come since the server did not see the request.

Instead, the proxy should return an application-layer reply indicating that access was denied, similar to how an HTTP proxy would return a ‘‘403 Access Denied’’ error. Look back at the protocol spec discussed in Step 2 to understand what an access denied message looks like for this protocol, and use the p.connection.Inject() method to send this error reply back to the client. See r2d2/r2d2parser.go for an example.

p.connection.Inject(true, []byte("ERROR\r\n"))

Note: p.connection.Inject() will inject the data it is passed into the reply datastream. In order for the client to parse this data correctly, it must be injected at a proper framing boundary (i.e., in between other reply messages that may be in the reply data stream). If the client is following a basic serial request/reply model per connection, this is essentially guaranteed as at the time of a request that is denied, there are no other replies potentially in the reply datastream. But if the protocol supports pipelining (i.e., multiple requests in flight) replies must be properly framed and PASSed on a per request basis, and the timing of the call to p.connection.Inject() must be controlled such that the client will properly match the Error response with the correct request. See the Memcached parser as an example of how to accomplish this.

Step 11: Add Access Logging

Cilium also has the notion of an ‘’Access Log’‘, which records each request handled by the proxy and indicates whether the request was allowed or denied.

A call to ‘’p.connection.Log()’’ implements access logging. See the OnData function in r2d2/r2d2parser.go as an example:

p.connection.Log(access_log_entry_type,
  &cilium.LogEntry_GenericL7{
      &cilium.L7LogEntry{
          Proto: "r2d2",
          Fields: map[string]string{
              "cmd":  reqData.cmd,
              "file": reqData.file,
          },
      },
})

Step 12: Manual Testing

Find the standard docker container for running the protocol server. Often the same image also has a CLI client that you can use as a client.

Start both a server and client container running in the cilium dev VM, and attach them to the already created “cilium-net”. For example, with Cassandra, we run:

docker run --name cass-server -l id=cass-server -d --net cilium-net cassandra

docker run --name cass-client -l id=cass-client -d --net cilium-net cassandra sh -c 'sleep 3000'

Note that we run both containers with labels that will make it easy to refer to these containers in a cilium network policy. Note that we have the client container run the sleep command, as we will use ‘docker exec’ to access the client CLI.

Use ‘’cilium endpoint list’’ to identify the IP address of the protocol server.

$ cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
2987       Disabled           Disabled          31423      container:id=cass-server      f00d::a0b:0:0:bab    10.11.51.247    ready
27333      Disabled           Disabled          4          reserved:health               f00d::a0b:0:0:6ac5   10.11.92.46     ready
50923      Disabled           Disabled          18253      container:id=cass-client      f00d::a0b:0:0:c6eb   10.11.175.191   ready

One can then invoke the client CLI using that server IP address (10.11.51.247 in the above example):

docker exec -it cass-client sh -c 'cqlsh 10.11.51.247 -e "select * from system.local"'

Note that in the above example, ingress policy is not enforced for the Cassandra server endpoint, so no data will flow through the Cassandra parser. A simple ‘’allow all’’ L7 Cassandra policy can be used to send all data to the Cassandra server through the Go Cassandra parser. This policy has a single empty rule, which matches all requests. An allow all policy looks like:

[ {
  "endpointSelector": {"matchLabels":{"id":"cass-server"}},
  "ingress": [ {
        "toPorts": [{
                "ports": [{"port": "9042", "protocol": "TCP"}],
                      "rules": {
                              "l7proto": "cassandra",
                              "l7": [{}]
                      }
              }]
        } ]
}]

A policy can be imported into cilium using ‘’cilium policy import’‘, after which another call to ‘’cilium endpoint list’’ confirms that ingress policy is now in place on the server. If the above policy was saved to a file cass-allow-all.json, one would run:

$ cilium policy import cass-allow-all.json
Revision: 1
$ cilium endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])   IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
2987       Enabled            Disabled          31423      container:id=cass-server      f00d::a0b:0:0:bab    10.11.51.247    ready
27333      Disabled           Disabled          4          reserved:health               f00d::a0b:0:0:6ac5   10.11.92.46     ready
50923      Disabled           Disabled          18253      container:id=cass-client      f00d::a0b:0:0:c6eb   10.11.175.191   ready

Note that policy is now showing as ‘’Enabled’’ for the Cassandra server on ingress.

To remove this or any other policy, run:

$ cilium policy delete --all

To install a new policy, first delete, and then run ‘’cilium policy import’’ again. For example, the following policy would allow select statements on a specific set of tables to this Cassandra server, but deny all other queries.

[ {
  "endpointSelector": {"matchLabels":{"id":"cass-server"}},
  "ingress": [ {
        "toPorts": [{
                "ports": [{"port": "9042", "protocol": "TCP"}],
                      "rules": {
                              "l7proto": "cassandra",
                              "l7": [
                                     { "query_action" : "select", "query_table": "^system.*"},
                                     { "query_action" : "select", "query_table" : "^posts_db.posts$"}

                              ]}
                      }]
       }]
} ]

When performing manual testing, remember that each time you change your Go proxy code, you must re-run ‘’make’’ and ‘’sudo make install’’ and then restart the cilium-agent process. If the only changes you have made since last compiling cilium are in your cilium/proxylib directory, you can safely just run ‘’make’’ and ‘’sudo make install’’ in that directory, which saves time. For example:

$ cd proxylib  // only safe is this is the only directory that has changed
$ make
  <snip>
$ sudo make install
  <snip>

If you rebase or other files change, you need to run both commands from the top level directory.

Cilium agent default to running as a service in the development VM. However, the default options do not include the ‘’–debug-verbose=flow’’ flag, which is critical to getting visibility in troubleshooting Go proxy frameworks. So it is easiest to stop the cilium service and run the cilium-agent directly as a command in a terminal window, and adding the ‘’–debug-verbose=flow’’ flag.

$ sudo service cilium stop

$ sudo /usr/bin/cilium-agent --debug --auto-direct-node-routes --ipv4-range 10.11.0.0/16 --kvstore-opt consul.address=192.168.33.11:8500 --kvstore consul -t vxlan --fixed-identity-mapping=128=kv-store --fixed-identity-mapping=129=kube-dns --debug-verbose=flow

Step 13: Add Runtime Tests

Before submitting this change to the Cilium community, it is recommended that you add runtime tests that will run as part of Cilium’s continuous integration testing. Usually these runtime test can be based on the same container images and test commands you used for manual testing.

The best approach for adding runtime tests is typically to start out by copying-and-pasting an existing L7 protocol runtime test and then updating it to run the container images and CLI commands specific to the new protocol. See cilium/test/runtime/cassandra.go as an example that matches the use of Cassandra described above in the manual testing section. Note that the json policy files used by the runtime tests are stored in cilium/test/runtime/manifests, and the Cassandra example policies in those directories are easy to use as a based for similar policies you may create for your new protocol.

Step 14: Review Spec for Corner Cases

Many protocols have advanced features or corner cases that will not manifest themselves as part of basic testing. Once you have written a first rev of the parser, it is a good idea to go back and review the protocol’s spec or list of commands to see what if any aspects may fall outside the scope of your initial parser. For example, corner cases like the handling of empty or nil lists may not show up in your testing, but may cause your parser to fail. Add more unit tests to cover these corner cases. It is OK for the first rev of your parser not to handle all types of requests, or to have a simplified policy structure in terms of which fields can be matched. However, it is important to know what aspects of the protocol you are not parsing, and ensure that it does not lead to any security concerns. For example, failing to parse prepared statements in a database protocol and instead just passing PREPARE and EXECUTE commands through would lead to gaping security whole that would render your other filtering meaningless in the face of a sophisticated attacker.

Step 15: Write Docs or Getting Started Guide (optional)

At a minimum, the policy examples included as part of the runtime tests serve as basic documentation of the policy and its expected behavior. But we also encourage adding more user friendly examples and documentation, for example, Getting Started Guides. cilium/Documentation/gettingstarted/cassandra.rst is a good example to follow. Also be sure to update Documentation/gettingstarted/index.rst with a link to this new getting started guide.

With that, you are ready to post this change for feedback from the Cilium community. Congrats!

System Requirements

Before installing Cilium, please ensure that your system meets the minimum requirements below. Most modern Linux distributions already do.

Summary

When running Cilium using the container image cilium/cilium, the host system must meet these requirements:

When running Cilium as a native process on your host (i.e. not running the cilium/cilium container image) these additional requirements must be met:

When running Cilium without Kubernetes these additional requirements must be met:

Requirement Minimum Version In cilium container
Linux kernel >= 4.9.17 no
Key-Value store (etcd) >= 3.1.0 no
Key-Value store (consul) >= 0.6.4 no
clang+LLVM >= 10.0 yes
iproute2 >= 5.0.0 [1] yes
[1](1, 2) Requires support for BPF templating as documented below.

Linux Distribution Compatibility Matrix

The following table lists Linux distributions that are known to work well with Cilium.

Distribution Minimum Version
Amazon Linux 2 all
Container-Optimized OS all
CentOS >= 7.0 [2]
CoreOS stable (>= 1298.5.0)
Debian >= 9 Stretch
Fedora Atomic/Core >= 25
LinuxKit all
RedHat Enterprise Linux >= 8.0
Ubuntu >= 16.04.2, >= 16.10
Opensuse Tumbleweed, >=Leap 15.0
RancherOS >= 1.5.5
[2]CentOS 7 requires a third-party kernel provided by ElRepo whereas CentOS 8 ships with a supported kernel.

Note

The above list is based on feedback by users. If you find an unlisted Linux distribution that works well, please let us know by opening a GitHub issue or by creating a pull request that updates this guide.

Linux Kernel

Cilium leverages and builds on the kernel BPF functionality as well as various subsystems which integrate with BPF. Therefore, host systems are required to run Linux kernel version 4.9.17 or later to run a Cilium agent. More recent kernels may provide additional BPF functionality that Cilium will automatically detect and use on agent start.

In order for the BPF feature to be enabled properly, the following kernel configuration options must be enabled. This is typically the case with distribution kernels. When an option can be built as a module or statically linked, either choice is valid.

CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=y
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=y

Note

Users running Linux 4.10 or earlier with Cilium CIDR policies may face Restrictions on unique prefix lengths for CIDR policy rules.

L7 proxy redirection currently uses TPROXY iptables actions as well as socket matches. For L7 redirection to work as intended kernel configuration must include the following modules:

CONFIG_NETFILTER_XT_TARGET_TPROXY=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_SOCKET=m

When xt_socket kernel module is missing the forwarding of redirected L7 traffic does not work in non-tunneled datapath modes. Since some notable kernels (e.g., COS) are shipping without xt_socket module, Cilium implements a fallback compatibility mode to allow L7 policies and visibility to be used with those kernels. Currently this fallback disables ip_early_demux kernel feature in non-tunneled datapath modes, which may decrease system networking performance. This guarantees HTTP and Kafka redirection works as intended. However, if HTTP or Kafka enforcement policies or visibility annotations are never used, this behavior can be turned off by adding the following to the helm configuration command line:

helm install cilium cilium/cilium --version 1.8.90 \
  ...
  --set global.enableXTSocketFallback=false

Advanced Features and Required Kernel Version

Cilium requires Linux kernel 4.9.17 or higher, however development on additional kernel features and functionality continues to progress in the Linux community. Some Cilium features and functionality are dependent on newer kernel versions. These additional Cilium features and functionality are enabled by upgrading to a later kernel version as detailed below:

Cilium Feature Minimum Kernel Version
IPv4 fragment tracking >= 4.10
Restrictions on unique prefix lengths for CIDR policy rules >= 4.11
Host-Reachable Services >= 4.19.57, >= 5.1.16, >= 5.2
Kubernetes without kube-proxy >= 4.19.57, >= 5.1.16, >= 5.2

Key-Value store

Cilium optionally uses a distributed Key-Value store to manage, synchronize and distribute security identities across all cluster nodes. The following Key-Value stores are currently supported:

  • etcd >= 3.1.0
  • consul >= 0.6.4

Cilium can be used without a Key-Value store when CRD-based state management is used with Kubernetes. This is the default for new Cilium installations. Larger clusters will perform better with a Key-Value store backed identity management instead, see Quick Installation for more details.

See Key-Value Store for details on how to configure the cilium-agent to use a Key-Value store.

clang+LLVM

Note

This requirement is only needed if you run cilium-agent natively. If you are using the Cilium container image cilium/cilium, clang+LLVM is included in the container image.

LLVM is the compiler suite that Cilium uses to generate BPF bytecode programs to be loaded into the Linux kernel. The minimum supported version of LLVM available to cilium-agent should be >=5.0. The version of clang installed must be compiled with the BPF backend enabled.

See https://releases.llvm.org/ for information on how to download and install LLVM.

iproute2

Note

iproute2 is only needed if you run cilium-agent directly on the host machine. iproute2 is included in the cilium/cilium container image.

iproute2 is a low level tool used to configure various networking related subsystems of the Linux kernel. Cilium uses iproute2 to configure networking and tc, which is part of iproute2, to load BPF programs into the kernel.

The version of iproute2 must include the BPF templating patches. See the links in the table below for documentation on how to install the correct version of iproute2 for your distribution.

Distribution Link
Binary (OpenSUSE) Open Build Service
Source Cilium iproute2 source

Firewall Rules

If you are running Cilium in an environment that requires firewall rules to enable connectivity, you will have to add the following rules to ensure Cilium works properly.

It is recommended but optional that all nodes running Cilium in a given cluster must be able to ping each other so cilium-health can report and monitor connectivity among nodes. This requires ICMP Type 0/8, Code 0 open among all nodes. TCP 4240 should also be open among all nodes for cilium-health monitoring. Note that it is also an option to only use one of these two methods to enable health monitoring. If the firewall does not permit either of these methods, Cilium will still operate fine but will not be able to provide health information.

If you are using VXLAN overlay network mode, Cilium uses Linux’s default VXLAN port 8472 over UDP, unless Linux has been configured otherwise. In this case, UDP 8472 must be open among all nodes to enable VXLAN overlay mode. The same applies to Geneve overlay network mode, except the port is UDP 6081.

If you are running in direct routing mode, your network must allow routing of pod IPs.

As an example, if you are running on AWS with VXLAN overlay networking, here is a minimum set of AWS Security Group (SG) rules. It assumes a separation between the SG on the master nodes, master-sg, and the worker nodes, worker-sg. It also assumes etcd is running on the master nodes.

Master Nodes (master-sg) Rules:

Port Range / Protocol Ingress/Egress Source/Destination Description
2379-2380/tcp ingress worker-sg etcd access
8472/udp ingress master-sg (self) VXLAN overlay
8472/udp ingress worker-sg VXLAN overlay
4240/tcp ingress master-sg (self) health checks
4240/tcp ingress worker-sg health checks
ICMP 8/0 ingress master-sg (self) health checks
ICMP 8/0 ingress worker-sg health checks
8472/udp egress master-sg (self) VXLAN overlay
8472/udp egress worker-sg VXLAN overlay
4240/tcp egress master-sg (self) health checks
4240/tcp egress worker-sg health checks
ICMP 8/0 egress master-sg (self) health checks
ICMP 8/0 egress worker-sg health checks

Worker Nodes (worker-sg):

Port Range / Protocol Ingress/Egress Source/Destination Description
8472/udp ingress master-sg VXLAN overlay
8472/udp ingress worker-sg (self) VXLAN overlay
4240/tcp ingress master-sg health checks
4240/tcp ingress worker-sg (self) health checks
ICMP 8/0 ingress master-sg health checks
ICMP 8/0 ingress worker-sg (self) health checks
8472/udp egress master-sg VXLAN overlay
8472/udp egress worker-sg (self) VXLAN overlay
4240/tcp egress master-sg health checks
4240/tcp egress worker-sg (self) health checks
ICMP 8/0 egress master-sg health checks
ICMP 8/0 egress worker-sg (self) health checks
2379-2380/tcp egress master-sg etcd access

Note

If you use a shared SG for the masters and workers, you can condense these rules into ingress/egress to self. If you are using Direct Routing mode, you can condense all rules into ingress/egress ANY port/protocol to/from self.

The following ports should also be available on each node:

Port Range / Protocol Description
4240/tcp cluster health checks (cilium-health)
4244/tcp hubble server
4245/tcp hubble relay
6942/tcp operator Prometheus metrics
9090/tcp cilium-agent Prometheus metrics
9876/tcp cilium-agent health status API

Privileges

The following privileges are required to run Cilium. When running the standard Kubernetes DaemonSet, the privileges are automatically granted to Cilium.

  • Cilium interacts with the Linux kernel to install BPF program which will then perform networking tasks and implement security rules. In order to install BPF programs system-wide, CAP_SYS_ADMIN privileges are required. These privileges must be granted to cilium-agent.

    The quickest way to meet the requirement is to run cilium-agent as root and/or as privileged container.

  • Cilium requires access to the host networking namespace. For this purpose, the Cilium pod is scheduled to run in the host networking namespace directly.

Scalability report

This report is intended for users planning to run Cilium on clusters with more than 200 nodes in CRD mode (without a kvstore available). In our development cycle we have deployed Cilium on large clusters and these were the options that were suitable for our testing:

Setup

helm template cilium \
    --namespace kube-system \
    --set global.endpointHealthChecking.enabled=false \
    --set global.healthChecking.enabled=false \
    --set global.ipam.mode=kubernetes \
    --set global.k8sServiceHost=<KUBE-APISERVER-LB-IP-ADDRESS> \
    --set global.k8sServicePort=<KUBE-APISERVER-LB-PORT-NUMBER> \
    --set global.prometheus.enabled=true \
    --set global.operatorPrometheus.enabled=true \
  > cilium.yaml
  • --set global.endpointHealthChecking.enabled=false and --set global.healthChecking.enabled=false disable endpoint health checking entirely. However it is recommended that those features be enabled initially on a smaller cluster (3-10 nodes) where it can be used to detect potential packet loss due to firewall rules or hypervisor settings.
  • --set global.ipam.mode=kubernetes is set to "kubernetes" since our cloud provider has pod CIDR allocation enabled in kube-controller-manager.
  • --set global.k8sServiceHost and --set global.k8sServicePort were set with the IP address of the loadbalancer that was in front of kube-apiserver. This allows Cilium to not depend on kube-proxy to connect to kube-apiserver.
  • --set global.prometheus.enabled=true and --set global.operatorPrometheus.enabled=true were just set because we had a Prometheus server probing for metrics in the entire cluster.

Our testing cluster consisted of 3 controller nodes and 1000 worker nodes. We have followed the recommended settings from the official Kubernetes documentation and have provisioned our machines with the following settings:

  • Cloud provider: Google Cloud
  • Controllers: 3x n1-standard-32 (32vCPU, 120GB memory and 50GB SSD, kernel 5.4.0-1009-gcp)
  • Workers: 1 pool of 1000x custom-2-4096 (2vCPU, 4GB memory and 10GB HDD, kernel 5.4.0-1009-gcp)
  • Metrics: 1x n1-standard-32 (32vCPU, 120GB memory and 10GB HDD + 500GB HDD) this is a dedicated node for prometheus and grafana pods.

Note

All 3 controller nodes were behind a GCE load balancer.

Each controller contained etcd, kube-apiserver, kube-controller-manager and kube-scheduler instances.

The CPU, memory and disk size set for the workers might be different for your use case. You might have pods that require more memory or CPU available so you should design your workers based on your requirements.

During our testing we had to set the etcd option quota-backend-bytes=17179869184 because etcd failed once it reached around 2GiB of allocated space.

We provisioned our worker nodes without kube-proxy since Cilium is capable of performing all functionalities provided by kube-proxy. We created a load balancer in front of kube-apiserver to allow Cilium to access kube-apiserver without kube-proxy, and configured Cilium with the options --set global.k8sServiceHost=<KUBE-APISERVER-LB-IP-ADDRESS> and --set global.k8sServicePort=<KUBE-APISERVER-LB-PORT-NUMBER>.

Our DaemonSet updateStrategy had the maxUnavailable set to 250 pods instead of 2, but this value highly depends on your requirements when you are performing a rolling update of Cilium.

Steps

For each step we took, we provide more details below, with our findings and expected behaviors.

1. Install Kubernetes v1.18.3 with EndpointSlice feature enabled

To test the most up-to-date functionalities from Kubernetes and Cilium, we have performed our testing with Kubernetes v1.18.3 and the EndpointSlice feature enabled to improve scalability.

Since Kubernetes requires an etcd cluster, we have deployed v3.4.9.

2. Deploy Prometheus, Grafana and Cilium

We have used Prometheus v2.18.1 and Grafana v7.0.1 to retrieve and analyze etcd, kube-apiserver, cilium and cilium-operator metrics.

3. Provision 2 worker nodes

This helped us to understand if our testing cluster was correctly provisioned and all metrics were being gathered.

4. Deploy 5 namespaces with 25 deployments on each namespace

  • Each deployment had 1 replica (125 pods in total).
  • To measure only the resources consumed by Cilium, all deployments used the same base image k8s.gcr.io/pause:3.2. This image does not have any CPU or memory overhead.
  • We provision a small number of pods in a small cluster to understand the CPU usage of Cilium:
_images/image_4_01.png

The mark shows when the creation of 125 pods started. As expected, we can see a slight increase of the CPU usage on both Cilium agents running and in the Cilium operator. The agents peaked at 6.8% CPU usage on a 2vCPU machine.

_images/image_4_02.png

For the memory usage, we have not seen a significant memory growth in the Cilium agent. On the BPF memory side, we do see it increasing due to the initialization of some BPF maps for the new pods.

5. Provision 998 additional nodes (total 1000 nodes)

_images/image_5_01.png

The first mark represents the action of creating nodes, the second mark when 1000 Cilium pods were in ready state. The CPU usage increase is expected since each Cilium agent receives events from Kubernetes whenever a new node is provisioned in the cluster. Once all nodes were deployed the CPU usage was 0.15% on average on a 2vCPU node.

_images/image_5_02.png

As we have increased the number of nodes in the cluster to 1000, it is expected to see a small growth of the memory usage in all metrics. However, it is relevant to point out that an increase in the number of nodes does not cause any significant increase in Cilium’s memory consumption in both control and dataplane.

6. Deploy 25 more deployments on each namespace

This will now bring us a total of 5 namespaces * (25 old deployments + 25 new deployments)=250 deployments in the entire cluster. We did not install 250 deployments from the start since we only had 2 nodes and that would create 125 pods on each worker node. According to the Kubernetes documentation the maximum recommended number of pods per node is 100.

7. Scale each deployment to 200 replicas (50000 pods in total)

Having 5 namespaces with 50 deployments means that we have 250 different unique security identities. Having a low cardinality in the labels selected by Cilium helps scale the cluster. By default, Cilium has a limit of 16k security identities, but it can be increase with bpf-policy-map-max in the Cilium ConfigMap.

_images/image_7_01.png

The first mark represents the action of scaling up the deployments, the second mark when 50000 pods were in ready state.

  • It is expected to see the CPU usage of Cilium increase since, on each node, Cilium agents receive events from Kubernetes when a new pod is scheduled and started.
  • The average CPU consumption of all Cilium agents was 3.38% on a 2vCPU machine. At one point, roughly around minute 15:23, one of those Cilium agents picked 27.94% CPU usage.
  • Cilium Operator had a stable 5% CPU consumption while the pods were being created.
_images/image_7_02.png

Similar to the behavior seen while increasing the number of worker nodes, adding new pods also increases Cilium memory consumption.

  • As we increased the number of pods from 250 to 50000, we saw a maximum memory usage of 573MiB for one of the Cilium agents while the average was 438 MiB.
  • For the BPF memory usage we saw a max usage of 462.7MiB
  • This means that each Cilium agent’s memory increased by 10.5KiB per new pod in the cluster.

8. Deploy 250 policies for 1 namespace

Here we have created 125 L4 network policies and 125 L7 policies. Each policy selected all pods on this namespace and was allowed to send traffic to another pod on this namespace. Each of the 250 policies allows access to a disjoint set of ports. In the end we will have 250 different policies selecting 10000 pods.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l4-rule-#"
  namespace: "namespace-1"
spec:
  endpointSelector:
    matchLabels:
      my-label: testing
  fromEndpoints:
    matchLabels:
      my-label: testing
  egress:
    - toPorts:
      - ports:
        - port: "[0-125]+80" // from 80 to 12580
          protocol: TCP
---
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l7-rule-#"
  namespace: "namespace-1"
spec:
  endpointSelector:
    matchLabels:
      my-label: testing
  fromEndpoints:
    matchLabels:
      my-label: testing
  ingress:
  - toPorts:
    - ports:
      - port: '[126-250]+80' // from 12680 to 25080
        protocol: TCP
      rules:
        http:
        - method: GET
          path: "/path1$"
        - method: PUT
          path: "/path2$"
          headers:
          - 'X-My-Header: true'
_images/image_8_01.png

In this case we saw one of the Cilium agents jumping to 100% CPU usage for 15 seconds while the average peak was 40% during a period of 90 seconds.

_images/image_8_02.png

As expected, increasing the number of policies does not have a significant impact on the memory usage of Cilium since the BPF policy maps have a constant size once a pod is initialized.

_images/image_8_03.png
_images/image_8_04.png

The first mark represents the point in time when we ran kubectl create to create the CiliumNetworkPolicies. Since we created the 250 policies sequentially, we cannot properly compute the convergence time. To do that, we could use a single CNP with multiple policy rules defined under the specs field (instead of the spec field).

Nevertheless, we can see the time it took the last Cilium agent to increment its Policy Revision, which is incremented individually on each Cilium agent every time a CiliumNetworkPolicy (CNP) is received, between second 15:45:44 and 15:45:46 and see when was the last time an Endpoint was regenerated by checking the 99th percentile of the “Endpoint regeneration time”. In this manner, that it took less than 5s. We can also verify the maximum time was less than 600ms for an endpoint to have the policy enforced.

9. Deploy 250 policies for CiliumClusterwideNetworkPolicies (CCNP)

The difference between these policies and the previous ones installed is that these select all pods in all namespaces. To recap, this means that we will now have 250 different network policies selecting 10000 pods and 250 different network policies selecting 50000 pods on a cluster with 1000 nodes. Similarly to the previous step we will deploy 125 L4 policies and another 125 L7 policies.

_images/image_9_01.png
_images/image_9_02.png

Similar to the creation of the previous 250 CNPs, there was also an increase in CPU usage during the creation of the CCNPs. The CPU usage was similar even though the policies were effectively selecting more pods.

_images/image_9_03.png

As all pods running in a node are selected by all 250 CCNPs created, we see an increase of the Endpoint regeneration time which peaked a little above 3s.

10. “Accidentally” delete 10000 pods

In this step we have “accidentally” deleted 10000 random pods. Kubernetes will then recreate 10000 new pods so it will help us understand what the convergence time is for all the deployed network polices.

_images/image_10_01.png
_images/image_10_02.png
  • The first mark represents the point in time when pods were “deleted” and the second mark represents the point in time when Kubernetes finished recreating 10k pods.
  • Besides the CPU usage slightly increasing while pods are being scheduled in the cluster, we did see some interesting data points in the BPF memory usage. As each endpoint can have one or more dedicated BPF maps, the BPF memory usage is directly proportional to the number of pods running in a node. If the number of pods per node decreases so does the BPF memory usage.
_images/image_10_03.png

We inferred the time it took for all the endpoints to get regenerated by looking at the number of Cilium endpoints with the policy enforced over time. Luckily enough we had another metric that was showing how many Cilium endpoints had policy being enforced:

_images/image_10_04.png

11. Control plane metrics over the test run

The focus of this test was to study the Cilium agent resource consumption at scale. However, we also monitored some metrics of the control plane nodes such as etcd metrics and CPU usage of the k8s-controllers and we present them in the next figures.

_images/image_11_01.png

Memory consumption of the 3 etcd instances during the entire scalability testing.

_images/image_11_02.png

CPU usage for the 3 controller nodes, average latency per request type in the etcd cluster as well as the number of operations per second made to etcd.

_images/image_11_03.png

All etcd metrics, from left to right, from top to bottom: database size, disk sync duration, client traffic in, client traffic out, peer traffic in, peer traffic out.

Final Remarks

These experiments helped us develop a better understanding of Cilium running in a large cluster entirely in CRD mode and without depending on etcd. There is still some work to be done to optimize the memory footprint of BPF maps even further, as well as reducing the memory footprint of the Cilium agent. We will address those in the next Cilium version.

We can also determine that it is scalable to run Cilium in CRD mode on a cluster with more than 200 nodes. However, it is worth pointing out that we need to run more tests to verify Cilium’s behavior when it loses the connectivity with kube-apiserver, as can happen during a control plane upgrade for example. This will also be our focus in the next Cilium version.

Upgrade Guide

This upgrade guide is intended for Cilium running on Kubernetes. Helm commands in this guide use helm3 syntax. If you have questions, feel free to ping us on the Slack channel.

Warning

Do not upgrade to 1.9.0 before reading the section IMPORTANT: Changes required before upgrading to 1.8.0 and completing the required steps. Skipping this step may lead to an non-functional upgrade.

Running pre-flight check (Required)

When rolling out an upgrade with Kubernetes, Kubernetes will first terminate the pod followed by pulling the new image version and then finally spin up the new image. In order to reduce the downtime of the agent, the new image version can be pre-pulled. It also verifies that the new image version can be pulled and avoids ErrImagePull errors during the rollout. If you are running in Kubernetes without kube-proxy mode you need to also pass on the Kubernetes API Server IP and / or the Kubernetes API Server Port when generating the cilium-preflight.yaml file.

helm template cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set preflight.enabled=true \
  --set agent.enabled=false \
  --set config.enabled=false \
  --set operator.enabled=false \
  > cilium-preflight.yaml
kubectl create -f cilium-preflight.yaml
helm install cilium-preflight cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set preflight.enabled=true \
  --set agent.enabled=false \
  --set config.enabled=false \
  --set operator.enabled=false
helm template cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set preflight.enabled=true \
  --set agent.enabled=false \
  --set config.enabled=false \
  --set operator.enabled=false \
  --set global.k8sServiceHost=API_SERVER_IP \
  --set global.k8sServicePort=API_SERVER_PORT \
  > cilium-preflight.yaml
kubectl create -f cilium-preflight.yaml
helm install cilium-preflight cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set preflight.enabled=true \
  --set agent.enabled=false \
  --set config.enabled=false \
  --set operator.enabled=false \
  --set global.k8sServiceHost=API_SERVER_IP \
  --set global.k8sServicePort=API_SERVER_PORT

After running the cilium-pre-flight.yaml, make sure the number of READY pods is the same number of Cilium pods running.

kubectl get daemonset -n kube-system | grep cilium
NAME                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
cilium                    2         2         2       2            2           <none>          1h20m
cilium-pre-flight-check   2         2         2       2            2           <none>          7m15s

Once the number of READY pods are the same, make sure the Cilium PreFlight deployment is also marked as READY 1/1. In case it shows READY 0/1 please see CNP Validation.

kubectl get deployment -n kube-system cilium-pre-flight-check -w
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
cilium-pre-flight-check   1/1     1            0           12s

Clean up pre-flight check

Once the number of READY for the preflight DaemonSet is the same as the number of cilium pods running and the preflight Deployment is marked as READY 1/1 you can delete the cilium-preflight and proceed with the upgrade.

kubectl delete -f cilium-preflight.yaml
helm delete cilium-preflight --namespace=kube-system

Upgrading Cilium

Warning

Do not upgrade to 1.9.0 before reading the section IMPORTANT: Changes required before upgrading to 1.8.0 and completing the required steps. Skipping this step may lead to an non-functional upgrade.

Step 2: Option B: Preserve ConfigMap

Alternatively, you can use Helm to regenerate all Kubernetes resources except for the ConfigMap. The configuration of Cilium is stored in a ConfigMap called cilium-config. The format is compatible between minor releases so configuration parameters are automatically preserved across upgrades. However, new minor releases may introduce new functionality that require opt-in via the ConfigMap. Refer to the Version Specific Notes for a list of new configuration options for each minor version.

Note

First, make sure you have Helm 3 installed.

If you have (or planning to have) Helm 2 charts (and Tiller) in the same cluster, there should be no issue as both version are mutually compatible in order to support gradual migration. Cilium chart is targeting Helm 3 (v3.0.3 and above).

Setup Helm repository:

helm repo add cilium https://helm.cilium.io/

Generate the required YAML file and deploy it:

helm template cilium/cilium --version 1.8.90 \
  --namespace kube-system \
  --set config.enabled=false \
  > cilium.yaml
kubectl apply -f cilium.yaml

Deploy Cilium release via Helm:

helm upgrade cilium cilium/cilium --version 1.8.90 \
  --namespace=kube-system \
  --set config.enabled=false

Note

The above variant can not be used in combination with --set or providing values.yaml because all options are fed into the DaemonSets and Deployments using the ConfigMap which is not generated if config.enabled=false is set. The above command only generates the DaemonSet, Deployment and RBAC definitions.

Step 3: Rolling Back

Occasionally, it may be necessary to undo the rollout because a step was missed or something went wrong during upgrade. To undo the rollout run:

kubectl rollout undo daemonset/cilium -n kube-system
helm history cilium --namespace=kube-system
helm rollback cilium [REVISION] --namespace=kube-system

This will revert the latest changes to the Cilium DaemonSet and return Cilium to the state it was in prior to the upgrade.

Note

When rolling back after new features of the new minor version have already been consumed, consult an eventual existing downgrade section in the Version Specific Notes to check and prepare for incompatible feature use before downgrading/rolling back. This step is only required after new functionality introduced in the new minor version has already been explicitly used by importing policy or by opting into new features via the ConfigMap.

Version Specific Notes

This section documents the specific steps required for upgrading from one version of Cilium to another version of Cilium. There are particular version transitions which are suggested by the Cilium developers to avoid known issues during upgrade, then subsequently there are sections for specific upgrade transitions, ordered by version.

The table below lists suggested upgrade transitions, from a specified current version running in a cluster to a specified target version. If a specific combination is not listed in the table below, then it may not be safe. In that case, consider staging the upgrade, for example upgrading from 1.1.x to the latest 1.1.y release before subsequently upgrading to 1.2.z.

Current version Target version Full YAML update L3 impact L7 impact
1.0.x 1.1.y Required N/A Clients must reconnect[1]
1.1.x 1.2.y Required Temporary disruption[2] Clients must reconnect[1]
1.2.x 1.3.y Required Minimal to None Clients must reconnect[1]
>=1.2.5 1.5.y Required Minimal to None Clients must reconnect[1]
1.5.x 1.6.y Required Minimal to None Clients must reconnect[1]
1.6.x 1.6.6 Not required Minimal to None Clients must reconnect[1]
1.6.x 1.6.7 Required Minimal to None Clients must reconnect[1]
1.6.x 1.7.y Required Minimal to None Clients must reconnect[1]
1.7.0 1.7.1 Required Minimal to None Clients must reconnect[1]
>=1.7.1 1.7.y Not required Minimal to None Clients must reconnect[1]

Annotations:

  1. Clients must reconnect: Any traffic flowing via a proxy (for example, because an L7 policy is in place) will be disrupted during upgrade. Endpoints communicating via the proxy must reconnect to re-establish connections.
  2. Temporary disruption: All traffic may be temporarily disrupted during upgrade. Connections should successfully re-establish without requiring clients to reconnect.

1.8 Upgrade Notes

IMPORTANT: Changes required before upgrading to 1.8.0

Warning

Do not upgrade to 1.8.0 before reading the following section and completing the required steps.

  • While operating in direct-routing mode (--tunnel=disabled), traffic with a destination address matching a particular CIDR is automatically excluded from being masqueraded. So far, this CIDR consisted of <alloc-cidr>/<size> where the size could be set with the option --ipv4-cluster-cidr-mask-size. This was not always desirable and limiting, therefore Cilium 1.6 had already introduced the option --native-routing-cidr allowing to explicitly specify the CIDR for native routing. With Cilium 1.8, the option --ipv4-cluster-cidr-mask-size is being deprecated and all users must use the option --native-routing-cidr instead.

    Note

    The ENI IPAM mode automatically derives the native routing CIDR so no action is required.

    The CiliumNetworkPolicy Status includes information which allows to derive when all nodes in a cluster are enforcing a particular CiliumNetworkPolicy. For large clusters running CRD mode, this visibility is costly as it requires all nodes to participate. In order to ensure scalability, CiliumNetworkPolicy status visibility has been disabled for all new deployments. If you want to enable it, set the ConfigMap option disable-cnp-status-updates to false by using Helm --set global.cnpStatusUpdates.enabled=true or by editing the ConfigMap directly.

Upgrading from >=1.7.0 to 1.8.y
  • Since Cilium 1.5, the TCP connection tracking table size parameter bpf-ct-global-tcp-max in the daemon was set to the default value 1000000 to retain backwards compatibility with previous versions. In Cilium 1.8 the default value is set to 512K by default in order to reduce the agent memory consumption.

    If Cilium was deployed using Helm, the new default value of 512K was already effective in Cilium 1.6 or later, unless it was manually configured to a different value.

    If the table size was configured to a value different from 512K in the previous installation, ongoing connections will be disrupted during the upgrade. To avoid connection breakage, bpf-ct-global-tcp-max needs to be manually adjusted.

    To check whether any action is required the following command can be used to check the currently configured maximum number of TCP conntrack entries:

    sudo grep -R CT_MAP_SIZE_TCP /var/run/cilium/state/templates/
    

    If the maximum number is 524288, no action is required. If the number is different, bpf-ct-global-tcp-max needs to be adjusted in the ConfigMap to the value shown by the command above (100000 in the example below):

helm template cilium \
--namespace=kube-system \
...
--set global.bpf.ctTcpMax=100000
...
> cilium.yaml
kubectl apply -f cilium.yaml
helm upgrade cilium --namespace=kube-system \
--set global.bpf.ctTcpMax=100000
  • The default value for the NAT table size parameter bpf-nat-global-max in the daemon is derived from the default value of the conntrack table size parameter bpf-ct-global-tcp-max. Since the latter was changed (see above), the default NAT table size decreased from ~820K to 512K.

    The NAT table is only used if either BPF NodePort (enable-node-port parameter) or masquerading (masquerade parameter) are enabled. No action is required if neither of the parameters is enabled.

    If either of the parameters is enabled, ongoing connections will be disrupted during the upgrade. In order to avoid connection breakage, bpf-nat-global-max needs to be manually adjusted.

    To check whether any adjustment is required the following command can be used to check the currently configured maximum number of NAT table entries:

    sudo grep -R SNAT_MAPPING_IPV[46]_SIZE /var/run/cilium/state/globals/
    

    If the command does not return any value or if the returned maximum number is 524288, no action is required. If the number is different, bpf-nat-global-max needs to be adjusted in the ConfigMap to the value shown by the command above (841429 in the example below):

helm template cilium \
--namespace=kube-system \
...
--set global.bpf.natMax=841429
...
> cilium.yaml
kubectl apply -f cilium.yaml
helm upgrade cilium --namespace=kube-system \
--set global.bpf.natMax=841429
New ConfigMap Options
  • mapDynamicSizeRatio has been added to allow sizing of the TCP CT, non-TCP CT, NAT and policy BPF maps based on the total system memory. This option allows to specify a ratio (0.0-1.0) of total system memory to use for these maps. On new installations, this ratio is set to 0.03 by default, leading to 3% of the total system memory to be allocated for these maps. On a node with 4 GiB of total system memory this ratio corresponds approximately to the default BPF map sizes. A value of 0.0 will disable sizing of the BPF maps based on system memory. Any BPF map sizes configured manually using the ctTcpMax, ctAnyMax, natMax, or policyMapMax options will take precedence over the dynamically determined value.

    On upgrades of existing installations, this option is disabled by default, i.e. it is set to 0.0. Users wanting to use this feature need to enable it explicitly in their ConfigMap, see section Rebasing a ConfigMap.

Deprecated options
  • keep-bpf-templates: This option no longer has any effect due to the BPF assets not being compiled into the cilium-agent binary anymore. The option is deprecated and will be removed in Cilium 1.9.
  • access-log: L7 access logs have been available via Hubble since Cilium 1.6. The access-log option to log to a file has been removed.
  • --disable-k8s-services option from cilium-agent has been deprecated and will be removed in Cilium 1.9.
  • --tofqdns-enable-poller: This option has been deprecated and will be removed in Cilium 1.9
  • --tofqdns-enable-poller-events: This option has been deprecated and will be removed in Cilium 1.9
New Metrics

The following metrics have been added:

  • bpf_maps_virtual_memory_max_bytes: Max memory used by BPF maps installed in the system
  • bpf_progs_virtual_memory_max_bytes: Max memory used by BPF programs installed in the system

Both bpf_maps_virtual_memory_max_bytes and bpf_progs_virtual_memory_max_bytes are currently reporting the system-wide memory usage of BPF that is directly and not directly managed by Cilium. This might change in the future and only report the BPF memory usage directly managed by Cilium.

Renamed Metrics

The following metrics have been renamed:

  • cilium_operator_eni_ips to cilium_operator_ipam_ips
  • cilium_operator_eni_allocation_ops to cilium_operator_ipam_allocation_ops
  • cilium_operator_eni_interface_creation_ops to cilium_operator_ipam_interface_creation_ops
  • cilium_operator_eni_available to cilium_operator_ipam_available
  • cilium_operator_eni_nodes_at_capacity to cilium_operator_ipam_nodes_at_capacity
  • cilium_operator_eni_resync_total to cilium_operator_ipam_resync_total
  • cilium_operator_eni_aws_api_duration_seconds to cilium_operator_ipam_api_duration_seconds
  • cilium_operator_eni_ec2_rate_limit_duration_seconds to cilium_operator_ipam_api_rate_limit_duration_seconds
Deprecated cilium-operator options
  • metrics-address: This option is being deprecated and a new flag is introduced to replace its usage. The new option is operator-prometheus-serve-addr. This old option will be removed in Cilium 1.9
  • ccnp-node-status-gc: This option is being deprecated. Disabling CCNP node status GC can be done with cnp-node-status-gc-interval=0. (Note that this is not a typo, it is meant to be cnp-node-status-gc-interval). This old option will be removed in Cilium 1.9
  • cnp-node-status-gc: This option is being deprecated. Disabling CNP node status GC can be done with cnp-node-status-gc-interval=0. This old option will be removed in Cilium 1.9
  • cilium-endpoint-gc: This option is being deprecated. Disabling cilium endpoint GC can be done with cilium-endpoint-gc-interval=0. This old option will be removed in Cilium 1.9
  • api-server-port: This option is being deprecated. The API Server address and port can be enabled with operator-api-serve-addr=127.0.0.1:9234 or operator-api-serve-addr=[::1]:9234 for IPv6-only clusters. This old option will be removed in Cilium 1.9
  • eni-parallel-workers: This option in the Operator has been renamed to parallel-alloc-workers. The obsolete option name eni-parallel-workers has been deprecated and will be removed in v1.9.
  • aws-client-burst: This option in the Operator has been renamed to limit-ipam-api-burst. The obsolete option name aws-client-burst has been deprecated and will be removed in v1.9.
  • aws-client-qps: This option in the Operator has been renamed to limit-ipam-api-qps. The obsolete option name aws-client-qps has been deprecated and will be removed in v1.9.
Removed options
  • enable-legacy-services: This option was deprecated in Cilium 1.6 and is now removed.
  • The options container-runtime and container-runtime-endpoint were deprecated in Cilium 1.7 and are now removed.
  • The conntrack-garbage-collector-interval option deprecated in Cilium 1.6 is now removed. Please use conntrack-gc-interval instead.
Removed helm options
  • operator.synchronizeK8sNodes: was removed and replaced with global.synchronizeK8sNodes
Removed resource fields
  • The fields CiliumEndpoint.Status.Status, CiliumEndpoint.Status.Spec, and EndpointIdentity.LabelsSHA256, deprecated in 1.4, have been removed.

1.7 Upgrade Notes

IMPORTANT: Changes required before upgrading to 1.7.x

Warning

Do not upgrade to 1.7.x before reading the following section and completing the required steps.

In particular, if you are using network policy and upgrading from 1.6.x or earlier to 1.7.x or later, you MUST read the 1.7 New ConfigMap Options section about the enable-remote-node-identity flag to avoid potential disruption to connectivity between host networking pods and Cilium-managed pods.

  • Cilium has bumped the minimal kubernetes version supported to v1.11.0.

  • The kubernetes.io/cluster-service label has been removed from the Cilium DaemonSet selector. Existing users must either choose to keep this label in DaemonSet specification to safely upgrade or re-create the Cilium DaemonSet without the deprecated label. It is advisable to keep the label when doing an upgrade from v1.6.x to v1.7.x in the event of having to do a downgrade. The removal of this label should be done after a successful upgrade.

    The helm option agent.keepDeprecatedLabels=true will keep the kubernetes.io/cluster-service label in the new DaemonSet:

helm template cilium --namespace=kube-system ...
--set agent.keepDeprecatedLabels=true ...
> cilium.yaml
kubectl apply -f cilium.yaml
helm upgrade cilium --namespace=kube-system --set agent.keepDeprecatedLabels=true

Trying to upgrade Cilium without this option might result in the following error: The DaemonSet "cilium" is invalid: spec.selector: Invalid value: ...: field is immutable

  • If kvstore is setup with etcd and TLS is enabled, the field name ca-file will have its usage deprecated and will be removed in Cilium v1.8.0. The new field name, trusted-ca-file, can be used since Cilium v1.1.0.

    Required action:

    This field name should be changed from ca-file to trusted-ca-file.

    Example of an old etcd configuration, with the ca-file field name:

    ---
    endpoints:
    - https://192.168.0.1:2379
    - https://192.168.0.2:2379
    ca-file: '/var/lib/cilium/etcd-ca.pem'
    # In case you want client to server authentication
    key-file: '/var/lib/cilium/etcd-client.key'
    cert-file: '/var/lib/cilium/etcd-client.crt'
    

    Example of new etcd configuration, with the trusted-ca-file field name:

    ---
    endpoints:
    - https://192.168.0.1:2379
    - https://192.168.0.2:2379
    trusted-ca-file: '/var/lib/cilium/etcd-ca.pem'
    # In case you want client to server authentication
    key-file: '/var/lib/cilium/etcd-client.key'
    cert-file: '/var/lib/cilium/etcd-client.crt'
    
  • Due to the removal of external libraries to connect to container runtimes, Cilium no longer supports the option flannel-manage-existing-containers. Cilium will still support integration with Flannel for new containers provisioned but not for containers already running in Flannel. The options container-runtime and container-runtime-endpoint will not have any effect and the flag removal is scheduled for v1.8.0

  • The default --tofqdns-min-ttl value has been reduced to 1 hour. Specific IPs in DNS entries are no longer expired when in-use by existing connections that are allowed by policy. Prior deployments that used the default value may now experience denied new connections if endpoints reuse DNS data more than 1 hour after the initial lookup without making new lookups. Long lived connections that previously outlived DNS entries are now better supported, and will not be disconnected when the corresponding DNS entry expires.

New ConfigMap Options
  • enable-remote-node-identity has been added to enable a new identity for remote cluster nodes and to associate all IPs of a node with that new identity. This allows for network policies that distinguish between connections from host networking pods or other processes on the local Kubernetes worker node from those on remote worker nodes.

    After enabling this option, all communication to and from non-local Kubernetes nodes must be whitelisted with a toEntity or fromEntity rule listing the entity remote-node. The existing entity cluster continues to work and now includes the entity remote-node. Existing policy rules whitelisting host will only affect the local node going forward. Existing CIDR-based rules to whitelist node IPs other than the Cilium internal IP (IP assigned to the cilium_host interface), will no longer take effect.

    This is important because Kubernetes Network Policy dictates that network connectivity from the local host must always be allowed, even for pods that have a default deny rule for ingress connectivity. This is so that network liveness and readiness probes from kubelet will not be dropped by network policy. Prior to 1.7.x, Cilium achieved this by always allowing ingress host network connectivity from any host in the cluster. With 1.7 and enable-remote-node-identity=true, Cilium will only automatically allow connectivity from the local node, thereby providing a better default security posture.

    The option is enabled by default for new deployments when generated via Helm, in order to gain the benefits of improved security. The Helm option is --set global.remoteNodeIdentity. This option can be disabled in order to maintain full compatibility with Cilium 1.6.x policy enforcement. Be aware that upgrading a cluster to 1.7.x by using helm to generate a new Cilium config that leaves enable-remote-node-identity set as the default value of true can break network connectivity.

    The reason for this is that with Cilium 1.6.x, the source identity of ANY connection from a host-networking pod or from other processes on a Kubernetes worker node would be the host identity. Thus, a Cilium 1.6.x or earlier environment with network policy enforced may be implicitly relying on the allow everything from host identity behavior to whitelist traffic from host networking to other Cilium-managed pods. With the shift to 1.7.x, if enable-remote-node-identity=true these connections will be denied by policy if they are coming from a host-networking pod or process on another Kubernetes worker node, since the source will be given the remote-node identity (which is not automatically allowed) rather than the host identity (which is automatically allowed).

    An indicator that this is happening would be drops visible in Hubble or Cilium monitor with a source identity equal to 6 (the numeric value for the new remote-node identity. For example:

    xx drop (Policy denied) flow 0x6d7b6dd0 to endpoint 1657, identity 6->51566: 172.16.9.243:47278 -> 172.16.8.21:9093 tcp SYN
    

    There are two ways to address this. One can set enable-remote-node-identity=false in the ConfigMap to retain the Cilium 1.6.x behavior. However, this is not ideal, as it means there is no way to prevent communication between host-networking pods and Cilium-managed pods, since all such connectivity is allowed automatically because it is from the host identity.

    The other option is to keep enable-remote-node-identity=true and create policy rules that explicitly whitelist connections between the remote-host identity and pods that should be reachable from host-networking pods or other processes that may be running on a remote Kubernetes worker node. An example of such a rule is:

    apiVersion: "cilium.io/v2"
    kind: CiliumNetworkPolicy
    metadata:
      name: "allow-from-remote-nodes"
    spec:
      endpointSelector:
        matchLabels:
          app: myapp
      ingress:
      - fromEntities:
        - remote-node
    

    See Access to/from all nodes in the cluster for more examples of remote-node policies.

  • enable-well-known-identities has been added to control the initialization of the well-known identities. Well-known identities have initially been added to support the managed etcd concept to allow the etcd operator to bootstrap etcd while Cilium still waited on etcd to become available. Cilium now uses CRDs by default which limits the use of well-known identities to the managed etcd mode. With the addition of this option, well-known identities are disabled by default in all new deployment and only enabled if the Helm option etcd.managed=true is set. Consider disabling this option if you are not using the etcd operator respectively managed etcd mode to reduce the number of policy identities whitelisted for each endpoint.

  • kube-proxy-replacement has been added to control which features should be enabled for the kube-proxy BPF replacement. The option is set to probe by default for new deployments when generated via Helm. This makes cilium-agent to probe for each feature support in a kernel, and to enable only supported features. When the option is not set via Helm, cilium-agent defaults to partial. This makes cilium-agent to enable only those features which user has explicitly enabled in their ConfigMap. See Kubernetes without kube-proxy for more option values.

    For users who previously were running with nodePort.enabled=true it is recommended to set the option to strict before upgrading.

  • enable-auto-protect-node-port-range has been added to enable auto-appending of a NodePort port range to net.ipv4.ip_local_reserved_ports if it overlaps with an ephemeral port range from net.ipv4.ip_local_port_range. The option is enabled by default. See Kubernetes without kube-proxy for the explanation why the overlap can be harmful.

Removed options
  • lb: The --lb feature has been removed. If you need load-balancing on a particular device, consider using Kubernetes without kube-proxy.
  • docker and e: This flags has been removed as Cilium no longer requires container runtime integrations to manage containers’ networks.
  • All code associated with monitor v1.0 socket handling has been removed.

1.6 Upgrade Notes

IMPORTANT: Changes required before upgrading to 1.6.0

Warning

Do not upgrade to 1.6.0 before reading the following section and completing the required steps.

  • The kvstore and kvstore-opt options have been moved from the DaemonSet into the ConfigMap. For many users, the DaemonSet definition was not considered to be under user control as the upgrade guide requests to apply the latest definition. Doing so for 1.6.0 without adding these options to the ConfigMap which is under user control would result in those settings to refer back to its default values.

    Required action:

    Add the following two lines to the cilium-config ConfigMap:

    kvstore: etcd
    kvstore-opt: '{"etcd.config": "/var/lib/etcd-config/etcd.config"}'
    

    This will preserve the existing behavior of the DaemonSet. Adding the options to the ConfigMap will not impact the ability to rollback. Cilium 1.5.y and earlier are compatible with the options although their values will be ignored as both options are defined in the DaemonSet definitions for these versions which takes precedence over the ConfigMap.

  • Downgrade warning: Be aware that if you want to change the identity-allocation-mode from kvstore to crd in order to no longer depend on the kvstore for identity allocation, then a rollback/downgrade requires you to revert that option and it will result in brief disruptions of all connections as identities are re-created in the kvstore.

Upgrading from >=1.5.0 to 1.6.y
  1. Follow the standard procedures to perform the upgrade as described in Upgrading Cilium. Users running older versions should first upgrade to the latest v1.5.x point release to minimize disruption of service connections during upgrade.
Changes that may require action
  • The CNI configuration file auto-generated by Cilium (/etc/cni/net.d/05-cilium.conf) is now always automatically overwritten unless the environment variable CILIUM_CUSTOM_CNI_CONF is set in which case any already existing configuration file is untouched.

  • The new default value for the option monitor-aggregation is now medium instead of none. This will cause the BPF datapath to perform more aggressive aggregation on packet forwarding related events to reduce CPU consumption while running cilium monitor. The automatic change only applies to the default ConfigMap. Existing deployments will need to change the setting in the ConfigMap explicitly.

  • Any new Cilium deployment on Kubernetes using the default ConfigMap will no longer fetch the container runtime specific labels when an endpoint is created and solely rely on the pod, namespace and ServiceAccount labels. Previously, Cilium also scraped labels from the container runtime which we are also pod labels and prefixed those with container:. We have seen less and less use of container runtime specific labels by users so it is no longer justified for every deployment to pay the cost of interacting with the container runtime by default. Any new deployment wishing to apply policy based on container runtime labels, must change the ConfigMap option container-runtime to auto or specify the container runtime to use.

    Existing deployments will continue to interact with the container runtime to fetch labels which are known to the runtime but not known to Kubernetes as pod labels. If you are not using container runtime labels, consider disabling it to reduce resource consumption on each by setting the option container-runtime to none in the ConfigMap.

New ConfigMap Options
  • cni-chaining-mode has been added to automatically generate CNI chaining configurations with various other plugins. See the section CNI Chaining for a list of supported CNI chaining plugins.
  • identity-allocation-mode has been added to allow selecting the identity allocation method. The default for new deployments is crd as per default ConfigMap. Existing deployments will continue to use kvstore unless opted into new behavior via the ConfigMap.
Deprecated options
  • enable-legacy-services: This option was introduced to ease the transition between Cilium 1.4.x and 1.5.x releases, allowing smooth upgrade and downgrade. As of 1.6.0, it is deprecated. Subsequently downgrading from 1.6.x or later to 1.4.x may result in disruption of connections that connect via services.
  • lb: The --lb feature has been deprecated. It has not been in use and has not been well tested. If you need load-balancing on a particular device, ping the development team on Slack to discuss options to get the feature fully supported.
Deprecated metrics
  • policy_l7_parse_errors_total: Use policy_l7_total instead.
  • policy_l7_forwarded_total: Use policy_l7_total instead.
  • policy_l7_denied_total: Use policy_l7_total instead.
  • policy_l7_received_total: Use policy_l7_total instead.

1.5 Upgrade Notes

Upgrading from >=1.4.0 to 1.5.y
  1. In v1.4, the TCP conntrack table size ct-global-max-entries-tcp ConfigMap parameter was ineffective due to a bug and thus, the default value (1000000) was used instead. To prevent from breaking established TCP connections, bpf-ct-global-tcp-max must be set to 1000000 in the ConfigMap before upgrading. Refer to the section Rebasing a ConfigMap on how to upgrade the ConfigMap.

  2. If you previously upgraded to v1.5, downgraded to <v1.5, and now want to upgrade to v1.5 again, then you must run the following DaemonSet before doing the upgrade:

    $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.10/cilium-pre-flight-with-rm-svc-v2.yaml
    
    $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.11/cilium-pre-flight-with-rm-svc-v2.yaml
    
    $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.12/cilium-pre-flight-with-rm-svc-v2.yaml
    
    $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.13/cilium-pre-flight-with-rm-svc-v2.yaml
    
    $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.14/cilium-pre-flight-with-rm-svc-v2.yaml
    
    $ kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.15/cilium-pre-flight-with-rm-svc-v2.yaml
    

    See Running pre-flight check (Required) for instructions how to run, validate and remove a pre-flight DaemonSet.

  3. Follow the standard procedures to perform the upgrade as described in Upgrading Cilium.

New Default Values
  • The connection-tracking garbage collector interval is now dynamic. It will automatically adjust based n on the percentage of the connection tracking table that has been cleared in the last run. The interval will vary between 10 seconds and 30 minutes or 12 hours for LRU based maps. This should automatically optimize CPU consumption as much as possible while keeping the connection tracking table utilization below 25%. If needed, the interval can be set to a static interval with the option --conntrack-gc-interval. If connectivity fails and cilium monitor --type drop shows xx drop (CT: Map insertion failed), then it is likely that the connection tracking table is filling up and the automatic adjustment of the garbage collector interval is insufficient. Set --conntrack-gc-interval to an interval lower than the default. Alternatively, the value for bpf-ct-global-any-max and bpf-ct-global-tcp-max can be increased. Setting both of these options will be a trade-off of CPU for conntrack-gc-interval, and for bpf-ct-global-any-max and bpf-ct-global-tcp-max the amount of memory consumed.

Advanced

Upgrade Impact

Upgrades are designed to have minimal impact on your running deployment. Networking connectivity, policy enforcement and load balancing will remain functional in general. The following is a list of operations that will not be available during the upgrade:

  • API aware policy rules are enforced in user space proxies and are currently running as part of the Cilium pod unless Cilium is configured to run in Istio mode. Upgrading Cilium will cause the proxy to restart which will result in a connectivity outage and connection to be reset.
  • Existing policy will remain effective but implementation of new policy rules will be postponed to after the upgrade has been completed on a particular node.
  • Monitoring components such as cilium monitor will experience a brief outage while the Cilium pod is restarting. Events are queued up and read after the upgrade. If the number of events exceeds the event buffer size, events will be lost.

Rebasing a ConfigMap

This section describes the procedure to rebase an existing ConfigMap to the template of another version.

Export the current ConfigMap
$ kubectl get configmap -n kube-system cilium-config -o yaml --export > cilium-cm-old.yaml
$ cat ./cilium-cm-old.yaml
apiVersion: v1
data:
  clean-cilium-state: "false"
  debug: "true"
  disable-ipv4: "false"
  etcd-config: |-
    ---
    endpoints:
    - https://192.168.33.11:2379
    #
    # In case you want to use TLS in etcd, uncomment the 'trusted-ca-file' line
    # and create a kubernetes secret by following the tutorial in
    # https://cilium.link/etcd-config
    trusted-ca-file: '/var/lib/etcd-secrets/etcd-client-ca.crt'
    #
    # In case you want client to server authentication, uncomment the following
    # lines and add the certificate and key in cilium-etcd-secrets below
    key-file: '/var/lib/etcd-secrets/etcd-client.key'
    cert-file: '/var/lib/etcd-secrets/etcd-client.crt'
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: cilium-config
  selfLink: /api/v1/namespaces/kube-system/configmaps/cilium-config

In the ConfigMap above, we can verify that Cilium is using debug with true, it has a etcd endpoint running with TLS, and the etcd is set up to have client to server authentication.

Generate the latest ConfigMap
helm template cilium \
  --namespace=kube-system \
  --set agent.enabled=false \
  --set config.enabled=true \
  --set operator.enabled=false \
  > cilium-configmap.yaml
Add new options

Add the new options manually to your old ConfigMap, and make the necessary changes.

In this example, the debug option is meant to be kept with true, the etcd-config is kept unchanged, and monitor-aggregation is a new option, but after reading the Version Specific Notes the value was kept unchanged from the default value.

After making the necessary changes, the old ConfigMap was migrated with the new options while keeping the configuration that we wanted:

$ cat ./cilium-cm-old.yaml
apiVersion: v1
data:
  debug: "true"
  disable-ipv4: "false"
  # If you want to clean cilium state; change this value to true
  clean-cilium-state: "false"
  monitor-aggregation: "medium"
  etcd-config: |-
    ---
    endpoints:
    - https://192.168.33.11:2379
    #
    # In case you want to use TLS in etcd, uncomment the 'trusted-ca-file' line
    # and create a kubernetes secret by following the tutorial in
    # https://cilium.link/etcd-config
    trusted-ca-file: '/var/lib/etcd-secrets/etcd-client-ca.crt'
    #
    # In case you want client to server authentication, uncomment the following
    # lines and add the certificate and key in cilium-etcd-secrets below
    key-file: '/var/lib/etcd-secrets/etcd-client.key'
    cert-file: '/var/lib/etcd-secrets/etcd-client.crt'
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: cilium-config
  selfLink: /api/v1/namespaces/kube-system/configmaps/cilium-config
Apply new ConfigMap

After adding the options, manually save the file with your changes and install the ConfigMap in the kube-system namespace of your cluster.

$ kubectl apply -n kube-system -f ./cilium-cm-old.yaml

As the ConfigMap is successfully upgraded we can start upgrading Cilium DaemonSet and RBAC which will pick up the latest configuration from the ConfigMap.

Restrictions on unique prefix lengths for CIDR policy rules

The Linux kernel applies limitations on the complexity of BPF code that is loaded into the kernel so that the code may be verified as safe to execute on packets. Over time, Linux releases become more intelligent about the verification of programs which allows more complex programs to be loaded. However, the complexity limitations affect some features in Cilium depending on the kernel version that is used with Cilium.

One such limitation affects Cilium’s configuration of CIDR policies. On Linux kernels 4.10 and earlier, this manifests as a restriction on the number of unique prefix lengths supported in CIDR policy rules.

Unique prefix lengths are counted by looking at the prefix portion of CIDR rules and considering which prefix lengths are unique. For example, in the following policy example, the toCIDR section specifies a /32, and the toCIDRSet section specifies a /8 with a /12 removed from it. In addition, three prefix lengths are always counted: the host prefix length for the protocol (IPv4: /32, IPv6: /128), the default prefix length (/0), and the cluster prefix length (default IPv4: /8, IPv6: /64). All in all, the following example counts as seven unique prefix lengths in IPv4:

  • /32 (from toCIDR, also from host prefix)
  • /12 (from toCIDRSet)
  • /11 (from toCIDRSet)
  • /10 (from toCIDRSet)
  • /9 (from toCIDRSet)
  • /8 (from cluster prefix)
  • /0 (from default prefix)
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "cidr-rule"
spec:
  endpointSelector:
    matchLabels:
      app: myService
  egress:
  - toCIDR:
    - 20.1.1.1/32
  - toCIDRSet:
    - cidr: 10.0.0.0/8
      except:
      - 10.96.0.0/12
[{
    "labels": [{"key": "name", "value": "cidr-rule"}],
    "endpointSelector": {"matchLabels":{"app":"myService"}},
    "egress": [{
        "toCIDR": [
            "20.1.1.1/32"
        ]
    }, {
        "toCIDRSet": [{
            "cidr": "10.0.0.0/8",
            "except": [
                "10.96.0.0/12"
            ]}
        ]
    }]
}]
Affected versions
  • Any version of Cilium running on Linux 4.10 or earlier

When a CIDR policy with too many unique prefix lengths is imported, Cilium will reject the policy with a message like the following:

$ cilium policy import too_many_cidrs.json
Error: Cannot import policy: [PUT /policy][500] putPolicyFailure  Adding
specified prefixes would result in too many prefix lengths (current: 3,
result: 32, max: 18)

The supported count of unique prefix lengths may differ between Cilium minor releases, for example Cilium 1.1 supported 20 unique prefix lengths on Linux 4.10 or older, while Cilium 1.2 only supported 18 (for IPv4) or 4 (for IPv6).

Mitigation

Users may construct CIDR policies that use fewer unique prefix lengths. This can be achieved by composing or decomposing adjacent prefixes.

Solution

Upgrade the host Linux version to 4.11 or later. This step is beyond the scope of the Cilium guide.

Migrating from kvstore-backed identities to kubernetes CRD-backed identities

Beginning with cilium 1.6, kubernetes CRD-backed security identities can be used for smaller clusters. Along with other changes in 1.6 this allows kvstore-free operation if desired. It is possible to migrate identities from an existing kvstore deployment to CRD-backed identities. This minimizes disruptions to traffic as the update rolls out through the cluster.

Affected versions
  • Cilium 1.6 deployments using kvstore-backend identities
Mitigation

When identities change, existing connections can be disrupted while cilium initializes and synchronizes with the shared identity store. The disruption occurs when new numeric identities are used for existing pods on some instances and others are used on others. When converting to CRD-backed identities, it is possible to pre-allocate CRD identities so that the numeric identities match those in the kvstore. This allows new and old cilium instances in the rollout to agree.

The steps below show an example of such a migration. It is safe to re-run the command if desired. It will identify already allocated identities or ones that cannot be migrated. Note that identity 34815 is migrated, 17003 is already migrated, and 11730 has a conflict and a new ID allocated for those labels.

The steps below assume a stable cluster with no new identities created during the rollout. Once a cilium using CRD-backed identities is running, it may begin allocating identities in a way that conflicts with older ones in the kvstore.

The cilium preflight manifest requires etcd support and can be built with:

helm template cilium \
  --namespace=kube-system \
  --set preflight.enabled=true \
  --set agent.enabled=false \
  --set config.enabled=false \
  --set operator.enabled=false \
  --set global.etcd.enabled=true \
  --set global.etcd.ssl=true \
  > cilium-preflight.yaml
kubectl create -f cilium-preflight.yaml
Example migration
$ kubectl exec -n kube-system cilium-preflight-1234 -- cilium preflight migrate-identity
INFO[0000] Setting up kvstore client
INFO[0000] Connecting to etcd server...                  config=/var/lib/cilium/etcd-config.yml endpoints="[https://192.168.33.11:2379]" subsys=kvstore
INFO[0000] Setting up kubernetes client
INFO[0000] Establishing connection to apiserver          host="https://192.168.33.11:6443" subsys=k8s
INFO[0000] Connected to apiserver                        subsys=k8s
INFO[0000] Got lease ID 29c66c67db8870c8                 subsys=kvstore
INFO[0000] Got lock lease ID 29c66c67db8870ca            subsys=kvstore
INFO[0000] Successfully verified version of etcd endpoint  config=/var/lib/cilium/etcd-config.yml endpoints="[https://192.168.33.11:2379]" etcdEndpoint="https://192.168.33.11:2379" subsys=kvstore version=3.3.13
INFO[0000] CRD (CustomResourceDefinition) is installed and up-to-date  name=CiliumNetworkPolicy/v2 subsys=k8s
INFO[0000] Updating CRD (CustomResourceDefinition)...    name=v2.CiliumEndpoint subsys=k8s
INFO[0001] CRD (CustomResourceDefinition) is installed and up-to-date  name=v2.CiliumEndpoint subsys=k8s
INFO[0001] Updating CRD (CustomResourceDefinition)...    name=v2.CiliumNode subsys=k8s
INFO[0002] CRD (CustomResourceDefinition) is installed and up-to-date  name=v2.CiliumNode subsys=k8s
INFO[0002] Updating CRD (CustomResourceDefinition)...    name=v2.CiliumIdentity subsys=k8s
INFO[0003] CRD (CustomResourceDefinition) is installed and up-to-date  name=v2.CiliumIdentity subsys=k8s
INFO[0003] Listing identities in kvstore
INFO[0003] Migrating identities to CRD
INFO[0003] Skipped non-kubernetes labels when labelling ciliumidentity. All labels will still be used in identity determination  labels="map[]" subsys=crd-allocator
INFO[0003] Skipped non-kubernetes labels when labelling ciliumidentity. All labels will still be used in identity determination  labels="map[]" subsys=crd-allocator
INFO[0003] Skipped non-kubernetes labels when labelling ciliumidentity. All labels will still be used in identity determination  labels="map[]" subsys=crd-allocator
INFO[0003] Migrated identity                             identity=34815 identityLabels="k8s:class=tiefighter;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:org=empire;"
WARN[0003] ID is allocated to a different key in CRD. A new ID will be allocated for the this key  identityLabels="k8s:class=deathstar;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:org=empire;" oldIdentity=11730
INFO[0003] Reusing existing global key                   key="k8s:class=deathstar;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:org=empire;" subsys=allocator
INFO[0003] New ID allocated for key in CRD               identity=17281 identityLabels="k8s:class=deathstar;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:org=empire;" oldIdentity=11730
INFO[0003] ID was already allocated to this key. It is already migrated  identity=17003 identityLabels="k8s:class=xwing;k8s:io.cilium.k8s.policy.cluster=default;k8s:io.cilium.k8s.policy.serviceaccount=default;k8s:io.kubernetes.pod.namespace=default;k8s:org=alliance;"

Note

It is also possible to use the --k8s-kubeconfig-path and --kvstore-opt cilium CLI options with the preflight command. The default is to derive the configuration as cilium-agent does.
cilium preflight migrate-identity --k8s-kubeconfig-path /var/lib/cilium/cilium.kubeconfig --kvstore etcd --kvstore-opt etcd.config=/var/lib/cilium/etcd-config.yml
Clearing CRD identities

If a migration has gone wrong, it possible to start with a clean slate. Ensure that no cilium instances are running with identity-allocation-mode crd and execute:

$ kubectl delete ciliumid --all

CNP Validation

Running the CNP Validator will make sure the policies deployed in the cluster are valid. It is important to run this validation before an upgrade so it will make sure Cilium has a correct behavior after upgrade. Avoiding doing this validation might cause Cilium from updating its NodeStatus in those invalid Network Policies as well as in the worst case scenario it might give a false sense of security to the user if a policy is badly formatted and Cilium is not enforcing that policy due a bad validation schema. This CNP Validator is automatically executed as part of the pre-flight check Running pre-flight check (Required).

Start by deployment the cilium-pre-flight-check and check if the the Deployment shows READY 1/1, if it does not check the pod logs.

$ kubectl get deployment -n kube-system cilium-pre-flight-check -w
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
cilium-pre-flight-check   0/1     1            0           12s

$ kubectl logs -n kube-system deployment/cilium-pre-flight-check -c cnp-validator --previous
level=info msg="Setting up kubernetes client"
level=info msg="Establishing connection to apiserver" host="https://172.20.0.1:443" subsys=k8s
level=info msg="Connected to apiserver" subsys=k8s
level=info msg="Validating CiliumNetworkPolicy 'default/cidr-rule': OK!
level=error msg="Validating CiliumNetworkPolicy 'default/cnp-update': unexpected validation error: spec.labels: Invalid value: \"string\": spec.labels in body must be of type object: \"string\""
level=error msg="Found invalid CiliumNetworkPolicy"

In this example, we can see the CiliumNetworkPolicy in the default namespace with the name cnp-update is not valid for the Cilium version we are trying to upgrade. In order to fix this policy we need to edit it, we can do this by saving the policy locally and modify it. For this example it seems the .spec.labels has set an array of strings which is not correct as per the official schema.

$ kubectl get cnp -n default cnp-update -o yaml > cnp-bad.yaml
$ cat cnp-bad.yaml
  apiVersion: cilium.io/v2
  kind: CiliumNetworkPolicy
  [...]
  spec:
    endpointSelector:
      matchLabels:
        id: app1
    ingress:
    - fromEndpoints:
      - matchLabels:
          id: app2
      toPorts:
      - ports:
        - port: "80"
          protocol: TCP
    labels:
    - custom=true
  [...]

To fix this policy we need to set the .spec.labels with the right format and commit these changes into kubernetes.

$ cat cnp-bad.yaml
  apiVersion: cilium.io/v2
  kind: CiliumNetworkPolicy
  [...]
  spec:
    endpointSelector:
      matchLabels:
        id: app1
    ingress:
    - fromEndpoints:
      - matchLabels:
          id: app2
      toPorts:
      - ports:
        - port: "80"
          protocol: TCP
    labels:
    - key: "custom"
      value: "true"
  [...]
$
$ kubectl apply -f ./cnp-bad.yaml

After applying the fixed policy we can delete the pod that was validating the policies so that kubernetes creates a new pod immediately to verify if the fixed policies are now valid.

$ kubectl delete pod -n kube-system -l k8s-app=cilium-pre-flight-check-deployment
pod "cilium-pre-flight-check-86dfb69668-ngbql" deleted
$ kubectl get deployment -n kube-system cilium-pre-flight-check
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
cilium-pre-flight-check   1/1     1            1           55m
$ kubectl logs -n kube-system deployment/cilium-pre-flight-check -c cnp-validator
level=info msg="Setting up kubernetes client"
level=info msg="Establishing connection to apiserver" host="https://172.20.0.1:443" subsys=k8s
level=info msg="Connected to apiserver" subsys=k8s
level=info msg="Validating CiliumNetworkPolicy 'default/cidr-rule': OK!
level=info msg="Validating CiliumNetworkPolicy 'default/cnp-update': OK!
level=info msg="All CCNPs and CNPs valid!"

Once they are valid you can continue with the upgrade process. Clean up pre-flight check

Network Policy

This chapter documents the policy language used to configure network policies in Cilium. Security policies can be specified and imported via the following mechanisms:

  • Using Kubernetes NetworkPolicy, CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy resources. See the section Network Policy for more details. In this mode, Kubernetes will automatically distribute the policies to all agents.
  • Directly imported into the agent via CLI or API Reference of the agent. This method does not automatically distribute policies to all agents. It is in the responsibility of the user to import the policy in all required agents.

Policy Enforcement Modes

The configuration of the Cilium agent and the Cilium Network Policy determines whether an endpoint accepts traffic from a source or not. The agent can be put into the following three policy enforcement modes:

default

This is the default behavior for policy enforcement when Cilium is launched without any specified value for the policy enforcement configuration. The following rules apply:

  • If any rule selects an Endpoint and the rule has an ingress section, the endpoint goes into default deny at ingress.
  • If any rule selects an Endpoint and the rule has an egress section, the endpoint goes into default deny at egress.

This means that endpoints will start without any restrictions and as soon as a rule restricts their ability to receive traffic on ingress or to transmit traffic on egress, then the endpoint goes into whitelisting mode and all traffic must be explicitly allowed.

always
With always mode, policy enforcement is enabled on all endpoints even if no rules select specific endpoints.
never
With never mode, policy enforcement is disabled on all endpoints, even if rules do select specific endpoints. In other words, all traffic is allowed from any source (on ingress) or destination (on egress).

To configure the policy enforcement mode at runtime for all endpoints managed by a Cilium agent, use:

$ cilium config PolicyEnforcement={default,always,never}

If you want to configure the policy enforcement mode at start-time for a particular agent, provide the following flag when launching the Cilium daemon:

$ cilium-agent --enable-policy={default,always,never} [...]

Similarly, you can enable the policy enforcement mode across a Kubernetes cluster by including the parameter above in the Cilium DaemonSet.

- name: CILIUM_ENABLE_POLICY
  value: always

Rule Basics

All policy rules are based upon a whitelist model, that is, each rule in the policy allows traffic that matches the rule. If two rules exist, and one would match a broader set of traffic, then all traffic matching the broader rule will be allowed. If there is an intersection between two or more rules, then traffic matching the union of those rules will be allowed. Finally, if traffic does not match any of the rules, it will be dropped pursuant to the Policy Enforcement Modes.

Policy rules share a common base type which specifies which endpoints the rule applies to and common metadata to identify the rule. Each rule is split into an ingress section and an egress section. The ingress section contains the rules which must be applied to traffic entering the endpoint, and the egress section contains rules applied to traffic coming from the endpoint matching the endpoint selector. Either ingress, egress, or both can be provided. If both ingress and egress are omitted, the rule has no effect.

type Rule struct {
        // EndpointSelector selects all endpoints which should be subject to
        // this rule. Cannot be empty.
        EndpointSelector EndpointSelector `json:"endpointSelector"`

        // Ingress is a list of IngressRule which are enforced at ingress.
        // If omitted or empty, this rule does not apply at ingress.
        //
        // +optional
        Ingress []IngressRule `json:"ingress,omitempty"`

        // Egress is a list of EgressRule which are enforced at egress.
        // If omitted or empty, this rule does not apply at egress.
        //
        // +optional
        Egress []EgressRule `json:"egress,omitempty"`

        // Labels is a list of optional strings which can be used to
        // re-identify the rule or to store metadata. It is possible to lookup
        // or delete strings based on labels. Labels are not required to be
        // unique, multiple rules can have overlapping or identical labels.
        //
        // +optional
        Labels labels.LabelArray `json:"labels,omitempty"`

        // Description is a free form string, it can be used by the creator of
        // the rule to store human readable explanation of the purpose of this
        // rule. Rules cannot be identified by comment.
        //
        // +optional
        Description string `json:"description,omitempty"`
}

endpointSelector
Selects the endpoints which the policy rules apply to. The policy rules will be applied to all endpoints which match the labels specified in the Endpoint Selector. See the Endpoint Selector section for additional details.
ingress
List of rules which must apply at ingress of the endpoint, i.e. to all network packets which are entering the endpoint.
egress
List of rules which must apply at egress of the endpoint, i.e. to all network packets which are leaving the endpoint.
labels
Labels are used to identify the rule. Rules can be listed and deleted by labels. Policy rules which are imported via kubernetes automatically get the label io.cilium.k8s.policy.name=NAME assigned where NAME corresponds to the name specified in the NetworkPolicy or CiliumNetworkPolicy resource.
description
Description is a string which is not interpreted by Cilium. It can be used to describe the intent and scope of the rule in a human readable form.

Endpoint Selector

The Endpoint Selector is based on the Kubernetes LabelSelector. It is called Endpoint Selector because it only applies to labels associated with Endpoint.

Layer 3 Examples

The layer 3 policy establishes the base connectivity rules regarding which endpoints can talk to each other. Layer 3 policies can be specified using the following methods:

  • Labels Based: This is used to describe the relationship if both endpoints are managed by Cilium and are thus assigned labels. The big advantage of this method is that IP addresses are not encoded into the policies and the policy is completely decoupled from the addressing.
  • Services based: This is an intermediate form between Labels and CIDR and makes use of the services concept in the orchestration system. A good example of this is the Kubernetes concept of Service endpoints which are automatically maintained to contain all backend IP addresses of a service. This allows to avoid hardcoding IP addresses into the policy even if the destination endpoint is not controlled by Cilium.
  • Entities Based: Entities are used to describe remote peers which can be categorized without knowing their IP addresses. This includes connectivity to the local host serving the endpoints or all connectivity to outside of the cluster.
  • IP/CIDR based: This is used to describe the relationship to or from external services if the remote peer is not an endpoint. This requires to hardcode either IP addresses or subnets into the policies. This construct should be used as a last resort as it requires stable IP or subnet assignments.
  • DNS based: Selects remote, non-cluster, peers using DNS names converted to IPs via DNS lookups. It shares all limitations of the IP/CIDR based rules above. DNS information is acquired by routing DNS traffic via a proxy, or polling for listed DNS targets. DNS TTLs are respected.

Labels Based

Label-based L3 policy is used to establish policy between endpoints inside the cluster managed by Cilium. Label-based L3 policies are defined by using an Endpoint Selector inside a rule to choose what kind of traffic that can be received (on ingress), or sent (on egress). An empty Endpoint Selector allows all traffic. The examples below demonstrate this in further detail.

Note

Kubernetes: See section Namespaces for details on how the Endpoint Selector applies in a Kubernetes environment with regard to namespaces.

Ingress

An endpoint is allowed to receive traffic from another endpoint if at least one ingress rule exists which selects the destination endpoint with the Endpoint Selector in the endpointSelector field. To restrict traffic upon ingress to the selected endpoint, the rule selects the source endpoint with the Endpoint Selector in the fromEndpoints field.

Simple Ingress Allow

The following example illustrates how to use a simple ingress rule to allow communication from endpoints with the label role=frontend to endpoints with the label role=backend.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l3-rule"
spec:
  endpointSelector:
    matchLabels:
      role: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: frontend
[{
    "labels": [{"key": "name", "value": "l3-rule"}],
    "endpointSelector": {"matchLabels": {"role":"backend"}},
    "ingress": [{
        "fromEndpoints": [
          {"matchLabels":{"role":"frontend"}}
        ]
    }]
}]
Ingress Allow All

An empty Endpoint Selector will select all endpoints, thus writing a rule that will allow all ingress traffic to an endpoint may be done as follows:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-all-to-victim"
spec:
  endpointSelector:
    matchLabels:
      role: victim
  ingress:
  - fromEndpoints:
    - {}
[{
    "labels": [{"key": "name", "value": "allow-all-to-victim"}],
    "endpointSelector": {"matchLabels": {"role":"victim"}},
    "ingress": [{
        "fromEndpoints": [
          {"matchLabels":{}}
        ]
    }]
}]

Note that while the above examples allow all ingress traffic to an endpoint, this does not mean that all endpoints are allowed to send traffic to this endpoint per their policies. In other words, policy must be configured on both sides (sender and receiver).

Egress

An endpoint is allowed to send traffic to another endpoint if at least one egress rule exists which selects the destination endpoint with the Endpoint Selector in the endpointSelector field. To restrict traffic upon egress to the selected endpoint, the rule selects the destination endpoint with the Endpoint Selector in the toEndpoints field.

Simple Egress Allow

The following example illustrates how to use a simple egress rule to allow communication to endpoints with the label role=backend from endpoints with the label role=frontend.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l3-egress-rule"
spec:
  endpointSelector:
    matchLabels:
      role: frontend
  egress:
  - toEndpoints:
    - matchLabels:
        role: backend
[{
    "labels": [{"key": "name", "value": "l3-egress-rule"}],
    "endpointSelector": {"matchLabels": {"role":"frontend"}},
    "egress": [{
        "toEndpoints": [
          {"matchLabels":{"role":"backend"}}
        ]
    }]
}]
Egress Allow All

An empty Endpoint Selector will select all endpoints, thus writing a rule that will allow all egress traffic from an endpoint may be done as follows:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-all-from-frontend"
spec:
  endpointSelector:
    matchLabels:
      role: frontend
  egress:
  - toEndpoints:
    - {}
[{
    "labels": [{"key": "name", "value": "allow-all-from-frontend"}],
    "endpointSelector": {"matchLabels": {"role":"frontend"}},
    "egress": [{
        "toEndpoints": [
          {"matchLabels":{}}
        ]
    }]
}]

Note that while the above examples allow all egress traffic from an endpoint, the receivers of the egress traffic may have ingress rules that deny the traffic. In other words, policy must be configured on both sides (sender and receiver).

Ingress/Egress Default Deny

An endpoint can be put into the default deny mode at ingress or egress if a rule selects the endpoint and contains the respective rule section ingress or egress.

Note

Any rule selecting the endpoint will have this effect, this example illustrates how to put an endpoint into default deny mode without whitelisting other peers at the same time.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "deny-all-egress"
spec:
  endpointSelector:
    matchLabels:
      role: restricted
  egress:
  - {}
[{
    "labels": [{"key": "name", "value": "deny-all-egress"}],
    "endpointSelector": {"matchLabels": {"role":"restricted"}},
    "egress": [{}]
}]
Additional Label Requirements

It is often required to apply the principle of separation of concern when defining policies. For this reason, an additional construct exists which allows to establish base requirements for any connectivity to happen.

For this purpose, the fromRequires field can be used to establish label requirements which serve as a foundation for any fromEndpoints relationship. fromRequires is a list of additional constraints which must be met in order for the selected endpoints to be reachable. These additional constraints do not grant access privileges by themselves, so to allow traffic there must also be rules which match fromEndpoints. The same applies for egress policies, with toRequires and toEndpoints.

The purpose of this rule is to allow establishing base requirements such as, any endpoint in env=prod can only be accessed if the source endpoint also carries the label env=prod.

This example shows how to require every endpoint with the label env=prod to be only accessible if the source endpoint also has the label env=prod.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "For endpoints with env=prod, only allow if source also has label env=prod"
metadata:
  name: "requires-rule"
specs:
  - endpointSelector:
      matchLabels:
        env: prod
    ingress:
    - fromRequires:
      - matchLabels:
          env: prod
[{
    "labels": [{"key": "name", "value": "requires-rule"}],
    "endpointSelector": {"matchLabels": {"env":"prod"}},
    "ingress": [{
        "fromRequires": [
          {"matchLabels":{"env":"prod"}}
        ]
    }]
}]

Services based

Services running in your cluster can be whitelisted in Egress rules. Currently Kubernetes Services without a Selector are supported when defined by their name and namespace or label selector. Future versions of Cilium will support specifying non-Kubernetes services and Kubernetes services which are backed by pods.

This example shows how to allow all endpoints with the label id=app2 to talk to all endpoints of kubernetes service myservice in kubernetes namespace default.

Note

These rules will only take effect on Kubernetes services without a selector.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "service-rule"
spec:
  endpointSelector:
    matchLabels:
      id: app2
  egress:
  - toServices:
    - k8sService:
        serviceName: myservice
        namespace: default
[{
      "labels": [{"key": "name", "value": "service-rule"}],
      "endpointSelector": {
        "matchLabels": {
          "id": "app2"
        }
      },
      "egress": [
        {
          "toServices": [
            {
              "k8sService": {
                "serviceName": "myservice",
                "namespace": "default"
              }
            }
          ]
        }
      ]
}]

This example shows how to allow all endpoints with the label id=app2 to talk to all endpoints of all kubernetes headless services which have head:none set as the label.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "service-labels-rule"
spec:
  endpointSelector:
    matchLabels:
      id: app2
  egress:
  - toServices:
    - k8sServiceSelector:
        selector:
          matchLabels:
            head: none
[{
    "labels": [{"key": "name", "value": "service-labels-rule"}],
    "endpointSelector": {
        "matchLabels": {
            "id": "app2"
        }
    },
    "egress": [
        {
            "toServices": [
                {
                    "k8sServiceSelector": {
                        "selector": {
                            "matchLabels": {
                                "head": "none"
                            }
                        }
                    }
                }
            ]
        }
    ]
}
]

Entities Based

fromEntities is used to describe the entities that can access the selected endpoints. toEntities is used to describe the entities that can be accessed by the selected endpoints.

The following entities are defined:

host
The host entity includes the local host. This also includes all containers running in host networking mode on the local host.
remote-node
Any node in any of the connected clusters other than the local host. This also includes all containers running in host-networking mode on remote nodes. (Requires the option enable-remote-node-identity to be enabled)
cluster
Cluster is the logical group of all network endpoints inside of the local cluster. This includes all Cilium-managed endpoints of the local cluster, unmanaged endpoints in the local cluster, as well as the host, remote-node, and init identities.
init
The init entity contains all endpoints in bootstrap phase for which the security identity has not been resolved yet. This is typically only observed in non-Kubernetes environments. See section Endpoint Lifecycle for details.
world
The world entity corresponds to all endpoints outside of the cluster. Allowing to world is identical to allowing to CIDR 0/0. An alternative to allowing from and to world is to define fine grained DNS or CIDR based policies.
all
The all entity represents the combination of all known clusters as well world and whitelists all communication.

New in version future: Allowing users to define custom identities is on the roadmap but has not been implemented yet.

Access to/from local host

Allow all endpoints with the label env=dev to access the host that is serving the particular endpoint.

Note

Kubernetes will automatically allow all communication from the local host of all local endpoints. You can run the agent with the option --allow-localhost=policy to disable this behavior which will give you control over this via policy.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "dev-to-host"
spec:
  endpointSelector:
    matchLabels:
      env: dev
  egress:
    - toEntities:
      - host
[{
    "labels": [{"key": "name", "value": "dev-to-host"}],
    "endpointSelector": {"matchLabels": {"env":"dev"}},
    "egress": [{
        "toEntities": ["host"]
    }]
}]
Access to/from all nodes in the cluster

Allow all endpoints with the label env=dev to receive traffic from any host in the cluster that Cilium is running on.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "to-dev-from-nodes-in-cluster"
spec:
  endpointSelector:
    matchLabels:
      env: dev
  ingress:
    - fromEntities:
      - host
      - remote-node
[{
    "labels": [{"key": "name", "value": "to-dev-from-nodes-in-cluster"}],
    "endpointSelector": {"matchLabels": {"env":"dev"}},
    "ingress": [{
        "fromEntities": [
            "host",
            "remote-node"
        ]
    }]
}]
Access to/from outside cluster

This example shows how to enable access from outside of the cluster to all endpoints that have the label role=public.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "from-world-to-role-public"
spec:
  endpointSelector:
    matchLabels:
      role: public
  ingress:
    - fromEntities:
      - world
[{
    "labels": [{"key": "name", "value":"from-world-to-role-public"}],
    "endpointSelector": {"matchLabels": {"role":"public"}},
    "ingress": [{
        "fromEntities": ["world"]
    }]
}]

IP/CIDR based

CIDR policies are used to define policies to and from endpoints which are not managed by Cilium and thus do not have labels associated with them. These are typically external services, VMs or metal machines running in particular subnets. CIDR policy can also be used to limit access to external services, for example to limit external access to a particular IP range. CIDR policies can be applied at ingress or egress.

CIDR rules apply if Cilium cannot map the source or destination to an identity derived from endpoint labels, ie the Special Identities. For example, CIDR rules will apply to traffic where one side of the connection is:

  • A network endpoint outside the cluster
  • The host network namespace where the pod is running.
  • Within the cluster prefix but the IP’s networking is not provided by Cilium.

Conversely, CIDR rules do not apply to traffic where both sides of the connection are either managed by Cilium or use an IP belonging to a node in the cluster (including host networking pods). This traffic may be allowed using labels, services or entities -based policies as described above.

Note

When running Cilium on Linux 4.10 or earlier, there are Restrictions on unique prefix lengths for CIDR policy rules.

Ingress
fromCIDR
List of source prefixes/CIDRs that are allowed to talk to all endpoints selected by the endpointSelector.
fromCIDRSet
List of source prefixes/CIDRs that are allowed to talk to all endpoints selected by the endpointSelector, along with an optional list of prefixes/CIDRs per source prefix/CIDR that are subnets of the source prefix/CIDR from which communication is not allowed.
Egress
toCIDR
List of destination prefixes/CIDRs that endpoints selected by endpointSelector are allowed to talk to. Note that endpoints which are selected by a fromEndpoints are automatically allowed to reply back to the respective destination endpoints.
toCIDRSet
List of destination prefixes/CIDRs that are allowed to talk to all endpoints selected by the endpointSelector, along with an optional list of prefixes/CIDRs per source prefix/CIDR that are subnets of the destination prefix/CIDR to which communication is not allowed.
Allow to external CIDR block

This example shows how to allow all endpoints with the label app=myService to talk to the external IP 20.1.1.1, as well as the CIDR prefix 10.0.0.0/8, but not CIDR prefix 10.96.0.0/12

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "cidr-rule"
spec:
  endpointSelector:
    matchLabels:
      app: myService
  egress:
  - toCIDR:
    - 20.1.1.1/32
  - toCIDRSet:
    - cidr: 10.0.0.0/8
      except:
      - 10.96.0.0/12
[{
    "labels": [{"key": "name", "value": "cidr-rule"}],
    "endpointSelector": {"matchLabels":{"app":"myService"}},
    "egress": [{
        "toCIDR": [
            "20.1.1.1/32"
        ]
    }, {
        "toCIDRSet": [{
            "cidr": "10.0.0.0/8",
            "except": [
                "10.96.0.0/12"
            ]}
        ]
    }]
}]

DNS based

DNS policies are used to define Layer 3 policies to endpoints that are not managed by cilium, but have DNS queryable domain names. The IP addresses provided in DNS responses are allowed by Cilium in a similar manner to IPs in CIDR based policies. They are an alternative when the remote IPs may change or are not know a priori, or when DNS is more convenient. To enforce policy on DNS requests themselves, see Layer 7 Examples.

IP information is captured from DNS responses per-Endpoint via a DNS Proxy. An L3 CIDR based rule is generated for every toFQDNs rule and applies to the same endpoints. The IP information is selected for insertion by matchName or matchPattern rules, and is collected from all DNS responses seen by Cilium on the node. Multiple selectors may be included in a single egress rule. See Obtaining DNS Data for use by toFQDNs for information on collecting this IP data.

toFQDNs egress rules cannot contain any other L3 rules, such as toEndpoints (under Labels Based) and toCIDRs (under CIDR Based). They may contain L4/L7 rules, such as toPorts (see Layer 4 Examples) with, optionally, HTTP and Kafka sections (see Layer 7 Examples).

Note

DNS based rules are intended for external connections and behave similarly to CIDR based rules. See Services based and Labels based for cluster-internal traffic.

IPs to be allowed are selected via:

toFQDNs.matchName
Inserts IPs of domains that match matchName exactly. Multiple distinct names may be included in separate matchName entries and IPs for domains that match any matchName will be inserted.
toFQDNs.matchPattern

Inserts IPs of domains that match the pattern in matchPattern, accounting for wildcards. Patterns are composed of literal characters that that are allowed in domain names: a-z, 0-9, . and -.

* is allowed as a wildcard with a number of convenience behaviors:

  • * within a domain allows 0 or more valid DNS characters, except for the . separator. *.cilium.io will match sub.cilium.io but not cilium.io. part*ial.com will match partial.com and part-extra-ial.com.
  • * alone matches all names, and inserts all cached DNS IPs into this rule.
Example
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "to-fqdn"
spec:
  endpointSelector:
    matchLabels:
      app: test-app
  egress:
    - toEndpoints:
      - matchLabels:
          "k8s:io.kubernetes.pod.namespace": kube-system
          "k8s:k8s-app": kube-dns
      toPorts:
        - ports:
           - port: "53"
             protocol: ANY
          rules:
            dns:
              - matchPattern: "*"
    - toFQDNs:
        - matchName: "my-remote-service.com"
[
  {
    "endpointSelector": {
      "matchLabels": {
        "app": "test-app"
      }
    },
    "egress": [
      {
        "toEndpoints": [
          {
            "matchLabels": {
              "app-type": "dns"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "53",
                "protocol": "ANY"
              }
            ],
            "rules": {
              "dns": [
                { "matchPattern": "*" }
              ]
            }
          }
        ]
      },
      {
        "toFQDNs": [
          {
            "matchName": "my-remote-service.com"
          }
        ]
      }
    ]
  }
]
Managing Long-Lived Connections & Minimum DNS Cache Times

Often, an application may keep a connection open for longer than the DNS TTL. Without further DNS queries the remote IP used in the long-lived connection may expire out of the DNS cache. When this occurs, existing connections established before the TTL expires will continue to be allowed until they terminate. Unused IPs will no longer be allowed, however, even when from the same DNS lookup as an in-use IP. This tracking is per-endpoint per-IP and DNS entries in this state will be have source: connection with a single IP listed within the cilium fqdn cache list output.

A minimum TTL is used to ensure a lower time bound to DNS data expiration, and IPs allowed by a toFQDNs rule will be allowed at least this long It can be configured with the --tofqdns-min-ttl CLI option. The value is in integer seconds and must be 1 or more, the default is 1 hour.

Some care needs to be taken when setting --tofqdns-min-ttl with DNS data that returns many distinct IPs over time. A long TTL will keep each IP cached long after the related connections have terminated. Large numbers of IPs each have corresponding Security Identities and too many may slow down Cilium policy regeneration.

Managing Short-Lived Connections & Maximum IPs per FQDN/endpoint

The minimum TTL for DNS entries in the cache is deliberately long with 1 hour as the default. This is done to accommodate long-lived persistent connections. On the other end of the spectrum are workloads that perform short-lived connections in repetition to FQDNs that are backed by a large number of IP addresses (e.g. AWS S3).

Many short-lived connections can grow the number of IPs mapping to an FQDN quickly. In order to limit the number of IP addresses that map a particular FQDN, each FQDN has a per-endpoint max capacity of IPs that will be retained (default: 50). Once this limit is exceeded, the oldest IP entries are automatically expired from the cache. This capacity can be changed using the --tofqdns-max-ip-per-hostname option.

As with long-lived connections above, live connections are not expired until they terminate. It is safe to mix long- and short-lived connections from the same Pod. IPs above the limit described above will only be removed if unused by a connection.

Layer 4 Examples

Limit ingress/egress ports

Layer 4 policy can be specified in addition to layer 3 policies or independently. It restricts the ability of an endpoint to emit and/or receive packets on a particular port using a particular protocol. If no layer 4 policy is specified for an endpoint, the endpoint is allowed to send and receive on all layer 4 ports and protocols including ICMP. If any layer 4 policy is specified, then ICMP will be blocked unless it’s related to a connection that is otherwise allowed by the policy. Layer 4 policies apply to ports after service port mapping has been applied.

Layer 4 policy can be specified at both ingress and egress using the toPorts field. The toPorts field takes a PortProtocol structure which is defined as follows:

// PortProtocol specifies an L4 port with an optional transport protocol
type PortProtocol struct {
        // Port is an L4 port number. For now the string will be strictly
        // parsed as a single uint16. In the future, this field may support
        // ranges in the form "1024-2048
        Port string `json:"port"`

        // Protocol is the L4 protocol. If omitted or empty, any protocol
        // matches. Accepted values: "TCP", "UDP", ""/"ANY"
        //
        // Matching on ICMP is not supported.
        //
        // +optional
        Protocol string `json:"protocol,omitempty"`
}
Example (L4)

The following rule limits all endpoints with the label app=myService to only be able to emit packets using TCP on port 80, to any layer 3 destination:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l4-rule"
spec:
  endpointSelector:
    matchLabels:
      app: myService
  egress:
    - toPorts:
      - ports:
        - port: "80"
          protocol: TCP
[{
    "labels": [{"key": "name", "value": "l4-rule"}],
    "endpointSelector": {"matchLabels":{"app":"myService"}},
    "egress": [{
        "toPorts": [
            {"ports":[ {"port": "80", "protocol": "TCP"}]}
        ]
    }]
}]
Labels-dependent Layer 4 rule

This example enables all endpoints with the label role=frontend to communicate with all endpoints with the label role=backend, but they must communicate using TCP on port 80. Endpoints with other labels will not be able to communicate with the endpoints with the label role=backend, and endpoints with the label role=frontend will not be able to communicate with role=backend on ports other than 80.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l4-rule"
spec:
  endpointSelector:
    matchLabels:
      role: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: frontend
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
[{
    "labels": [{"key": "name", "value": "l4-rule"}],
    "endpointSelector": {"matchLabels":{"role":"backend"}},
    "ingress": [{
        "fromEndpoints": [
          {"matchLabels":{"role":"frontend"}}
        ],
        "toPorts": [
            {"ports":[ {"port": "80", "protocol": "TCP"}]}
        ]
    }]
}]
CIDR-dependent Layer 4 Rule

This example enables all endpoints with the label role=crawler to communicate with all remote destinations inside the CIDR 192.0.2.0/24, but they must communicate using TCP on port 80. The policy does not allow Endpoints without the label role=crawler to communicate with destinations in the CIDR 192.0.2.0/24. Furthermore, endpoints with the label role=crawler will not be able to communicate with destinations in the CIDR 192.0.2.0/24 on ports other than port 80.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "cidr-l4-rule"
spec:
  endpointSelector:
    matchLabels:
      role: crawler
  egress:
  - toCIDR:
    - 192.0.2.0/24
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
[{
    "labels": [{"key": "name", "value": "cidr-l4-rule"}],
    "endpointSelector": {"matchLabels":{"role":"crawler"}},
    "egress": [{
        "toCIDR": [
            "192.0.2.0/24"
        ],
        "toPorts": [
            {"ports":[ {"port": "80", "protocol": "TCP"}]}
        ]
    }]
}]

Layer 7 Examples

Layer 7 policy rules are embedded into Layer 4 Examples rules and can be specified for ingress and egress. L7Rules structure is a base type containing an enumeration of protocol specific fields.

// L7Rules is a union of port level rule types. Mixing of different port
// level rule types is disallowed, so exactly one of the following must be set.
// If none are specified, then no additional port level rules are applied.
type L7Rules struct {
        // HTTP specific rules.
        //
        // +optional
        HTTP []PortRuleHTTP `json:"http,omitempty"`

        // Kafka-specific rules.
        //
        // +optional
        Kafka []PortRuleKafka `json:"kafka,omitempty"`

        // DNS-specific rules.
        //
        // +optional
        DNS []PortRuleDNS `json:"dns,omitempty"`
}

The structure is implemented as a union, i.e. only one member field can be used per port. If multiple toPorts rules with identical PortProtocol select an overlapping list of endpoints, then the layer 7 rules are combined together if they are of the same type. If the type differs, the policy is rejected.

Each member consists of a list of application protocol rules. A layer 7 request is permitted if at least one of the rules matches. If no rules are specified, then all traffic is permitted.

If a layer 4 rule is specified in the policy, and a similar layer 4 rule with layer 7 rules is also specified, then the layer 7 portions of the latter rule will have no effect.

Note

Unlike layer 3 and layer 4 policies, violation of layer 7 rules does not result in packet drops. Instead, if possible, an application protocol specific access denied message is crafted and returned, e.g. an HTTP 403 access denied is sent back for HTTP requests which violate the policy, or a DNS REFUSED response for DNS requests.

Note

There is currently a max limit of 40 ports with layer 7 policies per endpoint. This might change in the future when support for ranges is added.

HTTP

The following fields can be matched on:

Path
Path is an extended POSIX regex matched against the path of a request. Currently it can contain characters disallowed from the conventional “path” part of a URL as defined by RFC 3986. Paths must begin with a /. If omitted or empty, all paths are all allowed.
Method
Method is an extended POSIX regex matched against the method of a request, e.g. GET, POST, PUT, PATCH, DELETE, … If omitted or empty, all methods are allowed.
Host
Host is an extended POSIX regex matched against the host header of a request, e.g. foo.com. If omitted or empty, the value of the host header is ignored.
Headers
Headers is a list of HTTP headers which must be present in the request. If omitted or empty, requests are allowed regardless of headers present.
Allow GET /public

The following example allows GET requests to the URL /public to be allowed to endpoints with the labels env:prod, but requests to any other URL, or using another method, will be rejected. Requests on ports other than port 80 will be dropped.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Allow HTTP GET /public from env=prod to app=service"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      app: service
  ingress:
  - fromEndpoints:
    - matchLabels:
        env: prod
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/public"
[{
  "labels": [{"key": "name", "value": "rule1"}],
  "endpointSelector": {"matchLabels": {"app": "service"}},
  "ingress": [{
    "fromEndpoints": [
      {"matchLabels": {"env": "prod"}}
    ]},{
    "toPorts": [{
      "ports": [
        {"port": "80", "protocol": "TCP"}
      ],
      "rules": {
        "http": [
          {
            "method": "GET",
            "path": "/public"
          }
        ]
      }
    }]
  }]
}]
All GET /path1 and PUT /path2 when header set

The following example limits all endpoints which carry the labels app=myService to only be able to receive packets on port 80 using TCP. While communicating on this port, the only API endpoints allowed will be GET /path1 and PUT /path2 with the HTTP header X-My_header set to true:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "l7-rule"
spec:
  endpointSelector:
    matchLabels:
      app: myService
  ingress:
  - toPorts:
    - ports:
      - port: '80'
        protocol: TCP
      rules:
        http:
        - method: GET
          path: "/path1$"
        - method: PUT
          path: "/path2$"
          headers:
          - 'X-My-Header: true'
[{
    "labels": [{"key": "name", "value": "l7-rule"}],
    "endpointSelector": {"matchLabels":{"app":"myService"}},
    "ingress": [{
        "toPorts": [{
            "ports": [
                {"port": "80", "protocol": "TCP"}
            ],
            "rules": {
                "http": [
                    {
                        "method": "GET",
                        "path": "/path1$"
                    },{
                        "method": "PUT",
                        "path": "/path2$",
                        "headers": ["X-My-Header: true"]
                    }
                ]
            }
        }]
    }]
}]

Kafka (beta)

Note

Kafka support is currently in beta phase.

PortRuleKafka is a list of Kafka protocol constraints. All fields are optional, if all fields are empty or missing, the rule will match all Kafka messages. There are two ways to specify the Kafka rules. We can choose to specify a high-level “produce” or “consume” role to a topic or choose to specify more low-level Kafka protocol specific apiKeys. Writing rules based on Kafka roles is easier and covers most common use cases, however if more granularity is needed then users can alternatively write rules using specific apiKeys.

The following fields can be matched on:

Role

Role is a case-insensitive string which describes a group of API keys necessary to perform certain higher-level Kafka operations such as “produce” or “consume”. A Role automatically expands into all APIKeys required to perform the specified higher-level operation. The following roles are supported:

  • “produce”: Allow producing to the topics specified in the rule.
  • “consume”: Allow consuming from the topics specified in the rule.

This field is incompatible with the APIKey field, i.e APIKey and Role cannot both be specified in the same rule. If omitted or empty, and if APIKey is not specified, then all keys are allowed.

APIKey
APIKey is a case-insensitive string matched against the key of a request, for example “produce”, “fetch”, “createtopic”, “deletetopic”. For a more extensive list, see the Kafka protocol reference. This field is incompatible with the Role field.
APIVersion
APIVersion is the version matched against the api version of the Kafka message. If set, it must be a string representing a positive integer. If omitted or empty, all versions are allowed.
ClientID

ClientID is the client identifier as provided in the request.

From Kafka protocol documentation: This is a user supplied identifier for the client application. The user can use any identifier they like and it will be used when logging errors, monitoring aggregates, etc. For example, one might want to monitor not just the requests per second overall, but the number coming from each client application (each of which could reside on multiple servers). This id acts as a logical grouping across all requests from a particular client.

If omitted or empty, all client identifiers are allowed.

Topic

Topic is the topic name contained in the message. If a Kafka request contains multiple topics, then all topics in the message must be allowed by the policy or the message will be rejected.

This constraint is ignored if the matched request message type does not contain any topic. The maximum length of the Topic is 249 characters, which must be either a-z, A-Z, 0-9, -, . or _.

If omitted or empty, all topics are allowed.

Allow producing to topic empire-announce using Role
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "enable empire-hq to produce to empire-announce and deathstar-plans"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      app: kafka
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: empire-hq
    toPorts:
    - ports:
      - port: "9092"
        protocol: TCP
      rules:
        kafka:
        - role: "produce"
          topic: "deathstar-plans"
        - role: "produce"
          topic: "empire-announce"
[{
  "labels": [{"key": "name", "value": "rule1"}],
  "endpointSelector": {"matchLabels": {"app": "kafka"}},
  "ingress": [{
    "fromEndpoints": [
      {"matchLabels": {"app": "empire-hq"}}
    ],
    "toPorts": [{
      "ports": [
        {"port": "9092", "protocol": "TCP"}
      ],
      "rules": {
        "kafka": [
            {"role": "produce","topic": "deathstar-plans"},
            {"role": "produce", "topic": "empire-announce"}
        ]
      }
    }]
  }]
}]
Allow producing to topic empire-announce using apiKeys
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "enable empire-hq to produce to empire-announce and deathstar-plans"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      app: kafka
  ingress:
  - fromEndpoints:
    - matchLabels:
        app: empire-hq
    toPorts:
    - ports:
      - port: "9092"
        protocol: TCP
      rules:
        kafka:
        - apiKey: "apiversions"
        - apiKey: "metadata"
        - apiKey: "produce"
          topic: "deathstar-plans"
        - apiKey: "produce"
          topic: "empire-announce"
[{
  "labels": [{"key": "name", "value": "rule1"}],
  "endpointSelector": {"matchLabels": {"app": "kafka"}},
  "ingress": [{
    "fromEndpoints": [
      {"matchLabels": {"app": "empire-hq"}}
    ],
    "toPorts": [{
      "ports": [
        {"port": "9092", "protocol": "TCP"}
      ],
      "rules": {
        "kafka": [
            {"apiKey": "apiversions"},
            {"apiKey": "metadata"},
            {"apiKey": "produce", "topic": "deathstar-plans"},
            {"apiKey": "produce", "topic": "empire-announce"}
        ]
      }
    }]
  }]
}]

DNS Policy and IP Discovery

Policy may be applied to DNS traffic, allowing or disallowing specific DNS query names or patterns of names (other DNS fields, such as query type, are not considered). This policy is effected via a DNS proxy, which is also used to collect IPs used to populate L3 DNS based toFQDNs rules.

Note

While Layer 7 DNS policy can be applied without any other Layer 3 rules, the presence of a Layer 7 rule (with its Layer 3 and 4 components) will block other traffic.

DNS policy may be applied via:

matchName
Allows queries for domains that match matchName exactly. Multiple distinct names may be included in separate matchName entries and queries for domains that match any matchName will be allowed.
matchPattern

Allows queries for domains that match the pattern in matchPattern, accounting for wildcards. Patterns are composed of literal characters that that are allowed in domain names: a-z, 0-9, . and -.

* is allowed as a wildcard with a number of convenience behaviors:

  • * within a domain allows 0 or more valid DNS characters, except for the . separator. *.cilium.io will match sub.cilium.io but not cilium.io. part*ial.com will match partial.com and part-extra-ial.com.
  • * alone matches all names, and inserts all IPs in DNS responses into the cilium-agent DNS cache.

In this example, L7 DNS policy allows queries for cilium.io and any subdomains of cilium.io and api.cilium.io. No other DNS queries will be allowed.

The separate L3 toFQDNs egress rule allows connections to any IPs returned in DNS queries for cilium.io, sub.cilium.io, service1.api.cilium.io and any matches of special*service.api.cilium.io, such as special-region1-service.api.cilium.io but not region1-service.api.cilium.io. DNS queries to anothersub.cilium.io are allowed but connections to the returned IPs are not, as there is no L3 toFQDNs rule selecting them. L4 and L7 policy may also be applied (see DNS based), restricting connections to TCP port 80 in this case.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: "tofqdn-dns-visibility"
spec:
  endpointSelector:
    matchLabels:
      any:org: alliance
  egress:
  - toEndpoints:
    - matchLabels:
       "k8s:io.kubernetes.pod.namespace": kube-system
       "k8s:k8s-app": kube-dns
    toPorts:
      - ports:
         - port: "53"
           protocol: ANY
        rules:
          dns:
            - matchName: "cilium.io"
            - matchPattern: "*.cilium.io"
            - matchPattern: "*.api.cilium.io"

  - toFQDNs:
      - matchName: "cilium.io"
      - matchName: "sub.cilium.io"
      - matchName: "service1.api.cilium.io"
      - matchPattern: "special*service.api.cilium.io"
    toPorts:
      - ports:
         - port: "80"
           protocol: TCP
[
  {
    "endpointSelector": {
      "matchLabels": {
        "app": "test-app"
      }
    },
    "egress": [
      {
        "toEndpoints": [
          {
            "matchLabels": {
              "app-type": "dns"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "53",
                "protocol": "ANY"
              }
            ],
            "rules": {
              "dns": [
                { "matchName": "cilium.io" },
                { "matchPattern": "*.cilium.io" }, 
                { "matchPattern": "*.api.cilium.io" }
              ]
            }
          }
        ]
      },
      {
        "toFQDNs": [
          { "matchName": "cilium.io" },
          { "matchName": "sub.cilium.io" },
          { "matchName": "service1.api.cilium.io" },
          { "matchPattern": "special*service.api.cilium.io" }
       ]
      }
    ]
  }
]

Note

When applying DNS policy in kubernetes, queries for service.namespace.svc.cluster.local. must be explicitly allowed with matchPattern: *.*.svc.cluster.local..

Similarly, queries that rely on the DNS search list to complete the FQDN must be allowed in their entirety. e.g. A query for servicename that succeeds with servicename.namespace.svc.cluster.local. must have the latter allowed with matchName or matchPattern. See Alpine/musl deployments and DNS Refused.

Obtaining DNS Data for use by toFQDNs

IPs are obtained via intercepting DNS requests with a proxy or DNS polling, and matching names are inserted irrespective of how the data is obtained. These IPs can be selected with toFQDN rules. DNS responses are cached within cilium agent respecting TTL.

DNS Proxy

A DNS Proxy intercepts egress DNS traffic and records IPs seen in the responses. This interception is, itself, a separate policy rule governing the DNS requests, and must be specified separately. For details on how to enforce policy on DNS requests and configuring the DNS proxy, see Layer 7 Examples.

Only IPs in intercepted DNS responses to an application will be allowed in the cilium policy rules. For a given domain name, IPs from responses to all pods managed by a Cilium instance are allowed by policy (respecting TTLs). This ensures that allowed IPs are consistent with those returned to applications. The DNS Proxy is the only method to allow IPs from responses allowed by wildcard L7 DNS matchPattern rules for use in toFQDNs rules.

The following example obtains DNS data by interception without blocking any DNS requests. It allows L3 connections to cilium.io, sub.cilium.io and any subdomains of sub.cilium.io.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: "tofqdn-dns-visibility"
spec:
  endpointSelector:
    matchLabels:
      any:org: alliance
  egress:
  - toEndpoints:
    - matchLabels:
       "k8s:io.kubernetes.pod.namespace": kube-system
       "k8s:k8s-app": kube-dns
    toPorts:
      - ports:
         - port: "53"
           protocol: ANY
        rules:
          dns:
            - matchPattern: "*"
  - toFQDNs:
      - matchName: "cilium.io"
      - matchName: "sub.cilium.io"
      - matchPattern: "*.sub.cilium.io"
[
  {
    "endpointSelector": {
      "matchLabels": {
        "app": "test-app"
      }
    },
    "egress": [
      {
        "toEndpoints": [
          {
            "matchLabels": {
              "app-type": "dns"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "53",
                "protocol": "ANY"
              }
            ],
            "rules": {
              "dns": [
                { "matchPattern": "*" }
              ]
            }
          }
        ]
      },
      {
        "toFQDNs": [
          { "matchName": "cilium.io" },
          { "matchName": "sub.cilium.io" },
          { "matchPattern": "*.sub.cilium.io" }
        ]
      }
    ]
  }
]
Alpine/musl deployments and DNS Refused

Some common container images treat the DNS Refused response when the DNS Proxy rejects a query as a more general failure. This stops traversal of the search list defined in /etc/resolv.conf. It is common for pods to search by appending .svc.cluster.local. to DNS queries. When this occurs, a lookup for cilium.io may first be attempted as cilium.io.namespace.svc.cluster.local. and rejected by the proxy. Instead of continuing and eventually attempting cilium.io. alone, the Pod treats the DNS lookup is treated as failed.

This can be mitigated with the --tofqdns-dns-reject-response-code option. The default is refused but nameError can be selected, causing the proxy to return a NXDomain response to refused queries.

A more pod-specific solution is to configure ndots appropriately for each Pod, via dnsConfig, so that the search list is not used for DNS lookups that do not need it. See the Kubernetes documentation for instructions.

Kubernetes

This section covers Kubernetes specific network policy aspects.

Namespaces

Namespaces are used to create virtual clusters within a Kubernetes cluster. All Kubernetes objects including NetworkPolicy and CiliumNetworkPolicy belong to a particular namespace. Depending on how a policy is being defined and created, Kubernetes namespaces are automatically being taken into account:

  • Network policies created and imported as CiliumNetworkPolicy CRD and NetworkPolicy apply within the namespace, i.e. the policy only applies to pods within that namespace. It is however possible to grant access to and from pods in other namespaces as described below.
  • Network policies imported directly via the API Reference apply to all namespaces unless a namespace selector is specified as described below.

Note

While specification of the namespace via the label k8s:io.kubernetes.pod.namespace in the fromEndpoints and toEndpoints fields is deliberately supported. Specification of the namespace in the endpointSelector is prohibited as it would violate the namespace isolation principle of Kubernetes. The endpointSelector always applies to pods of the namespace which is associated with the CiliumNetworkPolicy resource itself.

Example: Enforce namespace boundaries

This example demonstrates how to enforce Kubernetes namespace-based boundaries for the namespaces ns1 and ns2 by enabling default-deny on all pods of either namespace and then allowing communication from all pods within the same namespace.

Note

The example locks down ingress of the pods in ns1 and ns2. This means that the pods can still communicate egress to anywhere unless the destination is in either ns1 or ns2 in which case both source and destination have to be in the same namespace. In order to enforce namespace boundaries at egress, the same example can be used by specifying the rules at egress in addition to ingress.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "isolate-ns1"
  namespace: ns1
spec:
  endpointSelector:
    matchLabels:
      {}
  ingress:
  - fromEndpoints:
    - matchLabels:
        {}
---
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "isolate-ns1"
  namespace: ns2
spec:
  endpointSelector:
    matchLabels:
      {}
  ingress:
  - fromEndpoints:
    - matchLabels:
        {}
[
   {
      "ingress" : [
         {
            "fromEndpoints" : [
               {
                  "matchLabels" : {
                     "k8s:io.kubernetes.pod.namespace" : "ns1"
                  }
               }
            ]
         }
      ],
      "endpointSelector" : {
         "matchLabels" : {
            "k8s:io.kubernetes.pod.namespace" : "ns1"
         }
      }
   },
   {
      "endpointSelector" : {
         "matchLabels" : {
            "k8s:io.kubernetes.pod.namespace" : "ns2"
         }
      },
      "ingress" : [
         {
            "fromEndpoints" : [
               {
                  "matchLabels" : {
                     "k8s:io.kubernetes.pod.namespace" : "ns2"
                  }
               }
            ]
         }
      ]
   }
]
Example: Expose pods across namespaces

The following example exposes all pods with the label name=leia in the namespace ns1 to all pods with the label name=luke in the namespace ns2.

Refer to the example YAML files for a fully functional example including pods deployed to different namespaces.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "k8s-expose-across-namespace"
  namespace: ns1
spec:
  endpointSelector:
    matchLabels:
      name: leia
  ingress:
  - fromEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: ns2
        name: luke
[{
    "labels": [{"key": "name", "value": "k8s-svc-account"}],
    "endpointSelector": {
        "matchLabels": {"name":"leia", "k8s:io.kubernetes.pod.namespace":"ns1"}
    },
    "ingress": [{
        "fromEndpoints": [{
	    "matchLabels":{"name": "luke", "k8s:io.kubernetes.pod.namespace":"ns2"}
        }]
    }]
}]
Example: Allow egress to kube-dns in kube-system namespace

The following example allows all pods in the namespace in which the policy is created to communicate with kube-dns on port 53/UDP in the kube-system namespace.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-to-kubedns"
spec:
  endpointSelector:
    {}
  egress:
  - toEndpoints:
    - matchLabels:
        k8s:io.kubernetes.pod.namespace: kube-system
        k8s-app: kube-dns
    toPorts:
    - ports:
      - port: '53'
        protocol: UDP
[
   {
      "endpointSelector" : {
         "matchLabels" : {}
      },
      "egress" : [
         {
            "toEndpoints" : [
               {
                  "matchLabels" : {
                     "k8s:io.kubernetes.pod.namespace" : "kube-system",
                     "k8s-app" : "kube-dns"
                  }
               }
            ],
            "toPorts" : [
               {
                  "ports" : [
                     {
                        "port" : "53",
                        "protocol" : "UDP"
                     }
                  ]
               }
            ]
         }
      ]
   }
]

ServiceAccounts

Kubernetes Service Accounts are used to associate an identity to a pod or process managed by Kubernetes and grant identities access to Kubernetes resources and secrets. Cilium supports the specification of network security policies based on the service account identity of a pod.

The service account of a pod is either defined via the service account admission controller or can be directly specified in the Pod, Deployment, ReplicationController resource like this:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  serviceAccountName: leia
  ...
Example

The following example grants any pod running under the service account of “luke” to issue a HTTP GET /public request on TCP port 80 to all pods running associated to the service account of “leia”.

Refer to the example YAML files for a fully functional example including deployment and service account resources.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "k8s-svc-account"
spec:
  endpointSelector:
    matchLabels:
      io.cilium.k8s.policy.serviceaccount: leia
  ingress:
  - fromEndpoints:
    - matchLabels:
        io.cilium.k8s.policy.serviceaccount: luke
    toPorts:
    - ports:
      - port: '80'
        protocol: TCP
      rules:
        http:
        - method: GET
          path: "/public$"
[{
    "labels": [{"key": "name", "value": "k8s-svc-account"}],
    "endpointSelector": {"matchLabels": {"io.cilium.k8s.policy.serviceaccount":"leia"}},
    "ingress": [{
        "fromEndpoints": [
          {"matchLabels":{"io.cilium.k8s.policy.serviceaccount":"luke"}}
        ],
        "toPorts": [{
            "ports": [
                {"port": "80", "protocol": "TCP"}
            ],
            "rules": {
                "http": [
                    {
                        "method": "GET",
                        "path": "/public$"
                    }
                ]
            }
        }]
    }]
}]

Multi-Cluster

When operating multiple cluster with cluster mesh, the cluster name is exposed via the label io.cilium.k8s.policy.cluster and can be used to restrict policies to a particular cluster.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "allow-cross-cluster"
  description: "Allow x-wing in cluster1 to contact rebel-base in cluster2"
spec:
  endpointSelector:
    matchLabels:
      name: x-wing
      io.cilium.k8s.policy.cluster: cluster1
  egress:
  - toEndpoints:
    - matchLabels:
        name: rebel-base
        io.cilium.k8s.policy.cluster: cluster2

Clusterwide Policies

CiliumNetworkPolicy only allows to bind a policy restricted to a particular namespace. There can be situations where one wants to have a cluster-scoped effect of the policy, which can be done using Cilium’s CiliumClusterwideNetworkPolicy Kubernetes custom resource. The specification of the policy is same as that of CiliumNetworkPolicy except that it is not namespaced.

In the cluster, this policy will allow ingress traffic from pods matching the label name=luke from any namespace to pods matching the labels name=leia in any namespace.

apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
description: "Policy for selective ingress allow to a pod from only a pod with given label"
metadata:
  name: "clusterwide-policy-example"
spec:
  endpointSelector:
    matchLabels:
      name: leia
  ingress:
  - fromEndpoints:
    - matchLabels:
        name: luke

Endpoint Lifecycle

This section specifies the lifecycle of Cilium endpoints.

Every endpoint in Cilium is in one of the following states:

  • restoring: The endpoint was started before Cilium started, and Cilium is restoring its networking configuration.
  • waiting-for-identity: Cilium is allocating a unique identity for the endpoint.
  • waiting-to-regenerate: The endpoint received an identity and is waiting for its networking configuration to be (re)generated.
  • regenerating: The endpoint’s networking configuration is being (re)generated. This includes programming BPF for that endpoint.
  • ready: The endpoint’s networking configuration has been successfully (re)generated.
  • disconnecting: The endpoint is being deleted.
  • disconnected: The endpoint has been deleted.
_images/cilium-endpoint-lifecycle.png

The state of an endpoint can be queried using the cilium endpoint list and cilium endpoint get CLI commands.

While an endpoint is running, it transitions between the waiting-for-identity, waiting-to-regenerate, regenerating, and ready states. A transition into the waiting-for-identity state indicates that the endpoint changed its identity. A transition into the waiting-to-regenerate or regenerating state indicates that the policy to be enforced on the endpoint has changed because of a change in identity, policy, or configuration.

An endpoint transitions into the disconnecting state when it is being deleted, regardless of its current state.

Init Identity

In some situations, Cilium can’t determine the labels of an endpoint immediately when the endpoint is created, and therefore can’t allocate an identity for the endpoint at that point. Until the endpoint’s labels are known, Cilium temporarily associates a special single label reserved:init to the endpoint. When the endpoint’s labels become known, Cilium then replaces that special label with the endpoint’s labels and allocates a proper identity to the endpoint.

This may occur during endpoint creation in the following cases: * Running Cilium with docker via libnetwork * With Kubernetes when the Kubernetes API server is not available * In etcd or consul mode when the corresponding kvstore is not available

To allow traffic to/from endpoints while they are initializing, you can create policy rules that select the reserved:init label, and/or rules that allow traffic to/from the special init entity.

For instance, writing a rule that allows all initializing endpoints to receive connections from the host and to perform DNS queries may be done as follows:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: init
specs:
  - endpointSelector:
      matchLabels:
        "reserved:init": ""
    ingress:
    - fromEntities:
      - host
    egress:
    - toEntities:
      - all
      toPorts:
      - ports:
        - port: "53"
          protocol: UDP
[{
  "labels": [{"key": "name", "value": "init"}],
  "endpointSelector": {"matchLabels":{"reserved:init":""}},
  "ingress": [{
    "fromEntities": ["host"]
  }],
  "egress": [{
    "toEntities": ["all"],
    "toPorts": [
      {"ports":[ {"port": "53", "protocol": "UDP"}]}
    ]
  }]
}]

Likewise, writing a rule that allows an endpoint to receive DNS queries from initializing endpoints may be done as follows:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "from-init"
spec:
  endpointSelector:
    matchLabels:
      app: myService
  ingress:
    - fromEntities:
      - init
    - toPorts:
      - ports:
        - port: "53"
          protocol: UDP
[{
  "labels": [{"key": "name", "value": "from-init"}],
  "endpointSelector": {"matchLabels":{"app":"myService"}},
  "ingress": [{
    "fromEntities": ["init"],
    "toPorts": [
      {"ports":[ {"port": "53", "protocol": "UDP"}]}
    ]
  }]
}]

If any ingress (resp. egress) policy rules selects the reserved:init label, all ingress (resp. egress) traffic to (resp. from) initializing endpoints that is not explicitly allowed by those rules will be dropped. Otherwise, if the policy enforcement mode is never or default, all ingress (resp. egress) traffic is allowed to (resp. from) initializing endpoints. Otherwise, all ingress (resp. egress) traffic is dropped.

Troubleshooting

Policy Tracing

If Cilium is allowing / denying connections in a way that is not aligned with the intent of your Cilium Network policy, there is an easy way to verify if and what policy rules apply between two endpoints. We can use the cilium policy trace to simulate a policy decision between the source and destination endpoints.

We will use the example from the Minikube Getting Started Guide to trace the policy. In this example, there is:

  • deathstar service identified by labels: org=empire, class=deathstar. The service is backed by two pods.
  • tiefighter spaceship client pod with labels: org=empire, class=tiefighter
  • xwing spaceship client pod with labels: org=alliance, class=xwing

An L3/L4 policy is enforced on the deathstar service to allow access to all spaceships with labels org=empire. With this policy, the tiefighter access is allowed but xwing access will be denied. Let’s use the cilium policy trace to simulate the policy decision. The command provides flexibility to run using pod names, labels or Cilium security identities.

Note

If the --dport option is not specified, then L4 policy will not be consulted in this policy trace command.

Currently, there is no support for tracing L7 policies via this tool.

# Policy trace using pod name and service labels

$ kubectl exec -ti cilium-88k78 -n kube-system -- cilium policy trace --src-k8s-pod default:xwing -d any:class=deathstar,k8s:org=empire,k8s:io.kubernetes.pod.namespace=default --dport 80
level=info msg="Waiting for k8s api-server to be ready..." subsys=k8s
level=info msg="Connected to k8s api-server" ipAddr="https://10.96.0.1:443" subsys=k8s
----------------------------------------------------------------
Tracing From: [k8s:class=xwing, k8s:io.cilium.k8s.policy.serviceaccount=default, k8s:io.kubernetes.pod.namespace=default, k8s:org=alliance] => To: [any:class=deathstar, k8s:org=empire, k8s:io.kubernetes.pod.namespace=default] Ports: [80/ANY]

Resolving ingress policy for [any:class=deathstar k8s:org=empire k8s:io.kubernetes.pod.namespace=default]
* Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
    Allows from labels {"matchLabels":{"any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}
      Labels [k8s:class=xwing k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance] not found
1/1 rules selected
Found no allow rule
Ingress verdict: denied

Final verdict: DENIED
# Get the Cilium security id

$ kubectl exec -ti cilium-88k78 -n kube-system -- cilium endpoint list | egrep  'deathstar|xwing|tiefighter'
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                              IPv6                 IPv4            STATUS
           ENFORCEMENT        ENFORCEMENT
568        Enabled            Disabled          22133      k8s:class=deathstar                                      f00d::a0f:0:0:238    10.15.65.193    ready
900        Enabled            Disabled          22133      k8s:class=deathstar                                      f00d::a0f:0:0:384    10.15.114.17    ready
33633      Disabled           Disabled          53208      k8s:class=xwing                                          f00d::a0f:0:0:8361   10.15.151.230   ready
38654      Disabled           Disabled          22962      k8s:class=tiefighter                                     f00d::a0f:0:0:96fe   10.15.88.156    ready

# Policy trace using Cilium security ids

$ kubectl exec -ti cilium-88k78 -n kube-system -- cilium policy trace --src-identity 53208 --dst-identity 22133  --dport 80
----------------------------------------------------------------
Tracing From: [k8s:class=xwing, k8s:io.cilium.k8s.policy.serviceaccount=default, k8s:io.kubernetes.pod.namespace=default, k8s:org=alliance] => To: [any:class=deathstar, k8s:org=empire, k8s:io.kubernetes.pod.namespace=default] Ports: [80/ANY]

Resolving ingress policy for [any:class=deathstar k8s:org=empire k8s:io.kubernetes.pod.namespace=default]
* Rule {"matchLabels":{"any:class":"deathstar","any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}: selected
    Allows from labels {"matchLabels":{"any:org":"empire","k8s:io.kubernetes.pod.namespace":"default"}}
      Labels [k8s:class=xwing k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=alliance] not found
1/1 rules selected
Found no allow rule
Ingress verdict: denied

Final verdict: DENIED

Policy Rule to Endpoint Mapping

To determine which policy rules are currently in effect for an endpoint the data from cilium endpoint list and cilium endpoint get can be paired with the data from cilium policy get. cilium endpoint get will list the labels of each rule that applies to an endpoint. The list of labels can be passed to cilium policy get to show that exact source policy. Note that rules that have no labels cannot be fetched alone (a no label cililum policy get returns the complete policy on the node). Rules with the same labels will be returned together.

In the above example, for one of the deathstar pods the endpoint id is 568. We can print all policies applied to it with:

# Get a shell on the Cilium pod

$ kubectl exec -ti cilium-88k78 -n kube-system /bin/bash

# print out the ingress labels
# clean up the data
# fetch each policy via each set of labels
# (Note that while the structure is "...l4.ingress...", it reflects all L3, L4 and L7 policy.

$ cilium endpoint get 568 -o jsonpath='{range ..status.policy.realized.l4.ingress[*].derived-from-rules}{@}{"\n"}{end}'|tr -d '][' | xargs -I{} bash -c 'echo "Labels: {}"; cilium policy get {}'
Labels: k8s:io.cilium.k8s.policy.name=rule1 k8s:io.cilium.k8s.policy.namespace=default
[
  {
    "endpointSelector": {
      "matchLabels": {
        "any:class": "deathstar",
        "any:org": "empire",
        "k8s:io.kubernetes.pod.namespace": "default"
      }
    },
    "ingress": [
      {
        "fromEndpoints": [
          {
            "matchLabels": {
              "any:org": "empire",
              "k8s:io.kubernetes.pod.namespace": "default"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "80",
                "protocol": "TCP"
              }
            ],
            "rules": {
              "http": [
                {
                  "path": "/v1/request-landing",
                  "method": "POST"
                }
              ]
            }
          }
        ]
      }
    ],
    "labels": [
      {
        "key": "io.cilium.k8s.policy.name",
        "value": "rule1",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.namespace",
        "value": "default",
        "source": "k8s"
      }
    ]
  }
]
Revision: 217


# repeat for egress
$ cilium endpoint get 568 -o jsonpath='{range ..status.policy.realized.l4.egress[*].derived-from-rules}{@}{"\n"}{end}' | tr -d '][' | xargs -I{} bash -c 'echo "Labels: {}"; cilium policy get {}'

Troubleshooting toFQDNs rules

The effect of toFQDNs may change long after a policy is applied, as DNS data changes. This can make it difficult to debug unexpectedly blocked connections, or transient failures. Cilium provides CLI tools to introspect the state of applying FQDN policy in multiple layers of the daemon:

  1. cilium policy get should show the FQDN policy that was imported:

    {
      "endpointSelector": {
        "matchLabels": {
          "any:class": "mediabot",
          "any:org": "empire",
          "k8s:io.kubernetes.pod.namespace": "default"
        }
      },
      "egress": [
        {
          "toFQDNs": [
            {
              "matchName": "api.twitter.com"
            }
          ]
        },
        {
          "toEndpoints": [
            {
              "matchLabels": {
                "k8s:io.kubernetes.pod.namespace": "kube-system",
                "k8s:k8s-app": "kube-dns"
              }
            }
          ],
          "toPorts": [
            {
              "ports": [
                {
                  "port": "53",
                  "protocol": "ANY"
                }
              ],
              "rules": {
                "dns": [
                  {
                    "matchPattern": "*"
                  }
                ]
              }
            }
          ]
        }
      ],
      "labels": [
        {
          "key": "io.cilium.k8s.policy.derived-from",
          "value": "CiliumNetworkPolicy",
          "source": "k8s"
        },
        {
          "key": "io.cilium.k8s.policy.name",
          "value": "fqdn",
          "source": "k8s"
        },
        {
          "key": "io.cilium.k8s.policy.namespace",
          "value": "default",
          "source": "k8s"
        },
        {
          "key": "io.cilium.k8s.policy.uid",
          "value": "fc9d6022-2ffa-4f72-b59e-b9067c3cfecf",
          "source": "k8s"
        }
      ]
    }
    
  2. After making a DNS request, the FQDN to IP mapping should be available via cilium fqdn cache list:

    # cilium fqdn cache list
    Endpoint   FQDN                TTL      ExpirationTime             IPs
    2761       help.twitter.com.   604800   2019-07-16T17:57:38.179Z   104.244.42.67,104.244.42.195,104.244.42.3,104.244.42.131
    2761       api.twitter.com.    604800   2019-07-16T18:11:38.627Z   104.244.42.194,104.244.42.130,104.244.42.66,104.244.42.2
    
  3. If the traffic is allowed, then these IPs should have corresponding local identities via cilium identity list | grep <IP>:

    # cilium identity list | grep -A 1 104.244.42.194
    16777220   cidr:104.244.42.194/32
               reserved:world
    
  4. Given the identity of the traffic that should be allowed, the regular Policy Tracing steps can be used to validate that the policy is calculated correctly.

L7 Protocol Visibility

While Monitoring Datapath State provides introspection into datapath state, by default it will only provide visibility into L3/L4 packet events. If Layer 7 Examples is configured, one can get visibility into L7 protocols, but this requires the full policy for each selected endpoint to be written. To get more visibility into the application without configuring a full policy, Cilium provides a means of prescribing visibility via annotations when running in tandem with Kubernetes.

Visibility information is represented by a comma-separated list of tuples in the annotation:

<{Traffic Direction}/{L4 Port}/{L4 Protocol}/{L7 Protocol}>

For example:

<Egress/53/UDP/DNS>,<Egress/80/TCP/HTTP>

To do this, you can provide the annotation in your Kubernetes YAMLs, or via the command line, e.g.:

kubectl annotate pod foo -n bar io.cilium.proxy-visibility="<Egress/53/UDP/DNS>,<Egress/80/TCP/HTTP>"

Cilium will pick up that pods have received these annotations, and will transparently redirect traffic to the proxy such that the output of cilium monitor shows traffic being redirected to the proxy, e.g.:

-> Request http from 1474 ([k8s:id=app2 k8s:io.kubernetes.pod.namespace=default k8s:appSecond=true k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=app2-account k8s:zgroup=testapp]) to 244 ([k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=app1-account k8s:io.kubernetes.pod.namespace=default k8s:zgroup=testapp k8s:id=app1]), identity 30162->42462, verdict Forwarded GET http://app1-service/ => 0
-> Response http to 1474 ([k8s:zgroup=testapp k8s:id=app2 k8s:io.kubernetes.pod.namespace=default k8s:appSecond=true k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=app2-account]) from 244 ([k8s:io.cilium.k8s.policy.serviceaccount=app1-account k8s:io.kubernetes.pod.namespace=default k8s:zgroup=testapp k8s:id=app1 k8s:io.cilium.k8s.policy.cluster=default]), identity 30162->42462, verdict Forwarded GET http://app1-service/ => 200

You can check the status of the visibility policy by checking the Cilium endpoint of that pod, for example:

$ kubectl get cep -n kube-system
NAME                       ENDPOINT ID   IDENTITY ID   INGRESS ENFORCEMENT   EGRESS ENFORCEMENT   VISIBILITY POLICY   ENDPOINT STATE   IPV4           IPV6
coredns-7d7f5b7685-wvzwb   1959          104           false                 false                                    ready            10.16.75.193   f00d::a10:0:0:2c77
$
$ kubectl annotate pod -n kube-system coredns-7d7f5b7685-wvzwb io.cilium.proxy-visibility="<Egress/53/UDP/DNS>,<Egress/80/TCP/HTTP>" --overwrite
pod/coredns-7d7f5b7685-wvzwb annotated
$
$ kubectl get cep -n kube-system
NAME                       ENDPOINT ID   IDENTITY ID   INGRESS ENFORCEMENT   EGRESS ENFORCEMENT   VISIBILITY POLICY   ENDPOINT STATE   IPV4           IPV6
coredns-7d7f5b7685-wvzwb   1959          104           false                 false                OK                  ready            10.16.75.193   f00d::a10:0:0:2c7

Troubleshooting

If L7 visibility is not appearing in cilium monitor or Hubble components, it is worth double-checking that:

  • No enforcement policy is applied in the direction specified in the annotation
  • The “Visibility Policy” column in the CiliumEndpoint shows OK. If it is blank, then no annotation is configured; if it shows an error then there is a problem with the visibility annotation.

The following example deliberately misconfigures the annotation to demonstrate that the CiliumEndpoint for the pod presents an error when the visibility annotation cannot be implemented:

$ kubectl annotate pod -n kube-system coredns-7d7f5b7685-wvzwb io.cilium.proxy-visibility="<Ingress/53/UDP/DNS>,<Egress/80/TCP/HTTP>"
pod/coredns-7d7f5b7685-wvzwb annotated
$
$ kubectl get cep -n kube-system
NAME                       ENDPOINT ID   IDENTITY ID   INGRESS ENFORCEMENT   EGRESS ENFORCEMENT   VISIBILITY POLICY                        ENDPOINT STATE   IPV4           IPV6
coredns-7d7f5b7685-wvzwb   1959          104           false                 false                dns not allowed with direction Ingress   ready            10.16.75.193   f00d::a10:0:0:2c77

Limitations

  • Visibility annotations do not apply if rules are imported which select the pod which is annotated.
  • Proxylib parsers are not supported.

Monitoring & Metrics

cilium-agent and cilium-operator can be configured to serve Prometheus metrics. Prometheus is a pluggable metrics collection and storage system and can act as a data source for Grafana, a metrics visualization frontend. Unlike some metrics collectors like statsd, Prometheus requires the collectors to pull metrics from each source.

To run Cilium with Prometheus metrics enabled, deploy it with the global.prometheus.enabled=true Helm value set.

All metrics are exported under the cilium Prometheus namespace. When running and collecting in Kubernetes they will be tagged with a pod name and namespace.

Installation

When deployed with the Helm value global.prometheus.enabled=true, all Cilium components will have the annotations to signal Prometheus whether to scrape metrics:

prometheus.io/scrape: "true"
prometheus.io/port: "9090"

Example Prometheus & Grafana Deployment

If you don’t have an existing Prometheus and Grafana stack running, you can deploy a stack with:

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/review-docs-redesign/examples/kubernetes/addons/prometheus/monitoring-example.yaml

It will run Prometheus and Grafana in the cilium-monitoring namespace. You can then expose Grafana to access it via your browser.

kubectl -n cilium-monitoring port-forward service/grafana 3000:3000

Open your browser and access https://localhost:3000/

cilium-agent

To expose any metrics, invoke cilium-agent with the --prometheus-serve-addr option. This option takes a IP:Port pair but passing an empty IP (e.g. :9090) will bind the server to all available interfaces (there is usually only one in a container).

in examples/kubernetes/addons/prometheus/monitoring-example.yaml

Exported Metrics

Endpoint
Name Labels Description
endpoint_count   Number of endpoints managed by this agent
endpoint_regenerations outcome Count of all endpoint regenerations that have completed
endpoint_regeneration_time_stats_seconds scope Endpoint regeneration time stats
endpoint_state state Count of all endpoints
Services
Name Labels Description
services_events_total   Number of services events labeled by action type
Datapath
Name Labels Description
datapath_errors_total area, name, family Total number of errors occurred in datapath management
datapath_conntrack_gc_runs_total status Number of times that the conntrack garbage collector process was run
datapath_conntrack_gc_key_fallbacks_total   The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
datapath_conntrack_gc_entries family The number of alive and deleted conntrack entries at the end of a garbage collector run
datapath_conntrack_gc_duration_seconds status Duration in seconds of the garbage collector process
BPF
Name Labels Description
bpf_syscall_duration_seconds operation, outcome Duration of BPF system call performed
bpf_map_ops_total mapName, operation, outcome Number of BPF map operations performed
bpf_maps_virtual_memory_max_bytes   Max memory used by BPF maps installed in the system
bpf_progs_virtual_memory_max_bytes   Max memory used by BPF programs installed in the system

Both bpf_maps_virtual_memory_max_bytes and bpf_progs_virtual_memory_max_bytes are currently reporting the system-wide memory usage of BPF that is directly and not directly managed by Cilium. This might change in the future and only report the BPF memory usage directly managed by Cilium.

Drops/Forwards (L3/L4)
Name Labels Description
drop_count_total reason, direction Total dropped packets
drop_bytes_total reason, direction Total dropped bytes
forward_count_total direction Total forwarded packets
forward_bytes_total direction Total forwarded bytes
Policy
Name Labels Description
policy_count   Number of policies currently loaded
policy_regeneration_total   Total number of policies regenerated successfully
policy_regeneration_time_stats_seconds scope Policy regeneration time stats labeled by the scope
policy_max_revision   Highest policy revision number in the agent
policy_import_errors   Number of times a policy import has failed
policy_endpoint_enforcement_status   Number of endpoints labeled by policy enforcement status
Policy L7 (HTTP/Kafka)
Name Labels Description
proxy_redirects protocol Number of redirects installed for endpoints
proxy_upstream_reply_seconds   Seconds waited for upstream server to reply to a request
policy_l7_total type Number of total L7 requests/responses
Identity
Name Labels Description
identity_count   Number of identities currently allocated
Events external to Cilium
Name Labels Description
event_ts source Last timestamp when we received an event
Controllers
Name Labels Description
controllers_runs_total status Number of times that a controller process was run
controllers_runs_duration_seconds status Duration in seconds of the controller process
SubProcess
Name Labels Description
subprocess_start_total subsystem Number of times that Cilium has started a subprocess
Kubernetes
Name Labels Description
kubernetes_events_received_total scope, action, validity, equiality Number of Kubernetes events received
kubernetes_events_total scope, action, outcome Number of Kubernetes events processed
k8s_cnp_status_completion_seconds attempts, outcome Duration in seconds in how long it took to complete a CNP status update
IPAM
Name Labels Description
ipam_events_total   Number of IPAM events received labeled by action and datapath family type
KVstore
Name Labels Description
kvstore_operations_duration_seconds action, kind, outcome, scope Duration of kvstore operation
kvstore_events_queue_seconds action, scope Duration of seconds of time received event was blocked before it could be queued
Agent
Name Labels Description
agent_bootstrap_seconds scope, outcome Duration of various bootstrap phases
api_process_time_seconds   Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.
FQDN
Name Labels Description
qdn_gc_deletions_total   Number of FQDNs that have been cleaned on FQDN garbage collector job

cilium-operator

cilium-operator can be configured to serve metrics by running with the option --enable-metrics. By default, the operator will expose metrics on port 6942, the port can be changed with the option --metrics-address.

Exported Metrics

All metrics are exported under the cilium_operator_ Prometheus namespace.

IPAM
Name Labels Description
ipam_ips type Number of IPs allocated
ipam_allocation_ops subnetId Number of IP allocation operations
ipam_interface_creation_ops subnetId, status Number of interfaces creation operations
ipam_available   Number of interfaces with addresses available
ipam_nodes_at_capacity   Number of nodes unable to allocate more addresses
ipam_resync_total   Number of synchronization operations with external IPAM API
ipam_api_duration_seconds operation, responseCode Duration of interactions with external IPAM API
ipam_api_rate_limit_duration_seconds operation Duration of rate limiting while accessing external IPAM API

Troubleshooting

This document describes how to troubleshoot Cilium in different deployment modes. It focuses on a full deployment of Cilium within a datacenter or public cloud. If you are just looking for a simple way to experiment, we highly recommend trying out the Getting Started Guides instead.

This guide assumes that you have read the Concepts which explains all the components and concepts.

We use GitHub issues to maintain a list of Cilium Frequently Asked Questions (FAQ). You can also check there to see if your question(s) is already addressed.

Component & Cluster Health

Kubernetes

An initial overview of Cilium can be retrieved by listing all pods to verify whether all pods have the status Running:

$ kubectl -n kube-system get pods -l k8s-app=cilium
NAME           READY     STATUS    RESTARTS   AGE
cilium-2hq5z   1/1       Running   0          4d
cilium-6kbtz   1/1       Running   0          4d
cilium-klj4b   1/1       Running   0          4d
cilium-zmjj9   1/1       Running   0          4d

If Cilium encounters a problem that it cannot recover from, it will automatically report the failure state via cilium status which is regularly queried by the Kubernetes liveness probe to automatically restart Cilium pods. If a Cilium pod is in state CrashLoopBackoff then this indicates a permanent failure scenario.

Detailed Status

If a particular Cilium pod is not in running state, the status and health of the agent on that node can be retrieved by running cilium status in the context of that pod:

$ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium status
KVStore:                Ok   etcd: 1/1 connected: http://demo-etcd-lab--a.etcd.tgraf.test1.lab.corp.isovalent.link:2379 - 3.2.5 (Leader)
ContainerRuntime:       Ok   docker daemon: OK
Kubernetes:             Ok   OK
Kubernetes APIs:        ["cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node", "CustomResourceDefinition"]
Cilium:                 Ok   OK
NodeMonitor:            Disabled
Cilium health daemon:   Ok
Controller Status:      14/14 healthy
Proxy Status:           OK, ip 10.2.0.172, port-range 10000-20000
Cluster health:   4/4 reachable   (2018-06-16T09:49:58Z)

Alternatively, the k8s-cilium-exec.sh script can be used to run cilium status on all nodes. This will provide detailed status and health information of all nodes in the cluster:

$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh
$ chmod +x ./k8s-cilium-exec.sh

… and run cilium status on all nodes:

$ ./k8s-cilium-exec.sh cilium status
KVStore:                Ok   Etcd: http://127.0.0.1:2379 - (Leader) 3.1.10
ContainerRuntime:       Ok
Kubernetes:             Ok   OK
Kubernetes APIs:        ["extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint"]
Cilium:                 Ok   OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
Controller Status:      7/7 healthy
Proxy Status:           OK, ip 10.15.28.238, 0 redirects, port-range 10000-20000
Cluster health:   1/1 reachable   (2018-02-27T00:24:34Z)

Detailed information about the status of Cilium can be inspected with the cilium status --verbose command. Verbose output includes detailed IPAM state (allocated addresses), Cilium controller status, and details of the Proxy status.

Logs

To retrieve log files of a cilium pod, run (replace cilium-1234 with a pod name returned by kubectl -n kube-system get pods -l k8s-app=cilium)

$ kubectl -n kube-system logs --timestamps cilium-1234

If the cilium pod was already restarted due to the liveness problem after encountering an issue, it can be useful to retrieve the logs of the pod before the last restart:

$ kubectl -n kube-system logs --timestamps -p cilium-1234

Generic

When logged in a host running Cilium, the cilium CLI can be invoked directly, e.g.:

$ cilium status
KVStore:                Ok   etcd: 1/1 connected: https://192.168.33.11:2379 - 3.2.7 (Leader)
ContainerRuntime:       Ok
Kubernetes:             Ok   OK
Kubernetes APIs:        ["core/v1::Endpoint", "extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service"]
Cilium:                 Ok   OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
IPv4 address pool:      261/65535 allocated
IPv6 address pool:      4/4294967295 allocated
Controller Status:      20/20 healthy
Proxy Status:           OK, ip 10.0.28.238, port-range 10000-20000
Cluster health:   2/2 reachable   (2018-04-11T15:41:01Z)

Connectivity Problems

Cilium connectivity tests

The Cilium connectivity test deploys a series of services, deployments, and CiliumNetworkPolicy which will use various connectivity paths to connect to each other. Connectivity paths include with and without service load-balancing and various network policy combinations.

Note

The connectivity tests this will only work in a namespace with no other pods or network policies applied. If there is a Cilium Clusterwide Network Policy enabled, that may also break this connectivity check.

To run the connectivity tests create an isolated test namespace called cilium-test to deploy the tests with.

$ kubectl create ns cilium-test
$ kubectl apply --namespace=cilium-test -f \ |SCM_WEB|\/examples/kubernetes/connectivity-check/connectivity-check.yaml

The tests cover various functionality of the system. Below we call out each test type. If tests pass, it suggests functionality of the referenced subsystem.

Pod-to-pod (intra-host) Pod-to-pod (inter-host) Pod-to-service (intra-host) Pod-to-service (inter-host) Pod-to-external resource
BPF routing is functional Data plane, routing, network BPF service map lookup VXLAN overlay port if used Egress, CiliumNetworkPolicy, masquerade

The pod name indicates the connectivity variant and the readiness and liveness gate indicates success or failure of the test:

$ kubectl get pods
NAME                                                     READY   STATUS             RESTARTS   AGE
echo-a-9b85dd869-292s2                                   1/1     Running            0          8m37s
echo-b-c7d9f4686-gdwcs                                   1/1     Running            0          8m37s
host-to-b-multi-node-clusterip-6d496f7cf9-956jb          1/1     Running            0          8m37s
host-to-b-multi-node-headless-bd589bbcf-jwbh2            1/1     Running            0          8m37s
pod-to-a-7cc4b6c5b8-9jfjb                                1/1     Running            0          8m36s
pod-to-a-allowed-cnp-6cc776bb4d-2cszk                    1/1     Running            0          8m36s
pod-to-a-external-1111-5c75bd66db-sxfck                  1/1     Running            0          8m35s
pod-to-a-l3-denied-cnp-7fdd9975dd-2pp96                  1/1     Running            0          8m36s
pod-to-b-intra-node-9d9d4d6f9-qccfs                      1/1     Running            0          8m35s
pod-to-b-multi-node-clusterip-5956c84b7c-hwzfg           1/1     Running            0          8m35s
pod-to-b-multi-node-headless-6698899447-xlhfw            1/1     Running            0          8m35s
pod-to-external-fqdn-allow-google-cnp-667649bbf6-v6rf8   1/1     Running            0

Information about test failures can be determined by describing a failed test pod

$ kubectl describe pod pod-to-b-intra-node-hostport
  Warning  Unhealthy  6s (x6 over 56s)   kubelet, agent1    Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 40000: Connection refused
  Warning  Unhealthy  2s (x3 over 52s)   kubelet, agent1    Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 40000: Connection refused

Checking cluster connectivity health

Cilium can rule out network fabric related issues when troubleshooting connectivity issues by providing reliable health and latency probes between all cluster nodes and a simulated workload running on each node.

By default when Cilium is run, it launches instances of cilium-health in the background to determine the overall connectivity status of the cluster. This tool periodically runs bidirectional traffic across multiple paths through the cluster and through each node using different protocols to determine the health status of each path and protocol. At any point in time, cilium-health may be queried for the connectivity status of the last probe.

$ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium-health status
Probe time:   2018-06-16T09:51:58Z
Nodes:
  ip-172-0-52-116.us-west-2.compute.internal (localhost):
    Host connectivity to 172.0.52.116:
      ICMP to stack: OK, RTT=315.254µs
      HTTP to agent: OK, RTT=368.579µs
    Endpoint connectivity to 10.2.0.183:
      ICMP to stack: OK, RTT=190.658µs
      HTTP to agent: OK, RTT=536.665µs
  ip-172-0-117-198.us-west-2.compute.internal:
    Host connectivity to 172.0.117.198:
      ICMP to stack: OK, RTT=1.009679ms
      HTTP to agent: OK, RTT=1.808628ms
    Endpoint connectivity to 10.2.1.234:
      ICMP to stack: OK, RTT=1.016365ms
      HTTP to agent: OK, RTT=2.29877ms

For each node, the connectivity will be displayed for each protocol and path, both to the node itself and to an endpoint on that node. The latency specified is a snapshot at the last time a probe was run, which is typically once per minute. The ICMP connectivity row represents Layer 3 connectivity to the networking stack, while the HTTP connectivity row represents connection to an instance of the cilium-health agent running on the host or as an endpoint.

Monitoring Datapath State

Sometimes you may experience broken connectivity, which may be due to a number of different causes. A main cause can be unwanted packet drops on the networking level. The tool cilium monitor allows you to quickly inspect and see if and where packet drops happen. Following is an example output (use kubectl exec as in previous examples if running with Kubernetes):

$ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium monitor --type drop
Listening for events on 2 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
xx drop (Policy denied) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest
xx drop (Policy denied) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest
xx drop (Policy denied) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest
xx drop (Policy denied) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest
xx drop (Invalid destination mac) to endpoint 0, identity 0->0: fe80::5c25:ddff:fe8e:78d8 -> ff02::2 RouterSolicitation

The above indicates that a packet to endpoint ID 25729 has been dropped due to violation of the Layer 3 policy.

Handling drop (CT: Map insertion failed)

If connectivity fails and cilium monitor --type drop shows xx drop (CT: Map insertion failed), then it is likely that the connection tracking table is filling up and the automatic adjustment of the garbage collector interval is insufficient. Set --conntrack-gc-interval to an interval lower than the default. Alternatively, the value for bpf-ct-global-any-max and bpf-ct-global-tcp-max can be increased. Setting both of these options will be a trade-off of CPU for conntrack-gc-interval, and for bpf-ct-global-any-max and bpf-ct-global-tcp-max the amount of memory consumed.

Policy Troubleshooting

Ensure pod is managed by Cilium

A potential cause for policy enforcement not functioning as expected is that the networking of the pod selected by the policy is not being managed by Cilium. The following situations result in unmanaged pods:

  • The pod is running in host networking and will use the host’s IP address directly. Such pods have full network connectivity but Cilium will not provide security policy enforcement for such pods.
  • The pod was started before Cilium was deployed. Cilium only manages pods that have been deployed after Cilium itself was started. Cilium will not provide security policy enforcement for such pods.

If pod networking is not managed by Cilium. Ingress and egress policy rules selecting the respective pods will not be applied. See the section Network Policy for more details.

You can run the following script to list the pods which are not managed by Cilium:

$ ./contrib/k8s/k8s-unmanaged.sh
kube-system/cilium-hqpk7
kube-system/kube-addon-manager-minikube
kube-system/kube-dns-54cccfbdf8-zmv2c
kube-system/kubernetes-dashboard-77d8b98585-g52k5
kube-system/storage-provisioner

See section Policy Tracing for details and examples on how to use the policy tracing feature.

Understand the rendering of your policy

There are always multiple ways to approach a problem. Cilium can provide the rendering of the aggregate policy provided to it, leaving you to simply compare with what you expect the policy to actually be rather than search (and potentially overlook) every policy. At the expense of reading a very large dump of an endpoint, this is often a faster path to discovering errant policy requests in the Kubernetes API.

Start by finding the endpoint you are debugging from the following list. There are several cross references for you to use in this list, including the IP address and pod labels:

kubectl -n kube-system exec -ti cilium-q8wvt -- cilium endpoint list

When you find the correct endpoint, the first column of every row is the endpoint ID. Use that to dump the full endpoint information:

kubectl -n kube-system exec -ti cilium-q8wvt -- cilium endpoint get 59084
_images/troubleshooting_policy.png

Importing this dump into a JSON-friendly editor can help browse and navigate the information here. At the top level of the dump, there are two nodes of note:

  • spec: The desired state of the endpoint
  • status: The current state of the endpoint

This is the standard Kubernetes control loop pattern. Cilium is the controller here, and it is iteratively working to bring the status in line with the spec.

Opening the status, we can drill down through policy.realized.l4. Do your ingress and egress rules match what you expect? If not, the reference to the errant rules can be found in the derived-from-rules node.

Symptom Library

Node to node traffic is being dropped

Symptom

Endpoint to endpoint communication on a single node succeeds but communication fails between endpoints across multiple nodes.

Troubleshooting steps:
  1. Run cilium-health status on the node of the source and destination endpoint. It should describe the connectivity from that node to other nodes in the cluster, and to a simulated endpoint on each other node. Identify points in the cluster that cannot talk to each other. If the command does not describe the status of the other node, there may be an issue with the KV-Store.
  2. Run cilium monitor on the node of the source and destination endpoint. Look for packet drops.

When running in Overlay Network Mode mode:

  1. Run cilium bpf tunnel list and verify that each Cilium node is aware of the other nodes in the cluster. If not, check the logfile for errors.

  2. If nodes are being populated correctly, run tcpdump -n -i cilium_vxlan on each node to verify whether cross node traffic is being forwarded correctly between nodes.

    If packets are being dropped,

    • verify that the node IP listed in cilium bpf tunnel list can reach each other.
    • verify that the firewall on each node allows UDP port 8472.

When running in Direct / Native Routing Mode mode:

  1. Run ip route or check your cloud provider router and verify that you have routes installed to route the endpoint prefix between all nodes.
  2. Verify that the firewall on each node permits to route the endpoint IPs.

Useful Scripts

Retrieve Cilium pod managing a particular pod

Identifies the Cilium pod that is managing a particular pod in a namespace:

k8s-get-cilium-pod.sh <pod> <namespace>

Example:

$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-get-cilium-pod.sh
$ ./k8s-get-cilium-pod.sh luke-pod default
cilium-zmjj9

Execute a command in all Kubernetes Cilium pods

Run a command within all Cilium pods of a cluster

k8s-cilium-exec.sh <command>

Example:

$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh
$ ./k8s-cilium-exec.sh uptime
 10:15:16 up 6 days,  7:37,  0 users,  load average: 0.00, 0.02, 0.00
 10:15:16 up 6 days,  7:32,  0 users,  load average: 0.00, 0.03, 0.04
 10:15:16 up 6 days,  7:30,  0 users,  load average: 0.75, 0.27, 0.15
 10:15:16 up 6 days,  7:28,  0 users,  load average: 0.14, 0.04, 0.01

List unmanaged Kubernetes pods

Lists all Kubernetes pods in the cluster for which Cilium does not provide networking. This includes pods running in host-networking mode and pods that were started before Cilium was deployed.

k8s-unmanaged.sh

Example:

$ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-unmanaged.sh
$ ./k8s-unmanaged.sh
kube-system/cilium-hqpk7
kube-system/kube-addon-manager-minikube
kube-system/kube-dns-54cccfbdf8-zmv2c
kube-system/kubernetes-dashboard-77d8b98585-g52k5
kube-system/storage-provisioner

Reporting a problem

Automatic log & state collection

Before you report a problem, make sure to retrieve the necessary information from your cluster before the failure state is lost. Cilium provides a script to automatically grab logs and retrieve debug information from all Cilium pods in the cluster.

The script has the following list of prerequisites:

  • Requires Python >= 2.7.*
  • Requires kubectl.
  • kubectl should be pointing to your cluster before running the tool.

You can download the latest version of the cilium-sysdump tool using the following command:

curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip
python cilium-sysdump.zip

You can specify from which nodes to collect the system dumps by passing node IP addresses via the --nodes argument:

python cilium-sysdump.zip --nodes=$NODE1_IP,$NODE2_IP2

Use --help to see more options:

python cilium-sysdump.zip --help
Single Node Bugtool

If you are not running Kubernetes, it is also possible to run the bug collection tool manually with the scope of a single node:

The cilium-bugtool captures potentially useful information about your environment for debugging. The tool is meant to be used for debugging a single Cilium agent node. In the Kubernetes case, if you have multiple Cilium pods, the tool can retrieve debugging information from all of them. The tool works by archiving a collection of command output and files from several places. By default, it writes to the tmp directory.

Note that the command needs to be run from inside the Cilium pod/container.

$ cilium-bugtool

When running it with no option as shown above, it will try to copy various files and execute some commands. If kubectl is detected, it will search for Cilium pods. The default label being k8s-app=cilium, but this and the namespace can be changed via k8s-namespace and k8s-label respectively.

If you’d prefer to browse the dump, there is a HTTP flag.

$ cilium-bugtool --serve

If you want to capture the archive from a Kubernetes pod, then the process is a bit different

# First we need to get the Cilium pod
$ kubectl get pods --namespace kube-system
  NAME                          READY     STATUS    RESTARTS   AGE
  cilium-kg8lv                  1/1       Running   0          13m
  kube-addon-manager-minikube   1/1       Running   0          1h
  kube-dns-6fc954457d-sf2nk     3/3       Running   0          1h
  kubernetes-dashboard-6xvc7    1/1       Running   0          1h

# Run the bugtool from this pod
$ kubectl -n kube-system exec cilium-kg8lv cilium-bugtool
  [...]

# Copy the archive from the pod
$ kubectl cp kube-system/cilium-kg8lv:/tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar /tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar
  [...]

Note

Please check the archive for sensitive information and strip it away before sharing it with us.

Below is an approximate list of the kind of information in the archive.

  • Cilium status
  • Cilium version
  • Kernel configuration
  • Resolve configuration
  • Cilium endpoint state
  • Cilium logs
  • Docker logs
  • dmesg
  • ethtool
  • ip a
  • ip link
  • ip r
  • iptables-save
  • kubectl -n kube-system get pods
  • kubectl get pods,svc for all namespaces
  • uname
  • uptime
  • cilium bpf * list
  • cilium endpoint get for each endpoint
  • cilium endpoint list
  • hostname
  • cilium policy get
  • cilium service list
Debugging information

If you are not running Kubernetes, you can use the cilium debuginfo command to retrieve useful debugging information. If you are running Kubernetes, this command is automatically run as part of the system dump.

cilium debuginfo can print useful output from the Cilium API. The output format is in Markdown format so this can be used when reporting a bug on the issue tracker. Running without arguments will print to standard output, but you can also redirect to a file like

$ cilium debuginfo -f debuginfo.md

Note

Please check the debuginfo file for sensitive information and strip it away before sharing it with us.

Slack Assistance

The Cilium slack community is helpful first point of assistance to get help troubleshooting a problem or to discuss options on how to address a problem.

The slack community is open to everyone. You can request an invite email by visiting Slack.

Report an issue via GitHub

If you believe to have found an issue in Cilium, please report a GitHub issue and make sure to attach a system dump as described above to ensure that developers have the best chance to reproduce the issue.

Special Interest Groups

All SIGs

The following is a list of special interest groups (SIG) that are meeting on a regular interval. See the respective slack channel for exact meeting cadence and meeting links.

SIG Meeting Slack Description
Datapath Wednesdays, 08:00 PT #sig-datapath Owner of all BPF and Linux kernel related datapath code.
Documentation None #sig-docs All documentation related discussions
Envoy Biweekly on Thursdays, 09:00 PT #sig-envoy Envoy, Istio and maintenance of all L7 protocol parsers.
Policy None #sig-policy All topics related to policy. The SIG is responsible for all security relevant APIs and the enforcement logic.
Release Management None #launchpad Responsible for the release management and backport process.

How to create a SIG

  1. Open a new GitHub issue
  2. Specify the title “SIG-Request: <Name>”
  3. Provide a description
  4. Find two Cilium committers to support the SIG.
  5. Ask on #development to get the Slack channel and Zoom meeting created
  6. Submit a PR to update the documentation to get your new SIG listed

Slack

The Cilium community is maintaining an active Slack channel. Click here to request an invite.

Slack channels

Name Purpose
#bpf BPF specific questions
#development Development discussions
#general General user discussions & questions
#git GitHub notifications
#kubernetes Kubernetes specific questions
#sig-* SIG specific discussions
#testing CI and testing related discussions

Development Guide

We’re happy you’re interested in contributing to the Cilium project.

This section of the Cilium documentation will help you make sure you have an environment capable of testing changes to the Cilium source code, and that you understand the workflow of getting these changes reviewed and merged upstream.

How To Contribute

Clone and Provision Environment

  1. Make sure you have a GitHub account

  2. Clone the cilium repository into your GOPATH.

    mkdir -p $GOPATH/src/github.com/cilium
    cd $GOPATH/src/github.com/cilium
    git clone https://github.com/cilium/cilium.git
    cd cilium
    
  3. Set up your Development Setup

  4. Check the GitHub issues for good tasks to get started.

Submitting a pull request

Contributions must be submitted in the form of pull requests against the github repository at: https://github.com/cilium/cilium.

  1. Fork the Cilium repository to your own personal GitHub space or request access to a Cilium developer account on Slack
  2. Push your changes to the topic branch in your fork of the repository.
  3. Submit a pull request on https://github.com/cilium/cilium.

Before hitting the submit button, please make sure that the following requirements have been met:

  1. Each commit compiles and is functional on its own to allow for bisecting of commits.

  2. All code is covered by unit and/or runtime tests where feasible.

  3. All changes have been tested and checked for regressions by running the existing testsuite against your changes. See the End-To-End Testing Framework section for additional details.

  4. All commits contain a well written commit description including a title, description and a Fixes: #XXX line if the commit addresses a particular GitHub issue. Note that the GitHub issue will be automatically closed when the commit is merged.

    apipanic: Log stack at debug level
    
    Previously, it was difficult to debug issues when the API panicked
    because only a single line like the following was printed:
    
    level=warning msg="Cilium API handler panicked" client=@ method=GET
    panic_message="write unix /var/run/cilium/cilium.sock->@: write: broken
    pipe"
    
    This patch logs the stack at this point at debug level so that it can at
    least be determined in developer environments.
    
    Fixes: #4191
    
    Signed-off-by: Joe Stringer <joe@cilium.io>
    
  5. If any of the commits fixes a particular commit already in the tree, that commit is referenced in the commit message of the bugfix. This ensures that whoever performs a backport will pull in all required fixes:

    daemon: use endpoint RLock in HandleEndpoint
    
    Fixes: a804c7c7dd9a ("daemon: wait for endpoint to be in ready state if specified via EndpointChangeRequest")
    
    Signed-off-by: André Martins <andre@cilium.io>
    
  6. All commits are signed off. See the section Developer’s Certificate of Origin.

  7. (optional) Pick the appropriate milestone for which this PR is being targeted, e.g. 1.6, 1.7. This is in particular important in the time frame between the feature freeze and final release date.

  8. If you have permissions to do so, pick the right release-note label. These labels will be used to generate the release notes which will primarily be read by users.

    Labels When to set
    release-note/bug This is a non-trivial bugfix and is a user-facing bug
    release-note/major This is a major feature addition, e.g. Add MongoDB support
    release-note/minor This is a minor feature addition, e.g. Add support for a Kubernetes version
    release-note/misc This is a not user-facing change , e.g. Refactor endpoint package, a bug fix of a non-released feature
    release-note/ci This is a CI feature of bug fix.
  9. Verify the release note text. If not explicitly changed, the title of the PR will be used for the release notes. If you want to change this, you can add a special section to the description of the PR. These release notes are primarily going to be read by users so it is important that release notes for bugs, major and minor features do not contain internal details of Cilium functionality which sometimes are irrelevant for users.

    Example of a bad release note

    ```release-note
    Fix concurrent access in k8s watchers structures
    ```
    

    Example of a good release note

    ```release-note
    Fix panic when Cilium received an invalid Cilium Network Policy from Kubernetes
    ```
    

    Note

    If multiple lines are provided, then the first line serves as the high level bullet point item and any additional line will be added as a sub item to the first line.

  10. If you have permissions, pick the right labels for your PR:

    Labels When to set
    kind/bug This is a bugfix worth mentioning in the release notes
    kind/enhancement This enhances existing functionality in Cilium
    kind/feature This is a feature
    priority/release-blocker This PR should block the current release
    needs-backport/X.Y PR needs to be backported to these stable releases
    backport/X.Y This is backport PR, may only be set as part of Backporting process
    upgrade-impact The code changes have a potential upgrade impact
    area/* (Optional) Code area this PR covers
  11. Open a draft pull request. GitHub provides the ability to create a Pull Request in “draft” mode. On the “New Pull Request” page, below the pull request description box there is a button for creating the pull request. Click the arrow and choose “Create draft pull request”. If your PR is still a work in progress, please select this mode. You will still be able to run the CI against it. Once the PR is ready for review you can click in “Ready for review” button at the bottom of the page” and reviewers will start reviewing. When you are actively changing your PR, set it back to draft PR mode to signal that reviewers do not need to spend time reviewing the PR right now. When it is ready for review again, mark it as such.

https://i1.wp.com/user-images.githubusercontent.com/3477155/52671177-5d0e0100-2ee8-11e9-8645-bdd923b7d93b.gif

Getting a pull request merged

  1. As you submit the pull request as described in the section Submitting a pull request. One of the reviewers will start a CI run by replying with a comment test-me-please as described in Triggering Pull-Request Builds With Jenkins. If you are a core team member, you may trigger the CI run yourself.

    1. Hound: basic golang/lint static code analyzer. You need to make the puppy happy.
    2. CI / Jenkins: Will run a series of tests:
      1. Unit tests
      2. Single node runtime tests
      3. Multi node Kubernetes tests
  2. As part of the submission, GitHub will have requested a review from the respective code owners according to the CODEOWNERS file in the repository.

    1. Address any feedback received from the reviewers
    2. You can push individual commits to address feedback and then rebase your branch at the end before merging.
  3. Owners of the repository will automatically adjust the labels on the pull request to track its state and progress towards merging.

  4. Once the PR has been reviewed and the CI tests have passed, the PR will be merged by one of the repository owners. In case this does not happen, ping us on Slack.

  5. If reviewers have requested changes and those changes have been addressed, re-request a review for the reviewers that have requested changes. Otherwise, those reviewers will not be notified and your PR will not receive any reviews. If the PR is considerably large (e.g. with more than 200 lines changed and/or more than 6 commits) create new commit for each review. This will make the review process smoother as GitHub has limitations that prevents reviewers from only seeing the new changes added since the last time they have reviewed a PR. Once all reviews are addressed those commits should be squashed against the commit that introduced those changes. This can be easily accomplished by the usage of git rebase -i origin/master and in that windows, move these new commits below the commit that introduced the changes and replace the work pick with fixup. In the following example, commit d2cb02265 will be meld into 9c62e62d8 and commit 146829b59 will be meld into 9400fed20.

    pick 9c62e62d8 docs: updating contribution guide process
    fixup d2cb02265 joe + paul + chris changes
    pick 9400fed20 docs: fixing typo
    fixup 146829b59 Quetin and Maciej reviews
    

    Once this is done you can perform push force into your branch and request for your PR to be merged.

Pull requests review process for committers

  1. Every committer in the committers team belongs to one or more other teams in the Cilium organization if you would like to be added or removed from any team, please contact any of the maintainers.

  2. Once a PR is open, GitHub will automatically pick which teams should review the PR using the CODEOWNERS file. Each committer can see the PRs they need to review by filtering by reviews requested. A good filter is provided in this link so make sure to bookmark it.

  3. Belonging to a team does not mean that a committer should know every single line of code the team is maintaining. For this reason it is recommended that once you have reviewed a PR, if you feel that another pair of eyes is needed, you should re-request a review from the appropriate team. In the example below, the committer belonging to the CI team is re-requesting a review for other team members to review the PR. This allows other team members belonging to the CI team to see the PR as part of the PRs that require review in the filter provided above

    _images/re-request-review.png
  4. When all review objectives for all CODEOWNERS are met, all required CI tests have passed and a proper release label as been set, you may set the ready-to-merge label to indicate that all criteria have been met. Maintainer’s little helper might set this label automatically if the previous requirements were met.

    Labels When to set
    ready-to-merge PR is ready to be merged

Weekly duties

Some members of the committers team will have rotational duties that change every week. The following steps describe how to perform those duties. Please submit changes to these steps if you have found a better way to perform each duty.

Pull request review process for Janitors team

Note

These instructions assume that whoever is reviewing is a member of the Cilium GitHub organization or has the status of a contributor. This is required to obtain the privileges to modify GitHub labels on the pull request.

Dedicated expectation time for each member of Janitors team: Follow the next steps 1 to 2 times per day. Works best if done first thing in the working day.

  1. Review all PRs requesting for review in for you;

  2. If this PR was opened by a non-committer (e.g. external contributor) please assign yourself to that PR and make sure to keep track the PR gets reviewed and merged. This may extend beyond your assigned week for Janitor duty.

  3. Review overall correctness of the PR according to the rules specified in the section Submitting a pull request.

    Set the labels accordingly, a bot called maintainer’s little helper might automatically help you with this.

    Labels When to set
    dont-merge/needs-sign-off Some commits are not signed off
    needs-rebase PR is outdated and needs to be rebased
  4. Validate that bugfixes are marked with kind/bug and validate whether the assessment of backport requirements as requested by the submitter conforms to the Backport Criteria.

    Labels When to set
    needs-backport/X.Y PR needs to be backported to these stable releases
  5. If the PR is subject to backport, validate that the PR does not mix bugfix and refactoring of code as it will heavily complicate the backport process. Demand for the PR to be split.

  6. Validate the release-note/* label and check the PR title for release note suitability. Put yourself into the perspective of a future release notes reader with lack of context and ensure the title is precise but brief.

    Labels When to set
    dont-merge/needs-release-note Do NOT merge PR, needs a release note
    release-note/bug This is a non-trivial bugfix and is a user-facing bug
    release-note/major This is a major feature addition, e.g. Add MongoDB support
    release-note/minor This is a minor feature addition, e.g. Add support for a Kubernetes version
    release-note/misc This is a not user-facing change , e.g. Refactor endpoint package, a bug fix of a non-released feature
    release-note/ci This is a CI feature of bug fix.
  7. Check for upgrade compatibility impact and if in doubt, set the label upgrade-impact and discuss in the Slack channel or in the weekly meeting.

    Labels When to set
    upgrade-impact The code changes have a potential upgrade impact
  8. When all review objectives for all CODEOWNERS are met, all CI tests have passed, and all reviewers have approved the requested changes, merge the PR by clicking in the “Rebase and merge” button.

  9. Merge PRs with the ready-to-merge label set here

Triage issues for Triage team

Dedicated expectation time for each member of Triage team: 15/30 minutes per day. Works best if done first thing in the working day.

  1. Committers belonging to the Triage team should make sure that:

    1. Issues opened by community users are tracked down:

      1. Add the label kind/community-report;
      2. If feasible, try to reproduce the issue described;
      3. Assign a member that is responsible for that code section to that GitHub issue;
      4. If it is a relevant bug to the rest of the committers, bring the issue up in the weekly meeting.
    2. Issues recently commented are not left out unanswered:

      1. If there is someone already assigned to that GitHub issue and that committer hasn’t provided an answer to that user for a while, ping that committer directly on Slack;
      2. If the issue cannot be solved, bring the issue up in the weekly meeting.
Backporting PR for Backport team

Dedicated expectation time for each member of Backport team: 60 minutes per week depending on releases that need to be performed at the moment.

Even if the next release is not imminently planned, it is still important to perform backports to keep the process smooth and to catch potential regressions in stable branches as soon as possible. If backports are delayed, this can also delay releases which is important to avoid especially if there are security-sensitive bug fixes that require an immediate release.

In addition, when a backport PR is open, the person opening it is responsible to drive it to completion, even if it stretches after the assigned week of backporting hat. If this is not feasible (e.g. PTO), you are responsible to initiate handover of the PR to the next week’s backporters.

If you can’t backport a PR due technical constraints feel free to contact the original author of that PR directly so they can backport the PR themselves.

Follow the Backporting process guide to know how to perform this task.

Developer’s Certificate of Origin

To improve tracking of who did what, we’ve introduced a “sign-off” procedure.

The sign-off is a simple line at the end of the explanation for the commit, which certifies that you wrote it or otherwise have the right to pass it on as open-source work. The rules are pretty simple: if you can certify the below:

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

then you just add a line saying:

Signed-off-by: Random J Developer <random@developer.example.org>

Use your real name (sorry, no pseudonyms or anonymous contributions.)

Cilium Committer Grant/Revocation Policy

A Cilium committer is a participant in the project with the ability to commit code directly to the master repository. Commit access grants a broad ability to affect the progress of the project as presented by its most important artifact, the code and related resources that produce working binaries of Cilium. As such it represents a significant level of trust in an individual’s commitment to working with other committers and the community at large for the benefit of the project. It can not be granted lightly and, in the worst case, must be revocable if the trust placed in an individual was inappropriate.

This document suggests guidelines for granting and revoking commit access. It is intended to provide a framework for evaluation of such decisions without specifying deterministic rules that wouldn’t be sensitive to the nuance of specific situations. In the end the decision to grant or revoke committer privileges is a judgment call made by the existing set of committers.

Expectations for Developers with commit access
Pre-requisites

Be familiar with the Development Guide.

Review

Code (yours or others’) must be reviewed publicly (by you or others) before you push it to the repository. With one exception (see below), every change needs at least one review.

If one or more people know an area of code particularly well, code that affects that area should ordinarily get a review from one of them.

The riskier, more subtle, or more complicated the change, the more careful the review required. When a change needs careful review, use good judgment regarding the quality of reviews. If a change adds 1000 lines of new code, and a review posted 5 minutes later says just “Looks good,” then this is probably not a quality review.

(The size of a change is correlated with the amount of care needed in review, but it is not strictly tied to it. A search and replace across many files may not need much review, but one-line optimization changes can have widespread implications.)

Your own small changes to fix a recently broken build (“make”) or tests (“make check”), that you believe to be visible to a large number of developers, may be checked in without review. If you are not sure, ask for review.

Regularly review submitted code in areas where you have expertise. Consider reviewing other code as well.

Git conventions

If you apply a change (yours or another’s) then it is your responsibility to handle any resulting problems, especially broken builds and other regressions. If it is someone else’s change, then you can ask the original submitter to address it. Regardless, you need to ensure that the problem is fixed in a timely way. The definition of “timely” depends on the severity of the problem.

If a bug is present on master and other branches, fix it on master first, then backport the fix to other branches. Straightforward backports do not require additional review (beyond that for the fix on master).

Feature development should be done only on master. Occasionally it makes sense to add a feature to the most recent release branch, before the first actual release of that branch. These should be handled in the same way as bug fixes, that is, first implemented on master and then backported.

Keep the authorship of a commit clear by maintaining a correct list of “Signed-off-by:”s. If a confusing situation comes up, as it occasionally does, bring it up in the development forums. If you explain the use of “Signed-off-by:” to a new developer, explain not just how but why, since the intended meaning of “Signed-off-by:” is more important than the syntax.

Use Reported-by: and Tested-by: tags in commit messages to indicate the source of a bug report.

Keep the AUTHORS file up to date.

Granting Commit Access

Granting commit access should be considered when a candidate has demonstrated the following in their interaction with the project:

  • Contribution of significant new features through the patch submission process where:
  • Submissions are free of obvious critical defects
  • Submissions do not typically require many iterations of improvement to be accepted
  • Consistent participation in code review of other’s patches, including existing committers, with comments consistent with the overall project standards
  • Assistance to those in the community who are less knowledgeable through active participation in project forums.
  • Plans for sustained contribution to the project compatible with the project’s direction as viewed by current committers.
  • Commitment to meet the expectations described in the “Expectations of Developer’s with commit access”

The process to grant commit access to a candidate is simple:

  • An existing committer nominates the candidate by sending an email to all existing committers with information substantiating the contributions of the candidate in the areas described above.
  • All existing committers discuss the pros and cons of granting commit access to the candidate in the email thread.
  • When the discussion has converged or a reasonable time has elapsed without discussion developing (e.g. a few business days) the nominator calls for a final decision on the candidate with a followup email to the thread.
  • Each committer may vote yes, no, or abstain by replying to the email thread. A failure to reply is an implicit abstention.
  • After votes from all existing committers have been collected or a reasonable time has elapsed for them to be provided (e.g. a couple of business days) the votes are evaluated. To be granted commit access the candidate must receive yes votes from a majority of the existing committers and zero no votes. Since a no vote is effectively a veto of the candidate it should be accompanied by a reason for the vote.
  • The nominator summarizes the result of the vote in an email to all existing committers.
  • If the vote to grant commit access passed, the candidate is contacted with an invitation to become a committer to the project which asks them to agree to the committer expectations documented on the project web site.
  • If the candidate agrees access is granted by setting up commit access to the repos.
Revoking Commit Access

There are two situations in which commit access might be revoked.

The straightforward situation is a committer who is no longer active in the project and has no plans to become active in the near future. The process in this case is:

  • Any time after a committer has been inactive for more than 6 months any other committer to the project may identify that committer as a candidate for revocation of commit access due to inactivity.
  • The plans of revocation should be sent in a private email to the candidate.
  • If the candidate for removal states plans to continue participating no action is taken and this process terminates.
  • If the candidate replies they no longer require commit access then commit access is removed and a notification is sent to the candidate and all existing committers.
  • If the candidate can not be reached within 1 week of the first attempting to contact this process continues.
  • A message proposing removal of commit access is sent to the candidate and all other committers.
  • If the candidate for removal states plans to continue participating no action is taken.
  • If the candidate replies they no longer require commit access then their access is removed.
  • If the candidate can not be reached within 2 months of the second attempting to contact them, access is removed.
  • In any case, where access is removed, this fact is published through an email to all existing committers (including the candidate for removal).

The more difficult situation is a committer who is behaving in a manner that is viewed as detrimental to the future of the project by other committers. This is a delicate situation with the potential for the creation of division within the greater community and should be handled with care. The process in this case is:

  • Discuss the behavior of concern with the individual privately and explain why you believe it is detrimental to the project. Stick to the facts and keep the email professional. Avoid personal attacks and the temptation to hypothesize about unknowable information such as the other’s motivations. Make it clear that you would prefer not to discuss the behavior more widely but will have to raise it with other contributors if it does not change. Ideally the behavior is eliminated and no further action is required. If not,
  • Start an email thread with all committers, including the source of the behavior, describing the behavior and the reason it is detrimental to the project. The message should have the same tone as the private discussion and should generally repeat the same points covered in that discussion. The person whose behavior is being questioned should not be surprised by anything presented in this discussion. Ideally the wider discussion provides more perspective to all participants and the issue is resolved. If not,
  • Start an email thread with all committers except the source of the detrimental behavior requesting a vote on revocation of commit rights. Cite the discussion among all committers and describe all the reasons why it was not resolved satisfactorily. This email should be carefully written with the knowledge that the reasoning it contains may be published to the larger community to justify the decision.
  • Each committer may vote yes, no, or abstain by replying to the email thread. A failure to reply is an implicit abstention.
  • After all votes have been collected or a reasonable time has elapsed for them to be provided (e.g. a couple of business days) the votes are evaluated. For the request to revoke commit access for the candidate to pass it must receive yes votes from two thirds of the existing committers.
  • anyone that votes no must provide their reasoning, and
  • if the proposal passes then counter-arguments for the reasoning in no votes should also be documented along with the initial reasons the revocation was proposed. Ideally there should be no new counter-arguments supplied in a no vote as all concerns should have surfaced in the discussion before the vote.
  • The original person to propose revocation summarizes the result of the vote in an email to all existing committers excepting the candidate for removal.
  • If the vote to revoke commit access passes, access is removed and the candidate for revocation is informed of that fact and the reasons for it as documented in the email requesting the revocation vote.
  • Ideally the revoked committer peacefully leaves the community and no further action is required. However, there is a distinct possibility that he/she will try to generate support for his/her point of view within the larger community. In this case the reasoning for removing commit access as described in the request for a vote will be published to the community.
Changing the Policy

The process for changing the policy is:

  • Propose the changes to the policy in an email to all current committers and request discussion.
  • After an appropriate period of discussion (a few days) update the proposal based on feedback if required and resend it to all current committers with a request for a formal vote.
  • After all votes have been collected or a reasonable time has elapsed for them to be provided (e.g. a couple of business days) the votes are evaluated. For the request to modify the policy to pass it must receive yes votes from two thirds of the existing committers.
Template Emails
Nomination to Grant Commit Access
I would like to nominate *[candidate]* for commit access. I believe
*[he/she]* has met the conditions for commit access described in the
committer grant policy on the project web site in the following ways:

*[list of requirements & evidence]*

Please reply to all in this message thread with your comments and
questions. If that discussion concludes favorably I will request a formal
vote on the nomination in a few days.
Vote to Grant Commit Access
I nominated *[candidate]* for commit access on *[date]*. Having allowed
sufficient time for discussion it's now time to formally vote on the
proposal.

Please reply to all in this thread with your vote of: YES, NO, or ABSTAIN.
A failure to reply will be counted as an abstention. If you vote NO, by our
policy you must include the reasons for that vote in your reply. The
deadline for votes is *[date and time]*.

If a majority of committers vote YES and there are zero NO votes commit
access will be granted.
Vote Results for Grant of Commit Access
The voting period for granting to commit access to *[candidate]* initiated
at *[date and time]* is now closed with the following results:

YES: *[count of yes votes]* (*[% of voters]*)

NO: *[count of no votes]* (*[% of voters]*)

ABSTAIN: *[count of abstentions]* (*[% of voters]*)

Based on these results commit access *[is/is NOT]* granted.
Invitation to Accepted Committer
Due to your sustained contributions to the Cilium project we
would like to provide you with commit access to the project repository.
Developers with commit access must agree to fulfill specific
responsibilities described in the source repository:

    /Documentation/commit-access.rst

Please let us know if you would like to accept commit access and if so that
you agree to fulfill these responsibilities. Once we receive your response
we'll set up access. We're looking forward continuing to work together to
advance the Cilium project.
Proposal to Remove Commit Access for Inactivity
Committer *[candidate]* has been inactive for *[duration]*. I have
attempted to privately contacted *[him/her]* and *[he/she]* could not be
reached.

Based on this I would like to formally propose removal of commit access.
If a response to this message documenting the reasons to retain commit
access is not received by *[date]* access will be removed.
Notification of Commit Removal for Inactivity
Committer *[candidate]* has been inactive for *[duration]*. *[He/she]*
*[stated no commit access is required/failed to respond]* to the formal
proposal to remove access on *[date]*. Commit access has now been removed.
Proposal to Revoke Commit Access for Detrimental Behavior
I regret that I feel compelled to propose revocation of commit access for
*[candidate]*. I have privately discussed with *[him/her]* the following
reasons I believe *[his/her]* actions are detrimental to the project and we
have failed to come to a mutual understanding:

*[List of reasons and supporting evidence]*

Please reply to all in this thread with your thoughts on this proposal.  I
plan to formally propose a vote on the proposal on or after *[date and
time]*.

It is important to get all discussion points both for and against the
proposal on the table during the discussion period prior to the vote.
Please make it a high priority to respond to this proposal with your
thoughts.
Vote to Revoke Commit Access
I nominated *[candidate]* for revocation of commit access on *[date]*.
Having allowed sufficient time for discussion it's now time to formally
vote on the proposal.

Please reply to all in this thread with your vote of: YES, NO, or ABSTAIN.
A failure to reply will be counted as an abstention. If you vote NO, by our
policy you must include the reasons for that vote in your reply. The
deadline for votes is *[date and time]*.

If 2/3rds of committers vote YES commit access will be revoked.

The following reasons for revocation have been given in the original
proposal or during discussion:

*[list of reasons to remove access]*

The following reasons for retaining access were discussed:

*[list of reasons to retain access]*

The counter-argument for each reason for retaining access is:

*[list of counter-arguments for retaining access]*
Vote Results for Revocation of Commit Access
The voting period for revoking the commit access of *[candidate]* initiated
at *[date and time]* is now closed with the following results:

-  YES: *[count of yes votes]* (*[% of voters]*)

-  NO: *[count of no votes]* (*[% of voters]*)

-  ABSTAIN: *[count of abstentions]* (*[% of voters]*)

Based on these results commit access *[is/is NOT]* revoked. The following
reasons for retaining commit access were proposed in NO votes:

*[list of reasons]*

The counter-arguments for each of these reasons are:

*[list of counter-arguments]*
Notification of Commit Revocation for Detrimental Behavior
After private discussion with you and careful consideration of the
situation, the other committers to the Cilium project have
concluded that it is in the best interest of the project that your commit
access to the project repositories be revoked and this has now occurred.

The reasons for this decision are:

*[list of reasons for removing access]*

While your goals and those of the project no longer appear to be aligned we
greatly appreciate all the work you have done for the project and wish you
continued success in your future work.

Development Setup

Requirements

You need to have the following tools available in order to effectively contribute to Cilium:

Dependency Version / Commit ID Download Command
git latest N/A (OS-specific)
clang >= 10.0 (latest recommended) N/A (OS-specific)
llvm >= 10.0 (latest recommended) N/A (OS-specific)
libelf-devel latest N/A (OS-specific)
go 1.14.3 N/A (OS-specific)
ginkgo >= 1.4.0 go get -u github.com/onsi/ginkgo/ginkgo
gomega >= 1.2.0 go get -u github.com/onsi/gomega
ineffassign >= 1003c8b go get -u github.com/gordonklaus/ineffassign
Docker OS-Dependent N/A (OS-specific)
Docker-Compose OS-Dependent N/A (OS-specific)
python3-pip latest N/A (OS-specific)

For Unit Testing, you will need to run docker without privileges. You can usually achieve this by adding your current user to the docker group.

Finally, in order to run Cilium locally on VMs, you need:

Dependency Version / Commit ID Download Command
Vagrant >= 2.0 Vagrant Install Instructions
VirtualBox (if not using libvirt) >= 5.2 N/A (OS-specific)

You should start with the Getting Started Guides, which walks you through the set-up, such as installing Vagrant, getting the Cilium sources, and going through some Cilium basics.

Vagrant Setup

While the Getting Started Guides uses a Vagrantfile tuned for the basic walk through, the setup for the Vagrantfile in the root of the Cilium tree depends on a number of environment variables and network setup that are managed via contrib/vagrant/start.sh.

Using the provided Vagrantfile

To bring up a Vagrant VM with Cilium plus dependencies installed, run:

$ contrib/vagrant/start.sh

This will create and run a vagrant VM based on the base box cilium/ubuntu. The box is currently available for the following providers:

  • virtualbox
Options

The following environment variables can be set to customize the VMs brought up by vagrant:

  • NWORKERS=n: Number of child nodes you want to start with the master, default 0.
  • RELOAD=1: Issue a vagrant reload instead of vagrant up, useful to resume halted VMs.
  • NO_PROVISION=1: Avoid provisioning Cilium inside the VM. Supports quick restart without recompiling all of Cilium.
  • NFS=1: Use NFS for vagrant shared directories instead of rsync.
  • K8S=1: Build & install kubernetes on the nodes. k8s1 is the master node, which contains both master components: etcd, kube-controller-manager, kube-scheduler, kube-apiserver, and node components: kubelet, kube-proxy, kubectl and Cilium. When used in combination with NWORKERS=1 a second node is created, where k8s2 will be a kubernetes node, which contains: kubelet, kube-proxy, kubectl and cilium.
  • NETNEXT=1: Run with net-next kernel.
  • IPV4=1: Run Cilium with IPv4 enabled.
  • RUNTIME=x: Sets up the container runtime to be used inside a kubernetes cluster. Valid options are: docker, containerd and crio. If not set, it defaults to docker.
  • VAGRANT_DEFAULT_PROVIDER={virtualbox \| libvirt \| ...}
  • VM_SET_PROXY=https://127.0.0.1:80/ Sets up VM’s https_proxy.
  • INSTALL=1: Restarts the installation of Cilium, Kubernetes, etc. Only useful when the installation was interrupted.
  • MAKECLEAN=1: Execute make clean before building cilium in the VM.

If you want to start the VM with cilium enabled with containerd, with kubernetes installed and plus a worker, run:

$ RUNTIME=containerd K8S=1 NWORKERS=1 contrib/vagrant/start.sh

If you want to get VM status, run:

$ RUNTIME=containerd K8S=1 NWORKERS=1 vagrant status

If you want to connect to the Kubernetes cluster running inside the developer VM via kubectl from your host machine, set KUBECONFIG environment variable to include new kubeconfig file:

$ export KUBECONFIG=$KUBECONFIG:$GOPATH/src/github.com/cilium/cilium/vagrant.kubeconfig

and add 127.0.0.1 k8s1 to your hosts file.

If you have any issue with the provided vagrant box cilium/ubuntu or need a different box format, you may build the box yourself using the packer scripts

Manual Installation

Alternatively you can import the vagrant box cilium/ubuntu directly and manually install Cilium:

$ vagrant init cilium/ubuntu
$ vagrant up
$ vagrant ssh [...]
$ cd go/src/github.com/cilium/cilium/
$ make
$ sudo make install
$ sudo mkdir -p /etc/sysconfig/
$ sudo cp contrib/systemd/cilium.service /etc/systemd/system/
$ sudo cp contrib/systemd/cilium  /etc/sysconfig/cilium
$ sudo usermod -a -G cilium vagrant
$ sudo systemctl enable cilium
$ sudo systemctl restart cilium
Notes

Your Cilium tree is mapped to the VM so that you do not need to keep manually copying files between your host and the VM. Folders are by default synced automatically using VirtualBox Shared Folders . You can also use NFS to access your Cilium tree from the VM by setting the environment variable NFS (mentioned above) before running the startup script (export NFS=1). Note that your host firewall must have a variety of ports open. The Vagrantfile will inform you of the configuration of these addresses and ports to enable NFS.

Note

OSX file system is by default case insensitive, which can confuse git. At the writing of this Cilium repo has no file names that would be considered referring to the same file on a case insensitive file system. Regardless, it may be useful to create a disk image with a case sensitive file system for holding your git repos.

Note

VirtualBox for OSX currently (version 5.1.22) always reports host-only networks’ prefix length as 64. Cilium needs this prefix to be 16, and the startup script will check for this. This check always fails when using VirtualBox on OSX, but it is safe to let the startup script to reset the prefix length to 16.

Note

Make sure your host NFS configuration is setup to use tcp:

# cat /etc/nfs.conf
...
[nfsd]
# grace-time=90
tcp=y
# vers2=n
# vers3=y
...

If for some reason, running of the provisioning script fails, you should bring the VM down before trying again:

$ vagrant halt

Local Development in Vagrant Box

See Development Setup for information on how to setup the development environment.

When the development VM is provisioned, it builds and installs Cilium. After the initial build and install you can do further building and testing incrementally inside the VM. vagrant ssh takes you to the Cilium source tree directory (/home/vagrant/go/src/github.com/cilium/cilium) by default, and the following commands assume that you are working within that directory.

Build Cilium

Assuming you have synced (rsync) the source tree after you have made changes, or the tree is automatically in sync via NFS or guest additions folder sharing, you can issue a build as follows:

$ make
Install to dev environment

After a successful build and test you can re-install Cilium by:

$ sudo -E make install
Restart Cilium service

To run the newly installed version of Cilium, restart the service:

$ sudo systemctl restart cilium

You can verify the service and cilium-agent status by the following commands, respectively:

$ sudo systemctl status cilium
$ cilium status

Making Changes

  1. Create a topic branch: git checkout -b myBranch master
  2. Make the changes you want
  3. Separate the changes into logical commits.
    1. Describe the changes in the commit messages. Focus on answering the question why the change is required and document anything that might be unexpected.
    2. If any description is required to understand your code changes, then those instructions should be code comments instead of statements in the commit description.
  4. Make sure your changes meet the following criteria:
    1. New code is covered by Unit Testing.
    2. End to end integration / runtime tests have been extended or added. If not required, mention in the commit message what existing test covers the new code.
    3. Follow-up commits are squashed together nicely. Commits should separate logical chunks of code and not represent a chronological list of changes.
  5. Run git diff --check to catch obvious white space violations
  6. Run make to build your changes. This will also run go fmt and error out on any golang formatting errors.
  7. See Unit Testing on how to run unit tests.
  8. See End-To-End Testing Framework for information how to run the end to end integration tests
  9. If you are making documentation changes, you can generate documentation files and serve them locally on http://localhost:9081 by running make render-docs. This make target works both inside and outside the Vagrant VM, assuming that docker is running in the environment.

Add/update a golang dependency

Lets assume we want to add github.com/containernetworking/cni version v0.5.2:

$ go get github.com/containernetworking/cni@v0.5.2
$ go mod tidy
$ go mod vendor
$ git add go.mod go.sum vendor/

For a first run, it can take a while as it will download all dependencies to your local cache but the remaining runs will be faster.

Updating k8s is a special case, for that one needs to do:

$ # get the tag we are updating (for example ``v0.17.3`` corresponds to k8s ``v1.17.3``)
$ # open go.mod and search and replace all ``v0.17.3`` with the version
$ # that we are trying to upgrade with, for example: ``v0.17.4``.
$ # Close the file and run:
$ go mod tidy
$ go mod vendor
$ make generate-k8s-api
$ git add go.mod go.sum vendor/

Optional: Docker and IPv6

Note that these instructions are useful to you if you care about having IPv6 addresses for your Docker containers.

If you’d like IPv6 addresses, you will need to follow these steps:

  1. Edit /etc/docker/daemon.json and set the ipv6 key to true.
{
  "ipv6": true
}

If that doesn’t work alone, try assigning a fixed range. Many people have reported trouble with IPv6 and Docker. Source here.

{
  "ipv6": true,
  "fixed-cidr-v6": "2001:db8:1::/64"
}

And then:

ip -6 route add 2001:db8:1::/64 dev docker0
sysctl net.ipv6.conf.default.forwarding=1
sysctl net.ipv6.conf.all.forwarding=1
  1. Restart the docker daemon to pick up the new configuration.
  2. The new command for creating a network managed by Cilium:
$ docker network create --ipv6 --driver cilium --ipam-driver cilium cilium-net

Now new containers will have an IPv6 address assigned to them.

Debugging

Datapath code

The tool cilium monitor can also be used to retrieve debugging information from the BPF based datapath. Debugging messages are sent if either the cilium-agent itself or the respective endpoint is in debug mode. The debug mode of the agent can be enabled by starting cilium-agent with the option --debug enabled or by running cilium config debug=true for an already running agent. Debugging of an individual endpoint can be enabled by running cilium endpoint config ID debug=true. Running cilium monitor -v will print the normal form of monitor output along with debug messages:

$ cilium endpoint config 731 debug=true
Endpoint 731 configuration updated successfully
$ cilium monitor -v
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor
<- endpoint 745 flow 0x6851276 identity 4->0 state new ifindex 0 orig-ip 0.0.0.0: 8e:3c:a3:67:cc:1e -> 16:f9:cd:dc:87:e5 ARP
-> lxc_health: 16:f9:cd:dc:87:e5 -> 8e:3c:a3:67:cc:1e ARP
CPU 00: MARK 0xbbe3d555 FROM 0 DEBUG: Inheriting identity=1 from stack
<- host flow 0xbbe3d555 identity 1->0 state new ifindex 0 orig-ip 0.0.0.0: 10.11.251.76:57896 -> 10.11.166.21:4240 tcp ACK
CPU 00: MARK 0xbbe3d555 FROM 0 DEBUG: Successfully mapped addr=10.11.251.76 to identity=1
CPU 00: MARK 0xbbe3d555 FROM 0 DEBUG: Attempting local delivery for container id 745 from seclabel 1
CPU 00: MARK 0xbbe3d555 FROM 745 DEBUG: Conntrack lookup 1/2: src=10.11.251.76:57896 dst=10.11.166.21:4240
CPU 00: MARK 0xbbe3d555 FROM 745 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=0
CPU 00: MARK 0xbbe3d555 FROM 745 DEBUG: CT entry found lifetime=21925, revnat=0
CPU 00: MARK 0xbbe3d555 FROM 745 DEBUG: CT verdict: Established, revnat=0
-> endpoint 745 flow 0xbbe3d555 identity 1->4 state established ifindex lxc_health orig-ip 10.11.251.76: 10.11.251.76:57896 -> 10.11.166.21:4240 tcp ACK

Passing -v -v supports deeper detail, for example:

$ cilium endpoint config 3978 debug=true
Endpoint 3978 configuration updated successfully
$ cilium monitor -v -v --hex
Listening for events on 2 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
------------------------------------------------------------------------------
CPU 00: MARK 0x1c56d86c FROM 3978 DEBUG: 70 bytes Incoming packet from container ifindex 85
00000000  33 33 00 00 00 02 ae 45  75 73 11 04 86 dd 60 00  |33.....Eus....`.|
00000010  00 00 00 10 3a ff fe 80  00 00 00 00 00 00 ac 45  |....:..........E|
00000020  75 ff fe 73 11 04 ff 02  00 00 00 00 00 00 00 00  |u..s............|
00000030  00 00 00 00 00 02 85 00  15 b4 00 00 00 00 01 01  |................|
00000040  ae 45 75 73 11 04 00 00  00 00 00 00              |.Eus........|
CPU 00: MARK 0x1c56d86c FROM 3978 DEBUG: Handling ICMPv6 type=133
------------------------------------------------------------------------------
CPU 00: MARK 0x1c56d86c FROM 3978 Packet dropped 131 (Invalid destination mac) 70 bytes ifindex=0 284->0
00000000  33 33 00 00 00 02 ae 45  75 73 11 04 86 dd 60 00  |33.....Eus....`.|
00000010  00 00 00 10 3a ff fe 80  00 00 00 00 00 00 ac 45  |....:..........E|
00000020  75 ff fe 73 11 04 ff 02  00 00 00 00 00 00 00 00  |u..s............|
00000030  00 00 00 00 00 02 85 00  15 b4 00 00 00 00 01 01  |................|
00000040  00 00 00 00                                       |....|
------------------------------------------------------------------------------
CPU 00: MARK 0x7dc2b704 FROM 3978 DEBUG: 86 bytes Incoming packet from container ifindex 85
00000000  33 33 ff 00 8a d6 ae 45  75 73 11 04 86 dd 60 00  |33.....Eus....`.|
00000010  00 00 00 20 3a ff fe 80  00 00 00 00 00 00 ac 45  |... :..........E|
00000020  75 ff fe 73 11 04 ff 02  00 00 00 00 00 00 00 00  |u..s............|
00000030  00 01 ff 00 8a d6 87 00  20 40 00 00 00 00 fd 02  |........ @......|
00000040  00 00 00 00 00 00 c0 a8  21 0b 00 00 8a d6 01 01  |........!.......|
00000050  ae 45 75 73 11 04 00 00  00 00 00 00              |.Eus........|
CPU 00: MARK 0x7dc2b704 FROM 3978 DEBUG: Handling ICMPv6 type=135
CPU 00: MARK 0x7dc2b704 FROM 3978 DEBUG: ICMPv6 neighbour soliciation for address b21a8c0:d68a0000

One of the most common issues when developing datapath code is that the BPF code cannot be loaded into the kernel. This frequently manifests as the endpoints appearing in the “not-ready” state and never switching out of it:

$ cilium endpoint list
ENDPOINT   POLICY        IDENTITY   LABELS (source:key[=value])   IPv6                     IPv4            STATUS
           ENFORCEMENT
48896      Disabled      266        container:id.server           fd02::c0a8:210b:0:bf00   10.11.13.37     not-ready
60670      Disabled      267        container:id.client           fd02::c0a8:210b:0:ecfe   10.11.167.158   not-ready

Running cilium endpoint get for one of the endpoints will provide a description of known state about it, which includes BPF verification logs.

The files under /var/run/cilium/state provide context about how the BPF datapath is managed and set up. The .h files describe specific configurations used for BPF program compilation. The numbered directories describe endpoint-specific state, including header configuration files and BPF binaries.

Current BPF map state for particular programs is held under /sys/fs/bpf/, and the bpf-map utility can be useful for debugging what is going on inside them, for example:

# ls /sys/fs/bpf/tc/globals/
cilium_calls_15124  cilium_calls_48896        cilium_ct4_global       cilium_lb4_rr_seq       cilium_lb6_services  cilium_policy_25729  cilium_policy_60670       cilium_proxy6
cilium_calls_25729  cilium_calls_60670        cilium_ct6_global       cilium_lb4_services     cilium_lxc           cilium_policy_3978   cilium_policy_reserved_1  cilium_reserved_policy
cilium_calls_3978   cilium_calls_netdev_ns_1  cilium_events           cilium_lb6_reverse_nat  cilium_policy        cilium_policy_4314   cilium_policy_reserved_2  cilium_tunnel_map
cilium_calls_4314   cilium_calls_overlay_2    cilium_lb4_reverse_nat  cilium_lb6_rr_seq       cilium_policy_15124  cilium_policy_48896  cilium_proxy4
# bpf-map info /sys/fs/bpf/tc/globals/cilium_policy_15124
Type:           Hash
Key size:       8
Value size:     24
Max entries:    1024
Flags:          0x0
# bpf-map dump /sys/fs/bpf/tc/globals/cilium_policy_15124
Key:
00000000  6a 01 00 00 82 23 06 00                           |j....#..|
Value:
00000000  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  00 00 00 00 00 00 00 00                           |........|

Building Container Images

Two make targets exists to build container images automatically based on the locally checked out branch:

Developer images

Run make dev-docker-image to build a Cilium Docker image that contains your local changes.

DOCKER_DEV_ACCOUNT=quay.io/myaccount DOCKER_IMAGE_TAG=jane-developer-my-fix make dev-docker-image

The command above assumes that your username for quay.io is myaccount. You can then push the image tag to your own registry for development builds:

docker push quay.io/myaccount/cilium-dev:jane-developer-my-fix-amd64

Official release images

Anyone can build official release images using the make target below but pushing to the official registries is restricted to Cilium maintainers. Ask in the #launchpad Slack channels for the exact details.

DOCKER_IMAGE_TAG=v1.4.0 make docker-image

You can then push the image tag to the registry:

docker push cilium/cilium:v1.4.0

Update cilium-builder and cilium-runtime images

Login to quay.io with your credentials to the repository that you want to update:

cilium-builder - contains Cilium build-time dependencies cilium-runtime - contains Cilium run-time dependencies

  1. After login, select the tab “builds” on the left menu.
_images/cilium-quayio-tag-0.png
  1. Click on the wheel.
  2. Enable the trigger for that build trigger.
_images/cilium-quayio-tag-1.png
  1. Confirm that you want to enable the trigger.
_images/cilium-quayio-tag-2.png
  1. After enabling the trigger, click again on the wheel.
  2. And click on “Run Trigger Now”.
_images/cilium-quayio-tag-3.png
  1. A new pop-up will appear and you can select the branch that contains your changes.
  2. Select the branch that contains the new changes.
_images/cilium-quayio-tag-4.png
  1. After selecting your branch click on “Start Build”.
_images/cilium-quayio-tag-5.png
  1. Once the build has started you can disable the Build trigger by clicking on the wheel.
  2. And click on “Disable Trigger”.
_images/cilium-quayio-tag-6.png
  1. Confirm that you want to disable the build trigger.
_images/cilium-quayio-tag-7.png
  1. Once the build is finished click under Tags (on the left menu).
  2. Click on the wheel and;
  3. Add a new tag to the image that was built.
_images/cilium-quayio-tag-8.png
  1. Write the name of the tag that you want to give for the newly built image.
  2. Confirm the name is correct and click on “Create Tag”.
_images/cilium-quayio-tag-9.png
  1. After the new tag was created you can delete the other tag, which is the name of your branch. Select the tag name.
  2. Click in Actions.
  3. Click in “Delete Tags”.
_images/cilium-quayio-tag-10.png
  1. Confirm that you want to delete tag with your branch name.
_images/cilium-quayio-tag-11.png

You have created a new image build with a new tag. The next steps should be to update the repository root’s Dockerfile so that it points to the new cilium-builder or cilium-runtime image recently created.

  1. Update the versions of the images that are pulled into the CI VMs.
  • Open a PR against the Packer-CI-Build with an update to said image versions. Once your PR is merged, a new version of the VM will be ready for consumption in the CI.
  • Update the SERVER_VERSION field in test/Vagrantfile to contain the new version, which is the build number from the Jenkins Job for the VMs. For example, build 119 from the pipeline would be the value to set for SERVER_VERSION.
  • Open a pull request with this version change in the cilium repository.

Nightly Docker image

After each successful Nightly build, a cilium/nightly image is pushed to dockerhub.

To use latest nightly build, please use cilium/nightly:latest tag. Nightly images are stored on dockerhub tagged with following format: YYYYMMDD-<job number>. Job number is added to tag for the unlikely event of two consecutive nightly builds being built on the same date.

Code Overview

This section provides an overview of the Cilium & Hubble source code directory structure. It is useful to get an initial overview on where to find what.

High-level

Top-level directories github.com/cilium/cilium:

api
The Cilium & Hubble API definition.
bpf
The BPF datapath code
bugtool
CLI for collecting agent & system information for bug reporting
cilium
Cilium CLI client
contrib, tools
Additional tooling and resources used for development
daemon
The cilium-agent running on each node
examples
Various example resources and manifests. Typically require to be modified before usage is possible.
hubble-relay
Hubble relay server
install
Helm deployment manifests for all components
pkg
Common Go packages shared between all components
operator
Operator responsible for centralized tasks which do not require to be performed on each node.
plugins
Plugins to integrate with Kubernetes and Docker
test
End-to-end integration tests run in the End-To-End Testing Framework.

Cilium

api/v1/openapi.yaml
API specification of the Cilium API. Used for code generation.
api/v1/models/
Go code generated from openapi.yaml representing all API resources
bpf
The BPF datapath code
cilium
Cilium CLI client
cilium-health
Cilium cluster connectivity CLI client
daemon
cilium-agent specific code
plugins/cilium-cni
The CNI plugin to integrate with Kubernetes
plugins/cilium-docker
The Docker integration plugin

Hubble

The server-side code of Hubble is integrated into the Cilium repository. The Hubble CLI can be found in the separate repository github.com/cilium/hubble. The Hubble UI can be found in the separate repository github.com/cilium/hubble-ui.

api/v1/external, api/v1/flow, api/v1/observer, api/v1/peer, api/v1/relay
API specifications of the Hubble APIs.
hubble-relay
Hubble relay agent
pkg/hubble
All Hubble specific code
pkg/hubble/container
Ring buffer implementation
pkg/hubble/filters
Flow filtering capabilities
pkg/hubble/metrics
Metrics plugins providing Prometheus based on Hubble’s visibility
pkg/hubble/observe
Layer running on top of the Cilium datapath monitoring, feeding the metrics and ring buffer.
pkg/hubble/parser
Network flow parsers
pkg/hubble/peer
Peer service implementation
pkg/hubble/relay
Relay service implementation
pkg/hubble/server
The server providing the API for the Hubble client and UI

Important common packages

pkg/allocator
Security identity allocation
pkg/bpf
Abstraction layer to interact with the BPF runtime
pkg/client
Go client to access Cilium API
pkg/clustermesh
Multi-cluster implementation including control plane and global services
pkg/controller
Base controller implementation for any background operation that requires retries or interval-based invocation.
pkg/datapath
Abstraction layer for datapath interaction
pkg/default
All default values
pkg/elf
ELF abstraction library for the BPF loader
pkg/endpoint
Abstraction of a Cilium endpoint, representing all workloads.
pkg/endpointmanager
Manager of all endpoints
pkg/envoy
Envoy proxy interactions
pkg/fqdn
FQDN proxy and FQDN policy implementation
pkg/health
Network connectivity health checking
pkg/identity
Representation of a security identity for workloads
pkg/ipam
IP address management
pkg/ipcache
Global cache mapping IPs to endpoints and security identities
pkg/k8s
All interactions with Kubernetes
pkg/kafka
Kafka protocol proxy and policy implementation
pkg/kvstore
Key-value store abstraction layer with backends for etcd and consul
pkg/labels
Base metadata type to describe all label/metadata requirements for workload identity specification and policy matching.
pkg/loadbalancer
Control plane for load-balancing functionality
pkg/maps
BPF map representations
pkg/metrics
Prometheus metrics implementation
pkg/monitor
BPF datapath monitoring abstraction
pkg/node
Representation of a network node
pkg/option
All available configuration options
pkg/policy
Policy enforcement specification & implementation
pkg/proxy
Layer 7 proxy abstraction
pkg/service
Representation of a load-balancing service
pkg/trigger
Implementation of trigger functionality to implement event-driven functionality

Debugging

toFQDNs and DNS Debugging

The interactions of L3 toFQDNs and L7 DNS rules can be difficult to debug around. Unlike many other policy rules, these are resolved at runtime with unknown data. Pods may create large numbers of IPs in the cache or the IPs returned returned may not be compatible with our datapath implementation. Sometimes we also just have bugs.

Isolating the source of toFQDNs issues

While there is no common culprit when debugging, the DNS Proxy shares the least code with other system and so is more likely the least audited in this chain. The cascading caching scheme is also complex in its behaviour. Determining whether an issue is caused by the DNS components, in the policy layer or in the datapath is often the first step when debugging toFQDNs related issues. Generally, working top-down is easiest as the information needed to verify low-level correctness can be collected in the initial debug invocations.

REFUSED vs NXDOMAIN responses

The proxy uses REFUSED DNS responses to indicate a denied request. Some libc implementations, notably musl which is common in Alpine Linux images, terminate the whole DNS search in these cases. This often manifests as a connect error in applications, as the libc lookup returns no data. To work around this, denied responses can be configured to be NXDOMAIN by setting the --tofqdns-dns-reject-response-code command line argument.

Monitor Events

The DNS Proxy emits multiple L7 DNS monitor events. One for the request and one for the response (if allowed). Often the L7 DNS rules are paired with L3 toFQDNs rules and events relating to those rules are also relevant.

Note

Be sure to run cilium monitor on the same node as the pod being debugged!

$ kubectl exec pod/cilium-sbp8v -n cilium -- cilium monitor --related-to 3459
Listening for events on 4 CPUs with 64x4096 of shared memory
Press Ctrl-C to quit
level=info msg="Initializing dissection cache..." subsys=monitor

-> Request dns from 3459 ([k8s:org=alliance k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.cilium.k8s.policy.cluster=default k8s:class=xwing]) to 0 ([k8s:io.cilium.k8s.policy.serviceaccount=kube-dns k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns k8s:io.cilium.k8s.policy.cluster=default]), identity 323->15194, verdict Forwarded DNS Query: cilium.io. A
-> endpoint 3459 flow 0xe6866e21 identity 15194->323 state reply ifindex lxc84b58cbdabfe orig-ip 10.60.1.115: 10.63.240.10:53 -> 10.60.0.182:42132 udp
-> Response dns to 3459 ([k8s:org=alliance k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.cilium.k8s.policy.cluster=default k8s:class=xwing]) from 0 ([k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=kube-dns k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]), identity 323->15194, verdict Forwarded DNS Query: cilium.io. A TTL: 486 Answer: '104.198.14.52'
-> endpoint 3459 flow 0xe6866e21 identity 15194->323 state reply ifindex lxc84b58cbdabfe orig-ip 10.60.1.115: 10.63.240.10:53 -> 10.60.0.182:42132 udp
Policy verdict log: flow 0x614e9723 local EP ID 3459, remote ID 16777217, dst port 80, proto 6, ingress false, action allow, match L3-Only, 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN

-> stack flow 0x614e9723 identity 323->16777217 state new ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN
-> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN
-> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp SYN, ACK
-> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
-> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
-> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
-> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
-> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp ACK
-> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK
-> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK, FIN
-> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK, FIN
-> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp ACK, FIN
-> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK

The above is for a simple curl cilium.io in a pod. The L7 DNS request is the first set of message and the subsequent L3 connection is the HTTP component. AAAA DNS lookups commonly happen but were removed to simplify the example.

  • If no L7 DNS requests appear, the proxy redirect is not in place. This may mean that the policy does not select this endpoint or there is an issue with the proxy redirection. Whether any redirects exist can be checked with cilium status --all-redirects. In the past, a bug occurred with more permissive L3 rules overriding the proxy redirect, causing the proxy to never see the requests.
  • If the L7 DNS request is blocked, with an explicit denied message, then the requests are not allowed by the proxy. This may be due to a typo in the network policy, or the matchPattern rule not allowing this domain. It may also be due to a bug in policy propagation to the DNS Proxy.
  • If the DNS request is allowed, with an explicit message, and it should not be, this may be because a more general policy is in place that allows the request. matchPattern: "*" visibility policies are commonly in place and would supersede all other, more restrictive, policies. If no other policies are in place, incorrect allows may indicate a bug when passing policy information to the proxy. There is no way to dump the rules in the proxy, but a debug log is printed when a rule is added. Look for DNS Proxy updating matchNames in allowed list during UpdateRules. The pkg/proxy/dns.go file contains the DNS proxy implementation.

If L7 DNS behaviour seems correct, see the sections below to further isolate the issue. This can be verified with cilium fqdn cache list. The IPs in the response should appear in the cache for the appropriate endpoint. The lookup time is included in the json output of the command.

$ kubectl exec pod/cilium-sbp8v -n cilium -- cilium fqdn cache list
Endpoint   Source   FQDN         TTL    ExpirationTime             IPs
3459       lookup   cilium.io.   3600   2020-04-21T15:04:27.146Z   104.198.14.52
DNS Proxy Errors

REFUSED responses are returned when the proxy encounters an error during processing. This can be confusing to debug as that is also the response when a DNS request is denied. An error log is always printed in these cases. Some are callbacks provided by other packages via daemon in cilium-agent.

  • Rejecting DNS query from endpoint due to error: This is the “normal” policy-reject message. It is a debug log.
  • cannot extract endpoint IP from DNS request: The proxy cannot read the socket information to read the source endpoint IP. This could mean an issue with the datapath routing and information passing.
  • cannot extract endpoint ID from DNS request: The proxy cannot use the source endpoint IP to get the cilium-internal ID for that endpoint. This is different from the Security Identity. This could mean that cilium is not managing this endpoint and that something has gone awry. It could also mean a routing problem where a packet has arrived at the proxy incorrectly.
  • cannot extract destination IP:port from DNS request: The proxy cannot read the socket information of the original request to obtain the intended target IP:Port. This could mean an issue with the datapath routing and information passing.
  • cannot find server ip in ipcache: The proxy cannot resolve a Security Identity for the target IP of the DNS request. This should always succeed, as world catches all IPs not set by more specific entries. This can mean a broken ipcache BPF table.
  • Rejecting DNS query from endpoint due to error: While checking if the DNS request was allowed (based on Endpoint ID, destination IP:Port and the DNS query) an error occurred. These errors would come from the internal rule lookup in the proxy, the allowed field.
  • Timeout waiting for response to forwarded proxied DNS lookup: The proxy forwards requests 1:1 and does not cache. It applies a 5s timeout on responses to those requests, as the client will retry within this period (usually). Bursts of these errors can happen if the DNS target server misbehaves and many pods see DNS timeouts. This isn’t an actual problem with cilium or the proxy although it can be caused by policy blocking the DNS target server if it is in-cluster.
  • Timed out waiting for datapath updates of FQDN IP information; returning response: When the proxy updates the DNS caches with response data, it needs to allow some time for that information to get into the datapath. Otherwise, pods would attempt to make the outbound connection (the thing that caused the DNS lookup) before the datapath is ready. Many stacks retry the SYN in such cases but some return an error and some apps further crash as a response. This delay is configurable by setting the --tofqdns-proxy-response-max-delay command line argument but defaults to 100ms. It can be exceeded if the system is under load.
Identities and Policy

Once a DNS response has been passed back through the proxy and is placed in the DNS cache toFQDNs rules can begin using the IPs in the cache. There are multiple layers of cache:

  • A per-Endpoint DNSCache stores the lookups for this endpoint. It is restored on cilium startup with the endpoint. Limits are applied here for --tofqdns-endpoint-max-ip-per-hostname and TTLs are tracked. The --tofqdns-min-ttl is not used here.
  • A per-Endpoint DNSZombieMapping list of IPs that have expired from the per-Endpoint cache but are waiting for the Connection Tracking GC to mark them in-use or not. This can take up to 12 hours to occur. This list is size-limited by --tofqdns-max-deferred-connection-deletes.
  • A global DNSCache where all endpoint and poller DNS data is collected. It does apply the --tofqdns-min-ttl value but not the --tofqdns-endpoint-max-ip-per-hostname value.

If an IP exists in the FQDN cache (check with cilium fqdn cache list) then toFQDNs rules that select a domain name, either explicitly via matchName or via matchPattern, should cause IPs for that domain to have allocated Security Identities. These can be listed with:

$ kubectl exec pod/cilium-sbp8v -n cilium -- cilium identity list
ID         LABELS
1          reserved:host
2          reserved:world
3          reserved:unmanaged
4          reserved:health
5          reserved:init
6          reserved:remote-node
323        k8s:class=xwing
           k8s:io.cilium.k8s.policy.cluster=default
           k8s:io.cilium.k8s.policy.serviceaccount=default
           k8s:io.kubernetes.pod.namespace=default
           k8s:org=alliance
...
16777217   cidr:104.198.14.52/32
           reserved:world

Note that CIDR identities are allocated locally on the node and have a high-bit set so they are often in the 16-million range. Note that this is the identity in the monitor output for the HTTP connection.

In cases where there is no matching identity for an IP in the fqdn cache it may simply be because no policy selects an associated domain. The policy system represents each toFQDNs: rule with a FQDNSelector instance. These receive updates from a global NameManage in the daemon. They can be listed along with other selectors (roughly corresponding to any L3 rule):

$ kubectl exec pod/cilium-sbp8v -n cilium -- cilium policy selectors
SELECTOR                                                                                                         USERS   IDENTITIES
MatchName: , MatchPattern: *                                                                                     1       16777217
&LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},}                   2       1
                                                                                                                         2
                                                                                                                         3
                                                                                                                         4
                                                                                                                         5
                                                                                                                         6
                                                                                                                         323
                                                                                                                         6188
                                                                                                                         15194
                                                                                                                         18892
                                                                                                                         25379
                                                                                                                         29200
                                                                                                                         32255
                                                                                                                         33831
                                                                                                                         16777217
&LabelSelector{MatchLabels:map[string]string{reserved.none: ,},MatchExpressions:[]LabelSelectorRequirement{},}   1

In this example 16777217 is used by two selectors, one with matchPattern: "*" and another empty one. This is because of the policy in use:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: "tofqdn-dns-visibility"
spec:
  endpointSelector:
    matchLabels:
      any:org: alliance
  egress:
  - toPorts:
      - ports:
         - port: "53"
           protocol: ANY
        rules:
          dns:
            - matchPattern: "*"
  - toFQDNs:
      - matchPattern: "*"

The L7 DNS rule has an implicit L3 allow-all because it defines only L4 and L7 sections. This is the second selector in the list, and includes all possible L3 identities known in the system. In contrast, the first selector, which corresponds to the toFQDNS: matchName: "*" rule would list all identities for IPs that came from the DNS Proxy. Other CIDR identities would not be included.

Datapath Plumbing

For a policy to be fully realized the datapath for an Endpoint must be updated. In the case of a new DNS-source IP, the CIDR identity associated with it must propagate from the selectors to the Endpoint specific policy. Unless a new policy is being added, this often only involves updating the Policy Map of the Endpoint with the new CIDR Identity of the IP. This can be verified:

$ kubectl exec pod/cilium-sbp8v -n cilium -- cilium bpf policy get 3459
DIRECTION   LABELS (source:key[=value])   PORT/PROTO   PROXY PORT   BYTES   PACKETS
Ingress     reserved:unknown              ANY          NONE         1367    7
Ingress     reserved:host                 ANY          NONE         0       0
Egress      reserved:unknown              53/TCP       36447        0       0
Egress      reserved:unknown              53/UDP       36447        138     2
Egress      cidr:104.198.14.52/32         ANY          NONE         477     6
            reserved:world

Note that the labels for identities are resolved here. This can be skipped, or there may be cases where this doesn’t occur:

$ kubectl exec pod/cilium-sbp8v -n cilium -- cilium bpf policy get -n 3459
DIRECTION   IDENTITY   PORT/PROTO   PROXY PORT   BYTES   PACKETS
Ingress     0          ANY          NONE         1367    7
Ingress     1          ANY          NONE         0       0
Egress      0          53/TCP       36447        0       0
Egress      0          53/UDP       36447        138     2
Egress      16777217   ANY          NONE         477     6

L3 toFQDNs rules are egress only, so we would expect to see an Egress entry with Security Identity 16777217. The L7 rule, used to redirect to the DNS Proxy is also present with a populated PROXY PORT. It has a 0 IDENTITY as it is an L3 wildcard, ie the policy allows any peer on the specified port.

An identity missing here can be an error in various places:

  • Policy doesn’t actually allow this Endpoint to connect. A sanity check is to use cilium endpoint list to see if cilium thinks it should have policy enforcement.
  • Endpoint regeneration is slow and the Policy Map has not been updated yet. This can occur in cases where we have leaked IPs from the DNS cache (i.e. they were never deleted correctly) or when there are legitimately many IPs. It can also simply mean an overloaded node or even a deadlock within cilium.
  • A more permissive policy has removed the need to include this identity. This is likely a bug, however, as the IP would still have an identity allocated and it would be included in the Policy Map. In the past, a similar bug occurred with the L7 redirect and that would stop this whole process at the beginning.

Mutexes / Locks and Data Races

Note

This section only applies to Golang code.

There are a few options available to debug Cilium data races and deadlocks.

To debug data races, Golang allows -race to be passed to the compiler to compile Cilium with race detection. Additionally, the flag can be provided to go test to detect data races in a testing context.

To compile a Cilium binary with race detection, you can do:

$ make RACE=1

To run unit tests with race detection, you can do:

$ make RACE=1 unit-tests
Deadlock detection

Cilium can be compiled with a build tag lockdebug which will provide a seamless wrapper over the standard mutex types in Golang, via sasha-s/go-deadlock library. No action is required, besides building the binary with this tag.

For example:

$ make LOCKDEBUG=1
$ # Deadlock detection during unit tests:
$ make LOCKDEBUG=1 unit-tests

Release Management

This section describes the release processes for tracking, preparing, and creating new Cilium releases. This includes information around the release cycles and guides for developers responsible for backporting fixes or preparing upcoming stable releases.

Organization

Release tracking

Feature work for upcoming releases is tracked through GitHub Projects. You can view the projects related to the 1.9 release here:

Release Cadence

New versions of Cilium are released based on completion of feature work that has been scheduled for that release. Minor releases are typically designated by by incrementing the Y in the version format X.Y.Z.

Three stable branches are maintained at a time: One for the most recent minor release, and two for the prior two minor releases. For each minor release that is currently maintained, the stable branch vX.Y on github contains the code for the next stable release. New micro releases for an existing stable version X.Y.Z are published incrementing the Z in the version format.

New micro releases for stable branches are made periodically to provide security and bug fixes, based upon community demand and bugfix severity. Potential fixes for an upcoming release are first merged into the master branch, then backported to the relevant stable branches using the following criteria.

Backporting process

Backport Criteria

Committers may nominate PRs that have been merged into master as candidates for backport into stable releases if they affect the stable production usage of community users.

Backport criteria for current minor release

Criteria for inclusion into the next stable release of the current latest minor version of Cilium, for example in a v1.2.z release prior to the release of version v1.3.0:

  • All bugfixes
Backport criteria for X.Y-1.Z and X.Y-2.Z

Criteria for the inclusion into the next stable release of the prior two minor versions of Cilium, for example in a v1.0.z or v1.1.z release prior to the release of version v1.3.0:

  • Security relevant fixes
  • Major bugfixes relevant to the correct operation of Cilium

Backporting guide

Cilium PRs that are marked with the label needs-backport/X.Y need to be backported to the stable branch X.Y. The following steps summarize the process for backporting these PRs:

  • One-time setup
  • Preparing PRs for backport
  • Cherry-picking commits into a backport branch
  • Posting the PR and updating GitHub labels
One-time setup
  1. The scripts referred to below need to be run on Linux, they do not work on macOS. You can use the cilium dev VM for this, but you need to configure git to have your name and email address to be used in the commit messages:

    $ git config --global user.name "John Doe"
    $ git config --global user.email johndoe@example.com
    
  2. Make sure you have a GitHub developer access token with the public_repos scope available. For details, see contrib/backporting/README.md

  3. This guide makes use of several tools to automate the backporting process. The basics require bash and git, but to automate interactions with github, further tools are required.

    Dependency Required? Download Command
    bash Yes N/A (OS-specific)
    git Yes N/A (OS-specific)
    jq Yes N/A (OS-specific)
    python3 No Python Downloads
    PyGithub No pip3 install PyGithub
    Github hub CLI No N/A (OS-specific)
Preparation

Pull requests that are candidates for backports to the X.Y stable release are tracked through the following links:

Make sure that the Github labels are up-to-date, as this process will deal with all commits from PRs that have the needs-backport/X.Y label set (for a stable release version X.Y). If any PRs contain labels such as backport-pending/X.Y, ensure that the backport for that PR have been merged and if so, change the label to backport-done/X.Y.

Creating the backports branch
  1. Run contrib/backporting/start-backport for the release version that you intend to backport PRs for. This will pull the latest repository commits from the Cilium repository (assumed to be the git remote origin), create a new branch, and runs the contrib/backporting/check-stable script to fetch the full set of PRs to backport.

    $ GITHUB_TOKEN=xxx contrib/backporting/start-backport 1.0
    

    Note

    This command will leave behind a file in the current directory with a name based upon the release version and the current date in the form vRELEASE-backport-YYYY-MM-DD.txt which contains a prepared backport pull-request description so you don’t need to write one yourself.

  2. Cherry-pick the commits using the master git SHAs listed, starting from the oldest (top), working your way down and fixing any merge conflicts as they appear. Note that for PRs that have multiple commits you will want to check that you are cherry-picking oldest commits first. The cherry-pick script accepts multiple arguments, in which case it will attempt to apply each commit in the order specified on the command line until one cherry pick fails or every commit is cherry-picked.

    $ contrib/backporting/cherry-pick <oldest-commit-sha>
    ...
    $ contrib/backporting/cherry-pick <newest-commit-sha>
    
  3. (Optional) If there are any commits or pull requests that are tricky or time-consuming to backport, consider reaching out for help on Slack. If the commit does not cherry-pick cleanly, please mention the necessary changes in the pull request description in the next section.

  4. Push your backports branch to cilium repo.

    $ git push -u origin HEAD
    
Creating the backport pull request

The backport pull-request may be created via CLI tools, or alternatively you can use the GitHub web interface to achieve these steps.

Via command-line tools

These steps require all of the tools described in the One-time setup section above. It pushes the git tree, creates the pull request and updates the labels for the PRs that are backported, based on the vRELEASE-backport-YYYY-MM-DD.txt file in the current directory.

# contrib/backporting/submit-backport
Via GitHub web interface
  1. Create a new PR from your branch towards the feature branch you are backporting to. Note that by default Github creates PRs against the master branch, so you will need to change it. The title and description for the pull request should be based upon the vRELEASE-backport-YYYY-MM-DD.txt file that was generated by the scripts above.
  2. Label the new backport PR with the backport label for the stable branch such as backport/X.Y as well as kind/backports so that it is easy to find backport PRs later.
  3. Mark all PRs you backported with the backport pending label backport-pending/X.Y and clear the needs-backport/vX.Y label. This can be done with the command printed out at the bottom of the output from the start-backport script above (GITHUB_TOKEN needs to be set for this to work).
Running the CI against the pull request

To validate a cross-section of various tests against the PRs, backport PRs should be validated in the CI by running all CI targets. This can be triggered by adding a comment to the PR with exactly the text never-tell-me-the-odds. The comment must not contain any other characters.

After the backports are merged

After the backport PR is merged, mark all backported PRs with backport-done/X.Y label and clear the backport-pending/X.Y label(s). If the backport pull request description was generated using the scripts above, then the full command is listed in the pull request description.

# Set PR 1234's v1.0 backporting labels to done
contrib/backporting/set-labels.py 1234 done 1.0.

Generic Release Process

This process applies to all releases other than feature releases, this includes:

  • Stable releases

If you intent to release a new feature release, see the Feature Release Process section instead.

Note

The following commands have been validated when ran in the VM used in the Cilium development process. See Development Setup for detailed instructions about setting up said VM.

  1. Ensure that the necessary backports have been completed and merged. See Backporting process.

    1. Update GitHub project and create vX.Y.Z+1 project if applicable.
    2. Update PRs / issues that were added to the vX.Y.Z project, but didn’t make it into this release into the vX.Y.Z+1 project.
  2. Checkout the desired stable branch and pull it:

    git checkout v1.0; git pull
    
  3. Create a branch for the release pull request:

    git checkout -b pr/prepare-v1.0.3
    
  4. Update the VERSION file to represent X.Y.Z+1

  5. If this is the first release after creating a new release branch. Adjust the image pull policy for all .sed files in install/kubernetes/cilium/values.yaml from Always to IfNotPresent.

  6. Update Helm chart documentation

    1. Update version and appVersion in install/kubernetes/cilium/Chart.yaml
    2. Update version tag in install/kubernetes/cilium/values.yaml
  7. Update the image tag versions in the examples:

    make -C install/kubernetes clean all
    
  8. Update the cilium_version and cilium_tag variables in examples/getting-started/Vagrantfile

  9. Update the AUTHORS file

    make update-authors
    

    Note

    Check to see if the AUTHORS file has any formatting errors (for instance, indentation mismatches) as well as duplicate contributor names, and correct them accordingly.

  10. Generate the release notes by running the instructions provided in github.com/cilium/release

  11. Add the generated release notes in the CHANGELOG.md file

  12. Create a new project named “X.Y.Z+1” to automatically track the backports for that particular release. Direct Link:

  13. Update the project URL for the respective release in file .github/cilium-actions.yml

  14. Add all modified files using git add and create a pull request with the title Prepare for release v1.0.3. Add the backport label to the PR which corresponds to the branch for which the release is being performed, e.g. backport/1.0.

    Note

    Make sure to create the PR against the desired stable branch. In this case v1.0

  15. Follow standard procedures to get the aforementioned PR merged into the desired stable branch. See Submitting a pull request for more information about this process.

  16. Checkout out the stable branch and pull your merged changes:

    git checkout v1.0; git pull
    
  17. Build the container images and push them

    DOCKER_IMAGE_TAG=v1.0.3 make docker-image
    docker push cilium/cilium:v1.0.3
    
  18. Create release tags:

    git tag -a v1.0.3 -m 'Release v1.0.3'
    git tag -a 1.0.3 -m 'Release 1.0.3'
    

    Note

    There are two tags that correspond to the same release because GitHub recommends using vx.y.z for release version formatting, and ReadTheDocs, which hosts the Cilium documentation, requires the version to be in format x.y.z For more information about how ReadTheDocs does versioning, you can read their Versions Documentation.

  19. Push the git release tag

    git push --tags
    
  20. Build the binaries and push it to the release bucket:

    DOMAIN=releases.cilium.io ./contrib/release/uploadrev v1.0.3
    

    This step will print a markdown snippet which you will need when crafting the GitHub release so make sure to keep it handy.

  21. Create a GitHub release:

    1. Choose the correct target branch, e.g. v1.0
    2. Choose the correct target tag, e.g. v1.0.3
    3. Title: 1.0.3
    4. Check the This is a pre-release box if you are releasing a release candidate.
    5. Fill in the release description with the output generated by github.com/cilium/release
    6. Preview the description and then publish the release
  22. Announce the release in the #general channel on Slack

  23. Update the README.rst#stable-releases section from the Cilium master branch

  24. Update the .github/cilium-actions.yml with the project created for the upcoming release.

  25. Bump the version of Cilium used in the Cilium upgrade tests to use the new release

    Please reach out on the #development channel on Slack for assistance with this task.

  26. Update the stable tags for cilium, cilium-operator, cilium-docker-plugin and hubble-relay on DockerHub.

  27. Update the external tools and guides to point to the released Cilium version:

Release Candidate Process

This process applies to all releases candidates:

If you intent to release a new feature release, see the Feature Release Process section instead.

Note

The following commands have been validated when ran in the VM used in the Cilium development process. See Development Setup for detailed instructions about setting up said VM.

  1. Ensure that the necessary features and fixes have been completed and merged into the branch for which the release candidate will happen.

    1. Update GitHub project and create vX.Y.Z-rcW+1 project if applicable.
    2. Update PRs / issues that were added to the vX.Y.Z-rcW project, but didn’t make it into this release into the vX.Y.Z-rcW+1 project.
  2. Checkout the desired stable branch (can be master branch if stable branch was not created) and pull it:

    git checkout v1.0; git pull
    
  3. Create a branch for the release pull request:

    git checkout -b pr/prepare-v1.0.3
    
  4. Update the AUTHORS file

    make update-authors
    

    Note

    Check to see if the AUTHORS file has any formatting errors (for instance, indentation mismatches) as well as duplicate contributor names, and correct them accordingly.

  5. Add all modified files using git add and create a pull request with the title Prepare for release vX.Y.Z-rcW+1. Add the backport label to the PR which corresponds to the branch for which the release is being performed, e.g. backport/1.0.

    Note

    Make sure to create the PR against the desired stable branch. In this case v1.0

  6. Follow standard procedures to get the aforementioned PR merged into the desired stable branch. See Submitting a pull request for more information about this process.

  7. Checkout out the stable branch and pull your merged changes:

    git checkout v1.0; git pull
    
  8. Check https://hub.docker.com and create a build for the new tag. This build will automatically be triggered when the tag is pushed to github.com

  9. Create release tags:

    git tag -a v1.0.3 -m 'Release v1.0.3'
    git tag -a 1.0.3 -m 'Release 1.0.3'
    

    Note

    There are two tags that correspond to the same release because GitHub recommends using vx.y.z for release version formatting, and ReadTheDocs, which hosts the Cilium documentation, requires the version to be in format x.y.z For more information about how ReadTheDocs does versioning, you can read their Versions Documentation.

  10. Push the git release tag

    git push --tags
    
  11. Create a GitHub release:

    1. Choose the correct target branch, e.g. v1.0

    2. Choose the correct target tag, e.g. v1.0.3

    3. Title: 1.0.3

    4. Check the This is a pre-release box if you are releasing a release candidate.

    5. Fill in the release description:

      Summary of Changes
      ------------------
      
      **Important Bug Fixes**
      
      * Fix dropped packets upon agent bootstrap when iptables rules are installed (@ianvernon)
      
      **Enhancements**
      
      **Documentation**
      
      Changes
      -------
      
      ```
      << contents of NEWS.rst for this release >>
      ```
      
      Release binaries
      ----------------
      
      << contents of snippet outputed by uploadrev >>
      
    6. Preview the description and then publish the release

  12. Announce the release in the #general channel on Slack

Feature Release Process

This document describes the process for creating a major or minor feature release of Cilium.

On Freeze date

  1. Fork a new release branch from master:

    git checkout master; git pull origin master
    git checkout -b v1.2
    git push
    
  2. Protect the branch using the GitHub UI to disallow direct push and require merging via PRs with proper reviews. Direct link

  3. Replace the contents of the CODEOWNERS file with the following to reduce code reviews to essential approvals:

    * @cilium/janitors
    api/ @cilium/api
    pkg/apisocket/ @cilium/api
    pkg/monitor/payload @cilium/api
    pkg/policy/api/ @cilium/api
    pkg/proxy/accesslog @cilium/api
    
  4. Create a new project named “X.Y.Z” to automatically track the backports for that particular release. Direct Link:

  5. Copy the .github/cilium-actions.yml from the previous release vX.Y-1 change the contents to be relevant for the release vX.Y and set the project: to be the generated link created by the previous step. The link should be something like: https://github.com/cilium/cilium/projects/NNN

  6. Commit changes, open a pull request against the new v1.2 branch, and get the pull request merged

    git checkout -b pr/prepare-v1.2
    git add [...]
    git commit
    git push
    
  7. Create the following GitHub labels:

    1. backport-pending/1.2
    2. backport-done/1.2
    3. backport/1.2
    4. needs-backport/1.2
  8. Checkout to master and update the .github/cilium-actions.yml to have all the necessary configurations for the backport of the new vX.Y branch.

    git checkout -b pr/master-cilium-actions-update origin/master
    # modify .github/cilium-actions.yml
    git add .github/cilium-actions.yml
    git commit
    git push
    
  9. Continue with the next step only after the previous steps are merged into master.

  10. Mark all open PRs with needs-backport/x.y that have the milestone x.y

  11. Change the VERSION file to contain the next rc version. For example, if we are branching v1.2 and still in the RC phase we need to change the VERSION file to contain the v1.2.0-rcX

  12. Set the branch as “Active” and the “Privacy Level” to “Private” in the readthedocs Admin page. (Replace v1.2 with the right branch) https://readthedocs.org/dashboard/cilium/version/v1.2/

  13. Since this is the first release being made from a new branch, please follow the Generic Release Process to release v1.2.0-rc1.

  14. Alert in the testing channel that a new jenkins job needs to be created for this new branch.

  15. Prepare the master branch for the next development cycle:

    git checkout master; git pull
    
  16. Update the VERSION file to contain v1.2.90

  17. Add the VERSION file using git add and create & merge a PR titled Prepare for 1.3.0 development.

  18. Update the release branch on

    Jenkins to be tested on every change and Nightly.

  19. (Only 1.0 minor releases) Tag newest 1.0.x Docker image as v1.0-stable and push it to Docker Hub. This will ensure that Kops uses latest 1.0 release by default.

For the final release

  1. Follow the Generic Release Process to create the final replace and replace X.Y.0-rcX with X.Y.0.

Testing

There are multiple ways to test Cilium functionality, including unit-testing and integration testing. In order to improve developer throughput, we provide ways to run both the unit and integration tests in your own workspace as opposed to being fully reliant on the Cilium CI infrastructure. We encourage all PRs to add unit tests and if necessary, integration tests. Consult the following pages to see how to run the variety of tests that have been written for Cilium, and information about Cilium’s CI infrastructure.

CI / Jenkins

The main CI infrastructure is maintained at https://jenkins.cilium.io/

Jobs Overview

Cilium-PR-Ginkgo-Tests-Validated

Runs validated Ginkgo tests which are confirmed to be stable and have been verified. These tests must always pass.

The configuration for this job is contained within ginkgo.Jenkinsfile.

It first runs unit tests using docker-compose using a YAML located at test/docker-compose.yaml.

The next steps happens in parallel:

  • Runs the single-node e2e tests using the Docker runtime.
  • Runs the multi-node Kubernetes e2e tests against the latest default version of Kubernetes specified above.

This job can be used to run tests on custom branches. To do so, log into Jenkins and go to https://jenkins.cilium.io/job/cilium-ginkgo/configure . Then add your branch name to GitHub Organization -> cilium -> Filter by name (with wildcards) -> Include field and save changes. After you don’t need to run tests on your branch, please remove the branch from this field.

Note

It is also possible to run specific tests from this suite via test-focus and test-gke. It takes trailing words as a regex. If you want to run only one It block, you need to prepend it with a test suite and create a regex, e.g test-focus K8sDatapathConfig.*Check connectivity with automatic direct nodes routes

test-focus K8s Runs all kubernetes tests
test-focus K8sConformance Runs all k8s conformance tests
test-focus K8sChaos Runs all k8s chaos tests
test-focus K8sDatapathConfig Runs all k8s datapath configuration tests
test-focus K8sDemos Runs all k8s demo tests
test-focus K8sKubeProxyFreeMatrix Runs all k8s kube-proxy free matrix tests
test-focus K8sFQDNTest Runs all k8s fqdn tests
test-focus K8sHealthTest Runs all k8s health tests
test-focus K8sHubbleTest Runs all k8s Hubble tests
test-focus K8sIdentity Runs all k8s identity tests
test-focus K8sIstioTest Runs all k8s Istio tests
test-focus K8sKafkaPolicyTest Runs all k8s Kafka tests
test-focus K8sPolicyTest Runs all k8s policy tests
test-focus K8sServicesTest Runs all k8s services tests
test-focus K8sUpdates Runs k8s update tests
test-focus Runtime Runs all runtime tests
Cilium-PR-Ginkgo-Tests-Kernel

Runs the Kubernetes e2e tests with a 4.19 kernel. The configuration for this job is contained within ginkgo-kernel.Jenkinsfile.

Cilium-PR-Ginkgo-Tests-k8s

Runs the Kubernetes e2e tests against all Kubernetes versions that are not currently not tested as part of each pull-request, but which Cilium still supports, as well as the the most-recently-released versions of Kubernetes that that might not be declared stable by Kubernetes upstream. Check the contents of ginkgo-kubernetes-all.Jenkinsfile in the branch of Cilium for which you are running tests to see which Kubernetes versions will be tested against.

Ginkgo-CI-Tests-Pipeline

Ginkgo-CI-Tests-Pipeline

Cilium-Nightly-Tests-PR

Runs long-lived tests which take extended time. Some of these tests have an expected failure rate.

Nightly tests run once per day in the Cilium-Nightly-Tests-PR job. The configuration for this job is stored in Jenkinsfile.nightly.

To see the results of these tests, you can view the JUnit Report for an individual job:

  1. Click on the build number you wish to get test results from on the left hand side of the Cilium-Nightly-Tests-PR job.
  2. Click on ‘Test Results’ on the left side of the page to view the results from the build. This will give you a report of which tests passed and failed. You can click on each test to view its corresponding output created from Ginkgo.

This first runs the Nightly tests with the following setup:

  • 4 Kubernetes 1.8 nodes
  • 4 GB of RAM per node.
  • 4 vCPUs per node.

Then, it runs tests Kubernetes tests against versions of Kubernetes that are currently not tested against as part of each pull-request, but that Cilium still supports.

It also runs a variety of tests against Envoy to ensure that proxy functionality is working correctly.

Packer-CI-Build

As part of Cilium development, we use a custom base box with a bunch of pre-installed libraries and tools that we need to enhance our daily workflow. That base box is built with Packer and it is hosted in the packer-ci-build GitHub repository.

New versions of this box can be created via Jenkins Packer Build, where new builds of the image will be pushed to Vagrant Cloud . The version of the image corresponds to the BUILD_ID environment variable in the Jenkins job. That version ID will be used in Cilium Vagrantfiles.

Changes to this image are made via contributions to the packer-ci-build repository. Authorized GitHub users can trigger builds with a GitHub comment on the PR containing the trigger phrase build-me-please. In case that a new box needs to be rebased with a different branch than master, authorized developers can run the build with custom parameters. To use a different Cilium branch in the job go to Build with parameters and a base branch can be set as the user needs.

This box will need to be updated when a new developer needs a new dependency that is not installed in the current version of the box, or if a dependency that is cached within the box becomes stale.

Make sure that you update vagrant box versions in test Vagrantfile and root Vagrantfile after new box is built and tested.

Once you change the image versions locally, create a branch named pr/update-packer-ci-build and open a PR github.com/cilium/cilium. It is important that you use that branch name so the VM images are cached into packet.net before the branch is merged.

Testing matrix

We are currently testing following kernel - k8s version pairs in our CI:

Kubernetes version Kernel version
Vagrant k8s clusters per PR
1.11 5.x.x (net-next)
1.17 4.19.57
1.18 4.9
Vagrant k8s clusters per backport (in addition to PR)
1.{12-17} 4.9
GKE clusters
1.14.10 4.14.138+

Triggering Pull-Request Builds With Jenkins

To ensure that build resources are used judiciously, builds on Jenkins are manually triggered via comments on each pull-request that contain “trigger-phrases”. Only members of the Cilium GitHub organization are allowed to trigger these jobs. Refer to the table below for information regarding which phrase triggers which build, which build is required for a pull-request to be merged, etc. Each linked job contains a description illustrating which subset of tests the job runs.

Jenkins Job Trigger Phrases Required To Merge?
K8s-1.18-kernel-4.9 test-me-please, retest-4.9 Yes
K8s-1.17-Kernel-4.19 test-me-please, retest-4.19 Yes
K8s-1.11-Kernel-netnext test-me-please, retest-net-next Yes
Runtime-4.9 test-me-please, retest-runtime Yes
Cilium-Ginkgo-Tests-Focus test-focus No
Cilium-PR-Ginkgo-Tests-k8s test-missed-k8s No
Cilium-Nightly-Tests-PR test-nightly No
Cilium-PR-Kubernetes-Upstream test-upstream-k8s No
Cilium-PR-Flannel test-flannel No
Cilium-PR-K8s-GKE test-me-please, test-gke Yes

For Backport PRs, the phrase never-tell-me-the-odds should be used to trigger all of the above jobs which are marked as required to validate changes to existing releases.

There are some feature flags based on Pull Requests labels, the list of labels are the following:

  • area/containerd: Enable containerd runtime on all Kubernetes test.
  • ci/net-next: Run tests on net-next kernel. This causes the test-me-please target to only run on the net-next kernel. It is purely for testing on a different kernel, to merge a PR it must pass the CI without this flag.

Using Jenkins for testing

Typically when running Jenkins tests via one of the above trigger phases, it will run all of the tests in that particular category. However, there may be cases where you just want to run a single test quickly on Jenkins and observe the test result. To do so, you need to update the relevant test to have a custom name, and to update the Jenkins file to focus that test. Below is an example patch that shows how this can be achieved.

diff --git a/ginkgo.Jenkinsfile b/ginkgo.Jenkinsfile
index ee17808748a6..637f99269a41 100644
--- a/ginkgo.Jenkinsfile
+++ b/ginkgo.Jenkinsfile
@@ -62,10 +62,10 @@ pipeline {
             steps {
                 parallel(
                     "Runtime":{
-                        sh 'cd ${TESTDIR}; ginkgo --focus="RuntimeValidated*" -v -noColor'
+                        sh 'cd ${TESTDIR}; ginkgo --focus="XFoooo*" -v -noColor'
                     },
                     "K8s-1.9":{
-                        sh 'cd ${TESTDIR}; K8S_VERSION=1.9 ginkgo --focus=" K8sValidated*" -v -noColor ${FAILFAST}'
+                        sh 'cd ${TESTDIR}; K8S_VERSION=1.9 ginkgo --focus=" K8sFooooo*" -v -noColor ${FAILFAST}'
                     },
                     failFast: true
                 )
diff --git a/test/k8sT/Nightly.go b/test/k8sT/Nightly.go
index 62b324619797..3f955c73a818 100644
--- a/test/k8sT/Nightly.go
+++ b/test/k8sT/Nightly.go
@@ -466,7 +466,7 @@ var _ = Describe("NightlyExamples", func() {

                })

-               It("K8sValidated Updating Cilium stable to master", func() {
+               FIt("K8sFooooo K8sValidated Updating Cilium stable to master", func() {
                        podFilter := "k8s:zgroup=testapp"

                        //This test should run in each PR for now.

CI Failure Triage

This section describes the process to triage CI failures. We define 3 categories:

Keyword Description
Flake Failure due to a temporary situation such as loss of connectivity to external services or bug in system component, e.g. quay.io is down, VM race conditions, kube-dns bug, …
CI-Bug Bug in the test itself that renders the test unreliable, e.g. timing issue when importing and missing to block until policy is being enforced before connectivity is verified.
Regression Failure is due to a regression, all failures in the CI that are not caused by bugs in the test are considered regressions.
Pipelines subject to triage

Build/test failures for the following Jenkins pipelines must be reported as GitHub issues using the process below:

Pipeline Description
Ginkgo-Tests-Validated-master Runs whenever a PR is merged into master
Ginkgo-CI-Tests-Pipeline Runs every two hours on the master branch
Master-Nightly Runs durability tests every night
Vagrant-Master-Boxes-Packer-Build Runs on merge into packer-ci-build repository.
Release-branch Runs various Ginkgo tests on merge into branch “review-docs-redesign”
Triage process
  1. Discover untriaged Jenkins failures via the jenkins-failures.sh script. It defaults to checking the previous 24 hours but this can be modified by setting the SINCE environment variable (it is a unix timestamp). The script checks the various test pipelines that need triage.

    $ contrib/scripts/jenkins-failures.sh
    

    Note

    You can quickly assign SINCE with statements like SINCE=`date -d -3days`

  2. Investigate the failure you are interested in and determine if it is a CI-Bug, Flake, or a Regression as defined in the table above.

    1. Search GitHub issues to see if bug is already filed. Make sure to also include closed issues in your search as a CI issue can be considered solved and then re-appears. Good search terms are:

      • The test name, e.g.

        k8s-1.7.K8sValidatedKafkaPolicyTest Kafka Policy Tests KafkaPolicies (from (k8s-1.7.xml))
        
      • The line on which the test failed, e.g.

        github.com/cilium/cilium/test/k8sT/KafkaPolicies.go:202
        
      • The error message, e.g.

        Failed to produce from empire-hq on topic deathstar-plan
        
  3. If a corresponding GitHub issue exists, update it with:

    1. A link to the failing Jenkins build (note that the build information is eventually deleted).
    2. Attach the zipfile downloaded from Jenkins with logs from the failing tests. A zipfile for all tests is also available.
    3. Check how much time has passed since the last reported occurrence of this failure and move this issue to the correct column in the CI flakes project board.
  4. If no existing GitHub issue was found, file a new GitHub issue:

    1. Attach zipfile downloaded from Jenkins with logs from failing test
    2. If the failure is a new regression or a real bug:
      1. Title: <Short bug description>
      2. Labels kind/bug and needs/triage.
    3. If failure is a new CI-Bug, Flake or if you are unsure:
      1. Title CI: <testname>: <cause>, e.g. CI: K8sValidatedPolicyTest Namespaces: cannot curl service
      2. Labels kind/bug/CI and needs/triage
      3. Include a link to the failing Jenkins build (note that the build information is eventually deleted).
      4. Attach zipfile downloaded from Jenkins with logs from failing test
      5. Include the test name and whole Stacktrace section to help others find this issue.
      6. Add issue to CI flakes project.

    Note

    Be extra careful when you see a new flake on a PR, and want to open an issue. It’s much more difficult to debug these without context around the PR and the changes it introduced. When creating an issue for a PR flake, include a description of the code change, the PR, or the diff. If it isn’t related to the PR, then it should already happen in master, and a new issue isn’t needed.

  5. Edit the description of the Jenkins build to mark it as triaged. This will exclude it from future jenkins-failures.sh output.

    1. Login -> Click on build -> Edit Build Information
    2. Add the failure type and GH issue number. Use the table describing the failure categories, at the beginning of this section, to help categorize them.

    Note

    This step can only be performed with an account on Jenkins. If you are interested in CI failure reviews and do not have an account yet, ping us on Slack.

Examples:

  • Flake, quay.io is down
  • Flake, DNS not ready, #3333
  • CI-Bug, K8sValidatedPolicyTest: Namespaces, pod not ready, #9939
  • Regression, k8s host policy, #1111
Bisect process

If you are unable to triage the issue, you may try to use bisect job to find when things went awry in Jenkins.

  1. Log in to Jenkins
  2. Go to https://jenkins.cilium.io/job/bisect-cilium/configure .
  3. Under Git Bisect build step fill in Good start revision and Bad end revision.
  4. Write description of what you are looking for under Search Identifier.
  5. Adjust Retry number and Min Successful Runs to account for current CI flakiness.
  6. Save the configuration.
  7. Click “Build Now” in https://jenkins.cilium.io/job/bisect-cilium/ .
  8. This may take over a day depending on how many underlying builds will be created. The result will be in bisect-cilium console output, actual builds will be happening in https://jenkins.cilium.io/job/cilium-revision/ job.

Infrastructure details

Logging into VM running tests
  1. If you have access to credentials for Jenkins, log into the Jenkins slave running the test workload
  2. Identify the vagrant box running the specific test
$ vagrant global-status
id       name                          provider   state   directory
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6e68c6c  k8s1-build-PR-1588-6          virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q/tests/k8s
ec5962a  cilium-master-build-PR-1588-6 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q
bfaffaa  k8s2-build-PR-1588-6          virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q/tests/k8s
3fa346c  k8s1-build-PR-1588-7          virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q@2/tests/k8s
b7ded3c  cilium-master-build-PR-1588-7 virtualbox running /root/jenkins/workspace/cilium_cilium_PR-1588-CWL743UTZEF6CPEZCNXQVSZVEW32FR3CMGKGY6667CU7X43AAZ4Q@2
  1. Log into the specific VM
$ JOB_BASE_NAME=PR-1588 BUILD_NUMBER=6 vagrant ssh 6e68c6c
Jenkinsfiles Extensions

Cilium uses a custom Jenkins helper library to gather metadata from PRs and simplify our Jenkinsfiles. The exported methods are:

  • ispr(): return true if the current build is a PR.
  • setIfPr(string, string): return the first argument in case of a PR, if not a PR return the second one.
  • BuildIfLabel(String label, String Job): trigger a new Job if the PR has that specific Label.
  • Status(String status, String context): set pull request check status on the given context, example Status("SUCCESS", "$JOB_BASE_NAME")

Documentation

First, start a local document server that automatically refreshes when you save files for real-time preview. After installing pipenv, run:

$ make render-docs-live-preview

and preview the documentation at http://localhost:8000/ as you make changes. After making changes to Cilium documentation you should check that you did not introduce any new warnings or errors, and also check that your changes look as you intended one last time before opening a pull request. To do this you can build the docs:

$ make render-docs

This generates documentation files and starts a web server using a Docker container. You can view the updated documentation by opening either Documentation/_build/html/index.html or http://localhost:9081 in a browser.

End-To-End Testing Framework

Introduction

Cilium uses Ginkgo as a testing framework for writing end-to-end tests which test Cilium all the way from the API level (e.g. importing policies, CLI) to the datapath (i.e, whether policy that is imported is enforced accordingly in the datapath). The tests in the test directory are built on top of Ginkgo. Ginkgo provides a rich framework for developing tests alongside the benefits of Golang (compilation-time checks, types, etc.). To get accustomed to the basics of Ginkgo, we recommend reading the Ginkgo Getting-Started Guide , as well as running example tests to get a feel for the Ginkgo workflow.

These test scripts will invoke vagrant to create virtual machine(s) to run the tests. The tests make heavy use of the Ginkgo focus concept to determine which VMs are necessary to run particular tests. All test names must begin with one of the following prefixes:

  • Runtime: Test cilium in a runtime environment running on a single node.
  • K8s: Create a small multi-node kubernetes environment for testing features beyond a single host, and for testing kubernetes-specific features.
  • Nightly: sets up a multinode Kubernetes cluster to run scale, performance, and chaos testing for Cilium.

Running End-To-End Tests

Running All Ginkgo Tests

Running all of the Ginkgo tests may take an hour or longer. To run all the ginkgo tests, invoke the make command as follows from the root of the cilium repository:

$ sudo make -C test/ test

The first time that this is invoked, the testsuite will pull the testing VMs and provision Cilium into them. This may take several minutes, depending on your internet connection speed. Subsequent runs of the test will reuse the image.

Running Runtime Tests

To run all of the runtime tests, execute the following command from the test directory:

ginkgo --focus="Runtime*" -noColor

Ginkgo searches for all tests in all subdirectories that are “named” beginning with the string “Runtime” and contain any characters after it. For instance, here is an example showing what tests will be ran using Ginkgo’s dryRun option:

$ ginkgo --focus="Runtime*" -noColor -v -dryRun
Running Suite: runtime
======================
Random Seed: 1516125117
Will run 42 of 164 specs
................
RuntimePolicyEnforcement Policy Enforcement Always
  Always to Never with policy
  /Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:258
•
------------------------------
RuntimePolicyEnforcement Policy Enforcement Always
  Always to Never without policy
  /Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:293
•
------------------------------
RuntimePolicyEnforcement Policy Enforcement Never
  Container creation
  /Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:332
•
------------------------------
RuntimePolicyEnforcement Policy Enforcement Never
  Never to default with policy
  /Users/ianvernon/go/src/github.com/cilium/cilium/test/runtime/Policies.go:349
.................
Ran 42 of 164 Specs in 0.002 seconds
SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 122 Skipped PASS

Ginkgo ran 1 suite in 1.830262168s
Test Suite Passed

The output has been truncated. For more information about this functionality, consult the aforementioned Ginkgo documentation.

Running Kubernetes Tests

To run all of the Kubernetes tests, run the following command from the test directory:

ginkgo --focus="K8s*" -noColor

Similar to the Runtime test suite, Ginkgo searches for all tests in all subdirectories that are “named” beginning with the string “K8s” and contain any characters after it.

The Kubernetes tests support the following Kubernetes versions:

  • 1.8
  • 1.9
  • 1.10
  • 1.11
  • 1.12
  • 1.13
  • 1.14
  • 1.15
  • 1.16
  • 1.17
  • 1.18

By default, the Vagrant VMs are provisioned with Kubernetes 1.13. To run with any other supported version of Kubernetes, run the test suite with the following format:

K8S_VERSION=<version> ginkgo --focus="K8s*" -noColor

Note

When provisioning VMs with the net-next kernel (NETNEXT=1) on VirtualBox which version does not match a version of the VM image VirtualBox Guest Additions, Vagrant will install a new version of the Additions with mount.vboxsf. The latter is not compatible with vboxsf.ko shipped within the VM image, and thus syncing of shared folders will not work.

To avoid this, one can prevent Vagrant from installing the Additions by putting the following into $HOME/.vagrant.d/Vagrantfile:

Vagrant.configure('2') do |config|
  if Vagrant.has_plugin?("vagrant-vbguest") then
    config.vbguest.auto_update = false
  end

  config.vm.provider :virtualbox do |vbox|
    vbox.check_guest_additions = false
  end
end
Running Nightly Tests

To run all of the Nightly tests, run the following command from the test directory:

ginkgo --focus="Nightly*"  -noColor

Similar to the other test suites, Ginkgo searches for all tests in all subdirectories that are “named” beginning with the string “Nightly” and contain any characters after it. The default version of running Nightly test are 1.8, but can be changed using the environment variable K8S_VERSION.

Available CLI Options

For more advanced workflows, check the list of available custom options for the Cilium framework in the test/ directory and interact with ginkgo directly:

$ cd test/
$ ginkgo . -- --help | grep -A 1 cilium
  -cilium.SSHConfig string
        Specify a custom command to fetch SSH configuration (eg: 'vagrant ssh-config')
  -cilium.benchmarks
        Specifies benchmark tests should be run which may increase test time
  -cilium.holdEnvironment
        On failure, hold the environment in its current state
  -cilium.hubble-relay-image string
        Specifies which image of hubble-relay to use during tests
  -cilium.image string
        Specifies which image of cilium to use during tests
  -cilium.kubeconfig string
        Kubeconfig to be used for k8s tests
  -cilium.multinode
        Enable tests across multiple nodes. If disabled, such tests may silently pass (default true)
  -cilium.operator-image string
        Specifies which image of cilium-operator to use during tests
  -cilium.passCLIEnvironment
        Pass the environment invoking ginkgo, including PATH, to subcommands
  -cilium.provision
        Provision Vagrant boxes and Cilium before running test (default true)
  -cilium.provision-k8s
        Specifies whether Kubernetes should be deployed and installed via kubeadm or not (default true)
  -cilium.registry string
        docker registry hostname for Cilium image
  -cilium.showCommands
        Output which commands are ran to stdout
  -cilium.skipLogs
        skip gathering logs if a test fails
  -cilium.testScope string
        Specifies scope of test to be ran (k8s, Nightly, runtime)
  -cilium.timeout duration
        Specifies timeout for test run (default 24h0m0s)

For more information about other built-in options to Ginkgo, consult the Ginkgo documentation.

Running Specific Tests Within a Test Suite

If you want to run one specified test, there are a few options:

  • By modifying code: add the prefix “FIt” on the test you want to run; this marks the test as focused. Ginkgo will skip other tests and will only run the “focused” test. For more information, consult the Focused Specs documentation from Ginkgo.
It("Example test", func(){
    Expect(true).Should(BeTrue())
})

FIt("Example focused test", func(){
    Expect(true).Should(BeTrue())
})
  • From the command line: specify a more granular focus if you want to focus on, say, L7 tests:
ginkgo --focus "Run*" --focus "L7 "

This will focus on tests prefixed with “Run*”, and within that focus, run any test that starts with “L7”.

Compiling the tests without running them

To validate that the Go code you’ve written for testing is correct without needing to run the full test, you can build the test directory:

make -C test/ build

Test Reports

The Cilium Ginkgo framework formulates JUnit reports for each test. The following files currently are generated depending upon the test suite that is ran:

  • runtime.xml
  • K8s.xml

Best Practices for Writing Tests

  • Provide informative output to console during a test using the By construct. This helps with debugging and gives those who did not write the test a good idea of what is going on. The lower the barrier of entry is for understanding tests, the better our tests will be!
  • Leave the testing environment in the same state that it was in when the test started by deleting resources, resetting configuration, etc.
  • Gather logs in the case that a test fails. If a test fails while running on Jenkins, a postmortem needs to be done to analyze why. So, dumping logs to a location where Jenkins can pick them up is of the highest imperative. Use the following code in an AfterFailed method:
AfterFailed(func() {
        vm.ReportFailed()
})

Ginkgo Extensions

In Cilium, some Ginkgo features are extended to cover some uses cases that are useful for testing Cilium.

BeforeAll

This function will run before all BeforeEach within a Describe or Context. This method is an equivalent to SetUp or initialize functions in common unit test frameworks.

AfterAll

This method will run after all AfterEach functions defined in a Describe or Context. This method is used for tearing down objects created which are used by all Its within the given Context or Describe. It is ran after all Its have ran, this method is a equivalent to tearDown or finalize methods in common unit test frameworks.

A good use case for using AfterAll method is to remove containers or pods that are needed for multiple Its in the given Context or Describe.

JustAfterEach

This method will run just after each test and before AfterFailed and AfterEach. The main reason of this method is to to perform some assertions for a group of tests. A good example of using a global JustAfterEach function is for deadlock detection, which checks the Cilium logs for deadlocks that may have occurred in the duration of the tests.

AfterFailed

This method will run before all AfterEach and after JustAfterEach. This function is only called when the test failed.This construct is used to gather logs, the status of Cilium, etc, which provide data for analysis when tests fail.

Example Test Layout

Here is an example layout of how a test may be written with the aforementioned constructs:

Test description diagram:

Describe
    BeforeAll(A)
    AfterAll(A)
    AfterFailed(A)
    AfterEach(A)
    JustAfterEach(A)
    TESTA1
    TESTA2
    TESTA3
    Context
        BeforeAll(B)
        AfterAll(B)
        AfterFailed(B)
        AfterEach(B)
        JustAfterEach(B)
        TESTB1
        TESTB2
        TESTB3

Test execution flow:

Describe
    BeforeAll
    TESTA1; JustAfterEach(A), AfterFailed(A), AfterEach(A)
    TESTA2; JustAfterEach(A), AfterFailed(A), AfterEach(A)
    TESTA3; JustAfterEach(A), AfterFailed(A), AfterEach(A)
    Context
        BeforeAll(B)
        TESTB1:
           JustAfterEach(B); JustAfterEach(A)
           AfterFailed(B); AfterFailed(A);
           AfterEach(B) ; AfterEach(A);
        TESTB2:
           JustAfterEach(B); JustAfterEach(A)
           AfterFailed(B); AfterFailed(A);
           AfterEach(B) ; AfterEach(A);
        TESTB3:
           JustAfterEach(B); JustAfterEach(A)
           AfterFailed(B); AfterFailed(A);
           AfterEach(B) ; AfterEach(A);
        AfterAll(B)
    AfterAll(A)

Debugging:

Ginkgo provides to us different ways of debugging. In case that you want to see all the logs messages in the console you can run the test in verbose mode using the option -v:

ginkgo --focus "Runtime*" -v

In case that the verbose mode is not enough, you can retrieve all run commands and their output in the report directory (./test/test_results). Each test creates a new folder, which contains a file called log where all information is saved, in case of a failing test an exhaustive data will be added.

$ head test/test_results/RuntimeKafkaKafkaPolicyIngress/logs
level=info msg=Starting testName=RuntimeKafka
level=info msg="Vagrant: running command \"vagrant ssh-config runtime\""
cmd: "sudo cilium status" exitCode: 0
 KVStore:            Ok         Consul: 172.17.0.3:8300
ContainerRuntime:   Ok
Kubernetes:         Disabled
Kubernetes APIs:    [""]
Cilium:             Ok   OK
NodeMonitor:        Disabled
Allocated IPv4 addresses:
Running with delve

Delve is a debugging tool for Go applications. If you want to run your test with delve, you should add a new breakpoint using runtime.BreakPoint() in the code, and run ginkgo using dlv.

Example how to run ginkgo using dlv:

dlv test . -- --ginkgo.focus="Runtime" -ginkgo.v=true --cilium.provision=false

Running End-To-End Tests In Other Environments via kubeconfig

The end-to-end tests can be run with an arbitrary kubeconfig file. Normally the CI will use the kubernetes created via vagrant but this can be overridden with --cilium.kubeconfig. When used, ginkgo will not start a VM nor compile cilium. It will also skip some setup tasks like labeling nodes for testing.

This mode expects:

  • The current directory is cilium/test

  • A test focus with --focus. --focus="K8s*" selects all kubernetes tests.

  • Cilium images as full URLs specified with the --cilium.image and --cilium.operator-image options.

  • A working kubeconfig with the --cilium.kubeconfig option

  • A populated K8S_VERSION environment variable set to the version of the cluster

  • If appropriate, set the CNI_INTEGRATION environment variable set to one of flannel, gke, eks, microk8s or minikube. This selects matching configuration overrides for cilium. Leaving this unset for non-matching integrations is also correct.

    For k8s environments that invoke an authentication agent, such as EKS and aws-iam-authenticator, set --cilium.passCLIEnvironment=true

An example invocation is

CNI_INTEGRATION=eks K8S_VERSION=1.13 ginkgo --focus="K8s*" -noColor -- -cilium.provision=false -cilium.kubeconfig=`echo ~/.kube/config` -cilium.image="docker.io/cilium/cilium:latest" -cilium.operator-image="docker.io/cilium/operator:latest" -cilium.passCLIEnvironment=true
GKE (experimental)

Not all tests can succeed on GKE. Many do, however and may be useful.

1- Setup a cluster as in Installation on Google GKE or utilize an existing cluster.

Note

You do not need to deploy Cilium in this step, as the End-To-End Testing Framework handles the deployment of Cilium.

Note

The tests require machines larger than n1-standard-4. This can be set with --machine-type n1-standard-4 on cluster creation.

2- Invoke the tests from cilium/test with options set as explained in Running End-To-End Tests In Other Environments via kubeconfig

CNI_INTEGRATION=gke K8S_VERSION=1.13 ginkgo -v --focus="K8sDemo*" -noColor -- -cilium.provision=false -cilium.kubeconfig=`echo ~/.kube/config` -cilium.image="docker.io/cilium/cilium:latest" -cilium.operator-image="docker.io/cilium/operator:latest" -cilium.passCLIEnvironment=true

Note

The kubernetes version defaults to 1.13 but can be configured with versions between 1.13 and 1.15. Check with kubectl version

AWS EKS (experimental)

Not all tests can succeed on EKS. Many do, however and may be useful.

1- Setup a cluster as in Installation on AWS EKS or utilize an existing cluster.

2- Invoke the tests from cilium/test with options set as explained in Running End-To-End Tests In Other Environments via kubeconfig

CNI_INTEGRATION=eks K8S_VERSION=1.14 ginkgo -v --focus="K8sDemo*" -noColor -- -cilium.provision=false -cilium.kubeconfig=`echo ~/.kube/config` -cilium.image="docker.io/cilium/cilium:latest" -cilium.operator-image="docker.io/cilium/operator:latest" -cilium.passCLIEnvironment=true

Be sure to pass --cilium.passCLIEnvironment=true to allow kubectl to invoke aws-iam-authenticator

Note

The kubernetes version varies between AWS regions. Be sure to check with kubectl version

Adding new Managed Kubernetes providers

All Managed Kubernetes test support relies on using a pre-configured kubeconfig file. This isn’t always adequate, however, and adding defaults specific to each provider is possible. The commit adding GKE support is a good reference.

1- Add a map of helm settings to act as an override for this provider in test/helpers/kubectl.go. These should be the helm settings used when generating cilium specs for this provider.

2- Add a unique CI Integration constant. This value is passed in when invoking ginkgo via the CNI_INTEGRATON environment variable.

3- Update the helm overrides mapping with the constant and the helm settings.

4- For cases where a test should be skipped use the SkipIfIntegration. To skip whole contexts, use SkipContextIf. More complex logic can be expressed with functions like IsIntegration. These functions are all part of the test/helpers package.

Running End-To-End Tests In Other Environments via SSH

If you want to run tests in an arbitrary environment with SSH access, you can use --cilium.SSHConfig to provide the SSH configuration of the endpoint on which tests will be run. The tests presume the following on the remote instance:

  • Cilium source code is located in the directory /home/vagrant/go/src/github.com/cilium/cilium/.
  • Cilium is installed and running.

The ssh connection needs to be defined as a ssh-config file and need to have the following targets:

  • runtime: To run runtime tests
  • k8s{1..2}-${K8S_VERSION}: to run Kubernetes tests. These instances must have Kubernetes installed and running as a prerequisite for running tests.

An example ssh-config can be the following:

Host runtime
  HostName 127.0.0.1
  User vagrant
  Port 2222
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /home/eloy/.go/src/github.com/cilium/cilium/test/.vagrant/machines/runtime/virtualbox/private_key
  IdentitiesOnly yes
  LogLevel FATAL

To run this you can use the following command:

ginkgo  -v -- --cilium.provision=false --cilium.SSHConfig="cat ssh-config"

VMs for Testing

The VMs used for testing are defined in test/Vagrantfile. There are a variety of configuration options that can be passed as environment variables:

ENV variable Default Value Options Description
K8S_NODES 2 0..100 Number of Kubernetes nodes in the cluster
NFS 0 1 If Cilium folder needs to be shared using NFS
IPv6 0 0-1 If 1 the Kubernetes cluster will use IPv6
CONTAINER_RUNTIME docker containerd To set the default container runtime in the Kubernetes cluster
K8S_VERSION 1.13 1.** Kubernetes version to install
SERVER_BOX cilium/ubuntu-dev
Vagrantcloud base image
VM_CPUS 2 0..100 Number of CPUs that need to have the VM
VM_MEMORY 4096 d+ RAM size in Megabytes

Further Assistance

Have a question about how the tests work or want to chat more about improving the testing infrastructure for Cilium? Hop on over to the testing channel on Slack.

Unit Testing

Cilium uses the standard go test framework in combination with gocheck for richer testing functionality.

Prerequisites

Some tests interact with the kvstore and depend on a local kvstore instances of both etcd and consul. To start the local instances, run:

$ make start-kvstores

Running all tests

To run unit tests over the entire repository, run the following command in the project root directory:

$ make unit-tests

Testing individual packages

It is possible to test individual packages by invoking go test directly. You can then cd into the package subject to testing and invoke go test:

$ cd pkg/kvstore
$ go test

If you need more verbose output, you can pass in the -check.v and -check.vv arguments:

$ cd pkg/kvstore
$ go test -check.v -check.vv

If the unit tests have some prerequisites like Prerequisites, you can use the following command to automatically set up the prerequisites, run the unit tests and tear down the prerequisites:

$ make unit-tests TESTPKGS=github.com/cilium/cilium/pkg/kvstore

Some packages have privileged tests. They are not run by default when you run the unit tests for the respective package. The privileged test files have an entry at the top of the test file as shown.

+build privileged_tests

There are two ways that you can run the ‘privileged’ tests.

  1. To run all the ‘privileged’ tests for cilium follow the instructions below.
$ sudo -E make tests-privileged
  1. To run a specific package ‘privileged’ test, follow the instructions below. Here for example we are trying to run the tests for ‘routing’ package.
$ TESTPKGS="pkg/aws/eni/routing" sudo -E make tests-privileged

Running individual tests

Due to the use of gocheck, the standard go test -run will not work, instead, the -check.f argument has to be specified:

$ go test -check.f TestParallelAllocation

Automatically run unit tests on code changes

The script contrib/shell/test.sh contains some helpful bash functions to improve the feedback cycle between writing tests and seeing their results. If you’re writing unit tests in a particular package, the watchtest function will watch for changes in a directory and run the unit tests for that package any time the files change. For example, if writing unit tests in pkg/policy, run this in a terminal next to your editor:

$ . contrib/shell/test.sh
$ watchtest pkg/policy

This shell script depends on the inotify-tools package on Linux.

The best way to get help if you get stuck is to ask a question on the Cilium Slack channel. With Cilium contributors across the globe, there is almost always someone available to help.

BPF and XDP Reference Guide

Note

This documentation section is targeted at developers and users who want to understand BPF and XDP in great technical depth. While reading this reference guide may help broaden your understanding of Cilium, it is not a requirement to use Cilium. Please refer to the Getting Started Guides and Architecture for a higher level introduction.

BPF is a highly flexible and efficient virtual machine-like construct in the Linux kernel allowing to execute bytecode at various hook points in a safe manner. It is used in a number of Linux kernel subsystems, most prominently networking, tracing and security (e.g. sandboxing).

Although BPF exists since 1992, this document covers the extended Berkeley Packet Filter (eBPF) version which has first appeared in Kernel 3.18 and renders the original version which is being referred to as “classic” BPF (cBPF) these days mostly obsolete. cBPF is known to many as being the packet filter language used by tcpdump. Nowadays, the Linux kernel runs eBPF only and loaded cBPF bytecode is transparently translated into an eBPF representation in the kernel before program execution. This documentation will generally refer to the term BPF unless explicit differences between eBPF and cBPF are being pointed out.

Even though the name Berkeley Packet Filter hints at a packet filtering specific purpose, the instruction set is generic and flexible enough these days that there are many use cases for BPF apart from networking. See Further Reading for a list of projects which use BPF.

Cilium uses BPF heavily in its data path, see Architecture for further information. The goal of this chapter is to provide a BPF reference guide in order to gain understanding of BPF, its networking specific use including loading BPF programs with tc (traffic control) and XDP (eXpress Data Path), and to aid with developing Cilium’s BPF templates.

BPF Architecture

BPF does not define itself by only providing its instruction set, but also by offering further infrastructure around it such as maps which act as efficient key / value stores, helper functions to interact with and leverage kernel functionality, tail calls for calling into other BPF programs, security hardening primitives, a pseudo file system for pinning objects (maps, programs), and infrastructure for allowing BPF to be offloaded, for example, to a network card.

LLVM provides a BPF back end, so that tools like clang can be used to compile C into a BPF object file, which can then be loaded into the kernel. BPF is deeply tied to the Linux kernel and allows for full programmability without sacrificing native kernel performance.

Last but not least, also the kernel subsystems making use of BPF are part of BPF’s infrastructure. The two main subsystems discussed throughout this document are tc and XDP where BPF programs can be attached to. XDP BPF programs are attached at the earliest networking driver stage and trigger a run of the BPF program upon packet reception. By definition, this achieves the best possible packet processing performance since packets cannot get processed at an even earlier point in software. However, since this processing occurs so early in the networking stack, the stack has not yet extracted metadata out of the packet. On the other hand, tc BPF programs are executed later in the kernel stack, so they have access to more metadata and core kernel functionality. Apart from tc and XDP programs, there are various other kernel subsystems as well which use BPF such as tracing (kprobes, uprobes, tracepoints, etc).

The following subsections provide further details on individual aspects of the BPF architecture.

Instruction Set

BPF is a general purpose RISC instruction set and was originally designed for the purpose of writing programs in a subset of C which can be compiled into BPF instructions through a compiler back end (e.g. LLVM), so that the kernel can later on map them through an in-kernel JIT compiler into native opcodes for optimal execution performance inside the kernel.

The advantages for pushing these instructions into the kernel include:

  • Making the kernel programmable without having to cross kernel / user space boundaries. For example, BPF programs related to networking, as in the case of Cilium, can implement flexible container policies, load balancing and other means without having to move packets to user space and back into the kernel. State between BPF programs and kernel / user space can still be shared through maps whenever needed.
  • Given the flexibility of a programmable data path, programs can be heavily optimized for performance also by compiling out features that are not required for the use cases the program solves. For example, if a container does not require IPv4, then the BPF program can be built to only deal with IPv6 in order to save resources in the fast-path.
  • In case of networking (e.g. tc and XDP), BPF programs can be updated atomically without having to restart the kernel, system services or containers, and without traffic interruptions. Furthermore, any program state can also be maintained throughout updates via BPF maps.
  • BPF provides a stable ABI towards user space, and does not require any third party kernel modules. BPF is a core part of the Linux kernel that is shipped everywhere, and guarantees that existing BPF programs keep running with newer kernel versions. This guarantee is the same guarantee that the kernel provides for system calls with regard to user space applications. Moreover, BPF programs are portable across different architectures.
  • BPF programs work in concert with the kernel, they make use of existing kernel infrastructure (e.g. drivers, netdevices, tunnels, protocol stack, sockets) and tooling (e.g. iproute2) as well as the safety guarantees which the kernel provides. Unlike kernel modules, BPF programs are verified through an in-kernel verifier in order to ensure that they cannot crash the kernel, always terminate, etc. XDP programs, for example, reuse the existing in-kernel drivers and operate on the provided DMA buffers containing the packet frames without exposing them or an entire driver to user space as in other models. Moreover, XDP programs reuse the existing stack instead of bypassing it. BPF can be considered a generic “glue code” to kernel facilities for crafting programs to solve specific use cases.

The execution of a BPF program inside the kernel is always event-driven! Examples:

  • A networking device which has a BPF program attached on its ingress path will trigger the execution of the program once a packet is received.
  • A kernel address which has a kprobe with a BPF program attached will trap once the code at that address gets executed, which will then invoke the kprobe’s callback function for instrumentation, subsequently triggering the execution of the attached BPF program.

BPF consists of eleven 64 bit registers with 32 bit subregisters, a program counter and a 512 byte large BPF stack space. Registers are named r0 - r10. The operating mode is 64 bit by default, the 32 bit subregisters can only be accessed through special ALU (arithmetic logic unit) operations. The 32 bit lower subregisters zero-extend into 64 bit when they are being written to.

Register r10 is the only register which is read-only and contains the frame pointer address in order to access the BPF stack space. The remaining r0 - r9 registers are general purpose and of read/write nature.

A BPF program can call into a predefined helper function, which is defined by the core kernel (never by modules). The BPF calling convention is defined as follows:

  • r0 contains the return value of a helper function call.
  • r1 - r5 hold arguments from the BPF program to the kernel helper function.
  • r6 - r9 are callee saved registers that will be preserved on helper function call.

The BPF calling convention is generic enough to map directly to x86_64, arm64 and other ABIs, thus all BPF registers map one to one to HW CPU registers, so that a JIT only needs to issue a call instruction, but no additional extra moves for placing function arguments. This calling convention was modeled to cover common call situations without having a performance penalty. Calls with 6 or more arguments are currently not supported. The helper functions in the kernel which are dedicated to BPF (BPF_CALL_0() to BPF_CALL_5() functions) are specifically designed with this convention in mind.

Register r0 is also the register containing the exit value for the BPF program. The semantics of the exit value are defined by the type of program. Furthermore, when handing execution back to the kernel, the exit value is passed as a 32 bit value.

Registers r1 - r5 are scratch registers, meaning the BPF program needs to either spill them to the BPF stack or move them to callee saved registers if these arguments are to be reused across multiple helper function calls. Spilling means that the variable in the register is moved to the BPF stack. The reverse operation of moving the variable from the BPF stack to the register is called filling. The reason for spilling/filling is due to the limited number of registers.

Upon entering execution of a BPF program, register r1 initially contains the context for the program. The context is the input argument for the program (similar to argc/argv pair for a typical C program). BPF is restricted to work on a single context. The context is defined by the program type, for example, a networking program can have a kernel representation of the network packet (skb) as the input argument.

The general operation of BPF is 64 bit to follow the natural model of 64 bit architectures in order to perform pointer arithmetics, pass pointers but also pass 64 bit values into helper functions, and to allow for 64 bit atomic operations.

The maximum instruction limit per program is restricted to 4096 BPF instructions, which, by design, means that any program will terminate quickly. For kernel newer than 5.1 this limit was lifted to 1 million BPF instructions. Although the instruction set contains forward as well as backward jumps, the in-kernel BPF verifier will forbid loops so that termination is always guaranteed. Since BPF programs run inside the kernel, the verifier’s job is to make sure that these are safe to run, not affecting the system’s stability. This means that from an instruction set point of view, loops can be implemented, but the verifier will restrict that. However, there is also a concept of tail calls that allows for one BPF program to jump into another one. This, too, comes with an upper nesting limit of 32 calls, and is usually used to decouple parts of the program logic, for example, into stages.

The instruction format is modeled as two operand instructions, which helps mapping BPF instructions to native instructions during JIT phase. The instruction set is of fixed size, meaning every instruction has 64 bit encoding. Currently, 87 instructions have been implemented and the encoding also allows to extend the set with further instructions when needed. The instruction encoding of a single 64 bit instruction on a big-endian machine is defined as a bit sequence from most significant bit (MSB) to least significant bit (LSB) of op:8, dst_reg:4, src_reg:4, off:16, imm:32. off and imm is of signed type. The encodings are part of the kernel headers and defined in linux/bpf.h header, which also includes linux/bpf_common.h.

op defines the actual operation to be performed. Most of the encoding for op has been reused from cBPF. The operation can be based on register or immediate operands. The encoding of op itself provides information on which mode to use (BPF_X for denoting register-based operations, and BPF_K for immediate-based operations respectively). In the latter case, the destination operand is always a register. Both dst_reg and src_reg provide additional information about the register operands to be used (e.g. r0 - r9) for the operation. off is used in some instructions to provide a relative offset, for example, for addressing the stack or other buffers available to BPF (e.g. map values, packet data, etc), or jump targets in jump instructions. imm contains a constant / immediate value.

The available op instructions can be categorized into various instruction classes. These classes are also encoded inside the op field. The op field is divided into (from MSB to LSB) code:4, source:1 and class:3. class is the more generic instruction class, code denotes a specific operational code inside that class, and source tells whether the source operand is a register or an immediate value. Possible instruction classes include:

  • BPF_LD, BPF_LDX: Both classes are for load operations. BPF_LD is used for loading a double word as a special instruction spanning two instructions due to the imm:32 split, and for byte / half-word / word loads of packet data. The latter was carried over from cBPF mainly in order to keep cBPF to BPF translations efficient, since they have optimized JIT code. For native BPF these packet load instructions are less relevant nowadays. BPF_LDX class holds instructions for byte / half-word / word / double-word loads out of memory. Memory in this context is generic and could be stack memory, map value data, packet data, etc.
  • BPF_ST, BPF_STX: Both classes are for store operations. Similar to BPF_LDX the BPF_STX is the store counterpart and is used to store the data from a register into memory, which, again, can be stack memory, map value, packet data, etc. BPF_STX also holds special instructions for performing word and double-word based atomic add operations, which can be used for counters, for example. The BPF_ST class is similar to BPF_STX by providing instructions for storing data into memory only that the source operand is an immediate value.
  • BPF_ALU, BPF_ALU64: Both classes contain ALU operations. Generally, BPF_ALU operations are in 32 bit mode and BPF_ALU64 in 64 bit mode. Both ALU classes have basic operations with source operand which is register-based and an immediate-based counterpart. Supported by both are add (+), sub (-), and (&), or (|), left shift (<<), right shift (>>), xor (^), mul (*), div (/), mod (%), neg (~) operations. Also mov (<X> := <Y>) was added as a special ALU operation for both classes in both operand modes. BPF_ALU64 also contains a signed right shift. BPF_ALU additionally contains endianness conversion instructions for half-word / word / double-word on a given source register.
  • BPF_JMP: This class is dedicated to jump operations. Jumps can be unconditional and conditional. Unconditional jumps simply move the program counter forward, so that the next instruction to be executed relative to the current instruction is off + 1, where off is the constant offset encoded in the instruction. Since off is signed, the jump can also be performed backwards as long as it does not create a loop and is within program bounds. Conditional jumps operate on both, register-based and immediate-based source operands. If the condition in the jump operations results in true, then a relative jump to off + 1 is performed, otherwise the next instruction (0 + 1) is performed. This fall-through jump logic differs compared to cBPF and allows for better branch prediction as it fits the CPU branch predictor logic more naturally. Available conditions are jeq (==), jne (!=), jgt (>), jge (>=), jsgt (signed >), jsge (signed >=), jlt (<), jle (<=), jslt (signed <), jsle (signed <=) and jset (jump if DST & SRC). Apart from that, there are three special jump operations within this class: the exit instruction which will leave the BPF program and return the current value in r0 as a return code, the call instruction, which will issue a function call into one of the available BPF helper functions, and a hidden tail call instruction, which will jump into a different BPF program.

The Linux kernel is shipped with a BPF interpreter which executes programs assembled in BPF instructions. Even cBPF programs are translated into eBPF programs transparently in the kernel, except for architectures that still ship with a cBPF JIT and have not yet migrated to an eBPF JIT.

Currently x86_64, arm64, ppc64, s390x, mips64, sparc64 and arm architectures come with an in-kernel eBPF JIT compiler.

All BPF handling such as loading of programs into the kernel or creation of BPF maps is managed through a central bpf() system call. It is also used for managing map entries (lookup / update / delete), and making programs as well as maps persistent in the BPF file system through pinning.

Helper Functions

Helper functions are a concept which enables BPF programs to consult a core kernel defined set of function calls in order to retrieve / push data from / to the kernel. Available helper functions may differ for each BPF program type, for example, BPF programs attached to sockets are only allowed to call into a subset of helpers compared to BPF programs attached to the tc layer. Encapsulation and decapsulation helpers for lightweight tunneling constitute an example of functions which are only available to lower tc layers, whereas event output helpers for pushing notifications to user space are available to tc and XDP programs.

Each helper function is implemented with a commonly shared function signature similar to system calls. The signature is defined as:

u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)

The calling convention as described in the previous section applies to all BPF helper functions.

The kernel abstracts helper functions into macros BPF_CALL_0() to BPF_CALL_5() which are similar to those of system calls. The following example is an extract from a helper function which updates map elements by calling into the corresponding map implementation callbacks:

BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
           void *, value, u64, flags)
{
    WARN_ON_ONCE(!rcu_read_lock_held());
    return map->ops->map_update_elem(map, key, value, flags);
}

const struct bpf_func_proto bpf_map_update_elem_proto = {
    .func           = bpf_map_update_elem,
    .gpl_only       = false,
    .ret_type       = RET_INTEGER,
    .arg1_type      = ARG_CONST_MAP_PTR,
    .arg2_type      = ARG_PTR_TO_MAP_KEY,
    .arg3_type      = ARG_PTR_TO_MAP_VALUE,
    .arg4_type      = ARG_ANYTHING,
};

There are various advantages of this approach: while cBPF overloaded its load instructions in order to fetch data at an impossible packet offset to invoke auxiliary helper functions, each cBPF JIT needed to implement support for such a cBPF extension. In case of eBPF, each newly added helper function will be JIT compiled in a transparent and efficient way, meaning that the JIT compiler only needs to emit a call instruction since the register mapping is made in such a way that BPF register assignments already match the underlying architecture’s calling convention. This allows for easily extending the core kernel with new helper functionality. All BPF helper functions are part of the core kernel and cannot be extended or added through kernel modules.

The aforementioned function signature also allows the verifier to perform type checks. The above struct bpf_func_proto is used to hand all the necessary information which need to be known about the helper to the verifier, so that the verifier can make sure that the expected types from the helper match the current contents of the BPF program’s analyzed registers.

Argument types can range from passing in any kind of value up to restricted contents such as a pointer / size pair for the BPF stack buffer, which the helper should read from or write to. In the latter case, the verifier can also perform additional checks, for example, whether the buffer was previously initialized.

The list of available BPF helper functions is rather long and constantly growing, for example, at the time of this writing, tc BPF programs can choose from 38 different BPF helpers. The kernel’s struct bpf_verifier_ops contains a get_func_proto callback function that provides the mapping of a specific enum bpf_func_id to one of the available helpers for a given BPF program type.

Maps

_images/bpf_map.png

Maps are efficient key / value stores that reside in kernel space. They can be accessed from a BPF program in order to keep state among multiple BPF program invocations. They can also be accessed through file descriptors from user space and can be arbitrarily shared with other BPF programs or user space applications.

BPF programs which share maps with each other are not required to be of the same program type, for example, tracing programs can share maps with networking programs. A single BPF program can currently access up to 64 different maps directly.

Map implementations are provided by the core kernel. There are generic maps with per-CPU and non-per-CPU flavor that can read / write arbitrary data, but there are also a few non-generic maps that are used along with helper functions.

Generic maps currently available are BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_PERCPU_ARRAY, BPF_MAP_TYPE_LRU_HASH, BPF_MAP_TYPE_LRU_PERCPU_HASH and BPF_MAP_TYPE_LPM_TRIE. They all use the same common set of BPF helper functions in order to perform lookup, update or delete operations while implementing a different backend with differing semantics and performance characteristics.

Non-generic maps that are currently in the kernel are BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_MAP_TYPE_CGROUP_ARRAY, BPF_MAP_TYPE_STACK_TRACE, BPF_MAP_TYPE_ARRAY_OF_MAPS, BPF_MAP_TYPE_HASH_OF_MAPS. For example, BPF_MAP_TYPE_PROG_ARRAY is an array map which holds other BPF programs, BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS both hold pointers to other maps such that entire BPF maps can be atomically replaced at runtime. These types of maps tackle a specific issue which was unsuitable to be implemented solely through a BPF helper function since additional (non-data) state is required to be held across BPF program invocations.

Object Pinning

_images/bpf_fs.png

BPF maps and programs act as a kernel resource and can only be accessed through file descriptors, backed by anonymous inodes in the kernel. Advantages, but also a number of disadvantages come along with them:

User space applications can make use of most file descriptor related APIs, file descriptor passing for Unix domain sockets work transparently, etc, but at the same time, file descriptors are limited to a processes’ lifetime, which makes options like map sharing rather cumbersome to carry out.

Thus, it brings a number of complications for certain use cases such as iproute2, where tc or XDP sets up and loads the program into the kernel and terminates itself eventually. With that, also access to maps is unavailable from user space side, where it could otherwise be useful, for example, when maps are shared between ingress and egress locations of the data path. Also, third party applications may wish to monitor or update map contents during BPF program runtime.

To overcome this limitation, a minimal kernel space BPF file system has been implemented, where BPF map and programs can be pinned to, a process called object pinning. The BPF system call has therefore been extended with two new commands which can pin (BPF_OBJ_PIN) or retrieve (BPF_OBJ_GET) a previously pinned object.

For instance, tools such as tc make use of this infrastructure for sharing maps on ingress and egress. The BPF related file system is not a singleton, it does support multiple mount instances, hard and soft links, etc.

Tail Calls

_images/bpf_tailcall.png

Another concept that can be used with BPF is called tail calls. Tail calls can be seen as a mechanism that allows one BPF program to call another, without returning back to the old program. Such a call has minimal overhead as unlike function calls, it is implemented as a long jump, reusing the same stack frame.

Such programs are verified independently of each other, thus for transferring state, either per-CPU maps as scratch buffers or in case of tc programs, skb fields such as the cb[] area must be used.

Only programs of the same type can be tail called, and they also need to match in terms of JIT compilation, thus either JIT compiled or only interpreted programs can be invoked, but not mixed together.

There are two components involved for carrying out tail calls: the first part needs to setup a specialized map called program array (BPF_MAP_TYPE_PROG_ARRAY) that can be populated by user space with key / values, where values are the file descriptors of the tail called BPF programs, the second part is a bpf_tail_call() helper where the context, a reference to the program array and the lookup key is passed to. Then the kernel inlines this helper call directly into a specialized BPF instruction. Such a program array is currently write-only from user space side.

The kernel looks up the related BPF program from the passed file descriptor and atomically replaces program pointers at the given map slot. When no map entry has been found at the provided key, the kernel will just “fall through” and continue execution of the old program with the instructions following after the bpf_tail_call(). Tail calls are a powerful utility, for example, parsing network headers could be structured through tail calls. During runtime, functionality can be added or replaced atomically, and thus altering the BPF program’s execution behavior.

BPF to BPF Calls

_images/bpf_call.png

Aside from BPF helper calls and BPF tail calls, a more recent feature that has been added to the BPF core infrastructure is BPF to BPF calls. Before this feature was introduced into the kernel, a typical BPF C program had to declare any reusable code that, for example, resides in headers as always_inline such that when LLVM compiles and generates the BPF object file all these functions were inlined and therefore duplicated many times in the resulting object file, artificially inflating its code size:

#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

#ifndef __inline
# define __inline                         \
   inline __attribute__((always_inline))
#endif

static __inline int foo(void)
{
    return XDP_DROP;
}

__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
    return foo();
}

char __license[] __section("license") = "GPL";

The main reason why this was necessary was due to lack of function call support in the BPF program loader as well as verifier, interpreter and JITs. Starting with Linux kernel 4.16 and LLVM 6.0 this restriction got lifted and BPF programs no longer need to use always_inline everywhere. Thus, the prior shown BPF example code can then be rewritten more naturally as:

#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

static int foo(void)
{
    return XDP_DROP;
}

__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
    return foo();
}

char __license[] __section("license") = "GPL";

Mainstream BPF JIT compilers like x86_64 and arm64 support BPF to BPF calls today with others following in near future. BPF to BPF call is an important performance optimization since it heavily reduces the generated BPF code size and therefore becomes friendlier to a CPU’s instruction cache.

The calling convention known from BPF helper function applies to BPF to BPF calls just as well, meaning r1 up to r5 are for passing arguments to the callee and the result is returned in r0. r1 to r5 are scratch registers whereas r6 to r9 preserved across calls the usual way. The maximum number of nesting calls respectively allowed call frames is 8. A caller can pass pointers (e.g. to the caller’s stack frame) down to the callee, but never vice versa.

BPF to BPF calls are currently incompatible with the use of BPF tail calls, since the latter requires to reuse the current stack setup as-is, whereas the former adds additional stack frames and thus changes the expected layout for tail calls.

BPF JIT compilers emit separate images for each function body and later fix up the function call addresses in the image in a final JIT pass. This has proven to require minimal changes to the JITs in that they can treat BPF to BPF calls as conventional BPF helper calls.

JIT

_images/bpf_jit.png

The 64 bit x86_64, arm64, ppc64, s390x, mips64, sparc64 and 32 bit arm, x86_32 architectures are all shipped with an in-kernel eBPF JIT compiler, also all of them are feature equivalent and can be enabled through:

# echo 1 > /proc/sys/net/core/bpf_jit_enable

The 32 bit mips, ppc and sparc architectures currently have a cBPF JIT compiler. The mentioned architectures still having a cBPF JIT as well as all remaining architectures supported by the Linux kernel which do not have a BPF JIT compiler at all need to run eBPF programs through the in-kernel interpreter.

In the kernel’s source tree, eBPF JIT support can be easily determined through issuing a grep for HAVE_EBPF_JIT:

# git grep HAVE_EBPF_JIT arch/
arch/arm/Kconfig:       select HAVE_EBPF_JIT   if !CPU_ENDIAN_BE32
arch/arm64/Kconfig:     select HAVE_EBPF_JIT
arch/powerpc/Kconfig:   select HAVE_EBPF_JIT   if PPC64
arch/mips/Kconfig:      select HAVE_EBPF_JIT   if (64BIT && !CPU_MICROMIPS)
arch/s390/Kconfig:      select HAVE_EBPF_JIT   if PACK_STACK && HAVE_MARCH_Z196_FEATURES
arch/sparc/Kconfig:     select HAVE_EBPF_JIT   if SPARC64
arch/x86/Kconfig:       select HAVE_EBPF_JIT   if X86_64

JIT compilers speed up execution of the BPF program significantly since they reduce the per instruction cost compared to the interpreter. Often instructions can be mapped 1:1 with native instructions of the underlying architecture. This also reduces the resulting executable image size and is therefore more instruction cache friendly to the CPU. In particular in case of CISC instruction sets such as x86, the JITs are optimized for emitting the shortest possible opcodes for a given instruction to shrink the total necessary size for the program translation.

Hardening

BPF locks the entire BPF interpreter image (struct bpf_prog) as well as the JIT compiled image (struct bpf_binary_header) in the kernel as read-only during the program’s lifetime in order to prevent the code from potential corruptions. Any corruption happening at that point, for example, due to some kernel bugs will result in a general protection fault and thus crash the kernel instead of allowing the corruption to happen silently.

Architectures that support setting the image memory as read-only can be determined through:

$ git grep ARCH_HAS_SET_MEMORY | grep select
arch/arm/Kconfig:    select ARCH_HAS_SET_MEMORY
arch/arm64/Kconfig:  select ARCH_HAS_SET_MEMORY
arch/s390/Kconfig:   select ARCH_HAS_SET_MEMORY
arch/x86/Kconfig:    select ARCH_HAS_SET_MEMORY

The option CONFIG_ARCH_HAS_SET_MEMORY is not configurable, thanks to which this protection is always built-in. Other architectures might follow in the future.

In case of the x86_64 JIT compiler, the JITing of the indirect jump from the use of tail calls is realized through a retpoline in case CONFIG_RETPOLINE has been set which is the default at the time of writing in most modern Linux distributions.

In case of /proc/sys/net/core/bpf_jit_harden set to 1 additional hardening steps for the JIT compilation take effect for unprivileged users. This effectively trades off their performance slightly by decreasing a (potential) attack surface in case of untrusted users operating on the system. The decrease in program execution still results in better performance compared to switching to interpreter entirely.

Currently, enabling hardening will blind all user provided 32 bit and 64 bit constants from the BPF program when it gets JIT compiled in order to prevent JIT spraying attacks which inject native opcodes as immediate values. This is problematic as these immediate values reside in executable kernel memory, therefore a jump that could be triggered from some kernel bug would jump to the start of the immediate value and then execute these as native instructions.

JIT constant blinding prevents this due to randomizing the actual instruction, which means the operation is transformed from an immediate based source operand to a register based one through rewriting the instruction by splitting the actual load of the value into two steps: 1) load of a blinded immediate value rnd ^ imm into a register, 2) xoring that register with rnd such that the original imm immediate then resides in the register and can be used for the actual operation. The example was provided for a load operation, but really all generic operations are blinded.

Example of JITing a program with hardening disabled:

# echo 0 > /proc/sys/net/core/bpf_jit_harden

  ffffffffa034f5e9 + <x>:
  [...]
  39:   mov    $0xa8909090,%eax
  3e:   mov    $0xa8909090,%eax
  43:   mov    $0xa8ff3148,%eax
  48:   mov    $0xa89081b4,%eax
  4d:   mov    $0xa8900bb0,%eax
  52:   mov    $0xa810e0c1,%eax
  57:   mov    $0xa8908eb4,%eax
  5c:   mov    $0xa89020b0,%eax
  [...]

The same program gets constant blinded when loaded through BPF as an unprivileged user in the case hardening is enabled:

# echo 1 > /proc/sys/net/core/bpf_jit_harden

  ffffffffa034f1e5 + <x>:
  [...]
  39:   mov    $0xe1192563,%r10d
  3f:   xor    $0x4989b5f3,%r10d
  46:   mov    %r10d,%eax
  49:   mov    $0xb8296d93,%r10d
  4f:   xor    $0x10b9fd03,%r10d
  56:   mov    %r10d,%eax
  59:   mov    $0x8c381146,%r10d
  5f:   xor    $0x24c7200e,%r10d
  66:   mov    %r10d,%eax
  69:   mov    $0xeb2a830e,%r10d
  6f:   xor    $0x43ba02ba,%r10d
  76:   mov    %r10d,%eax
  79:   mov    $0xd9730af,%r10d
  7f:   xor    $0xa5073b1f,%r10d
  86:   mov    %r10d,%eax
  89:   mov    $0x9a45662b,%r10d
  8f:   xor    $0x325586ea,%r10d
  96:   mov    %r10d,%eax
  [...]

Both programs are semantically the same, only that none of the original immediate values are visible anymore in the disassembly of the second program.

At the same time, hardening also disables any JIT kallsyms exposure for privileged users, preventing that JIT image addresses are not exposed to /proc/kallsyms anymore.

Moreover, the Linux kernel provides the option CONFIG_BPF_JIT_ALWAYS_ON which removes the entire BPF interpreter from the kernel and permanently enables the JIT compiler. This has been developed as part of a mitigation in the context of Spectre v2 such that when used in a VM-based setting, the guest kernel is not going to reuse the host kernel’s BPF interpreter when mounting an attack anymore. For container-based environments, the CONFIG_BPF_JIT_ALWAYS_ON configuration option is optional, but in case JITs are enabled there anyway, the interpreter may as well be compiled out to reduce the kernel’s complexity. Thus, it is also generally recommended for widely used JITs in case of main stream architectures such as x86_64 and arm64.

Last but not least, the kernel offers an option to disable the use of the bpf(2) system call for unprivileged users through the /proc/sys/kernel/unprivileged_bpf_disabled sysctl knob. This is on purpose a one-time kill switch, meaning once set to 1, there is no option to reset it back to 0 until a new kernel reboot. When set only CAP_SYS_ADMIN privileged processes out of the initial namespace are allowed to use the bpf(2) system call from that point onwards. Upon start, Cilium sets this knob to 1 as well.

# echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled

Offloads

_images/bpf_offload.png

Networking programs in BPF, in particular for tc and XDP do have an offload-interface to hardware in the kernel in order to execute BPF code directly on the NIC.

Currently, the nfp driver from Netronome has support for offloading BPF through a JIT compiler which translates BPF instructions to an instruction set implemented against the NIC. This includes offloading of BPF maps to the NIC as well, thus the offloaded BPF program can perform map lookups, updates and deletions.

Toolchain

Current user space tooling, introspection facilities and kernel control knobs around BPF are discussed in this section. Note, the tooling and infrastructure around BPF is still rapidly evolving and thus may not provide a complete picture of all available tools.

Development Environment

A step by step guide for setting up a development environment for BPF can be found below for both Fedora and Ubuntu. This will guide you through building, installing and testing a development kernel as well as building and installing iproute2.

The step of manually building iproute2 and Linux kernel is usually not necessary given that major distributions already ship recent enough kernels by default, but would be needed for testing bleeding edge versions or contributing BPF patches to iproute2 and to the Linux kernel, respectively. Similarly, for debugging and introspection purposes building bpftool is optional, but recommended.

Fedora

The following applies to Fedora 25 or later:

$ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \
  openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static

Note

If you are running some other Fedora derivative and dnf is missing, try using yum instead.

Ubuntu

The following applies to Ubuntu 17.04 or later:

$ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \
  clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl-dev bison flex \
  graphviz
openSUSE Tumbleweed

The following applies to openSUSE Tumbleweed and openSUSE Leap 15.0 or later:

$ sudo  zypper install -y git gcc ncurses-devel libelf-devel bc libopenssl-devel \
libcap-devel clang llvm graphviz bison flex glibc-devel-static
Compiling the Kernel

Development of new BPF features for the Linux kernel happens inside the net-next git tree, latest BPF fixes in the net tree. The following command will obtain the kernel source for the net-next tree through git:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git

If the git commit history is not of interest, then --depth 1 will clone the tree much faster by truncating the git history only to the most recent commit.

In case the net tree is of interest, it can be cloned from this url:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git

There are dozens of tutorials in the Internet on how to build Linux kernels, one good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild) that can be followed with one of the two git trees mentioned above.

Make sure that the generated .config file contains the following CONFIG_* entries for running BPF. These entries are also needed for Cilium.

CONFIG_CGROUP_BPF=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_ACT=y
CONFIG_BPF_JIT=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_TEST_BPF=m

Some of the entries cannot be adjusted through make menuconfig. For example, CONFIG_HAVE_EBPF_JIT is selected automatically if a given architecture does come with an eBPF JIT. In this specific case, CONFIG_HAVE_EBPF_JIT is optional but highly recommended. An architecture not having an eBPF JIT compiler will need to fall back to the in-kernel interpreter with the cost of being less efficient executing BPF instructions.

Verifying the Setup

After you have booted into the newly compiled kernel, navigate to the BPF selftest suite in order to test BPF functionality (current working directory points to the root of the cloned git tree):

$ cd tools/testing/selftests/bpf/
$ make
$ sudo ./test_verifier

The verifier tests print out all the current checks being performed. The summary at the end of running all tests will dump information of test successes and failures:

Summary: 847 PASSED, 0 SKIPPED, 0 FAILED

Note

For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+ caused by the BPF function calls which do not need to be inlined anymore. See section BPF to BPF Calls or the cover letter mail from the kernel patch (https://lwn.net/Articles/741773/) for more information. Not every BPF program has a dependency on LLVM 6.0+ if it does not use this new feature. If your distribution does not provide LLVM 6.0+ you may compile it by following the instruction in the LLVM section.

In order to run through all BPF selftests, the following command is needed:

$ sudo make run_tests

If you see any failures, please contact us on Slack with the full test output.

Compiling iproute2

Similar to the net (fixes only) and net-next (new features) kernel trees, the iproute2 git tree has two branches, namely master and net-next. The master branch is based on the net tree and the net-next branch is based against the net-next kernel tree. This is necessary, so that changes in header files can be synchronized in the iproute2 tree.

In order to clone the iproute2 master branch, the following command can be used:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git

Similarly, to clone into mentioned net-next branch of iproute2, run the following:

$ git clone -b net-next git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git

After that, proceed with the build and installation:

$ cd iproute2/
$ ./configure --prefix=/usr
TC schedulers
 ATM    no

libc has setns: yes
SELinux support: yes
ELF support: yes
libmnl support: no
Berkeley DB: no

docs: latex: no
 WARNING: no docs can be built from LaTeX files
 sgml2html: no
 WARNING: no HTML docs can be built from SGML
$ make
[...]
$ sudo make install

Ensure that the configure script shows ELF support: yes, so that iproute2 can process ELF files from LLVM’s BPF back end. libelf was listed in the instructions for installing the dependencies in case of Fedora and Ubuntu earlier.

Compiling bpftool

bpftool is an essential tool around debugging and introspection of BPF programs and maps. It is part of the kernel tree and available under tools/bpf/bpftool/.

Make sure to have cloned either the net or net-next kernel tree as described earlier. In order to build and install bpftool, the following steps are required:

$ cd <kernel-tree>/tools/bpf/bpftool/
$ make
Auto-detecting system features:
...                        libbfd: [ on  ]
...        disassembler-four-args: [ OFF ]

  CC       xlated_dumper.o
  CC       prog.o
  CC       common.o
  CC       cgroup.o
  CC       main.o
  CC       json_writer.o
  CC       cfg.o
  CC       map.o
  CC       jit_disasm.o
  CC       disasm.o
make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf'

Auto-detecting system features:
...                        libelf: [ on  ]
...                           bpf: [ on  ]

  CC       libbpf.o
  CC       bpf.o
  CC       nlattr.o
  LD       libbpf-in.o
  LINK     libbpf.a
make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf'
  LINK     bpftool
$ sudo make install

LLVM

LLVM is currently the only compiler suite providing a BPF back end. gcc does not support BPF at this point.

The BPF back end was merged into LLVM’s 3.7 release. Major distributions enable the BPF back end by default when they package LLVM, therefore installing clang and llvm is sufficient on most recent distributions to start compiling C into BPF object files.

The typical workflow is that BPF programs are written in C, compiled by LLVM into object / ELF files, which are parsed by user space BPF ELF loaders (such as iproute2 or others), and pushed into the kernel through the BPF system call. The kernel verifies the BPF instructions and JITs them, returning a new file descriptor for the program, which then can be attached to a subsystem (e.g. networking). If supported, the subsystem could then further offload the BPF program to hardware (e.g. NIC).

For LLVM, BPF target support can be checked, for example, through the following:

$ llc --version
LLVM (http://llvm.org/):
LLVM version 3.8.1
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake

Registered Targets:
  [...]
  bpf        - BPF (host endian)
  bpfeb      - BPF (big endian)
  bpfel      - BPF (little endian)
  [...]

By default, the bpf target uses the endianness of the CPU it compiles on, meaning that if the CPU’s endianness is little endian, the program is represented in little endian format as well, and if the CPU’s endianness is big endian, the program is represented in big endian. This also matches the runtime behavior of BPF, which is generic and uses the CPU’s endianness it runs on in order to not disadvantage architectures in any of the format.

For cross-compilation, the two targets bpfeb and bpfel were introduced, thanks to that BPF programs can be compiled on a node running in one endianness (e.g. little endian on x86) and run on a node in another endianness format (e.g. big endian on arm). Note that the front end (clang) needs to run in the target endianness as well.

Using bpf as a target is the preferred way in situations where no mixture of endianness applies. For example, compilation on x86_64 results in the same output for the targets bpf and bpfel due to being little endian, therefore scripts triggering a compilation also do not have to be endian aware.

A minimal, stand-alone XDP drop program might look like the following example (xdp-example.c):

#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

__section("prog")
int xdp_drop(struct xdp_md *ctx)
{
    return XDP_DROP;
}

char __license[] __section("license") = "GPL";

It can then be compiled and loaded into the kernel as follows:

$ clang -O2 -Wall -target bpf -c xdp-example.c -o xdp-example.o
# ip link set dev em1 xdp obj xdp-example.o

Note

Attaching an XDP BPF program to a network device as above requires Linux 4.11 with a device that supports XDP, or Linux 4.12 or later.

For the generated object file LLVM (>= 3.9) uses the official BPF machine value, that is, EM_BPF (decimal: 247 / hex: 0xf7). In this example, the program has been compiled with bpf target under x86_64, therefore LSB (as opposed to MSB) is shown regarding endianness:

$ file xdp-example.o
xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped

readelf -a xdp-example.o will dump further information about the ELF file, which can sometimes be useful for introspecting generated section headers, relocation entries and the symbol table.

In the unlikely case where clang and LLVM need to be compiled from scratch, the following commands can be used:

$ git clone https://git.llvm.org/git/llvm.git
$ cd llvm/tools
$ git clone --depth 1 https://git.llvm.org/git/clang.git
$ cd ..; mkdir build; cd build
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF
$ make -j $(getconf _NPROCESSORS_ONLN)

$ ./bin/llc --version
LLVM (http://llvm.org/):
LLVM version x.y.zsvn
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake

Registered Targets:
  bpf    - BPF (host endian)
  bpfeb  - BPF (big endian)
  bpfel  - BPF (little endian)
  x86    - 32-bit X86: Pentium-Pro and above
  x86-64 - 64-bit X86: EM64T and AMD64

$ export PATH=$PWD/bin:$PATH   # add to ~/.bashrc

Make sure that --version mentions Optimized build., otherwise the compilation time for programs when having LLVM in debugging mode will significantly increase (e.g. by 10x or more).

For debugging, clang can generate the assembler output as follows:

$ clang -O2 -S -Wall -target bpf -c xdp-example.c -o xdp-example.S
$ cat xdp-example.S
    .text
    .section    prog,"ax",@progbits
    .globl      xdp_drop
    .p2align    3
xdp_drop:                             # @xdp_drop
# BB#0:
    r0 = 1
    exit

    .section    license,"aw",@progbits
    .globl    __license               # @__license
__license:
    .asciz    "GPL"

Starting from LLVM’s release 6.0, there is also assembler parser support. You can program using BPF assembler directly, then use llvm-mc to assemble it into an object file. For example, you can assemble the xdp-example.S listed above back into object file using:

$ llvm-mc -triple bpf -filetype=obj -o xdp-example.o xdp-example.S

Furthermore, more recent LLVM versions (>= 4.0) can also store debugging information in dwarf format into the object file. This can be done through the usual workflow by adding -g for compilation.

$ clang -O2 -g -Wall -target bpf -c xdp-example.c -o xdp-example.o
$ llvm-objdump -S -no-show-raw-insn xdp-example.o

xdp-example.o:        file format ELF64-BPF

Disassembly of section prog:
xdp_drop:
; {
    0:        r0 = 1
; return XDP_DROP;
    1:        exit

The llvm-objdump tool can then annotate the assembler output with the original C code used in the compilation. The trivial example in this case does not contain much C code, however, the line numbers shown as 0: and 1: correspond directly to the kernel’s verifier log.

This means that in case BPF programs get rejected by the verifier, llvm-objdump can help to correlate the instructions back to the original C code, which is highly useful for analysis.

# ip link set dev em1 xdp obj xdp-example.o verb

Prog section 'prog' loaded (5)!
 - Type:         6
 - Instructions: 2 (0 over limit)
 - License:      GPL

Verifier analysis:

0: (b7) r0 = 1
1: (95) exit
processed 2 insns

As it can be seen in the verifier analysis, the llvm-objdump output dumps the same BPF assembler code as the kernel.

Leaving out the -no-show-raw-insn option will also dump the raw struct bpf_insn as hex in front of the assembly:

$ llvm-objdump -S xdp-example.o

xdp-example.o:        file format ELF64-BPF

Disassembly of section prog:
xdp_drop:
; {
   0:       b7 00 00 00 01 00 00 00     r0 = 1
; return foo();
   1:       95 00 00 00 00 00 00 00     exit

For LLVM IR debugging, the compilation process for BPF can be split into two steps, generating a binary LLVM IR intermediate file xdp-example.bc, which can later on be passed to llc:

$ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o

The generated LLVM IR can also be dumped in human readable format through:

$ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o -

LLVM is able to attach debug information such as the description of used data types in the program to the generated BPF object file. By default this is in DWARF format.

A heavily simplified version used by BPF is called BTF (BPF Type Format). The resulting DWARF can be converted into BTF and is later on loaded into the kernel through BPF object loaders. The kernel will then verify the BTF data for correctness and keeps track of the data types the BTF data is containing.

BPF maps can then be annotated with key and value types out of the BTF data such that a later dump of the map exports the map data along with the related type information. This allows for better introspection, debugging and value pretty printing. Note that BTF data is a generic debugging data format and as such any DWARF to BTF converted data can be loaded (e.g. kernel’s vmlinux DWARF data could be converted to BTF and loaded). Latter is in particular useful for BPF tracing in the future.

In order to generate BTF from DWARF debugging information, elfutils (>= 0.173) is needed. If that is not available, then adding the -mattr=dwarfris option to the llc command is required during compilation:

$ llc -march=bpf -mattr=help |& grep dwarfris
  dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
  [...]

The reason using -mattr=dwarfris is because the flag dwarfris (dwarf relocation in section) disables DWARF cross-section relocations between DWARF and the ELF’s symbol table since libdw does not have proper BPF relocation support, and therefore tools like pahole would otherwise not be able to properly dump structures from the object.

elfutils (>= 0.173) implements proper BPF relocation support and therefore the same can be achieved without the -mattr=dwarfris option. Dumping the structures from the object file could be done from either DWARF or BTF information. pahole uses the LLVM emitted DWARF information at this point, however, future pahole versions could rely on BTF if available.

For converting DWARF into BTF, a recent pahole version (>= 1.12) is required. A recent pahole version can also be obtained from its official git repository if not available from one of the distribution packages:

$ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git

pahole comes with the option -J to convert DWARF into BTF from an object file. pahole can be probed for BTF support as follows (note that the llvm-objcopy tool is required for pahole as well, so check its presence, too):

$ pahole --help | grep BTF
-J, --btf_encode           Encode as BTF

Generating debugging information also requires the front end to generate source level debug information by passing -g to the clang command line. Note that -g is needed independently of whether llc’s dwarfris option is used. Full example for generating the object file:

$ clang -O2 -g -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o

Alternatively, by using clang only to build a BPF program with debugging information (again, the dwarfris flag can be omitted when having proper elfutils version):

$ clang -target bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o

After successful compilation pahole can be used to properly dump structures of the BPF program based on the DWARF information:

$ pahole xdp-example.o
struct xdp_md {
        __u32                      data;                 /*     0     4 */
        __u32                      data_end;             /*     4     4 */
        __u32                      data_meta;            /*     8     4 */

        /* size: 12, cachelines: 1, members: 3 */
        /* last cacheline: 12 bytes */
};

Through the option -J pahole can eventually generate the BTF from DWARF. In the object file DWARF data will still be retained alongside the newly added BTF data. Full clang and pahole example combined:

$ clang -target bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
$ pahole -J xdp-example.o

The presence of a .BTF section can be seen through readelf tool:

$ readelf -a xdp-example.o
[...]
  [18] .BTF              PROGBITS         0000000000000000  00000671
[...]

BPF loaders such as iproute2 will detect and load the BTF section, so that BPF maps can be annotated with type information.

LLVM by default uses the BPF base instruction set for generating code in order to make sure that the generated object file can also be loaded with older kernels such as long-term stable kernels (e.g. 4.9+).

However, LLVM has a -mcpu selector for the BPF back end in order to select different versions of the BPF instruction set, namely instruction set extensions on top of the BPF base instruction set in order to generate more efficient and smaller code.

Available -mcpu options can be queried through:

$ llc -march bpf -mcpu=help
Available CPUs for this target:

  generic - Select the generic processor.
  probe   - Select the probe processor.
  v1      - Select the v1 processor.
  v2      - Select the v2 processor.
[...]

The generic processor is the default processor, which is also the base instruction set v1 of BPF. Options v1 and v2 are typically useful in an environment where the BPF program is being cross compiled and the target host where the program is loaded differs from the one where it is compiled (and thus available BPF kernel features might differ as well).

The recommended -mcpu option which is also used by Cilium internally is -mcpu=probe! Here, the LLVM BPF back end queries the kernel for availability of BPF instruction set extensions and when found available, LLVM will use them for compiling the BPF program whenever appropriate.

A full command line example with llc’s -mcpu=probe:

$ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
$ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o

Generally, LLVM IR generation is architecture independent. There are however a few differences when using clang -target bpf versus leaving -target bpf out and thus using clang’s default target which, depending on the underlying architecture, might be x86_64, arm64 or others.

Quoting from the kernel’s Documentation/bpf/bpf_devel_QA.txt:

  • BPF programs may recursively include header file(s) with file scope inline assembly codes. The default target can handle this well, while bpf target may fail if bpf backend assembler does not understand these assembly codes, which is true in most cases.
  • When compiled without -g, additional elf sections, e.g., .eh_frame and .rela.eh_frame, may be present in the object file with default target, but not with bpf target.
  • The default target may turn a C switch statement into a switch table lookup and jump operation. Since the switch table is placed in the global read-only section, the bpf program will fail to load. The bpf target does not support switch table optimization. The clang option -fno-jump-tables can be used to disable switch table generation.
  • For clang -target bpf, it is guaranteed that pointer or long / unsigned long types will always have a width of 64 bit, no matter whether underlying clang binary or default target (or kernel) is 32 bit. However, when native clang target is used, then it will compile these types based on the underlying architecture’s conventions, meaning in case of 32 bit architecture, pointer or long / unsigned long types e.g. in BPF context structure will have width of 32 bit while the BPF LLVM back end still operates in 64 bit.

The native target is mostly needed in tracing for the case of walking the kernel’s struct pt_regs that maps CPU registers, or other kernel structures where CPU’s register width matters. In all other cases such as networking, the use of clang -target bpf is the preferred choice.

Also, LLVM started to support 32-bit subregisters and BPF ALU32 instructions since LLVM’s release 7.0. A new code generation attribute alu32 is added. When it is enabled, LLVM will try to use 32-bit subregisters whenever possible, typically when there are operations on 32-bit types. The associated ALU instructions with 32-bit subregisters will become ALU32 instructions. For example, for the following sample code:

$ cat 32-bit-example.c
    void cal(unsigned int *a, unsigned int *b, unsigned int *c)
    {
      unsigned int sum = *a + *b;
      *c = sum;
    }

At default code generation, the assembler will looks like:

$ clang -target bpf -emit-llvm -S 32-bit-example.c
$ llc -march=bpf 32-bit-example.ll
$ cat 32-bit-example.s
    cal:
      r1 = *(u32 *)(r1 + 0)
      r2 = *(u32 *)(r2 + 0)
      r2 += r1
      *(u32 *)(r3 + 0) = r2
      exit

64-bit registers are used, hence the addition means 64-bit addition. Now, if you enable the new 32-bit subregisters support by specifying -mattr=+alu32, then the assembler will looks like:

$ llc -march=bpf -mattr=+alu32 32-bit-example.ll
$ cat 32-bit-example.s
    cal:
      w1 = *(u32 *)(r1 + 0)
      w2 = *(u32 *)(r2 + 0)
      w2 += w1
      *(u32 *)(r3 + 0) = w2
      exit

w register, meaning 32-bit subregister, will be used instead of 64-bit r register.

Enable 32-bit subregisters might help reducing type extension instruction sequences. It could also help kernel eBPF JIT compiler for 32-bit architectures for which registers pairs are used to model the 64-bit eBPF registers and extra instructions are needed for manipulating the high 32-bit. Given read from 32-bit subregister is guaranteed to read from low 32-bit only even though write still needs to clear the high 32-bit, if the JIT compiler has known the definition of one register only has subregister reads, then instructions for setting the high 32-bit of the destination could be eliminated.

When writing C programs for BPF, there are a couple of pitfalls to be aware of, compared to usual application development with C. The following items describe some of the differences for the BPF model:

  1. Everything needs to be inlined, there are no function calls (on older LLVM versions) or shared library calls available.

    Shared libraries, etc cannot be used with BPF. However, common library code used in BPF programs can be placed into header files and included in the main programs. For example, Cilium makes heavy use of it (see bpf/lib/). However, this still allows for including header files, for example, from the kernel or other libraries and reuse their static inline functions or macros / definitions.

    Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF function calls are supported, then LLVM needs to compile and inline the entire code into a flat sequence of BPF instructions for a given program section. In such case, best practice is to use an annotation like __inline for every library function as shown below. The use of always_inline is recommended, since the compiler could still decide to uninline large functions that are only annotated as inline.

    In case the latter happens, LLVM will generate a relocation entry into the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and will thus produce an error since only BPF maps are valid relocation entries which loaders can process.

    #include <linux/bpf.h>
    
    #ifndef __section
    # define __section(NAME)                  \
       __attribute__((section(NAME), used))
    #endif
    
    #ifndef __inline
    # define __inline                         \
       inline __attribute__((always_inline))
    #endif
    
    static __inline int foo(void)
    {
        return XDP_DROP;
    }
    
    __section("prog")
    int xdp_drop(struct xdp_md *ctx)
    {
        return foo();
    }
    
    char __license[] __section("license") = "GPL";
    
  2. Multiple programs can reside inside a single C file in different sections.

    C programs for BPF make heavy use of section annotations. A C file is typically structured into 3 or more sections. BPF ELF loaders use these names to extract and prepare the relevant information in order to load the programs and maps through the bpf system call. For example, iproute2 uses maps and license as default section name to find metadata needed for map creation and the license for the BPF program, respectively. On program creation time the latter is pushed into the kernel as well, and enables some of the helper functions which are exposed as GPL only in case the program also holds a GPL compatible license, for example bpf_ktime_get_ns(), bpf_probe_read() and others.

    The remaining section names are specific for BPF program code, for example, the below code has been modified to contain two program sections, ingress and egress. The toy example code demonstrates that both can share a map and common static inline helpers such as the account_data() function.

    The xdp-example.c example has been modified to a tc-example.c example that can be loaded with tc and attached to a netdevice’s ingress and egress hook. It accounts the transferred bytes into a map called acc_map, which has two map slots, one for traffic accounted on the ingress hook, one on the egress hook.

    #include <linux/bpf.h>
    #include <linux/pkt_cls.h>
    #include <stdint.h>
    #include <iproute2/bpf_elf.h>
    
    #ifndef __section
    # define __section(NAME)                  \
       __attribute__((section(NAME), used))
    #endif
    
    #ifndef __inline
    # define __inline                         \
       inline __attribute__((always_inline))
    #endif
    
    #ifndef lock_xadd
    # define lock_xadd(ptr, val)              \
       ((void)__sync_fetch_and_add(ptr, val))
    #endif
    
    #ifndef BPF_FUNC
    # define BPF_FUNC(NAME, ...)              \
       (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
    #endif
    
    static void *BPF_FUNC(map_lookup_elem, void *map, const void *key);
    
    struct bpf_elf_map acc_map __section("maps") = {
        .type           = BPF_MAP_TYPE_ARRAY,
        .size_key       = sizeof(uint32_t),
        .size_value     = sizeof(uint32_t),
        .pinning        = PIN_GLOBAL_NS,
        .max_elem       = 2,
    };
    
    static __inline int account_data(struct __sk_buff *skb, uint32_t dir)
    {
        uint32_t *bytes;
    
        bytes = map_lookup_elem(&acc_map, &dir);
        if (bytes)
                lock_xadd(bytes, skb->len);
    
        return TC_ACT_OK;
    }
    
    __section("ingress")
    int tc_ingress(struct __sk_buff *skb)
    {
        return account_data(skb, 0);
    }
    
    __section("egress")
    int tc_egress(struct __sk_buff *skb)
    {
        return account_data(skb, 1);
    }
    
    char __license[] __section("license") = "GPL";
    

The example also demonstrates a couple of other things which are useful to be aware of when developing programs. The code includes kernel headers, standard C headers and an iproute2 specific header containing the definition of struct bpf_elf_map. iproute2 has a common BPF ELF loader and as such the definition of struct bpf_elf_map is the very same for XDP and tc typed programs.

A struct bpf_elf_map entry defines a map in the program and contains all relevant information (such as key / value size, etc) needed to generate a map which is used from the two BPF programs. The structure must be placed into the maps section, so that the loader can find it. There can be multiple map declarations of this type with different variable names, but all must be annotated with __section("maps").

The struct bpf_elf_map is specific to iproute2. Different BPF ELF loaders can have different formats, for example, the libbpf in the kernel source tree, which is mainly used by perf, has a different specification. iproute2 guarantees backwards compatibility for struct bpf_elf_map. Cilium follows the iproute2 model.

The example also demonstrates how BPF helper functions are mapped into the C code and being used. Here, map_lookup_elem() is defined by mapping this function into the BPF_FUNC_map_lookup_elem enum value which is exposed as a helper in uapi/linux/bpf.h. When the program is later loaded into the kernel, the verifier checks whether the passed arguments are of the expected type and re-points the helper call into a real function call. Moreover, map_lookup_elem() also demonstrates how maps can be passed to BPF helper functions. Here, &acc_map from the maps section is passed as the first argument to map_lookup_elem().

Since the defined array map is global, the accounting needs to use an atomic operation, which is defined as lock_xadd(). LLVM maps __sync_fetch_and_add() as a built-in function to the BPF atomic add instruction, that is, BPF_STX | BPF_XADD | BPF_W for word sizes.

Last but not least, the struct bpf_elf_map tells that the map is to be pinned as PIN_GLOBAL_NS. This means that tc will pin the map into the BPF pseudo file system as a node. By default, it will be pinned to /sys/fs/bpf/tc/globals/acc_map for the given example. Due to the PIN_GLOBAL_NS, the map will be placed under /sys/fs/bpf/tc/globals/. globals acts as a global namespace that spans across object files. If the example used PIN_OBJECT_NS, then tc would create a directory that is local to the object file. For example, different C files with BPF code could have the same acc_map definition as above with a PIN_GLOBAL_NS pinning. In that case, the map will be shared among BPF programs originating from various object files. PIN_NONE would mean that the map is not placed into the BPF file system as a node, and as a result will not be accessible from user space after tc quits. It would also mean that tc creates two separate map instances for each program, since it cannot retrieve a previously pinned map under that name. The acc_map part from the mentioned path is the name of the map as specified in the source code.

Thus, upon loading of the ingress program, tc will find that no such map exists in the BPF file system and creates a new one. On success, the map will also be pinned, so that when the egress program is loaded through tc, it will find that such map already exists in the BPF file system and will reuse that for the egress program. The loader also makes sure in case maps exist with the same name that also their properties (key / value size, etc) match.

Just like tc can retrieve the same map, also third party applications can use the BPF_OBJ_GET command from the bpf system call in order to create a new file descriptor pointing to the same map instance, which can then be used to lookup / update / delete map elements.

The code can be compiled and loaded via iproute2 as follows:

$ clang -O2 -Wall -target bpf -c tc-example.c -o tc-example.o

# tc qdisc add dev em1 clsact
# tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress
# tc filter add dev em1 egress bpf da obj tc-example.o sec egress

# tc filter show dev em1 ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f

# tc filter show dev em1 egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714

# mount | grep bpf
sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700)

# tree /sys/fs/bpf/
/sys/fs/bpf/
+-- ip -> /sys/fs/bpf/tc/
+-- tc
|   +-- globals
|       +-- acc_map
+-- xdp -> /sys/fs/bpf/tc/

4 directories, 1 file

As soon as packets pass the em1 device, counters from the BPF map will be increased.

  1. There are no global variables allowed.

For the reasons already mentioned in point 1, BPF cannot have global variables as often used in normal C programs.

However, there is a work-around in that the program can simply use a BPF map of type BPF_MAP_TYPE_PERCPU_ARRAY with just a single slot of arbitrary value size. This works, because during execution, BPF programs are guaranteed to never get preempted by the kernel and therefore can use the single map entry as a scratch buffer for temporary data, for example, to extend beyond the stack limitation. This also functions across tail calls, since it has the same guarantees with regards to preemption.

Otherwise, for holding state across multiple BPF program runs, normal BPF maps can be used.

  1. There are no const strings or arrays allowed.

Defining const strings or other arrays in the BPF C program does not work for the same reasons as pointed out in sections 1 and 3, which is, that relocation entries will be generated in the ELF file which will be rejected by loaders due to not being part of the ABI towards loaders (loaders also cannot fix up such entries as it would require large rewrites of the already compiled BPF sequence).

In the future, LLVM might detect these occurrences and early throw an error to the user.

Helper functions such as trace_printk() can be worked around as follows:

static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...);

#ifndef printk
# define printk(fmt, ...)                                      \
    ({                                                         \
        char ____fmt[] = fmt;                                  \
        trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
    })
#endif

The program can then use the macro naturally like printk("skb len:%u\n", skb->len);. The output will then be written to the trace pipe. tc exec bpf dbg can be used to retrieve the messages from there.

The use of the trace_printk() helper function has a couple of disadvantages and thus is not recommended for production usage. Constant strings like the "skb len:%u\n" need to be loaded into the BPF stack each time the helper function is called, but also BPF helper functions are limited to a maximum of 5 arguments. This leaves room for only 3 additional variables which can be passed for dumping.

Therefore, despite being helpful for quick debugging, it is recommended (for networking programs) to use the skb_event_output() or the xdp_event_output() helper, respectively. They allow for passing custom structs from the BPF program to the perf event ring buffer along with an optional packet sample. For example, Cilium’s monitor makes use of these helpers in order to implement a debugging framework, notifications for network policy violations, etc. These helpers pass the data through a lockless memory mapped per-CPU perf ring buffer, and is thus significantly faster than trace_printk().

  1. Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().

Since BPF programs cannot perform any function calls other than those to BPF helpers, common library code needs to be implemented as inline functions. In addition, also LLVM provides some built-ins that the programs can use for constant sizes (here: n) which will then always get inlined:

#ifndef memset
# define memset(dest, chr, n)   __builtin_memset((dest), (chr), (n))
#endif

#ifndef memcpy
# define memcpy(dest, src, n)   __builtin_memcpy((dest), (src), (n))
#endif

#ifndef memmove
# define memmove(dest, src, n)  __builtin_memmove((dest), (src), (n))
#endif

The memcmp() built-in had some corner cases where inlining did not take place due to an LLVM issue in the back end, and is therefore not recommended to be used until the issue is fixed.

  1. There are no loops available (yet).

The BPF verifier in the kernel checks that a BPF program does not contain loops by performing a depth first search of all possible program paths besides other control flow graph validations. The purpose is to make sure that the program is always guaranteed to terminate.

A very limited form of looping is available for constant upper loop bounds by using #pragma unroll directive. Example code that is compiled to BPF:

#pragma unroll
    for (i = 0; i < IPV6_MAX_HEADERS; i++) {
        switch (nh) {
        case NEXTHDR_NONE:
            return DROP_INVALID_EXTHDR;
        case NEXTHDR_FRAGMENT:
            return DROP_FRAG_NOSUPPORT;
        case NEXTHDR_HOP:
        case NEXTHDR_ROUTING:
        case NEXTHDR_AUTH:
        case NEXTHDR_DEST:
            if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0)
                return DROP_INVALID;

            nh = opthdr.nexthdr;
            if (nh == NEXTHDR_AUTH)
                len += ipv6_authlen(&opthdr);
            else
                len += ipv6_optlen(&opthdr);
            break;
        default:
            *nexthdr = nh;
            return len;
        }
    }

Another possibility is to use tail calls by calling into the same program again and using a BPF_MAP_TYPE_PERCPU_ARRAY map for having a local scratch space. While being dynamic, this form of looping however is limited to a maximum of 32 iterations.

In the future, BPF may have some native, but limited form of implementing loops.

  1. Partitioning programs with tail calls.

Tail calls provide the flexibility to atomically alter program behavior during runtime by jumping from one BPF program into another. In order to select the next program, tail calls make use of program array maps (BPF_MAP_TYPE_PROG_ARRAY), and pass the map as well as the index to the next program to jump to. There is no return to the old program after the jump has been performed, and in case there was no program present at the given map index, then execution continues on the original program.

For example, this can be used to implement various stages of a parser, where such stages could be updated with new parsing features during runtime.

Another use case are event notifications, for example, Cilium can opt in packet drop notifications during runtime, where the skb_event_output() call is located inside the tail called program. Thus, during normal operations, the fall-through path will always be executed unless a program is added to the related map index, where the program then prepares the metadata and triggers the event notification to a user space daemon.

Program array maps are quite flexible, enabling also individual actions to be implemented for programs located in each map index. For example, the root program attached to XDP or tc could perform an initial tail call to index 0 of the program array map, performing traffic sampling, then jumping to index 1 of the program array map, where firewalling policy is applied and the packet either dropped or further processed in index 2 of the program array map, where it is mangled and sent out of an interface again. Jumps in the program array map can, of course, be arbitrary. The kernel will eventually execute the fall-through path when the maximum tail call limit has been reached.

Minimal example extract of using tail calls:

[...]

#ifndef __stringify
# define __stringify(X)   #X
#endif

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

#ifndef __section_tail
# define __section_tail(ID, KEY)          \
   __section(__stringify(ID) "/" __stringify(KEY))
#endif

#ifndef BPF_FUNC
# define BPF_FUNC(NAME, ...)              \
   (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
#endif

#define BPF_JMP_MAP_ID   1

static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map,
                     uint32_t index);

struct bpf_elf_map jmp_map __section("maps") = {
    .type           = BPF_MAP_TYPE_PROG_ARRAY,
    .id             = BPF_JMP_MAP_ID,
    .size_key       = sizeof(uint32_t),
    .size_value     = sizeof(uint32_t),
    .pinning        = PIN_GLOBAL_NS,
    .max_elem       = 1,
};

__section_tail(JMP_MAP_ID, 0)
int looper(struct __sk_buff *skb)
{
    printk("skb cb: %u\n", skb->cb[0]++);
    tail_call(skb, &jmp_map, 0);
    return TC_ACT_OK;
}

__section("prog")
int entry(struct __sk_buff *skb)
{
    skb->cb[0] = 0;
    tail_call(skb, &jmp_map, 0);
    return TC_ACT_OK;
}

char __license[] __section("license") = "GPL";

When loading this toy program, tc will create the program array and pin it to the BPF file system in the global namespace under jmp_map. Also, the BPF ELF loader in iproute2 will also recognize sections that are marked as __section_tail(). The provided id in struct bpf_elf_map will be matched against the id marker in the __section_tail(), that is, JMP_MAP_ID, and the program therefore loaded at the user specified program array map index, which is 0 in this example. As a result, all provided tail call sections will be populated by the iproute2 loader to the corresponding maps. This mechanism is not specific to tc, but can be applied with any other BPF program type that iproute2 supports (such as XDP, lwt).

The generated elf contains section headers describing the map id and the entry within that map:

$ llvm-objdump -S --no-show-raw-insn prog_array.o | less
prog_array.o:   file format ELF64-BPF

Disassembly of section 1/0:
looper:
       0:       r6 = r1
       1:       r2 = *(u32 *)(r6 + 48)
       2:       r1 = r2
       3:       r1 += 1
       4:       *(u32 *)(r6 + 48) = r1
       5:       r1 = 0 ll
       7:       call -1
       8:       r1 = r6
       9:       r2 = 0 ll
      11:       r3 = 0
      12:       call 12
      13:       r0 = 0
      14:       exit
Disassembly of section prog:
entry:
       0:       r2 = 0
       1:       *(u32 *)(r1 + 48) = r2
       2:       r2 = 0 ll
       4:       r3 = 0
       5:       call 12
       6:       r0 = 0
       7:       exi

In this case, the section 1/0 indicates that the looper() function resides in the map id 1 at position 0.

The pinned map can be retrieved by a user space applications (e.g. Cilium daemon), but also by tc itself in order to update the map with new programs. Updates happen atomically, the initial entry programs that are triggered first from the various subsystems are also updated atomically.

Example for tc to perform tail call map updates:

# tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec foo

In case iproute2 would update the pinned program array, the graft command can be used. By pointing it to globals/jmp_map, tc will update the map at index / key 0 with a new program residing in the object file new.o under section foo.

  1. Limited stack space of maximum 512 bytes.
Stack space in BPF programs is limited to only 512 bytes, which needs to be taken into careful consideration when implementing BPF programs in C. However, as mentioned earlier in point 3, a BPF_MAP_TYPE_PERCPU_ARRAY map with a single entry can be used in order to enlarge scratch buffer space.
  1. Use of BPF inline assembly possible.

LLVM 6.0 or later allows use of inline assembly for BPF for the rare cases where it might be needed. The following (nonsense) toy example shows a 64 bit atomic add. Due to lack of documentation, LLVM source code in lib/Target/BPF/BPFInstrInfo.td as well as test/CodeGen/BPF/ might be helpful for providing some additional examples. Test code:

#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

__section("prog")
int xdp_test(struct xdp_md *ctx)
{
    __u64 a = 2, b = 3, *c = &a;
    /* just a toy xadd example to show the syntax */
    asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c));
    return a;
}

char __license[] __section("license") = "GPL";

The above program is compiled into the following sequence of BPF instructions:

Verifier analysis:

0: (b7) r1 = 2
1: (7b) *(u64 *)(r10 -8) = r1
2: (b7) r1 = 3
3: (bf) r2 = r10
4: (07) r2 += -8
5: (db) lock *(u64 *)(r2 +0) += r1
6: (79) r0 = *(u64 *)(r10 -8)
7: (95) exit
processed 8 insns (limit 131072), stack depth 8
  1. Remove struct padding with aligning members by using #pragma pack.

In modern compilers, data structures are aligned by default to access memory efficiently. Structure members are aligned to memory address that multiples their size, and padding is added for the proper alignment. Because of this, the size of struct may often grow larger than expected.

struct called_info {
    u64 start;  // 8-byte
    u64 end;    // 8-byte
    u32 sector; // 4-byte
}; // size of 20-byte ?

printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte

// Actual compiled composition of struct called_info
// 0x0(0)                   0x8(8)
//  ↓________________________↓
//  |        start (8)       |
//  |________________________|
//  |         end  (8)       |
//  |________________________|
//  |  sector(4) |  PADDING  | <= address aligned to 8
//  |____________|___________|     with 4-byte PADDING.

The BPF verifier in the kernel checks the stack boundary that a BPF program does not access outside of boundary or uninitialized stack area. Using struct with the padding as a map value, will cause invalid indirect read from stack failure on bpf_prog_load().

Example code:

struct called_info {
    u64 start;
    u64 end;
    u32 sector;
};

struct bpf_map_def SEC("maps") called_info_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct called_info),
    .max_entries = 4096,
};

SEC("kprobe/submit_bio")
int submit_bio_entry(struct pt_regs *ctx)
{
    char fmt[] = "submit_bio(bio=0x%lx) called: %llu\n";
    u64 start_time = bpf_ktime_get_ns();
    long bio_ptr = PT_REGS_PARM1(ctx);
    struct called_info called_info = {
            .start = start_time,
            .end = 0,
            .bi_sector = 0
    };

    bpf_map_update_elem(&called_info_map, &bio_ptr, &called_info, BPF_ANY);
    bpf_trace_printk(fmt, sizeof(fmt), bio_ptr, start_time);
    return 0;
}

// On bpf_load_program
bpf_load_program() err=13
0: (bf) r6 = r1
...
19: (b7) r1 = 0
20: (7b) *(u64 *)(r10 -72) = r1
21: (7b) *(u64 *)(r10 -80) = r7
22: (63) *(u32 *)(r10 -64) = r1
...
30: (85) call bpf_map_update_elem#2
invalid indirect read from stack off -80+20 size 24

At bpf_prog_load(), an eBPF verifier bpf_check() is called, and it’ll check stack boundary by calling check_func_arg() -> check_stack_boundary(). From the upper error shows, struct called_info is compiled to 24-byte size, and the message says reading a data from +20 is an invalid indirect read. And as we discussed earlier, the address 0x14(20) is the place where PADDING is.

// Actual compiled composition of struct called_info
// 0x10(16)    0x14(20)    0x18(24)
//  ↓____________↓___________↓
//  |  sector(4) |  PADDING  | <= address aligned to 8
//  |____________|___________|     with 4-byte PADDING.

The check_stack_boundary() internally loops through the every access_size (24) byte from the start pointer to make sure that it’s within stack boundary and all elements of the stack are initialized. Since the padding isn’t supposed to be used, it gets the ‘invalid indirect read from stack’ failure. To avoid this kind of failure, remove the padding from the struct is necessary.

Removing the padding by using #pragma pack(n) directive:

#pragma pack(4)
struct called_info {
    u64 start;  // 8-byte
    u64 end;    // 8-byte
    u32 sector; // 4-byte
}; // size of 20-byte ?

printf("size of %d-byte\n", sizeof(struct called_info)); // size of 20-byte

// Actual compiled composition of packed struct called_info
// 0x0(0)                   0x8(8)
//  ↓________________________↓
//  |        start (8)       |
//  |________________________|
//  |         end  (8)       |
//  |________________________|
//  |  sector(4) |             <= address aligned to 4
//  |____________|                 with no PADDING.

By locating #pragma pack(4) before of struct called_info, compiler will align members of a struct to the least of 4-byte and their natural alignment. As you can see, the size of struct called_info has been shrunk to 20-byte and the padding is no longer exist.

But, removing the padding have downsides either. For example, compiler will generate less optimized code. Since we’ve removed the padding, processors will conduct unaligned access to the structure and this might lead to performance degradation. And also, unaligned access might get rejected by verifier on some architectures.

However, there is a way to avoid downsides of packed structure. By simply adding the explicit padding u32 pad member at the end will resolve the same problem without packing of the structure.

struct called_info {
    u64 start;  // 8-byte
    u64 end;    // 8-byte
    u32 sector; // 4-byte
    u32 pad;    // 4-byte
}; // size of 24-byte ?

printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte

// Actual compiled composition of struct called_info with explicit padding
// 0x0(0)                   0x8(8)
//  ↓________________________↓
//  |        start (8)       |
//  |________________________|
//  |         end  (8)       |
//  |________________________|
//  |  sector(4) |  pad (4)  | <= address aligned to 8
//  |____________|___________|     with explicit PADDING.
  1. Accessing packet data via invalidated references

Some networking BPF helper functions such as bpf_skb_store_bytes might change the size of a packet data. As verifier is not able to track such changes, any a priori reference to the data will be invalidated by verifier. Therefore, the reference needs to be updated before accessing the data to avoid verifier rejecting a program.

To illustrate this, consider the following snippet:

struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN;

skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0);

if (ip4->protocol == IPPROTO_TCP) {
    // do something
}

Verifier will reject the snippet due to dereference of the invalidated ip4->protocol:

R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=34,r=34,imm=0) R3=inv0
R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
R8=inv4294967162 R9=pkt(id=0,off=0,r=34,imm=0) R10=fp0,call_-1
...
18: (85) call bpf_skb_store_bytes#9
19: (7b) *(u64 *)(r10 -56) = r7
R0=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=2,var_off=(0x0; 0x3))
R8=inv4294967162 R9=inv(id=0) R10=fp0,call_-1 fp-48=mmmm???? fp-56=mmmmmmmm
21: (61) r1 = *(u32 *)(r9 +23)
R9 invalid mem access 'inv'

To fix this, the reference to ip4 has to be updated:

struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN;

skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0);

ip4 = (struct iphdr *) skb->data + ETH_HLEN;

if (ip4->protocol == IPPROTO_TCP) {
    // do something
}

iproute2

There are various front ends for loading BPF programs into the kernel such as bcc, perf, iproute2 and others. The Linux kernel source tree also provides a user space library under tools/lib/bpf/, which is mainly used and driven by perf for loading BPF tracing programs into the kernel. However, the library itself is generic and not limited to perf only. bcc is a toolkit providing many useful BPF programs mainly for tracing that are loaded ad-hoc through a Python interface embedding the BPF C code. Syntax and semantics for implementing BPF programs slightly differ among front ends in general, though. Additionally, there are also BPF samples in the kernel source tree (samples/bpf/) which parse the generated object files and load the code directly through the system call interface.

This and previous sections mainly focus on the iproute2 suite’s BPF front end for loading networking programs of XDP, tc or lwt type, since Cilium’s programs are implemented against this BPF loader. In future, Cilium will be equipped with a native BPF loader, but programs will still be compatible to be loaded through iproute2 suite in order to facilitate development and debugging.

All BPF program types supported by iproute2 share the same BPF loader logic due to having a common loader back end implemented as a library (lib/bpf.c in iproute2 source tree).

The previous section on LLVM also covered some iproute2 parts related to writing BPF C programs, and later sections in this document are related to tc and XDP specific aspects when writing programs. Therefore, this section will rather focus on usage examples for loading object files with iproute2 as well as some of the generic mechanics of the loader. It does not try to provide a complete coverage of all details, but enough for getting started.

1. Loading of XDP BPF object files.

Given a BPF object file prog.o has been compiled for XDP, it can be loaded through ip to a XDP-supported netdevice called em1 with the following command:

# ip link set dev em1 xdp obj prog.o

The above command assumes that the program code resides in the default section which is called prog in XDP case. Should this not be the case, and the section is named differently, for example, foobar, then the program needs to be loaded as:

# ip link set dev em1 xdp obj prog.o sec foobar

Note that it is also possible to load the program out of the .text section. Changing the minimal, stand-alone XDP drop program by removing the __section() annotation from the xdp_drop entry point would look like the following:

#include <linux/bpf.h>

#ifndef __section
# define __section(NAME)                  \
   __attribute__((section(NAME), used))
#endif

int xdp_drop(struct xdp_md *ctx)
{
    return XDP_DROP;
}

char __license[] __section("license") = "GPL";

And can be loaded as follows:

# ip link set dev em1 xdp obj prog.o sec .text

By default, ip will throw an error in case a XDP program is already attached to the networking interface, to prevent it from being overridden by accident. In order to replace the currently running XDP program with a new one, the -force option must be used:

# ip -force link set dev em1 xdp obj prog.o

Most XDP-enabled drivers today support an atomic replacement of the existing program with a new one without traffic interruption. There is always only a single program attached to an XDP-enabled driver due to performance reasons, hence a chain of programs is not supported. However, as described in the previous section, partitioning of programs can be performed through tail calls to achieve a similar use case when necessary.

The ip link command will display an xdp flag if the interface has an XDP program attached. ip link | grep xdp can thus be used to find all interfaces that have XDP running. Further introspection facilities are provided through the detailed view with ip -d link and bpftool can be used to retrieve information about the attached program based on the BPF program ID shown in the ip link dump.

In order to remove the existing XDP program from the interface, the following command must be issued:

# ip link set dev em1 xdp off

In the case of switching a driver’s operation mode from non-XDP to native XDP and vice versa, typically the driver needs to reconfigure its receive (and transmit) rings in order to ensure received packet are set up linearly within a single page for BPF to read and write into. However, once completed, then most drivers only need to perform an atomic replacement of the program itself when a BPF program is requested to be swapped.

In total, XDP supports three operation modes which iproute2 implements as well: xdpdrv, xdpoffload and xdpgeneric.

xdpdrv stands for native XDP, meaning the BPF program is run directly in the driver’s receive path at the earliest possible point in software. This is the normal / conventional XDP mode and requires driver’s to implement XDP support, which all major 10G/40G/+ networking drivers in the upstream Linux kernel already provide.

xdpgeneric stands for generic XDP and is intended as an experimental test bed for drivers which do not yet support native XDP. Given the generic XDP hook in the ingress path comes at a much later point in time when the packet already enters the stack’s main receive path as a skb, the performance is significantly less than with processing in xdpdrv mode. xdpgeneric therefore is for the most part only interesting for experimenting, less for production environments.

Last but not least, the xdpoffload mode is implemented by SmartNICs such as those supported by Netronome’s nfp driver and allow for offloading the entire BPF/XDP program into hardware, thus the program is run on each packet reception directly on the card. This provides even higher performance than running in native XDP although not all BPF map types or BPF helper functions are available for use compared to native XDP. The BPF verifier will reject the program in such case and report to the user what is unsupported. Other than staying in the realm of supported BPF features and helper functions, no special precautions have to be taken when writing BPF C programs.

When a command like ip link set dev em1 xdp obj [...] is used, then the kernel will attempt to load the program first as native XDP, and in case the driver does not support native XDP, it will automatically fall back to generic XDP. Thus, for example, using explicitly xdpdrv instead of xdp, the kernel will only attempt to load the program as native XDP and fail in case the driver does not support it, which provides a guarantee that generic XDP is avoided altogether.

Example for enforcing a BPF/XDP program to be loaded in native XDP mode, dumping the link details and unloading the program again:

# ip -force link set dev em1 xdpdrv obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 1 tag 57cd311f2e27366b
[...]
# ip link set dev em1 xdpdrv off

Same example now for forcing generic XDP, even if the driver would support native XDP, and additionally dumping the BPF instructions of the attached dummy program through bpftool:

# ip -force link set dev em1 xdpgeneric obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 4 tag 57cd311f2e27366b                <-- BPF program ID 4
[...]
# bpftool prog dump xlated id 4                       <-- Dump of instructions running on em1
0: (b7) r0 = 1
1: (95) exit
# ip link set dev em1 xdpgeneric off

And last but not least offloaded XDP, where we additionally dump program information via bpftool for retrieving general metadata:

# ip -force link set dev em1 xdpoffload obj prog.o
# ip link show
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 8 tag 57cd311f2e27366b
[...]
# bpftool prog show id 8
8: xdp  tag 57cd311f2e27366b dev em1                  <-- Also indicates a BPF program offloaded to em1
    loaded_at Apr 11/20:38  uid 0
    xlated 16B  not jited  memlock 4096B
# ip link set dev em1 xdpoffload off

Note that it is not possible to use xdpdrv and xdpgeneric or other modes at the same time, meaning only one of the XDP operation modes must be picked.

A switch between different XDP modes e.g. from generic to native or vice versa is not atomically possible. Only switching programs within a specific operation mode is:

# ip -force link set dev em1 xdpgeneric obj prog.o
# ip -force link set dev em1 xdpoffload obj prog.o
RTNETLINK answers: File exists
# ip -force link set dev em1 xdpdrv obj prog.o
RTNETLINK answers: File exists
# ip -force link set dev em1 xdpgeneric obj prog.o    <-- Succeeds due to xdpgeneric
#

Switching between modes requires to first leave the current operation mode in order to then enter the new one:

# ip -force link set dev em1 xdpgeneric obj prog.o
# ip -force link set dev em1 xdpgeneric off
# ip -force link set dev em1 xdpoffload obj prog.o
# ip l
[...]
6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000
    link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 17 tag 57cd311f2e27366b
[...]
# ip -force link set dev em1 xdpoffload off

2. Loading of tc BPF object files.

Given a BPF object file prog.o has been compiled for tc, it can be loaded through the tc command to a netdevice. Unlike XDP, there is no driver dependency for supporting attaching BPF programs to the device. Here, the netdevice is called em1, and with the following command the program can be attached to the networking ingress path of em1:

# tc qdisc add dev em1 clsact
# tc filter add dev em1 ingress bpf da obj prog.o

The first step is to set up a clsact qdisc (Linux queueing discipline). clsact is a dummy qdisc similar to the ingress qdisc, which can only hold classifier and actions, but does not perform actual queueing. It is needed in order to attach the bpf classifier. The clsact qdisc provides two special hooks called ingress and egress, where the classifier can be attached to. Both ingress and egress hooks are located in central receive and transmit locations in the networking data path, where every packet on the device passes through. The ingress hook is called from __netif_receive_skb_core() -> sch_handle_ingress() in the kernel and the egress hook from __dev_queue_xmit() -> sch_handle_egress().

The equivalent for attaching the program to the egress hook looks as follows:

# tc filter add dev em1 egress bpf da obj prog.o

The clsact qdisc is processed lockless from ingress and egress direction and can also be attached to virtual, queue-less devices such as veth devices connecting containers.

Next to the hook, the tc filter command selects bpf to be used in da (direct-action) mode. da mode is recommended and should always be specified. It basically means that the bpf classifier does not need to call into external tc action modules, which are not necessary for bpf anyway, since all packet mangling, forwarding or other kind of actions can already be performed inside the single BPF program which is to be attached, and is therefore significantly faster.

At this point, the program has been attached and is executed once packets traverse the device. Like in XDP, should the default section name not be used, then it can be specified during load, for example, in case of section foobar:

# tc filter add dev em1 egress bpf da obj prog.o sec foobar

iproute2’s BPF loader allows for using the same command line syntax across program types, hence the obj prog.o sec foobar is the same syntax as with XDP mentioned earlier.

The attached programs can be listed through the following commands:

# tc filter show dev em1 ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f

# tc filter show dev em1 egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714

The output of prog.o:[ingress] tells that program section ingress was loaded from the file prog.o, and bpf operates in direct-action mode. The program id and tag is appended for each case, where the latter denotes a hash over the instruction stream which can be correlated with the object file or perf reports with stack traces, etc. Last but not least, the id represents the system-wide unique BPF program identifier that can be used along with bpftool to further inspect or dump the attached BPF program.

tc can attach more than just a single BPF program, it provides various other classifiers which can be chained together. However, attaching a single BPF program is fully sufficient since all packet operations can be contained in the program itself thanks to da (direct-action) mode, meaning the BPF program itself will already return the tc action verdict such as TC_ACT_OK, TC_ACT_SHOT and others. For optimal performance and flexibility, this is the recommended usage.

In the above show command, tc also displays pref 49152 and handle 0x1 next to the BPF related output. Both are auto-generated in case they are not explicitly provided through the command line. pref denotes a priority number, which means that in case multiple classifiers are attached, they will be executed based on ascending priority, and handle represents an identifier in case multiple instances of the same classifier have been loaded under the same pref. Since in case of BPF, a single program is fully sufficient, pref and handle can typically be ignored.

Only in the case where it is planned to atomically replace the attached BPF programs, it would be recommended to explicitly specify pref and handle a priori on initial load, so that they do not have to be queried at a later point in time for the replace operation. Thus, creation becomes:

# tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar

# tc filter show dev em1 ingress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f

And for the atomic replacement, the following can be issued for updating the existing program at ingress hook with the new BPF program from the file prog.o in section foobar:

# tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar

Last but not least, in order to remove all attached programs from the ingress respectively egress hook, the following can be used:

# tc filter del dev em1 ingress
# tc filter del dev em1 egress

For removing the entire clsact qdisc from the netdevice, which implicitly also removes all attached programs from the ingress and egress hooks, the below command is provided:

# tc qdisc del dev em1 clsact

tc BPF programs can also be offloaded if the NIC and driver has support for it similarly as with XDP BPF programs. Netronome’s nfp supported NICs offer both types of BPF offload.

# tc qdisc add dev em1 clsact
# tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
Error: TC offload is disabled on net device.
We have an error talking to the kernel

If the above error is shown, then tc hardware offload first needs to be enabled for the device through ethtool’s hw-tc-offload setting:

# ethtool -K em1 hw-tc-offload on
# tc qdisc add dev em1 clsact
# tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
# tc filter show dev em1 ingress
filter protocol all pref 1 bpf
filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b

The in_hw flag confirms that the program has been offloaded to the NIC.

Note that BPF offloads for both tc and XDP cannot be loaded at the same time, either the tc or XDP offload option must be selected.

3. Testing BPF offload interface via netdevsim driver.

The netdevsim driver which is part of the Linux kernel provides a dummy driver which implements offload interfaces for XDP BPF and tc BPF programs and facilitates testing kernel changes or low-level user space programs implementing a control plane directly against the kernel’s UAPI.

A netdevsim device can be created as follows:

# modprobe netdevsim
// [ID] [PORT_COUNT]
# echo "1 1" > /sys/bus/netdevsim/new_device
# devlink dev
netdevsim/netdevsim1
# devlink port
netdevsim/netdevsim1/0: type eth netdev eth0 flavour physical
# ip l
[...]
4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff

After that step, XDP BPF or tc BPF programs can be test loaded as shown in the various examples earlier:

# ip -force link set dev eth0 xdpoffload obj prog.o
# ip l
[...]
4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff
    prog/xdp id 16 tag a04f5eef06a7f555

These two workflows are the basic operations to load XDP BPF respectively tc BPF programs with iproute2.

There are other various advanced options for the BPF loader that apply both to XDP and tc, some of them are listed here. In the examples only XDP is presented for simplicity.

1. Verbose log output even on success.

The option verb can be appended for loading programs in order to dump the verifier log, even if no error occurred:

# ip link set dev em1 xdp obj xdp-example.o verb

Prog section 'prog' loaded (5)!
 - Type:         6
 - Instructions: 2 (0 over limit)
 - License:      GPL

Verifier analysis:

0: (b7) r0 = 1
1: (95) exit
processed 2 insns

2. Load program that is already pinned in BPF file system.

Instead of loading a program from an object file, iproute2 can also retrieve the program from the BPF file system in case some external entity pinned it there and attach it to the device:

# ip link set dev em1 xdp pinned /sys/fs/bpf/prog

iproute2 can also use the short form that is relative to the detected mount point of the BPF file system:

# ip link set dev em1 xdp pinned m:prog

When loading BPF programs, iproute2 will automatically detect the mounted file system instance in order to perform pinning of nodes. In case no mounted BPF file system instance was found, then tc will automatically mount it to the default location under /sys/fs/bpf/.

In case an instance has already been found, then it will be used and no additional mount will be performed:

# mkdir /var/run/bpf
# mount --bind /var/run/bpf /var/run/bpf
# mount -t bpf bpf /var/run/bpf
# tc filter add dev em1 ingress bpf da obj tc-example.o sec prog
# tree /var/run/bpf
/var/run/bpf
+-- ip -> /run/bpf/tc/
+-- tc
|   +-- globals
|       +-- jmp_map
+-- xdp -> /run/bpf/tc/

4 directories, 1 file

By default tc will create an initial directory structure as shown above, where all subsystem users will point to the same location through symbolic links for the globals namespace, so that pinned BPF maps can be reused among various BPF program types in iproute2. In case the file system instance has already been mounted and an existing structure already exists, then tc will not override it. This could be the case for separating lwt, tc and xdp maps in order to not share globals among all.

As briefly covered in the previous LLVM section, iproute2 will install a header file upon installation which can be included through the standard include path by BPF programs:

#include <iproute2/bpf_elf.h>

The purpose of this header file is to provide an API for maps and default section names used by programs. It’s a stable contract between iproute2 and BPF programs.

The map definition for iproute2 is struct bpf_elf_map. Its members have been covered earlier in the LLVM section of this document.

When parsing the BPF object file, the iproute2 loader will walk through all ELF sections. It initially fetches ancillary sections like maps and license. For maps, the struct bpf_elf_map array will be checked for validity and whenever needed, compatibility workarounds are performed. Subsequently all maps are created with the user provided information, either retrieved as a pinned object, or newly created and then pinned into the BPF file system. Next the loader will handle all program sections that contain ELF relocation entries for maps, meaning that BPF instructions loading map file descriptors into registers are rewritten so that the corresponding map file descriptors are encoded into the instructions immediate value, in order for the kernel to be able to convert them later on into map kernel pointers. After that all the programs themselves are created through the BPF system call, and tail called maps, if present, updated with the program’s file descriptors.

bpftool

bpftool is the main introspection and debugging tool around BPF and developed and shipped along with the Linux kernel tree under tools/bpf/bpftool/.

The tool can dump all BPF programs and maps that are currently loaded in the system, or list and correlate all BPF maps used by a specific program. Furthermore, it allows to dump the entire map’s key / value pairs, or lookup, update, delete individual ones as well as retrieve a key’s neighbor key in the map. Such operations can be performed based on BPF program or map IDs or by specifying the location of a BPF file system pinned program or map. The tool additionally also offers an option to pin maps or programs into the BPF file system.

For a quick overview of all BPF programs currently loaded on the host invoke the following command:

# bpftool prog
398: sched_cls  tag 56207908be8ad877
   loaded_at Apr 09/16:24  uid 0
   xlated 8800B  jited 6184B  memlock 12288B  map_ids 18,5,17,14
399: sched_cls  tag abc95fb4835a6ec9
   loaded_at Apr 09/16:24  uid 0
   xlated 344B  jited 223B  memlock 4096B  map_ids 18
400: sched_cls  tag afd2e542b30ff3ec
   loaded_at Apr 09/16:24  uid 0
   xlated 1720B  jited 1001B  memlock 4096B  map_ids 17
401: sched_cls  tag 2dbbd74ee5d51cc8
   loaded_at Apr 09/16:24  uid 0
   xlated 3728B  jited 2099B  memlock 4096B  map_ids 17
[...]

Similarly, to get an overview of all active maps:

# bpftool map
5: hash  flags 0x0
    key 20B  value 112B  max_entries 65535  memlock 13111296B
6: hash  flags 0x0
    key 20B  value 20B  max_entries 65536  memlock 7344128B
7: hash  flags 0x0
    key 10B  value 16B  max_entries 8192  memlock 790528B
8: hash  flags 0x0
    key 22B  value 28B  max_entries 8192  memlock 987136B
9: hash  flags 0x0
    key 20B  value 8B  max_entries 512000  memlock 49352704B
[...]

Note that for each command, bpftool also supports json based output by appending --json at the end of the command line. An additional --pretty improves the output to be more human readable.

# bpftool prog --json --pretty

For dumping the post-verifier BPF instruction image of a specific BPF program, one starting point could be to inspect a specific program, e.g. attached to the tc ingress hook:

# tc filter show dev cilium_host egress
filter protocol all pref 1 bpf chain 0
filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_host.o:[from-netdev] \
                    direct-action not_in_hw id 406 tag e0362f5bd9163a0a jited

The program from the object file bpf_host.o, section from-netdev has a BPF program ID of 406 as denoted in id 406. Based on this information bpftool can provide some high-level metadata specific to the program:

# bpftool prog show id 406
406: sched_cls  tag e0362f5bd9163a0a
     loaded_at Apr 09/16:24  uid 0
     xlated 11144B  jited 7721B  memlock 12288B  map_ids 18,20,8,5,6,14

The program of ID 406 is of type sched_cls (BPF_PROG_TYPE_SCHED_CLS), has a tag of e0362f5bd9163a0a (SHA sum over the instruction sequence), it was loaded by root uid 0 on Apr 09/16:24. The BPF instruction sequence is 11,144 bytes long and the JITed image 7,721 bytes. The program itself (excluding maps) consumes 12,288 bytes that are accounted / charged against user uid 0. And the BPF program uses the BPF maps with IDs 18, 20, 8, 5, 6 and 14. The latter IDs can further be used to get information or dump the map themselves.

Additionally, bpftool can issue a dump request of the BPF instructions the program runs:

# bpftool prog dump xlated id 406
 0: (b7) r7 = 0
 1: (63) *(u32 *)(r1 +60) = r7
 2: (63) *(u32 *)(r1 +56) = r7
 3: (63) *(u32 *)(r1 +52) = r7
[...]
47: (bf) r4 = r10
48: (07) r4 += -40
49: (79) r6 = *(u64 *)(r10 -104)
50: (bf) r1 = r6
51: (18) r2 = map[id:18]                    <-- BPF map id 18
53: (b7) r5 = 32
54: (85) call bpf_skb_event_output#5656112  <-- BPF helper call
55: (69) r1 = *(u16 *)(r6 +192)
[...]

bpftool correlates BPF map IDs into the instruction stream as shown above as well as calls to BPF helpers or other BPF programs.

The instruction dump reuses the same ‘pretty-printer’ as the kernel’s BPF verifier. Since the program was JITed and therefore the actual JIT image that was generated out of above xlated instructions is executed, it can be dumped as well through bpftool:

# bpftool prog dump jited id 406
 0:        push   %rbp
 1:        mov    %rsp,%rbp
 4:        sub    $0x228,%rsp
 b:        sub    $0x28,%rbp
 f:        mov    %rbx,0x0(%rbp)
13:        mov    %r13,0x8(%rbp)
17:        mov    %r14,0x10(%rbp)
1b:        mov    %r15,0x18(%rbp)
1f:        xor    %eax,%eax
21:        mov    %rax,0x20(%rbp)
25:        mov    0x80(%rdi),%r9d
[...]

Mainly for BPF JIT developers, the option also exists to interleave the disassembly with the actual native opcodes:

# bpftool prog dump jited id 406 opcodes
 0:        push   %rbp
           55
 1:        mov    %rsp,%rbp
           48 89 e5
 4:        sub    $0x228,%rsp
           48 81 ec 28 02 00 00
 b:        sub    $0x28,%rbp
           48 83 ed 28
 f:        mov    %rbx,0x0(%rbp)
           48 89 5d 00
13:        mov    %r13,0x8(%rbp)
           4c 89 6d 08
17:        mov    %r14,0x10(%rbp)
           4c 89 75 10
1b:        mov    %r15,0x18(%rbp)
           4c 89 7d 18
[...]

The same interleaving can be done for the normal BPF instructions which can sometimes be useful for debugging in the kernel:

# bpftool prog dump xlated id 406 opcodes
 0: (b7) r7 = 0
    b7 07 00 00 00 00 00 00
 1: (63) *(u32 *)(r1 +60) = r7
    63 71 3c 00 00 00 00 00
 2: (63) *(u32 *)(r1 +56) = r7
    63 71 38 00 00 00 00 00
 3: (63) *(u32 *)(r1 +52) = r7
    63 71 34 00 00 00 00 00
 4: (63) *(u32 *)(r1 +48) = r7
    63 71 30 00 00 00 00 00
 5: (63) *(u32 *)(r1 +64) = r7
    63 71 40 00 00 00 00 00
 [...]

The basic blocks of a program can also be visualized with the help of graphviz. For this purpose bpftool has a visual dump mode that generates a dot file instead of the plain BPF xlated instruction dump that can later be converted to a png file:

# bpftool prog dump xlated id 406 visual &> output.dot
$ dot -Tpng output.dot -o output.png

Another option would be to pass the dot file to dotty as a viewer, that is dotty output.dot, where the result for the bpf_host.o program looks as follows (small extract):

_images/bpf_dot.png

Note that the xlated instruction dump provides the post-verifier BPF instruction image which means that it dumps the instructions as if they were to be run through the BPF interpreter. In the kernel, the verifier performs various rewrites of the original instructions provided by the BPF loader.

One example of rewrites is the inlining of helper functions in order to improve runtime performance, here in the case of a map lookup for hash tables:

# bpftool prog dump xlated id 3
 0: (b7) r1 = 2
 1: (63) *(u32 *)(r10 -4) = r1
 2: (bf) r2 = r10
 3: (07) r2 += -4
 4: (18) r1 = map[id:2]                      <-- BPF map id 2
 6: (85) call __htab_map_lookup_elem#77408   <-+ BPF helper inlined rewrite
 7: (15) if r0 == 0x0 goto pc+2                |
 8: (07) r0 += 56                              |
 9: (79) r0 = *(u64 *)(r0 +0)                <-+
10: (15) if r0 == 0x0 goto pc+24
11: (bf) r2 = r10
12: (07) r2 += -4
[...]

bpftool correlates calls to helper functions or BPF to BPF calls through kallsyms. Therefore, make sure that JITed BPF programs are exposed to kallsyms (bpf_jit_kallsyms) and that kallsyms addresses are not obfuscated (calls are otherwise shown as call bpf_unspec#0):

# echo 0 > /proc/sys/kernel/kptr_restrict
# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms

BPF to BPF calls are correlated as well for both, interpreter as well as JIT case. In the latter, the tag of the subprogram is shown as call target. In each case, the pc+2 is the pc-relative offset of the call target, which denotes the subprogram.

# bpftool prog dump xlated id 1
0: (85) call pc+2#__bpf_prog_run_args32
1: (b7) r0 = 1
2: (95) exit
3: (b7) r0 = 2
4: (95) exit

JITed variant of the dump:

# bpftool prog dump xlated id 1
0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
1: (b7) r0 = 1
2: (95) exit
3: (b7) r0 = 2
4: (95) exit

In the case of tail calls, the kernel maps them into a single instruction internally, bpftool will still correlate them as a helper call for ease of debugging:

# bpftool prog dump xlated id 2
[...]
10: (b7) r2 = 8
11: (85) call bpf_trace_printk#-41312
12: (bf) r1 = r6
13: (18) r2 = map[id:1]
15: (b7) r3 = 0
16: (85) call bpf_tail_call#12
17: (b7) r1 = 42
18: (6b) *(u16 *)(r6 +46) = r1
19: (b7) r0 = 0
20: (95) exit

# bpftool map show id 1
1: prog_array  flags 0x0
      key 4B  value 4B  max_entries 1  memlock 4096B

Dumping an entire map is possible through the map dump subcommand which iterates through all present map elements and dumps the key / value pairs.

If no BTF (BPF Type Format) data is available for a given map, then the key / value pairs are dumped as hex:

# bpftool map dump id 5
key:
f0 0d 00 00 00 00 00 00  0a 66 00 00 00 00 8a d6
02 00 00 00
value:
00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
key:
0a 66 1c ee 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00
value:
00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
[...]
Found 6 elements

However, with BTF, the map also holds debugging information about the key and value structures. For example, BTF in combination with BPF maps and the BPF_ANNOTATE_KV_PAIR() macro from iproute2 will result in the following dump (test_xdp_noinline.o from kernel selftests):

# cat tools/testing/selftests/bpf/test_xdp_noinline.c
  [...]
   struct ctl_value {
         union {
                 __u64 value;
                 __u32 ifindex;
                 __u8 mac[6];
         };
   };

   struct bpf_map_def __attribute__ ((section("maps"), used)) ctl_array = {
          .type            = BPF_MAP_TYPE_ARRAY,
          .key_size        = sizeof(__u32),
          .value_size      = sizeof(struct ctl_value),
          .max_entries     = 16,
          .map_flags       = 0,
   };
   BPF_ANNOTATE_KV_PAIR(ctl_array, __u32, struct ctl_value);

   [...]

The BPF_ANNOTATE_KV_PAIR() macro forces a map-specific ELF section containing an empty key and value, this enables the iproute2 BPF loader to correlate BTF data with that section and thus allows to choose the corresponding types out of the BTF for loading the map.

Compiling through LLVM and generating BTF through debugging information by pahole:

# clang [...] -O2 -target bpf -g -emit-llvm -c test_xdp_noinline.c -o - |
  llc -march=bpf -mcpu=probe -mattr=dwarfris -filetype=obj -o test_xdp_noinline.o
# pahole -J test_xdp_noinline.o

Now loading into kernel and dumping the map via bpftool:

# ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:227 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
[...]
# bpftool prog show id 227
227: xdp  tag a85e060c275c5616  gpl
    loaded_at 2018-07-17T14:41:29+0000  uid 0
    xlated 8152B  not jited  memlock 12288B  map_ids 381,385,386,382,384,383
# bpftool map dump id 386
 [{
      "key": 0,
      "value": {
          "": {
              "value": 0,
              "ifindex": 0,
              "mac": []
          }
      }
  },{
      "key": 1,
      "value": {
          "": {
              "value": 0,
              "ifindex": 0,
              "mac": []
          }
      }
  },{
[...]

Lookup, update, delete, and ‘get next key’ operations on the map for specific keys can be performed through bpftool as well.

If the BPF program has been successfully loaded with BTF debugging information, the BTF ID will be shown in prog show command result denoted in btf_id.

# bpftool prog show id 72
72: xdp  name balancer_ingres  tag acf44cabb48385ed  gpl
   loaded_at 2020-04-13T23:12:08+0900  uid 0
   xlated 19104B  jited 10732B  memlock 20480B  map_ids 126,130,131,127,129,128
   btf_id 60

This can also be confirmed with btf show command which dumps all BTF objects loaded on a system.

# bpftool btf show
60: size 12243B  prog_ids 72  map_ids 126,130,131,127,129,128

And the subcommand btf dump can be used to check which debugging information is included in the BTF. With this command, BTF dump can be formatted either ‘raw’ or ‘c’, the one that is used in C code.

# bpftool btf dump id 60 format c
  [...]
   struct ctl_value {
         union {
                 __u64 value;
                 __u32 ifindex;
                 __u8 mac[6];
         };
   };

   typedef unsigned int u32;
   [...]

BPF sysctls

The Linux kernel provides few sysctls that are BPF related and covered in this section.

  • /proc/sys/net/core/bpf_jit_enable: Enables or disables the BPF JIT compiler.

    Value Description
    0 Disable the JIT and use only interpreter (kernel’s default value)
    1 Enable the JIT compiler
    2 Enable the JIT and emit debugging traces to the kernel log

    As described in subsequent sections, bpf_jit_disasm tool can be used to process debugging traces when the JIT compiler is set to debugging mode (option 2).

  • /proc/sys/net/core/bpf_jit_harden: Enables or disables BPF JIT hardening. Note that enabling hardening trades off performance, but can mitigate JIT spraying by blinding out the BPF program’s immediate values. For programs processed through the interpreter, blinding of immediate values is not needed / performed.

    Value Description
    0 Disable JIT hardening (kernel’s default value)
    1 Enable JIT hardening for unprivileged users only
    2 Enable JIT hardening for all users
  • /proc/sys/net/core/bpf_jit_kallsyms: Enables or disables export of JITed programs as kernel symbols to /proc/kallsyms so that they can be used together with perf tooling as well as making these addresses aware to the kernel for stack unwinding, for example, used in dumping stack traces. The symbol names contain the BPF program tag (bpf_prog_<tag>). If bpf_jit_harden is enabled, then this feature is disabled.

    Value Description
    0 Disable JIT kallsyms export (kernel’s default value)
    1 Enable JIT kallsyms export for privileged users only
  • /proc/sys/kernel/unprivileged_bpf_disabled: Enables or disable unprivileged use of the bpf(2) system call. The Linux kernel has unprivileged use of bpf(2) enabled by default, but once the switch is flipped, unprivileged use will be permanently disabled until the next reboot. This sysctl knob is a one-time switch, meaning if once set, then neither an application nor an admin can reset the value anymore. This knob does not affect any cBPF programs such as seccomp or traditional socket filters that do not use the bpf(2) system call for loading the program into the kernel.

    Value Description
    0 Unprivileged use of bpf syscall enabled (kernel’s default value)
    1 Unprivileged use of bpf syscall disabled

Kernel Testing

The Linux kernel ships a BPF selftest suite, which can be found in the kernel source tree under tools/testing/selftests/bpf/.

$ cd tools/testing/selftests/bpf/
$ make
# make run_tests

The test suite contains test cases against the BPF verifier, program tags, various tests against the BPF map interface and map types. It contains various runtime tests from C code for checking LLVM back end, and eBPF as well as cBPF asm code that is run in the kernel for testing the interpreter and JITs.

JIT Debugging

For JIT developers performing audits or writing extensions, each compile run can output the generated JIT image into the kernel log through:

# echo 2 > /proc/sys/net/core/bpf_jit_enable

Whenever a new BPF program is loaded, the JIT compiler will dump the output, which can then be inspected with dmesg, for example:

[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f from=tcpdump pid=20583
[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3

flen is the length of the BPF program (here, 6 BPF instructions), and proglen tells the number of bytes generated by the JIT for the opcode image (here, 70 bytes in size). pass means that the image was generated in 3 compiler passes, for example, x86_64 can have various optimization passes to further reduce the image size when possible. image contains the address of the generated JIT image, from and pid the user space application name and PID respectively, which triggered the compilation process. The dump output for eBPF and cBPF JITs is the same format.

In the kernel tree under tools/bpf/, there is a tool called bpf_jit_disasm. It reads out the latest dump and prints the disassembly for further inspection:

# ./bpf_jit_disasm
70 bytes emitted from JIT compiler (pass:3, flen:6)
ffffffffa0069c8f + <x>:
   0:       push   %rbp
   1:       mov    %rsp,%rbp
   4:       sub    $0x60,%rsp
   8:       mov    %rbx,-0x8(%rbp)
   c:       mov    0x68(%rdi),%r9d
  10:       sub    0x6c(%rdi),%r9d
  14:       mov    0xd8(%rdi),%r8
  1b:       mov    $0xc,%esi
  20:       callq  0xffffffffe0ff9442
  25:       cmp    $0x800,%eax
  2a:       jne    0x0000000000000042
  2c:       mov    $0x17,%esi
  31:       callq  0xffffffffe0ff945e
  36:       cmp    $0x1,%eax
  39:       jne    0x0000000000000042
  3b:       mov    $0xffff,%eax
  40:       jmp    0x0000000000000044
  42:       xor    %eax,%eax
  44:       leaveq
  45:       retq

Alternatively, the tool can also dump related opcodes along with the disassembly.

# ./bpf_jit_disasm -o
70 bytes emitted from JIT compiler (pass:3, flen:6)
ffffffffa0069c8f + <x>:
   0:       push   %rbp
    55
   1:       mov    %rsp,%rbp
    48 89 e5
   4:       sub    $0x60,%rsp
    48 83 ec 60
   8:       mov    %rbx,-0x8(%rbp)
    48 89 5d f8
   c:       mov    0x68(%rdi),%r9d
    44 8b 4f 68
  10:       sub    0x6c(%rdi),%r9d
    44 2b 4f 6c
  14:       mov    0xd8(%rdi),%r8
    4c 8b 87 d8 00 00 00
  1b:       mov    $0xc,%esi
    be 0c 00 00 00
  20:       callq  0xffffffffe0ff9442
    e8 1d 94 ff e0
  25:       cmp    $0x800,%eax
    3d 00 08 00 00
  2a:       jne    0x0000000000000042
    75 16
  2c:       mov    $0x17,%esi
    be 17 00 00 00
  31:       callq  0xffffffffe0ff945e
    e8 28 94 ff e0
  36:       cmp    $0x1,%eax
    83 f8 01
  39:       jne    0x0000000000000042
    75 07
  3b:       mov    $0xffff,%eax
    b8 ff ff 00 00
  40:       jmp    0x0000000000000044
    eb 02
  42:       xor    %eax,%eax
    31 c0
  44:       leaveq
    c9
  45:       retq
    c3

More recently, bpftool adapted the same feature of dumping the BPF JIT image based on a given BPF program ID already loaded in the system (see bpftool section).

For performance analysis of JITed BPF programs, perf can be used as usual. As a prerequisite, JITed programs need to be exported through kallsyms infrastructure.

# echo 1 > /proc/sys/net/core/bpf_jit_enable
# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms

Enabling or disabling bpf_jit_kallsyms does not require a reload of the related BPF programs. Next, a small workflow example is provided for profiling BPF programs. A crafted tc BPF program is used for demonstration purposes, where perf records a failed allocation inside bpf_clone_redirect() helper. Due to the use of direct write, bpf_try_make_head_writable() failed, which would then release the cloned skb again and return with an error message. perf thus records all kfree_skb events.

# tc qdisc add dev em1 clsact
# tc filter add dev em1 ingress bpf da obj prog.o sec main
# tc filter show dev em1 ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 prog.o:[main] direct-action id 1 tag 8227addf251b7543

# cat /proc/kallsyms
[...]
ffffffffc00349e0 t fjes_hw_init_command_registers    [fjes]
ffffffffc003e2e0 d __tracepoint_fjes_hw_stop_debug_err    [fjes]
ffffffffc0036190 t fjes_hw_epbuf_tx_pkt_send    [fjes]
ffffffffc004b000 t bpf_prog_8227addf251b7543

# perf record -a -g -e skb:kfree_skb sleep 60
# perf script --kallsyms=/proc/kallsyms
[...]
ksoftirqd/0     6 [000]  1004.578402:    skb:kfree_skb: skbaddr=0xffff9d4161f20a00 protocol=2048 location=0xffffffffc004b52c
   7fffb8745961 bpf_clone_redirect (/lib/modules/4.10.0+/build/vmlinux)
   7fffc004e52c bpf_prog_8227addf251b7543 (/lib/modules/4.10.0+/build/vmlinux)
   7fffc05b6283 cls_bpf_classify (/lib/modules/4.10.0+/build/vmlinux)
   7fffb875957a tc_classify (/lib/modules/4.10.0+/build/vmlinux)
   7fffb8729840 __netif_receive_skb_core (/lib/modules/4.10.0+/build/vmlinux)
   7fffb8729e38 __netif_receive_skb (/lib/modules/4.10.0+/build/vmlinux)
   7fffb872ae05 process_backlog (/lib/modules/4.10.0+/build/vmlinux)
   7fffb872a43e net_rx_action (/lib/modules/4.10.0+/build/vmlinux)
   7fffb886176c __do_softirq (/lib/modules/4.10.0+/build/vmlinux)
   7fffb80ac5b9 run_ksoftirqd (/lib/modules/4.10.0+/build/vmlinux)
   7fffb80ca7fa smpboot_thread_fn (/lib/modules/4.10.0+/build/vmlinux)
   7fffb80c6831 kthread (/lib/modules/4.10.0+/build/vmlinux)
   7fffb885e09c ret_from_fork (/lib/modules/4.10.0+/build/vmlinux)

The stack trace recorded by perf will then show the bpf_prog_8227addf251b7543() symbol as part of the call trace, meaning that the BPF program with the tag 8227addf251b7543 was related to the kfree_skb event, and such program was attached to netdevice em1 on the ingress hook as shown by tc.

Introspection

The Linux kernel provides various tracepoints around BPF and XDP which can be used for additional introspection, for example, to trace interactions of user space programs with the bpf system call.

Tracepoints for BPF:

# perf list | grep bpf:
bpf:bpf_map_create                                 [Tracepoint event]
bpf:bpf_map_delete_elem                            [Tracepoint event]
bpf:bpf_map_lookup_elem                            [Tracepoint event]
bpf:bpf_map_next_key                               [Tracepoint event]
bpf:bpf_map_update_elem                            [Tracepoint event]
bpf:bpf_obj_get_map                                [Tracepoint event]
bpf:bpf_obj_get_prog                               [Tracepoint event]
bpf:bpf_obj_pin_map                                [Tracepoint event]
bpf:bpf_obj_pin_prog                               [Tracepoint event]
bpf:bpf_prog_get_type                              [Tracepoint event]
bpf:bpf_prog_load                                  [Tracepoint event]
bpf:bpf_prog_put_rcu                               [Tracepoint event]

Example usage with perf (alternatively to sleep example used here, a specific application like tc could be used here instead, of course):

# perf record -a -e bpf:* sleep 10
# perf script
sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
     swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

For the BPF programs, their individual program tag is displayed.

For debugging, XDP also has a tracepoint that is triggered when exceptions are raised:

# perf list | grep xdp:
xdp:xdp_exception                                  [Tracepoint event]

Exceptions are triggered in the following scenarios:

  • The BPF program returned an invalid / unknown XDP action code.
  • The BPF program returned with XDP_ABORTED indicating a non-graceful exit.
  • The BPF program returned with XDP_TX, but there was an error on transmit, for example, due to the port not being up, due to the transmit ring being full, due to allocation failures, etc.

Both tracepoint classes can also be inspected with a BPF program itself attached to one or more tracepoints, collecting further information in a map or punting such events to a user space collector through the bpf_perf_event_output() helper, for example.

Tracing pipe

When a BPF program makes a call to bpf_trace_printk(), the output is sent to the kernel tracing pipe. Users may read from this file to consume events that are traced to this buffer:

# tail -f /sys/kernel/debug/tracing/trace_pipe
...

Miscellaneous

BPF programs and maps are memory accounted against RLIMIT_MEMLOCK similar to perf. The currently available size in unit of system pages which may be locked into memory can be inspected through ulimit -l. The setrlimit system call man page provides further details.

The default limit is usually insufficient to load more complex programs or larger BPF maps, so that the BPF system call will return with errno of EPERM. In such situations a workaround with ulimit -l unlimited or with a sufficiently large limit could be performed. The RLIMIT_MEMLOCK is mainly enforcing limits for unprivileged users. Depending on the setup, setting a higher limit for privileged users is often acceptable.

Program Types

At the time of this writing, there are eighteen different BPF program types available, two of the main types for networking are further explained in below subsections, namely XDP BPF programs as well as tc BPF programs. Extensive usage examples for the two program types for LLVM, iproute2 or other tools are spread throughout the toolchain section and not covered here. Instead, this section focuses on their architecture, concepts and use cases.

XDP

XDP stands for eXpress Data Path and provides a framework for BPF that enables high-performance programmable packet processing in the Linux kernel. It runs the BPF program at the earliest possible point in software, namely at the moment the network driver receives the packet.

At this point in the fast-path the driver just picked up the packet from its receive rings, without having done any expensive operations such as allocating an skb for pushing the packet further up the networking stack, without having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program is executed at the earliest point when it becomes available to the CPU for processing.

XDP works in concert with the Linux kernel and its infrastructure, meaning the kernel is not bypassed as in various networking frameworks that operate in user space only. Keeping the packet in kernel space has several major advantages:

  • XDP is able to reuse all the upstream developed kernel networking drivers, user space tooling, or even other available in-kernel infrastructure such as routing tables, sockets, etc in BPF helper calls itself.
  • Residing in kernel space, XDP has the same security model as the rest of the kernel for accessing hardware.
  • There is no need for crossing kernel / user space boundaries since the processed packet already resides in the kernel and can therefore flexibly forward packets into other in-kernel entities like namespaces used by containers or the kernel’s networking stack itself. This is particularly relevant in times of Meltdown and Spectre.
  • Punting packets from XDP to the kernel’s robust, widely used and efficient TCP/IP stack is trivially possible, allows for full reuse and does not require maintaining a separate TCP/IP stack as with user space frameworks.
  • The use of BPF allows for full programmability, keeping a stable ABI with the same ‘never-break-user-space’ guarantees as with the kernel’s system call ABI and compared to modules it also provides safety measures thanks to the BPF verifier that ensures the stability of the kernel’s operation.
  • XDP trivially allows for atomically swapping programs during runtime without any network traffic interruption or even kernel / system reboot.
  • XDP allows for flexible structuring of workloads integrated into the kernel. For example, it can operate in “busy polling” or “interrupt driven” mode. Explicitly dedicating CPUs to XDP is not required. There are no special hardware requirements and it does not rely on hugepages.
  • XDP does not require any third party kernel modules or licensing. It is a long-term architectural solution, a core part of the Linux kernel, and developed by the kernel community.
  • XDP is already enabled and shipped everywhere with major distributions running a kernel equivalent to 4.8 or higher and supports most major 10G or higher networking drivers.

As a framework for running BPF in the driver, XDP additionally ensures that packets are laid out linearly and fit into a single DMA’ed page which is readable and writable by the BPF program. XDP also ensures that additional headroom of 256 bytes is available to the program for implementing custom encapsulation headers with the help of the bpf_xdp_adjust_head() BPF helper or adding custom metadata in front of the packet through bpf_xdp_adjust_meta().

The framework contains XDP action codes further described in the section below which a BPF program can return in order to instruct the driver how to proceed with the packet, and it enables the possibility to atomically replace BPF programs running at the XDP layer. XDP is tailored for high-performance by design. BPF allows to access the packet data through ‘direct packet access’ which means that the program holds data pointers directly in registers, loads the content into registers, respectively writes from there into the packet.

The packet representation in XDP that is passed to the BPF program as the BPF context looks as follows:

struct xdp_buff {
    void *data;
    void *data_end;
    void *data_meta;
    void *data_hard_start;
    struct xdp_rxq_info *rxq;
};

data points to the start of the packet data in the page, and as the name suggests, data_end points to the end of the packet data. Since XDP allows for a headroom, data_hard_start points to the maximum possible headroom start in the page, meaning, when the packet should be encapsulated, then data is moved closer towards data_hard_start via bpf_xdp_adjust_head(). The same BPF helper function also allows for decapsulation in which case data is moved further away from data_hard_start.

data_meta initially points to the same location as data but bpf_xdp_adjust_meta() is able to move the pointer towards data_hard_start as well in order to provide room for custom metadata which is invisible to the normal kernel networking stack but can be read by tc BPF programs since it is transferred from XDP to the skb. Vice versa, it can remove or reduce the size of the custom metadata through the same BPF helper function by moving data_meta away from data_hard_start again. data_meta can also be used solely for passing state between tail calls similarly to the skb->cb[] control block case that is accessible in tc BPF programs.

This gives the following relation respectively invariant for the struct xdp_buff packet pointers: data_hard_start <= data_meta <= data < data_end.

The rxq field points to some additional per receive queue metadata which is populated at ring setup time (not at XDP runtime):

struct xdp_rxq_info {
    struct net_device *dev;
    u32 queue_index;
    u32 reg_state;
} ____cacheline_aligned;

The BPF program can retrieve queue_index as well as additional data from the netdevice itself such as ifindex, etc.

BPF program return codes

After running the XDP BPF program, a verdict is returned from the program in order to tell the driver how to process the packet next. In the linux/bpf.h system header file all available return verdicts are enumerated:

enum xdp_action {
    XDP_ABORTED = 0,
    XDP_DROP,
    XDP_PASS,
    XDP_TX,
    XDP_REDIRECT,
};

XDP_DROP as the name suggests will drop the packet right at the driver level without wasting any further resources. This is in particular useful for BPF programs implementing DDoS mitigation mechanisms or firewalling in general. The XDP_PASS return code means that the packet is allowed to be passed up to the kernel’s networking stack. Meaning, the current CPU that was processing this packet now allocates a skb, populates it, and passes it onwards into the GRO engine. This would be equivalent to the default packet handling behavior without XDP. With XDP_TX the BPF program has an efficient option to transmit the network packet out of the same NIC it just arrived on again. This is typically useful when few nodes are implementing, for example, firewalling with subsequent load balancing in a cluster and thus act as a hairpinned load balancer pushing the incoming packets back into the switch after rewriting them in XDP BPF. XDP_REDIRECT is similar to XDP_TX in that it is able to transmit the XDP packet, but through another NIC. Another option for the XDP_REDIRECT case is to redirect into a BPF cpumap, meaning, the CPUs serving XDP on the NIC’s receive queues can continue to do so and push the packet for processing the upper kernel stack to a remote CPU. This is similar to XDP_PASS, but with the ability that the XDP BPF program can keep serving the incoming high load as opposed to temporarily spend work on the current packet for pushing into upper layers. Last but not least, XDP_ABORTED which serves denoting an exception like state from the program and has the same behavior as XDP_DROP only that XDP_ABORTED passes the trace_xdp_exception tracepoint which can be additionally monitored to detect misbehavior.

Use cases for XDP

Some of the main use cases for XDP are presented in this subsection. The list is non-exhaustive and given the programmability and efficiency XDP and BPF enables, it can easily be adapted to solve very specific use cases.

  • DDoS mitigation, firewalling

    One of the basic XDP BPF features is to tell the driver to drop a packet with XDP_DROP at this early stage which allows for any kind of efficient network policy enforcement with having an extremely low per-packet cost. This is ideal in situations when needing to cope with any sort of DDoS attacks, but also more general allows to implement any sort of firewalling policies with close to no overhead in BPF e.g. in either case as stand alone appliance (e.g. scrubbing ‘clean’ traffic through XDP_TX) or widely deployed on nodes protecting end hosts themselves (via XDP_PASS or cpumap XDP_REDIRECT for good traffic). Offloaded XDP takes this even one step further by moving the already small per-packet cost entirely into the NIC with processing at line-rate.

  • Forwarding and load-balancing

    Another major use case of XDP is packet forwarding and load-balancing through either XDP_TX or XDP_REDIRECT actions. The packet can be arbitrarily mangled by the BPF program running in the XDP layer, even BPF helper functions are available for increasing or decreasing the packet’s headroom in order to arbitrarily encapsulate respectively decapsulate the packet before sending it out again. With XDP_TX hairpinned load-balancers can be implemented that push the packet out of the same networking device it originally arrived on, or with the XDP_REDIRECT action it can be forwarded to another NIC for transmission. The latter return code can also be used in combination with BPF’s cpumap to load-balance packets for passing up the local stack, but on remote, non-XDP processing CPUs.

  • Pre-stack filtering / processing

    Besides policy enforcement, XDP can also be used for hardening the kernel’s networking stack with the help of XDP_DROP case, meaning, it can drop irrelevant packets for a local node right at the earliest possible point before the networking stack sees them e.g. given we know that a node only serves TCP traffic, any UDP, SCTP or other L4 traffic can be dropped right away. This has the advantage that packets do not need to traverse various entities like GRO engine, the kernel’s flow dissector and others before it can be determined to drop them and thus this allows for reducing the kernel’s attack surface. Thanks to XDP’s early processing stage, this effectively ‘pretends’ to the kernel’s networking stack that these packets have never been seen by the networking device. Additionally, if a potential bug in the stack’s receive path got uncovered and would cause a ‘ping of death’ like scenario, XDP can be utilized to drop such packets right away without having to reboot the kernel or restart any services. Due to the ability to atomically swap such programs to enforce a drop of bad packets, no network traffic is even interrupted on a host.

    Another use case for pre-stack processing is that given the kernel has not yet allocated an skb for the packet, the BPF program is free to modify the packet and, again, have it ‘pretend’ to the stack that it was received by the networking device this way. This allows for cases such as having custom packet mangling and encapsulation protocols where the packet can be decapsulated prior to entering GRO aggregation in which GRO otherwise would not be able to perform any sort of aggregation due to not being aware of the custom protocol. XDP also allows to push metadata (non-packet data) in front of the packet. This is ‘invisible’ to the normal kernel stack, can be GRO aggregated (for matching metadata) and later on processed in coordination with a tc ingress BPF program where it has the context of a skb available for e.g. setting various skb fields.

  • Flow sampling, monitoring

    XDP can also be used for cases such as packet monitoring, sampling or any other network analytics, for example, as part of an intermediate node in the path or on end hosts in combination also with prior mentioned use cases. For complex packet analysis, XDP provides a facility to efficiently push network packets (truncated or with full payload) and custom metadata into a fast lockless per CPU memory mapped ring buffer provided from the Linux perf infrastructure to an user space application. This also allows for cases where only a flow’s initial data can be analyzed and once determined as good traffic having the monitoring bypassed. Thanks to the flexibility brought by BPF, this allows for implementing any sort of custom monitoring or sampling.

One example of XDP BPF production usage is Facebook’s SHIV and Droplet infrastructure which implement their L4 load-balancing and DDoS countermeasures. Migrating their production infrastructure away from netfilter’s IPVS (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared to their previous IPVS setup. This was first presented at the netdev 2.1 conference:

Another example is the integration of XDP into Cloudflare’s DDoS mitigation pipeline, which originally was using cBPF instead of eBPF for attack signature matching through iptables’ xt_bpf module. Due to use of iptables this caused severe performance problems under attack where a user space bypass solution was deemed necessary but came with drawbacks as well such as needing to busy poll the NIC and expensive packet re-injection into the kernel’s stack. The migration over to eBPF and XDP combined best of both worlds by having high-performance programmable packet processing directly inside the kernel:

XDP operation modes

XDP has three operation modes where ‘native’ XDP is the default mode. When talked about XDP this mode is typically implied.

  • Native XDP

    This is the default mode where the XDP BPF program is run directly out of the networking driver’s early receive path. Most widespread used NICs for 10G and higher support native XDP already.

  • Offloaded XDP

    In the offloaded XDP mode the XDP BPF program is directly offloaded into the NIC instead of being executed on the host CPU. Thus, the already extremely low per-packet cost is pushed off the host CPU entirely and executed on the NIC, providing even higher performance than running in native XDP. This offload is typically implemented by SmartNICs containing multi-threaded, multicore flow processors where a in-kernel JIT compiler translates BPF into native instructions for the latter. Drivers supporting offloaded XDP usually also support native XDP for cases where some BPF helpers may not yet or only be available for the native mode.

  • Generic XDP

    For drivers not implementing native or offloaded XDP yet, the kernel provides an option for generic XDP which does not require any driver changes since run at a much later point out of the networking stack. This setting is primarily targeted at developers who want to write and test programs against the kernel’s XDP API, and will not operate at the performance rate of the native or offloaded modes. For XDP usage in a production environment either the native or offloaded mode is better suited and the recommended way to run XDP.

Driver support

Since BPF and XDP is evolving quickly in terms of feature and driver support, the following lists native and offloaded XDP drivers as of kernel 4.17.

Drivers supporting native XDP

  • Broadcom
    • bnxt
  • Cavium
    • thunderx
  • Intel
    • ixgbe
    • ixgbevf
    • i40e
  • Mellanox
    • mlx4
    • mlx5
  • Netronome
    • nfp
  • Others
    • tun
    • virtio_net
  • Qlogic
    • qede
  • Solarflare

Drivers supporting offloaded XDP

  • Netronome

Note that examples for writing and loading XDP programs are included in the toolchain section under the respective tools.

[1]XDP for sfc available via out of tree driver as of kernel 4.17, but will be upstreamed soon.
[2](1, 2) Some BPF helper functions such as retrieving the current CPU number will not be available in an offloaded setting.

tc (traffic control)

Aside from other program types such as XDP, BPF can also be used out of the kernel’s tc (traffic control) layer in the networking data path. On a high-level there are three major differences when comparing XDP BPF programs to tc BPF ones:

  • The BPF input context is a sk_buff not a xdp_buff. When the kernel’s networking stack receives a packet, after the XDP layer, it allocates a buffer and parses the packet to store metadata about the packet. This representation is known as the sk_buff. This structure is then exposed in the BPF input context so that BPF programs from the tc ingress layer can use the metadata that the stack extracts from the packet. This can be useful, but comes with an associated cost of the stack performing this allocation and metadata extraction, and handling the packet until it hits the tc hook. By definition, the xdp_buff doesn’t have access to this metadata because the XDP hook is called before this work is done. This is a significant contributor to the performance difference between the XDP and tc hooks.

    Therefore, BPF programs attached to the tc BPF hook can, for instance, read or write the skb’s mark, pkt_type, protocol, priority, queue_mapping, napi_id, cb[] array, hash, tc_classid or tc_index, vlan metadata, the XDP transferred custom metadata and various other information. All members of the struct __sk_buff BPF context used in tc BPF are defined in the linux/bpf.h system header.

    Generally, the sk_buff is of a completely different nature than xdp_buff where both come with advantages and disadvantages. For example, the sk_buff case has the advantage that it is rather straight forward to mangle its associated metadata, however, it also contains a lot of protocol specific information (e.g. GSO related state) which makes it difficult to simply switch protocols by solely rewriting the packet data. This is due to the stack processing the packet based on the metadata rather than having the cost of accessing the packet contents each time. Thus, additional conversion is required from BPF helper functions taking care that sk_buff internals are properly converted as well. The xdp_buff case however does not face such issues since it comes at such an early stage where the kernel has not even allocated an sk_buff yet, thus packet rewrites of any kind can be realized trivially. However, the xdp_buff case has the disadvantage that sk_buff metadata is not available for mangling at this stage. The latter is overcome by passing custom metadata from XDP BPF to tc BPF, though. In this way, the limitations of each program type can be overcome by operating complementary programs of both types as the use case requires.

  • Compared to XDP, tc BPF programs can be triggered out of ingress and also egress points in the networking data path as opposed to ingress only in the case of XDP.

    The two hook points sch_handle_ingress() and sch_handle_egress() in the kernel are triggered out of __netif_receive_skb_core() and __dev_queue_xmit(), respectively. The latter two are the main receive and transmit functions in the data path that, setting XDP aside, are triggered for every network packet going in or coming out of the node allowing for full visibility for tc BPF programs at these hook points.

  • The tc BPF programs do not require any driver changes since they are run at hook points in generic layers in the networking stack. Therefore, they can be attached to any type of networking device.

    While this provides flexibility, it also trades off performance compared to running at the native XDP layer. However, tc BPF programs still come at the earliest point in the generic kernel’s networking data path after GRO has been run but before any protocol processing, traditional iptables firewalling such as iptables PREROUTING or nftables ingress hooks or other packet processing takes place. Likewise on egress, tc BPF programs execute at the latest point before handing the packet to the driver itself for transmission, meaning after traditional iptables firewalling hooks like iptables POSTROUTING, but still before handing the packet to the kernel’s GSO engine.

    One exception which does require driver changes however are offloaded tc BPF programs, typically provided by SmartNICs in a similar way as offloaded XDP just with differing set of features due to the differences in the BPF input context, helper functions and verdict codes.

BPF programs run in the tc layer are run from the cls_bpf classifier. While the tc terminology describes the BPF attachment point as a “classifier”, this is a bit misleading since it under-represents what cls_bpf is capable of. That is to say, a fully programmable packet processor being able not only to read the skb metadata and packet data, but to also arbitrarily mangle both, and terminate the tc processing with an action verdict. cls_bpf can thus be regarded as a self-contained entity that manages and executes tc BPF programs.

cls_bpf can hold one or more tc BPF programs. In the case where Cilium deploys cls_bpf programs, it attaches only a single program for a given hook in direct-action mode. Typically, in the traditional tc scheme, there is a split between classifier and action modules, where the classifier has one or more actions attached to it that are triggered once the classifier has a match. In the modern world for using tc in the software data path this model does not scale well for complex packet processing. Given tc BPF programs attached to cls_bpf are fully self-contained, they effectively fuse the parsing and action process together into a single unit. Thanks to cls_bpf’s direct-action mode, it will just return the tc action verdict and terminate the processing pipeline immediately. This allows for implementing scalable programmable packet processing in the networking data path by avoiding linear iteration of actions. cls_bpf is the only such “classifier” module in the tc layer capable of such a fast-path.

Like XDP BPF programs, tc BPF programs can be atomically updated at runtime via cls_bpf without interrupting any network traffic or having to restart services.

Both the tc ingress and the egress hook where cls_bpf itself can be attached to is managed by a pseudo qdisc called sch_clsact. This is a drop-in replacement and proper superset of the ingress qdisc since it is able to manage both, ingress and egress tc hooks. For tc’s egress hook in __dev_queue_xmit() it is important to stress that it is not executed under the kernel’s qdisc root lock. Thus, both tc ingress and egress hooks are executed in a lockless manner in the fast-path. In either case, preemption is disabled and execution happens under RCU read side.

Typically on egress there are qdiscs attached to netdevices such as sch_mq, sch_fq, sch_fq_codel or sch_htb where some of them are classful qdiscs that contain subclasses and thus require a packet classification mechanism to determine a verdict where to demux the packet. This is handled by a call to tcf_classify() which calls into tc classifiers if present. cls_bpf can also be attached and used in such cases. Such operation usually happens under the qdisc root lock and can be subject to lock contention. The sch_clsact qdisc’s egress hook comes at a much earlier point however which does not fall under that and operates completely independent from conventional egress qdiscs. Thus for cases like sch_htb the sch_clsact qdisc could perform the heavy lifting packet classification through tc BPF outside of the qdisc root lock, setting the skb->mark or skb->priority from there such that sch_htb only requires a flat mapping without expensive packet classification under the root lock thus reducing contention.

Offloaded tc BPF programs are supported for the case of sch_clsact in combination with cls_bpf where the prior loaded BPF program was JITed from a SmartNIC driver to be run natively on the NIC. Only cls_bpf programs operating in direct-action mode are supported to be offloaded. cls_bpf only supports offloading a single program and cannot offload multiple programs. Furthermore only the ingress hook supports offloading BPF programs.

One cls_bpf instance is able to hold multiple tc BPF programs internally. If this is the case, then the TC_ACT_UNSPEC program return code will continue execution with the next tc BPF program in that list. However, this has the drawback that several programs would need to parse the packet over and over again resulting in degraded performance.

BPF program return codes

Both the tc ingress and egress hook share the same action return verdicts that tc BPF programs can use. They are defined in the linux/pkt_cls.h system header:

#define TC_ACT_UNSPEC         (-1)
#define TC_ACT_OK               0
#define TC_ACT_SHOT             2
#define TC_ACT_STOLEN           4
#define TC_ACT_REDIRECT         7

There are a few more action TC_ACT_* verdicts available in the system header file which are also used in the two hooks. However, they share the same semantics with the ones above. Meaning, from a tc BPF perspective, TC_ACT_OK and TC_ACT_RECLASSIFY have the same semantics, as well as the three TC_ACT_STOLEN, TC_ACT_QUEUED and TC_ACT_TRAP opcodes. Therefore, for these cases we only describe TC_ACT_OK and the TC_ACT_STOLEN opcode for the two groups.

Starting out with TC_ACT_UNSPEC. It has the meaning of “unspecified action” and is used in three cases, i) when an offloaded tc BPF program is attached and the tc ingress hook is run where the cls_bpf representation for the offloaded program will return TC_ACT_UNSPEC, ii) in order to continue with the next tc BPF program in cls_bpf for the multi-program case. The latter also works in combination with offloaded tc BPF programs from point i) where the TC_ACT_UNSPEC from there continues with a next tc BPF program solely running in non-offloaded case. Last but not least, iii) TC_ACT_UNSPEC is also used for the single program case to simply tell the kernel to continue with the skb without additional side-effects. TC_ACT_UNSPEC is very similar to the TC_ACT_OK action code in the sense that both pass the skb onwards either to upper layers of the stack on ingress or down to the networking device driver for transmission on egress, respectively. The only difference to TC_ACT_OK is that TC_ACT_OK sets skb->tc_index based on the classid the tc BPF program set. The latter is set out of the tc BPF program itself through skb->tc_classid from the BPF context.

TC_ACT_SHOT instructs the kernel to drop the packet, meaning, upper layers of the networking stack will never see the skb on ingress and similarly the packet will never be submitted for transmission on egress. TC_ACT_SHOT and TC_ACT_STOLEN are both similar in nature with few differences: TC_ACT_SHOT will indicate to the kernel that the skb was released through kfree_skb() and return NET_XMIT_DROP to the callers for immediate feedback, whereas TC_ACT_STOLEN will release the skb through consume_skb() and pretend to upper layers that the transmission was successful through NET_XMIT_SUCCESS. The perf’s drop monitor which records traces of kfree_skb() will therefore also not see any drop indications from TC_ACT_STOLEN since its semantics are such that the skb has been “consumed” or queued but certainly not “dropped”.

Last but not least the TC_ACT_REDIRECT action which is available for tc BPF programs as well. This allows to redirect the skb to the same or another’s device ingress or egress path together with the bpf_redirect() helper. Being able to inject the packet into another device’s ingress or egress direction allows for full flexibility in packet forwarding with BPF. There are no requirements on the target networking device other than being a networking device itself, there is no need to run another instance of cls_bpf on the target device or other such restrictions.

tc BPF FAQ

This section contains a few miscellaneous question and answer pairs related to tc BPF programs that are asked from time to time.

  • Question: What about act_bpf as a tc action module, is it still relevant?
  • Answer: Not really. Although cls_bpf and act_bpf share the same functionality for tc BPF programs, cls_bpf is more flexible since it is a proper superset of act_bpf. The way tc works is that tc actions need to be attached to tc classifiers. In order to achieve the same flexibility as cls_bpf, act_bpf would need to be attached to the cls_matchall classifier. As the name says, this will match on every packet in order to pass them through for attached tc action processing. For act_bpf, this is will result in less efficient packet processing than using cls_bpf in direct-action mode directly. If act_bpf is used in a setting with other classifiers than cls_bpf or cls_matchall then this will perform even worse due to the nature of operation of tc classifiers. Meaning, if classifier A has a mismatch, then the packet is passed to classifier B, reparsing the packet, etc, thus in the typical case there will be linear processing where the packet would need to traverse N classifiers in the worst case to find a match and execute act_bpf on that. Therefore, act_bpf has never been largely relevant. Additionally, act_bpf does not provide a tc offloading interface either compared to cls_bpf.
  • Question: Is it recommended to use cls_bpf not in direct-action mode?
  • Answer: No. The answer is similar to the one above in that this is otherwise unable to scale for more complex processing. tc BPF can already do everything needed by itself in an efficient manner and thus there is no need for anything other than direct-action mode.
  • Question: Is there any performance difference in offloaded cls_bpf and offloaded XDP?
  • Answer: No. Both are JITed through the same compiler in the kernel which handles the offloading to the SmartNIC and the loading mechanism for both is very similar as well. Thus, the BPF program gets translated into the same target instruction set in order to be able to run on the NIC natively. The two tc BPF and XDP BPF program types have a differing set of features, so depending on the use case one might be picked over the other due to availability of certain helper functions in the offload case, for example.

Use cases for tc BPF

Some of the main use cases for tc BPF programs are presented in this subsection. Also here, the list is non-exhaustive and given the programmability and efficiency of tc BPF, it can easily be tailored and integrated into orchestration systems in order to solve very specific use cases. While some use cases with XDP may overlap, tc BPF and XDP BPF are mostly complementary to each other and both can also be used at the same time or one over the other depending which is most suitable for a given problem to solve.

  • Policy enforcement for containers

    One application which tc BPF programs are suitable for is to implement policy enforcement, custom firewalling or similar security measures for containers or pods, respectively. In the conventional case, container isolation is implemented through network namespaces with veth networking devices connecting the host’s initial namespace with the dedicated container’s namespace. Since one end of the veth pair has been moved into the container’s namespace whereas the other end remains in the initial namespace of the host, all network traffic from the container has to pass through the host-facing veth device allowing for attaching tc BPF programs on the tc ingress and egress hook of the veth. Network traffic going into the container will pass through the host-facing veth’s tc egress hook whereas network traffic coming from the container will pass through the host-facing veth’s tc ingress hook.

    For virtual devices like veth devices XDP is unsuitable in this case since the kernel operates solely on a skb here and generic XDP has a few limitations where it does not operate with cloned skb’s. The latter is heavily used from the TCP/IP stack in order to hold data segments for retransmission where the generic XDP hook would simply get bypassed instead. Moreover, generic XDP needs to linearize the entire skb resulting in heavily degraded performance. tc BPF on the other hand is more flexible as it specializes on the skb input context case and thus does not need to cope with the limitations from generic XDP.

  • Forwarding and load-balancing

    The forwarding and load-balancing use case is quite similar to XDP, although slightly more targeted towards east-west container workloads rather than north-south traffic (though both technologies can be used in either case). Since XDP is only available on ingress side, tc BPF programs allow for further use cases that apply in particular on egress, for example, container based traffic can already be NATed and load-balanced on the egress side through BPF out of the initial namespace such that this is done transparent to the container itself. Egress traffic is already based on the sk_buff structure due to the nature of the kernel’s networking stack, so packet rewrites and redirects are suitable out of tc BPF. By utilizing the bpf_redirect() helper function, BPF can take over the forwarding logic to push the packet either into the ingress or egress path of another networking device. Thus, any bridge-like devices become unnecessary to use as well by utilizing tc BPF as forwarding fabric.

  • Flow sampling, monitoring

    Like in XDP case, flow sampling and monitoring can be realized through a high-performance lockless per-CPU memory mapped perf ring buffer where the BPF program is able to push custom data, the full or truncated packet contents, or both up to a user space application. From the tc BPF program this is realized through the bpf_skb_event_output() BPF helper function which has the same function signature and semantics as bpf_xdp_event_output(). Given tc BPF programs can be attached to ingress and egress as opposed to only ingress in XDP BPF case plus the two tc hooks are at the lowest layer in the (generic) networking stack, this allows for bidirectional monitoring of all network traffic from a particular node. This might be somewhat related to the cBPF case which tcpdump and Wireshark makes use of, though, without having to clone the skb and with being a lot more flexible in terms of programmability where, for example, BPF can already perform in-kernel aggregation rather than pushing everything up to user space as well as custom annotations for packets pushed into the ring buffer. The latter is also heavily used in Cilium where packet drops can be further annotated to correlate container labels and reasons for why a given packet had to be dropped (such as due to policy violation) in order to provide a richer context.

  • Packet scheduler pre-processing

    The sch_clsact’s egress hook which is called sch_handle_egress() runs right before taking the kernel’s qdisc root lock, thus tc BPF programs can be utilized to perform all the heavy lifting packet classification and mangling before the packet is transmitted into a real full blown qdisc such as sch_htb. This type of interaction of sch_clsact with a real qdisc like sch_htb coming later in the transmission phase allows to reduce the lock contention on transmission since sch_clsact’s egress hook is executed without taking locks.

One concrete example user of tc BPF but also XDP BPF programs is Cilium. Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes and operates at Layer 3/4 as well as Layer 7. At the heart of Cilium operates BPF in order to implement the policy enforcement as well as load balancing and monitoring.

Driver support

Since tc BPF programs are triggered from the kernel’s networking stack and not directly out of the driver, they do not require any extra driver modification and therefore can run on any networking device. The only exception listed below is for offloading tc BPF programs to the NIC.

Drivers supporting offloaded tc BPF

  • Netronome

Note that also here examples for writing and loading tc BPF programs are included in the toolchain section under the respective tools.

Further Reading

Mentioned lists of docs, projects, talks, papers, and further reading material are likely not complete. Thus, feel free to open pull requests to complete the list.

Kernel Developer FAQ

Under Documentation/bpf/, the Linux kernel provides two FAQ files that are mainly targeted for kernel developers involved in the BPF subsystem.

Projects using BPF

The following list includes a selection of open source projects making use of BPF respectively provide tooling for BPF. In this context the eBPF instruction set is specifically meant instead of projects utilizing the legacy cBPF:

Tracing

  • BCC

    BCC stands for BPF Compiler Collection and its key feature is to provide a set of easy to use and efficient kernel tracing utilities all based upon BPF programs hooking into kernel infrastructure based upon kprobes, kretprobes, tracepoints, uprobes, uretprobes as well as USDT probes. The collection provides close to hundred tools targeting different layers across the stack from applications, system libraries, to the various different kernel subsystems in order to analyze a system’s performance characteristics or problems. Additionally, BCC provides an API in order to be used as a library for other projects.

    https://github.com/iovisor/bcc

  • bpftrace

    bpftrace is a DTrace-style dynamic tracing tool for Linux and uses LLVM as a back end to compile scripts to BPF-bytecode and makes use of BCC for interacting with the kernel’s BPF tracing infrastructure. It provides a higher-level language for implementing tracing scripts compared to native BCC.

    https://github.com/ajor/bpftrace

  • perf

    The perf tool which is developed by the Linux kernel community as part of the kernel source tree provides a way to load tracing BPF programs through the conventional perf record subcommand where the aggregated data from BPF can be retrieved and post processed in perf.data for example through perf script and other means.

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf

  • ply

    ply is a tracing tool that follows the ‘Little Language’ approach of yore, and compiles ply scripts into Linux BPF programs that are attached to kprobes and tracepoints in the kernel. The scripts have a C-like syntax, heavily inspired by DTrace and by extension awk. ply keeps dependencies to very minimum and only requires flex and bison at build time, only libc at runtime.

    https://github.com/wkz/ply

  • systemtap

    systemtap is a scripting language and tool for extracting, filtering and summarizing data in order to diagnose and analyze performance or functional problems. It comes with a BPF back end called stapbpf which translates the script directly into BPF without the need of an additional compiler and injects the probe into the kernel. Thus, unlike stap’s kernel modules this does neither have external dependencies nor requires to load kernel modules.

    https://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=summary

  • PCP

    Performance Co-Pilot (PCP) is a system performance and analysis framework which is able to collect metrics through a variety of agents as well as analyze collected systems’ performance metrics in real-time or by using historical data. With pmdabcc, PCP has a BCC based performance metrics domain agent which extracts data from the kernel via BPF and BCC.

    https://github.com/performancecopilot/pcp

  • Weave Scope

    Weave Scope is a cloud monitoring tool collecting data about processes, networking connections or other system data by making use of BPF in combination with kprobes. Weave Scope works on top of the gobpf library in order to load BPF ELF files into the kernel, and comes with a tcptracer-bpf tool which monitors connect, accept and close calls in order to trace TCP events.

    https://github.com/weaveworks/scope

Networking

  • Cilium

    Cilium provides and transparently secures network connectivity and load-balancing between application workloads such as application containers or processes. Cilium operates at Layer 3/4 to provide traditional networking and security services as well as Layer 7 to protect and secure use of modern application protocols such as HTTP, gRPC and Kafka. It is integrated into orchestration frameworks such as Kubernetes and Mesos, and BPF is the foundational part of Cilium that operates in the kernel’s networking data path.

    https://github.com/cilium/cilium

  • iproute2

    iproute2 offers the ability to load BPF programs as LLVM generated ELF files into the kernel. iproute2 supports both, XDP BPF programs as well as tc BPF programs through a common BPF loader backend. The tc and ip command line utilities enable loader and introspection functionality for the user.

    https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/

  • p4c-xdp

    p4c-xdp presents a P4 compiler backend targeting BPF and XDP. P4 is a domain specific language describing how packets are processed by the data plane of a programmable network element such as NICs, appliances or switches, and with the help of p4c-xdp P4 programs can be translated into BPF C programs which can be compiled by clang / LLVM and loaded as BPF programs into the kernel at XDP layer for high performance packet processing.

    https://github.com/vmware/p4c-xdp

Others

  • LLVM

    clang / LLVM provides the BPF back end in order to compile C BPF programs into BPF instructions contained in ELF files. The LLVM BPF back end is developed alongside with the BPF core infrastructure in the Linux kernel and maintained by the same community. clang / LLVM is a key part in the toolchain for developing BPF programs.

    https://llvm.org/

  • bpftool

    bpftool is the main tool for introspecting and debugging BPF programs and BPF maps, and like libbpf is developed by the Linux kernel community. It allows for dumping all active BPF programs and maps in the system, dumping and disassembling BPF or JITed BPF instructions from a program as well as dumping and manipulating BPF maps in the system. bpftool supports interaction with the BPF filesystem, loading various program types from an object file into the kernel and much more.

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/bpftool

  • gobpf

    gobpf provides go bindings for the bcc framework as well as low-level routines in order to load and use BPF programs from ELF files.

    https://github.com/iovisor/gobpf

  • ebpf_asm

    ebpf_asm provides an assembler for BPF programs written in an Intel-like assembly syntax, and therefore offers an alternative for writing BPF programs directly in assembly for cases where programs are rather small and simple without needing the clang / LLVM toolchain.

    https://github.com/Xilinx-CNS/ebpf_asm

XDP Newbies

There are a couple of walk-through posts by David S. Miller to the xdp-newbies mailing list (http://vger.kernel.org/vger-lists.html#xdp-newbies), which explain various parts of XDP and BPF:

  1. May 2017,
    BPF Verifier Overview, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00185.html
  1. May 2017,
    Contextually speaking…, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00181.html
  1. May 2017,
    bpf.h and you…, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00179.html
  1. Apr 2017,
    XDP example of the day, David S. Miller, https://www.spinics.net/lists/xdp-newbies/msg00009.html

BPF Newsletter

Alexander Alemayhu initiated a newsletter around BPF roughly once per week covering latest developments around BPF in Linux kernel land and its surrounding ecosystem in user space.

All BPF update newsletters (01 - 12) can be found here:

Podcasts

There have been a number of technical podcasts partially covering BPF. Incomplete list:

  1. Feb 2017,
    Linux Networking Update from Netdev Conference, Thomas Graf, Software Gone Wild, Show 71, https://blog.ipspace.net/2017/02/linux-networking-update-from-netdev.html https://www.ipspace.net/nuggets/podcast/Show_71-NetDev_Update.mp3
  1. Jan 2017,
    The IO Visor Project, Brenden Blanco, OVS Orbit, Episode 23, https://ovsorbit.org/#e23 https://ovsorbit.org/episode-23.mp3
  1. Oct 2016,
    Fast Linux Packet Forwarding, Thomas Graf, Software Gone Wild, Show 64, https://blog.ipspace.net/2016/10/fast-linux-packet-forwarding-with.html https://www.ipspace.net/nuggets/podcast/Show_64-Cilium_with_Thomas_Graf.mp3
  1. Aug 2016,
    P4 on the Edge, John Fastabend, OVS Orbit, Episode 11, https://ovsorbit.org/#e11 https://ovsorbit.org/episode-11.mp3
  1. May 2016,
    Cilium, Thomas Graf, OVS Orbit, Episode 4, https://ovsorbit.org/#e4 https://ovsorbit.org/episode-4.mp3

Blog posts

The following (incomplete) list includes blog posts around BPF, XDP and related projects:

  1. May 2017,
    An entertaining eBPF XDP adventure, Suchakra Sharma, https://suchakra.wordpress.com/2017/05/23/an-entertaining-ebpf-xdp-adventure/
  1. May 2017,
    eBPF, part 2: Syscall and Map Types, Ferris Ellis, https://ferrisellis.com/posts/ebpf_syscall_and_maps/
  1. May 2017,
    Monitoring the Control Plane, Gary Berger, https://www.firstclassfunc.com/2018/07/monitoring-the-control-plane/
  1. Apr 2017,
    USENIX/LISA 2016 Linux bcc/BPF Tools, Brendan Gregg, http://www.brendangregg.com/blog/2017-04-29/usenix-lisa-2016-bcc-bpf-tools.html
  1. Apr 2017,
    Liveblog: Cilium for Network and Application Security with BPF and XDP, Scott Lowe, https://blog.scottlowe.org/2017/04/18/black-belt-cilium/
  1. Apr 2017,
    eBPF, part 1: Past, Present, and Future, Ferris Ellis, https://ferrisellis.com/posts/ebpf_past_present_future/
  1. Mar 2017,
    Analyzing KVM Hypercalls with eBPF Tracing, Suchakra Sharma, https://suchakra.wordpress.com/2017/03/31/analyzing-kvm-hypercalls-with-ebpf-tracing/
  1. Jan 2017,
    Golang bcc/BPF Function Tracing, Brendan Gregg, http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
  1. Dec 2016,
    Give me 15 minutes and I’ll change your view of Linux tracing, Brendan Gregg, http://www.brendangregg.com/blog/2016-12-27/linux-tracing-in-15-minutes.html
  1. Nov 2016,
    Cilium: Networking and security for containers with BPF and XDP, Daniel Borkmann, https://opensource.googleblog.com/2016/11/cilium-networking-and-security.html
  1. Nov 2016,
    Linux bcc/BPF tcplife: TCP Lifespans, Brendan Gregg, http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html
  1. Oct 2016,
    DTrace for Linux 2016, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-27/dtrace-for-linux-2016.html
  1. Oct 2016,
    Linux 4.9’s Efficient BPF-based Profiler, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html
  1. Oct 2016,
    Linux bcc tcptop, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html
  1. Oct 2016,
    Linux bcc/BPF Node.js USDT Tracing, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html
  1. Oct 2016,
    Linux bcc/BPF Run Queue (Scheduler) Latency, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html
  1. Oct 2016,
    Linux bcc ext4 Latency Tracing, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html
  1. Oct 2016,
    Linux MySQL Slow Query Tracing with bcc/BPF, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html
  1. Oct 2016,
    Linux bcc Tracing Security Capabilities, Brendan Gregg, http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html
  1. Sep 2016,
    Suricata bypass feature, Eric Leblond, https://www.stamus-networks.com/blog/2016/09/28/suricata-bypass-feature
  1. Aug 2016,
    Introducing the p0f BPF compiler, Gilberto Bertin, https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler/
  1. Jun 2016,
    Ubuntu Xenial bcc/BPF, Brendan Gregg, http://www.brendangregg.com/blog/2016-06-14/ubuntu-xenial-bcc-bpf.html
  1. Mar 2016,
    Linux BPF/bcc Road Ahead, March 2016, Brendan Gregg, http://www.brendangregg.com/blog/2016-03-28/linux-bpf-bcc-road-ahead-2016.html
  1. Mar 2016,
    Linux BPF Superpowers, Brendan Gregg, http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html
  1. Feb 2016,
    Linux eBPF/bcc uprobes, Brendan Gregg, http://www.brendangregg.com/blog/2016-02-08/linux-ebpf-bcc-uprobes.html
  1. Feb 2016,
    Who is waking the waker? (Linux chain graph prototype), Brendan Gregg, http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html
  1. Feb 2016,
    Linux Wakeup and Off-Wake Profiling, Brendan Gregg, http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
  1. Jan 2016,
    Linux eBPF Off-CPU Flame Graph, Brendan Gregg, http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html
  1. Jan 2016,
    Linux eBPF Stack Trace Hack, Brendan Gregg, http://www.brendangregg.com/blog/2016-01-18/ebpf-stack-trace-hack.html
  1. Sep 2015,
    Linux Networking, Tracing and IO Visor, a New Systems Performance Tool for a Distributed World, Suchakra Sharma, https://thenewstack.io/comparing-dtrace-iovisor-new-systems-performance-platform-advance-linux-networking-virtualization/
  1. Aug 2015,
    BPF Internals - II, Suchakra Sharma, https://suchakra.wordpress.com/2015/08/12/bpf-internals-ii/
  1. May 2015,
    eBPF: One Small Step, Brendan Gregg, http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html
  1. May 2015,
    BPF Internals - I, Suchakra Sharma, https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/
  1. Jul 2014,
    Introducing the BPF Tools, Marek Majkowski, https://blog.cloudflare.com/introducing-the-bpf-tools/
  1. May 2014,
    BPF - the forgotten bytecode, Marek Majkowski, https://blog.cloudflare.com/bpf-the-forgotten-bytecode/

Books

BPF Performance Tools (Gregg, Addison Wesley, 2019)

Talks

The following (incomplete) list includes talks and conference papers related to BPF and XDP:

  1. May 2017,
    PyCon 2017, Portland, Executing python functions in the linux kernel by transpiling to bpf, Alex Gartrell, https://www.youtube.com/watch?v=CpqMroMBGP4
  1. May 2017,
    gluecon 2017, Denver, Cilium + BPF: Least Privilege Security on API Call Level for Microservices, Dan Wendlandt, http://gluecon.com/#agenda
  1. May 2017,
    Lund Linux Con, Lund, XDP - eXpress Data Path, Jesper Dangaard Brouer, http://people.netfilter.org/hawk/presentations/LLC2017/XDP_DDoS_protecting_LLC2017.pdf
  1. May 2017,
    Polytechnique Montreal, Trace Aggregation and Collection with eBPF, Suchakra Sharma, https://nova.polymtl.ca/~suchakra/eBPF-5May2017.pdf
  1. Apr 2017,
    DockerCon, Austin, Cilium - Network and Application Security with BPF and XDP, Thomas Graf, https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp
  1. Apr 2017,
    NetDev 2.1, Montreal, XDP Mythbusters, David S. Miller, https://netdevconf.info/2.1/slides/apr7/miller-XDP-MythBusters.pdf
  1. Apr 2017,
    NetDev 2.1, Montreal, Droplet: DDoS countermeasures powered by BPF + XDP, Huapeng Zhou, Doug Porter, Ryan Tierney, Nikita Shirokov, https://netdevconf.info/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
  1. Apr 2017,
    NetDev 2.1, Montreal, XDP in practice: integrating XDP in our DDoS mitigation pipeline, Gilberto Bertin, https://netdevconf.info/2.1/slides/apr6/bertin_Netdev-XDP.pdf
  1. Apr 2017,
    NetDev 2.1, Montreal, XDP for the Rest of Us, Andy Gospodarek, Jesper Dangaard Brouer, https://netdevconf.info/2.1/slides/apr7/gospodarek-Netdev2.1-XDP-for-the-Rest-of-Us_Final.pdf
  1. Mar 2017,
    SCALE15x, Pasadena, Linux 4.x Tracing: Performance Analysis with bcc/BPF, Brendan Gregg, https://www.slideshare.net/brendangregg/linux-4x-tracing-performance-analysis-with-bccbpf
  1. Mar 2017,
    XDP Inside and Out, David S. Miller, https://raw.githubusercontent.com/iovisor/bpf-docs/master/XDP_Inside_and_Out.pdf
  1. Mar 2017,
    OpenSourceDays, Copenhagen, XDP - eXpress Data Path, Used for DDoS protection, Jesper Dangaard Brouer, http://people.netfilter.org/hawk/presentations/OpenSourceDays2017/XDP_DDoS_protecting_osd2017.pdf
  1. Mar 2017,
    source{d}, Infrastructure 2017, Madrid, High-performance Linux monitoring with eBPF, Alfonso Acosta, https://www.youtube.com/watch?v=k4jqTLtdrxQ
  1. Feb 2017,
    FOSDEM 2017, Brussels, Stateful packet processing with eBPF, an implementation of OpenState interface, Quentin Monnet, https://archive.fosdem.org/2017/schedule/event/stateful_ebpf/
  1. Feb 2017,
    FOSDEM 2017, Brussels, eBPF and XDP walkthrough and recent updates, Daniel Borkmann, http://borkmann.ch/talks/2017_fosdem.pdf
  1. Feb 2017,
    FOSDEM 2017, Brussels, Cilium - BPF & XDP for containers, Thomas Graf, https://archive.fosdem.org/2017/schedule/event/cilium/
  1. Jan 2017,
    linuxconf.au, Hobart, BPF: Tracing and more, Brendan Gregg, https://www.slideshare.net/brendangregg/bpf-tracing-and-more
  1. Dec 2016,
    USENIX LISA 2016, Boston, Linux 4.x Tracing Tools: Using BPF Superpowers, Brendan Gregg, https://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers
  1. Nov 2016,
    Linux Plumbers, Santa Fe, Cilium: Networking & Security for Containers with BPF & XDP, Thomas Graf, https://www.slideshare.net/ThomasGraf5/clium-container-networking-with-bpf-xdp
  1. Nov 2016,
    OVS Conference, Santa Clara, Offloading OVS Flow Processing using eBPF, William (Cheng-Chun) Tu, http://www.openvswitch.org/support/ovscon2016/7/1120-tu.pdf
  1. Oct 2016,
    One.com, Copenhagen, XDP - eXpress Data Path, Intro and future use-cases, Jesper Dangaard Brouer, http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf
  1. Oct 2016,
    Docker Distributed Systems Summit, Berlin, Cilium: Networking & Security for Containers with BPF & XDP, Thomas Graf, https://www.slideshare.net/Docker/cilium-bpf-xdp-for-containers-66969823
  1. Oct 2016,
    NetDev 1.2, Tokyo, Data center networking stack, Tom Herbert, https://netdevconf.info/1.2/session.html?tom-herbert
  1. Oct 2016,
    NetDev 1.2, Tokyo, Fast Programmable Networks & Encapsulated Protocols, David S. Miller, https://netdevconf.info/1.2/session.html?david-miller-keynote
  1. Oct 2016,
    NetDev 1.2, Tokyo, XDP workshop - Introduction, experience, and future development, Tom Herbert, https://netdevconf.info/1.2/session.html?herbert-xdp-workshop
  1. Oct 2016,
    NetDev1.2, Tokyo, The adventures of a Suricate in eBPF land, Eric Leblond, https://netdevconf.info/1.2/slides/oct6/10_suricata_ebpf.pdf
  1. Oct 2016,
    NetDev1.2, Tokyo, cls_bpf/eBPF updates since netdev 1.1, Daniel Borkmann, http://borkmann.ch/talks/2016_tcws.pdf
  1. Oct 2016,
    NetDev1.2, Tokyo, Advanced programmability and recent updates with tc’s cls_bpf, Daniel Borkmann, http://borkmann.ch/talks/2016_netdev2.pdf https://netdevconf.info/1.2/papers/borkmann.pdf
  1. Oct 2016,
    NetDev 1.2, Tokyo, eBPF/XDP hardware offload to SmartNICs, Jakub Kicinski, Nic Viljoen, https://netdevconf.info/1.2/papers/eBPF_HW_OFFLOAD.pdf
  1. Aug 2016,
    LinuxCon, Toronto, What Can BPF Do For You?, Brenden Blanco, https://events.static.linuxfound.org/sites/events/files/slides/iovisor-lc-bof-2016.pdf
  1. Aug 2016,
    LinuxCon, Toronto, Cilium - Fast IPv6 Container Networking with BPF and XDP, Thomas Graf, https://www.slideshare.net/ThomasGraf5/cilium-fast-ipv6-container-networking-with-bpf-and-xdp
  1. Aug 2016,
    P4, EBPF and Linux TC Offload, Dinan Gunawardena, Jakub Kicinski, https://de.slideshare.net/Open-NFP/p4-epbf-and-linux-tc-offload
  1. Jul 2016,
    Linux Meetup, Santa Clara, eXpress Data Path, Brenden Blanco, https://www.slideshare.net/IOVisor/express-data-path-linux-meetup-santa-clara-july-2016
  1. Jul 2016,
    Linux Meetup, Santa Clara, CETH for XDP, Yan Chan, Yunsong Lu, https://www.slideshare.net/IOVisor/ceth-for-xdp-linux-meetup-santa-clara-july-2016
  1. May 2016,
    P4 workshop, Stanford, P4 on the Edge, John Fastabend, https://schd.ws/hosted_files/2016p4workshop/1d/Intel%20Fastabend-P4%20on%20the%20Edge.pdf
  1. Mar 2016,
    Performance @Scale 2016, Menlo Park, Linux BPF Superpowers, Brendan Gregg, https://www.slideshare.net/brendangregg/linux-bpf-superpowers
  1. Mar 2016,
    eXpress Data Path, Tom Herbert, Alexei Starovoitov, https://raw.githubusercontent.com/iovisor/bpf-docs/master/Express_Data_Path.pdf
  1. Feb 2016,
    NetDev1.1, Seville, On getting tc classifier fully programmable with cls_bpf, Daniel Borkmann, http://borkmann.ch/talks/2016_netdev.pdf https://netdevconf.info/1.1/proceedings/papers/On-getting-tc-classifier-fully-programmable-with-cls-bpf.pdf
  1. Jan 2016,
    FOSDEM 2016, Brussels, Linux tc and eBPF, Daniel Borkmann, http://borkmann.ch/talks/2016_fosdem.pdf
  1. Oct 2015,
    LinuxCon Europe, Dublin, eBPF on the Mainframe, Michael Holzheu, https://events.static.linuxfound.org/sites/events/files/slides/ebpf_on_the_mainframe_lcon_2015.pdf
  1. Aug 2015,
    Tracing Summit, Seattle, LLTng’s Trace Filtering and beyond (with some eBPF goodness, of course!), Suchakra Sharma, https://raw.githubusercontent.com/iovisor/bpf-docs/master/ebpf_excerpt_20Aug2015.pdf
  1. Jun 2015,
    LinuxCon Japan, Tokyo, Exciting Developments in Linux Tracing, Elena Zannoni, https://events.static.linuxfound.org/sites/events/files/slides/tracing-linux-ezannoni-linuxcon-ja-2015_0.pdf
  1. Feb 2015,
    Collaboration Summit, Santa Rosa, BPF: In-kernel Virtual Machine, Alexei Starovoitov, https://events.static.linuxfound.org/sites/events/files/slides/bpf_collabsummit_2015feb20.pdf
  1. Feb 2015,
    NetDev 0.1, Ottawa, BPF: In-kernel Virtual Machine, Alexei Starovoitov, https://netdevconf.info/0.1/sessions/15.html
  1. Feb 2014,
    DevConf.cz, Brno, tc and cls_bpf: lightweight packet classifying with BPF, Daniel Borkmann, http://borkmann.ch/talks/2014_devconf.pdf

Further Documents

API Reference

Introduction

The Cilium API is JSON based and provided by the cilium-agent. The purpose of the API is to provide visibility and control over an individual agent instance. In general, all API calls affect only the resources managed by the individual cilium-agent serving the API. A few selected API calls such as the security identity resolution provides cluster wide visibility. Such API calls are marked specifically. Unless noted otherwise, API calls will only affect local agent resources.

How to access the API

CLI Client

The easiest way to access the API is via the cilium CLI client. cilium will automatically locate the API of the agent running on the same node and access it. However, using the -H or --host flag, the cilium client can be pointed to an arbitrary API address.

Example
$ cilium -H unix:///var/run/cilium/cilium.sock
[...]

Golang Package

The following Go packages can be used to access the API:

Package Description
pkg/client Main client API abstraction
api/v1/models API resource data type models
Example

The full example can be found in the cilium/client-example repository.

import (
        "fmt"

        "github.com/cilium/cilium/pkg/client"
)

func main() {
        c, err := client.NewDefaultClient()
        if err != nil {
                ...
        }

        endpoints, err := c.EndpointList()
        if err != nil {
                ...
        }

        for _, ep := range endpoints {
                fmt.Printf("%8d %14s %16s %32s\n", ep.ID, ep.ContainerName, ep.Addressing.IPV4, ep.Addressing.IPV6)
        }

Compatibility Guarantees

Cilium API is stable as of version 1.0, backward compatibility will be upheld for whole lifecycle of Cilium 1.x.

API Reference

GET /cluster/nodes

Get nodes information stored in the cilium-agent

Status Codes:
Request Headers:
 
  • client-id – Client UUID should be used when the client wants to request a diff of nodes added and / or removed since the last time that client has made a request.
GET /healthz

Get health of Cilium daemon

Returns health and status information of the Cilium daemon and related components such as the local container runtime, connected datastore, Kubernetes integration and Hubble.

Status Codes:
Request Headers:
 
  • brief – Brief will return a brief representation of the Cilium status.
GET /config

Get configuration of Cilium daemon

Returns the configuration of the Cilium daemon.

Status Codes:
PATCH /config

Modify daemon configuration

Updates the daemon configuration by applying the provided ConfigurationMap and regenerates & recompiles all required datapath components.

Status Codes:
GET /endpoint/{id}

Get endpoint by endpoint ID

Returns endpoint information

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
PUT /endpoint/{id}

Create endpoint

Creates a new endpoint

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
PATCH /endpoint/{id}

Modify existing endpoint

Applies the endpoint change request to an existing endpoint

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
DELETE /endpoint/{id}

Delete endpoint

Deletes the endpoint specified by the ID. Deletion is imminent and atomic, if the deletion request is valid and the endpoint exists, deletion will occur even if errors are encountered in the process. If errors have been encountered, the code 202 will be returned, otherwise 200 on success.

All resources associated with the endpoint will be freed and the workload represented by the endpoint will be disconnected.It will no longer be able to initiate or receive communications of any sort.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
GET /endpoint

Retrieves a list of endpoints that have metadata matching the provided parameters.

Retrieves a list of endpoints that have metadata matching the provided parameters, or all endpoints if no parameters provided.

Status Codes:
GET /endpoint/{id}/config

Retrieve endpoint configuration

Retrieves the configuration of the specified endpoint.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
PATCH /endpoint/{id}/config

Modify mutable endpoint configuration

Update the configuration of an existing endpoint and regenerates & recompiles the corresponding programs automatically.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
GET /endpoint/{id}/labels

Retrieves the list of labels associated with an endpoint.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
PATCH /endpoint/{id}/labels

Set label configuration of endpoint

Sets labels associated with an endpoint. These can be user provided or derived from the orchestration system.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
GET /endpoint/{id}/log

Retrieves the status logs associated with this endpoint.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
GET /endpoint/{id}/healthz

Retrieves the status logs associated with this endpoint.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Status Codes:
GET /identity

Retrieves a list of identities that have metadata matching the provided parameters.

Retrieves a list of identities that have metadata matching the provided parameters, or all identities if no parameters are provided.

Status Codes:
  • 200 OK – Success
  • 404 Not Found – Identities with provided parameters not found
  • 520 – Identity storage unreachable. Likely a network problem.
  • 521 – Invalid identity format in storage
GET /identity/{id}

Retrieve identity

Parameters:
  • id (string) – Cluster wide unique identifier of a security identity.
Status Codes:
  • 200 OK – Success
  • 400 Bad Request – Invalid identity provided
  • 404 Not Found – Identity not found
  • 520 – Identity storage unreachable. Likely a network problem.
  • 521 – Invalid identity format in storage
GET /identity/endpoints

Retrieve identities which are being used by local endpoints

Status Codes:
  • 200 OK – Success
  • 404 Not Found – Set of identities which are being used by local endpoints could not be found.
POST /ipam

Allocate an IP address

Query Parameters:
 
  • family (string) –
  • owner (string) –
Status Codes:
Request Headers:
 
  • expiration
POST /ipam/{ip}

Allocate an IP address

Parameters:
  • ip (string) – IP address
Query Parameters:
 
  • owner (string) –
Status Codes:
DELETE /ipam/{ip}

Release an allocated IP address

Parameters:
  • ip (string) – IP address or owner name
Status Codes:
GET /policy

Retrieve entire policy tree

Returns the entire policy tree with all children.

Status Codes:
PUT /policy

Create or update a policy (sub)tree

Status Codes:
DELETE /policy

Delete a policy (sub)tree

Status Codes:
GET /policy/resolve

Resolve policy for an identity context

Status Codes:
GET /policy/selectors

See what selectors match which identities

Status Codes:
GET /service

Retrieve list of all services

Status Codes:
GET /service/{id}

Retrieve configuration of a service

Parameters:
  • id (integer) – ID of service
Status Codes:
PUT /service/{id}

Create or update service

Parameters:
  • id (integer) – ID of service
Status Codes:
DELETE /service/{id}

Delete a service

Parameters:
  • id (integer) – ID of service
Status Codes:
GET /prefilter

Retrieve list of CIDRs

Status Codes:
PATCH /prefilter

Update list of CIDRs

Status Codes:
DELETE /prefilter

Delete list of CIDRs

Status Codes:
GET /debuginfo

Retrieve information about the agent and evironment for debugging

Status Codes:
GET /map

List all open maps

Status Codes:
GET /map/{name}

Retrieve contents of BPF map

Parameters:
  • name (string) – Name of map
Status Codes:
GET /metrics/

Retrieve cilium metrics

Status Codes:
GET /fqdn/cache

Retrieves the list of DNS lookups intercepted from all endpoints.

Retrieves the list of DNS lookups intercepted from endpoints, optionally filtered by endpoint id, DNS name, or CIDR IP range.

Query Parameters:
 
  • matchpattern (string) – A toFQDNs compatible matchPattern expression
  • cidr (string) – A CIDR range of IPs
Status Codes:
DELETE /fqdn/cache

Deletes matching DNS lookups from the policy-generation cache.

Deletes matching DNS lookups from the cache, optionally restricted by DNS name. The removed IP data will no longer be used in generated policies.

Query Parameters:
 
  • matchpattern (string) – A toFQDNs compatible matchPattern expression
Status Codes:
GET /fqdn/cache/{id}

Retrieves the list of DNS lookups intercepted from an endpoint.

Retrieves the list of DNS lookups intercepted from endpoints, optionally filtered by endpoint id, DNS name, or CIDR IP range.

Parameters:
  • id (string) –

    String describing an endpoint with the format [prefix:]id. If no prefix is specified, a prefix of cilium-local: is assumed. Not all endpoints will be addressable by all endpoint ID prefixes with the exception of the local Cilium UUID which is assigned to all endpoints.

    Supported endpoint id prefixes:
    • cilium-local: Local Cilium endpoint UUID, e.g. cilium-local:3389595
    • cilium-global: Global Cilium endpoint UUID, e.g. cilium-global:cluster1:nodeX:452343
    • container-id: Container runtime ID, e.g. container-id:22222
    • container-name: Container name, e.g. container-name:foobar
    • pod-name: pod name for this container if K8s is enabled, e.g. pod-name:default:foobar
    • docker-endpoint: Docker libnetwork endpoint ID, e.g. docker-endpoint:4444
Query Parameters:
 
  • matchpattern (string) – A toFQDNs compatible matchPattern expression
  • cidr (string) – A CIDR range of IPs
Status Codes:
GET /fqdn/names

List internal DNS selector representations

Retrieves the list of DNS-related fields (names to poll, selectors and their corresponding regexes).

Status Codes:
GET /ip

Lists information about known IP addresses

Retrieves a list of IPs with known associated information such as their identities, host addresses, Kubernetes pod names, etc. The list can optionally filtered by a CIDR IP range.

Query Parameters:
 
  • cidr (string) – A CIDR range of IPs
Status Codes:

Hubble internals

Note

This documentation section is targeted at developers who are interested in contributing to Hubble. For this purpose, it describes Hubble internals.

Note

This documentation covers the Hubble server (sometimes referred as “Hubble embedded”) and Hubble-relay components but does not cover the Hubble UI and CLI.

Hubble builds on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner. One of the design goals of Hubble is to achieve all of this at large scale.

Hubble’s server component is embedded into the Cilium agent in order to achieve high performance with low-overhead. The gRPC services offered by Hubble server may be consumed locally via a Unix domain socket or, more typically, through Hubble-relay. Hubble-relay is a standalone component which is aware of all running Hubble instances and offers full cluster visibility by connecting to their respective gRPC APIs. This capability is usually referred to as multi-node. Hubble-relay’s main goal is to offer a rich API that can be safely exposed and consumed by the Hubble UI and CLI.

Note

This guide does not cover Hubble in standalone mode, which is deprecated with the release of Cilium v1.8.

Hubble Architecture

Hubble exposes gRPC services from the Cilium process that allows clients to receive flows and other type of data.

Hubble server

The Hubble server component implements two gRPC services. The Observer service which may optionally be exposed via a TCP socket in addition to a local Unix domain socket and the Peer service, which is only served on a local Unix domain socket.

The Observer service

The Observer service is the principal service. It makes two methods available: GetFlows and ServerStatus. While the ServerStatus method is pretty straightforward (it provides metrics related to the running server), the GetFlows one is far more sophisticated and the more important one.

Using GetFlows, callers can get a stream of payloads. Request parameters allow callers to specify filters in the form of blacklists and whitelists to allow for fine-grained filtering of data.

In order to answer GetFlows requests, Hubble stores monitoring events from Cilium’s event monitor into a ring buffer structure. Monitoring events are obtained by registering a new listener to Cilium’s monitor. The ring buffer is capable of storing a configurable amount of events in memory. Events are continuously consumed, overriding older ones once the ring buffer is full.

_images/hubble_getflows.png

For efficiency, the internal buffer length is a bit mask of ones + 1. The most significant bit of this bit mask is the same position of the most significant bit position of ‘n’. In other terms, the internal buffer size is always a power of 2. As the ring buffer is a hot code path, it has been designed to not employ any locking mechanisms and uses atomic operations instead. While this approach has performance benefits, it also has the downsides of being a complex component and that reading the very last event written to the buffer is not possible as it cannot be guaranteed that it has been fully written.

Due to its complex nature, the ring buffer is typically accessed via a ring reader that abstracts the complexity of this data structure for reading. The ring reader allows reading one event at the time with ‘previous’ and ‘next’ methods but also implements a follow mode where events are continuously read as they are written to the ring buffer.

The Peer service

The Peer service sends information about Hubble peers in the cluster in a stream. When the Notify method is called, it reports information about all the peers in the cluster and subsequently sends information about peers that are updated, added or removed from the cluster. Thus , it allows the caller to keep track of all Hubble instances and query their respective gRPC services.

This service is typically only exposed on a local Unix domain socket and is primarily used by Hubble-relay in order to have a cluster-wide view of all Hubble instances.

The Peer service obtains peer change notifications by subscribing to Cilium’s node manager. To this end, it internally defines a handler that implements Cilium’s datapath node handler interface.

Hubble-relay

Note

At the time of this writing, the hubble-relay component is still work in progress and may undergo major changes. For this reason, internal documentation about Hubble-relay is limited.

Hubble-relay is a component that was introduced in the context of multi-node support. It leverages the Peer service to obtain information about Hubble instances and consume their gRPC API in order to provide a more rich API that covers events from across the entire cluster.

Command Cheatsheet

Cilium is controlled via an easy command-line interface. This CLI is a single application that takes subcommands that you can find in the command reference guide.

$ cilium
CLI for interacting with the local Cilium Agent

Usage:
  cilium [command]

Available Commands:
  bpf                      Direct access to local BPF maps
  cleanup                  Reset the agent state
  completion               Output shell completion code for bash
  config                   Cilium configuration options
  debuginfo                Request available debugging information from agent
  endpoint                 Manage endpoints
  identity                 Manage security identities
  kvstore                  Direct access to the kvstore
  monitor                  Monitoring
  policy                   Manage security policies
  prefilter                Manage XDP CIDR filters
  service                  Manage services & loadbalancers
  status                   Display status of daemon
  version                  Print version information

Flags:
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API

Use "cilium [command] --help" for more information about a command.

All commands and subcommands have the option -h that will provide information about the options and arguments that the subcommand has. In case of any error in the command, Cilium CLI will return a non-zero status.

Command utilities:

JSON Output

All the list commands will return a pretty printed list with the information retrieved from Cilium Daemon. If you need something more detailed you can use JSON output, to get the JSON output you can use the global option -o json

$ cilium endpoint list -o json

Moreover, Cilium also provides a JSONPath support, so detailed information can be extracted. JSONPath template reference can be found in Kubernetes documentation

$ cilium endpoint list -o jsonpath='{[*].id}'
29898 38939 56326
$ cilium endpoint list -o jsonpath='{range [*]}{@.id}{"="}{@.status.policy.spec.policy-enabled}{"\n"}{end}'
29898=none
38939=none
56326=none

Shell Tab-completion

If you use bash or zsh, Cilium CLI can provide tab completion for subcommands. If you want to install tab completion, you should run the following command in your terminal.

$ source <(cilium completion)

If you want to have Cilium completion always loaded, you can install using the following:

$ echo "source <(cilium completion)" >> ~/.bashrc

Command examples:

Basics

Check the status of the agent

$ cilium status
KVStore:                Ok         Consul: 172.17.0.3:8300
ContainerRuntime:       Ok
Kubernetes:             Disabled
Cilium:                 Ok         OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
Controller Status:      6/6 healthy
Proxy Status:           OK, ip 10.15.28.238, port-range 10000-20000
Cluster health:   1/1 reachable   (2018-04-11T07:33:09Z)
$

Get a detailed status of the agent:

$ cilium status --all-controllers --all-health --all-redirects
KVStore:                Ok         Consul: 172.17.0.3:8300
ContainerRuntime:       Ok
Kubernetes:             Disabled
Cilium:                 Ok         OK
NodeMonitor:            Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon:   Ok
Controller Status:      6/6 healthy
  Name                                 Last success   Last error   Count   Message
  kvstore-lease-keepalive              2m52s ago      never        0       no error
  ipcache-bpf-garbage-collection       2m50s ago      never        0       no error
  resolve-identity-29898               2m50s ago      never        0       no error
  sync-identity-to-k8s-pod (29898)     50s ago        never        0       no error
  sync-IPv4-identity-mapping (29898)   2m49s ago      never        0       no error
  sync-IPv6-identity-mapping (29898)   2m49s ago      never        0       no error
Proxy Status:   OK, ip 10.15.28.238, port-range 10000-20000
Cluster health:         1/1 reachable   (2018-04-11T07:32:09Z)
  Name                  IP              Reachable   Endpoints reachable
  runtime (localhost)   10.0.2.15       true        false
$

Get the current agent configuration

cilium config

Policy management

Importing a Cilium Network Policy

cilium policy import my-policy.json

Get list of all imported policy rules

cilium policy get

Remove all policies

cilium policy delete --all
Tracing

Check policy enforcement between two labels on port 80:

cilium policy trace -s <app.from> -d <app.to> --dport 80

Check policy enforcement between two identities

cilium policy trace --src-identity <from-id> --dst-identity <to-id>

Check policy enforcement between two pods:

cilium policy trace --src-k8s-pod <namespace>:<pod.from> --dst-k8s-pod <namespace>:<pod.to>
Monitoring

Monitor cilium datapath notifications

cilium monitor

Verbose output (including debug if enabled)

cilium monitor -v

Extra verbose output (including packet dissection)

cilium monitor -v -v

Filter for only the events related to endpoint

cilium monitor --related-to=<id>

Filter for only events on layer 7

cilium monitor -t L7

Show notifications only for dropped packet events

cilium monitor --type drop

Don’t dissect packet payload, display payload in hex information

cilium monitor -v -v --hex

Connectivity

Check cluster Connectivity

cilium-health status

There is also a blog post related to this tool.

Endpoints

Get list of all local endpoints

cilium endpoint list

Get detailed view of endpoint properties and state

cilium endpoint get <id>

Show recent endpoint specific log entries

cilium endpoint log <id>

Enable debugging output on the cilium monitor for this endpoint

cilium endpoint config <id> Debug=true

Loadbalancing

Get list of loadbalancer services

cilium service list

Or you can get the loadbalancer information using bpf list ::

cilium bpf lb list

Add a new loadbalancer

cilium service update --frontend 127.0.0.1:80 \
    --backends 127.0.0.2:90,127.0.0.3:90 \
    --id 20

BPF

List node tunneling mapping information

cilium bpf tunnel list

Checking logs for verifier issue

journalctl -u cilium | grep -B20 -F10 Verifier

List connection tracking entries:

sudo cilium bpf ct list global

Flush connection tracking entries:

sudo cilium bpf ct flush

List proxy configuration:

sudo cilium bpf proxy list

Kubernetes examples:

If you running Cilium on top of Kubernetes you may also want a way to list all cilium endpoints or policies from a single Kubectl commands. Cilium provides all this information to the user by using Kubernetes Resource Definitions:

Policies

In Kubernetes you can use two kinds of policies, Kubernetes Network Policies or Cilium Network Policies. Both can be retrieved from the kubectl command:

Kubernetes Network Policies
 kubectl get netpol
Kubernetes Cilium Policies
 $ kubectl get cnp
 NAME      AGE
 rule1     3m
 $ kubectl get cnp rule1
 NAME      AGE
 rule1     3m
 $ kubectl get cnp rule1 -o json

Endpoints

To retrieve a list of all endpoints managed by cilium, Cilum Endpoint resource can be used.

$ kubectl get cep
NAME                AGE
34e299f0-b25c2fef   41s
34e299f0-dd86986c   42s
4d088f48-83e4f98d   2m
4d088f48-d04ab55f   2m
5c6211b5-9217a4d1   1m
5c6211b5-dccc3d24   1m
700e0976-6cb50b02   3m
700e0976-afd3a30c   3m
78092a35-4874ed16   1m
78092a35-4b08b92b   1m
9b74f61f-14571299   7s
9b74f61f-f9a96f4a   7s

$ kubectl get cep 700e0976-6cb50b02 -o json

$ kubectl get cep -o jsonpath='{range .items[*]}{@.status.id}{"="}{@.status.status.policy.spec.policy-enabled}{"\n"}{end}'
30391=ingress
5766=ingress
51796=none
40355=none

Command Reference

cilium-agent

Run the cilium agent

Synopsis

Run the cilium agent

cilium-agent [flags]

Options

      --agent-health-port int                         TCP port for agent health status API (default 9876)
      --agent-labels strings                          Additional labels to identify this agent
      --allow-icmp-frag-needed                        Allow ICMP Fragmentation Needed type packets for purposes like TCP Path MTU. (default true)
      --allow-localhost string                        Policy when to allow local stack to reach local endpoints { auto | always | policy } (default "auto")
      --annotate-k8s-node                             Annotate Kubernetes node (default true)
      --auto-create-cilium-node-resource              Automatically create CiliumNode resource for own node on startup (default true)
      --auto-direct-node-routes                       Enable automatic L2 routing between nodes
      --blacklist-conflicting-routes                  Don't blacklist IP allocations conflicting with local non-cilium routes (default true)
      --bpf-compile-debug                             Enable debugging of the BPF compilation process
      --bpf-ct-global-any-max int                     Maximum number of entries in non-TCP CT table (default 262144)
      --bpf-ct-global-tcp-max int                     Maximum number of entries in TCP CT table (default 524288)
      --bpf-ct-timeout-regular-any duration           Timeout for entries in non-TCP CT table (default 1m0s)
      --bpf-ct-timeout-regular-tcp duration           Timeout for established entries in TCP CT table (default 6h0m0s)
      --bpf-ct-timeout-regular-tcp-fin duration       Teardown timeout for entries in TCP CT table (default 10s)
      --bpf-ct-timeout-regular-tcp-syn duration       Establishment timeout for entries in TCP CT table (default 1m0s)
      --bpf-ct-timeout-service-any duration           Timeout for service entries in non-TCP CT table (default 1m0s)
      --bpf-ct-timeout-service-tcp duration           Timeout for established service entries in TCP CT table (default 6h0m0s)
      --bpf-fragments-map-max int                     Maximum number of entries in fragments tracking map (default 8192)
      --bpf-map-dynamic-size-ratio float              Ratio (0.0-1.0) of total system memory to use for dynamic sizing of CT, NAT and policy BPF maps. Set to 0.0 to disable dynamic BPF map sizing (default: 0.0)
      --bpf-nat-global-max int                        Maximum number of entries for the global BPF NAT table (default 524288)
      --bpf-neigh-global-max int                      Maximum number of entries for the global BPF neighbor table (default 524288)
      --bpf-policy-map-max int                        Maximum number of entries in endpoint policy map (per endpoint) (default 16384)
      --bpf-root string                               Path to BPF filesystem
      --certificates-directory string                 Root directory to find certificates specified in L7 TLS policy enforcement (default "/var/run/cilium/certs")
      --cgroup-root string                            Path to Cgroup2 filesystem
      --cluster-id int                                Unique identifier of the cluster
      --cluster-name string                           Name of the cluster (default "default")
      --clustermesh-config string                     Path to the ClusterMesh configuration directory
      --config string                                 Configuration file (default "$HOME/ciliumd.yaml")
      --config-dir string                             Configuration directory that contains a file for each option
      --conntrack-gc-interval duration                Overwrite the connection-tracking garbage collection interval
      --datapath-mode string                          Datapath mode name (default "veth")
  -D, --debug                                         Enable debugging mode
      --debug-verbose strings                         List of enabled verbose debug groups
  -d, --device strings                                List of devices facing cluster/external network for attaching bpf_netdev (first device should be one used for direct routing if tunneling is disabled)
      --disable-cnp-status-updates                    Do not send CNP NodeStatus updates to the Kubernetes api-server (recommended to run with "cnp-node-status-gc=false" in cilium-operator)
      --disable-conntrack                             Disable connection tracking
      --disable-endpoint-crd                          Disable use of CiliumEndpoint CRD
      --disable-iptables-feeder-rules strings         Chains to ignore when installing feeder rules.
      --egress-masquerade-interfaces string           Limit egress masquerading to interface selector
      --enable-auto-protect-node-port-range           Append NodePort range to net.ipv4.ip_local_reserved_ports if it overlaps with ephemeral port range (net.ipv4.ip_local_port_range) (default true)
      --enable-bpf-clock-probe                        Enable BPF clock source probing for more efficient tick retrieval
      --enable-bpf-masquerade                         Masquerade packets from endpoints leaving the host with BPF instead of iptables
      --enable-endpoint-health-checking               Enable connectivity health checking between virtual endpoints (default true)
      --enable-endpoint-routes                        Use per endpoint routes instead of routing via cilium_host
      --enable-external-ips                           Enable k8s service externalIPs feature (requires enabling enable-node-port) (default true)
      --enable-health-checking                        Enable connectivity health checking (default true)
      --enable-host-firewall                          Enable host network policies
      --enable-host-port                              Enable k8s hostPort mapping feature (requires enabling enable-node-port) (default true)
      --enable-host-reachable-services                Enable reachability of services for host applications (beta)
      --enable-hubble                                 Enable hubble server
      --enable-ip-masq-agent                          Enable BPF ip-masq-agent
      --enable-ipsec                                  Enable IPSec support
      --enable-ipv4                                   Enable IPv4 support (default true)
      --enable-ipv4-fragment-tracking                 Enable IPv4 fragments tracking for L4-based lookups (default true)
      --enable-ipv6                                   Enable IPv6 support (default true)
      --enable-k8s-api-discovery                      Enable discovery of Kubernetes API groups and resources with the discovery API
      --enable-k8s-endpoint-slice                     Enables k8s EndpointSlice feature in Cilium if the k8s cluster supports it (default true)
      --enable-k8s-event-handover                     Enable k8s event handover to kvstore for improved scalability
      --enable-l7-proxy                               Enable L7 proxy for L7 policy enforcement (default true)
      --enable-local-node-route                       Enable installation of the route which points the allocation prefix of the local node (default true)
      --enable-node-port                              Enable NodePort type services by Cilium (beta)
      --enable-policy string                          Enable policy enforcement (default "default")
      --enable-remote-node-identity                   Enable use of remote node identity
      --enable-session-affinity                       Enable support for service session affinity
      --enable-tracing                                Enable tracing while determining policy (debugging)
      --enable-well-known-identities                  Enable well-known identities for known Kubernetes components (default true)
      --enable-xt-socket-fallback                     Enable fallback for missing xt_socket module (default true)
      --encrypt-interface string                      Transparent encryption interface
      --encrypt-node                                  Enables encrypting traffic from non-Cilium pods and host networking
      --endpoint-interface-name-prefix string         Prefix of interface name shared by all endpoints (default "lxc+")
      --endpoint-queue-size int                       size of EventQueue per-endpoint (default 25)
      --endpoint-status strings                       Enable additional CiliumEndpoint status features (controllers,health,log,policy,state)
      --envoy-log string                              Path to a separate Envoy log file, if any
      --exclude-local-address strings                 Exclude CIDR from being recognized as local address
      --fixed-identity-mapping map                    Key-value for the fixed identity mapping which allows to use reserved label for fixed identities (default map[])
      --flannel-master-device string                  Installs a BPF program to allow for policy enforcement in the given network interface. Allows to run Cilium on top of other CNI plugins that provide networking, e.g. flannel, where for flannel, this value should be set with 'cni0'. [EXPERIMENTAL]
      --flannel-uninstall-on-exit                     When used along the flannel-master-device flag, it cleans up all BPF programs installed when Cilium agent is terminated.
      --force-local-policy-eval-at-source             Force policy evaluation of all local communication at the source endpoint (default true)
  -h, --help                                          help for cilium-agent
      --host-reachable-services-protos strings        Only enable reachability of services for host applications for specific protocols (default [tcp,udp])
      --http-idle-timeout uint                        Time after which a non-gRPC HTTP stream is considered failed unless traffic in the stream has been processed (in seconds); defaults to 0 (unlimited)
      --http-max-grpc-timeout uint                    Time after which a forwarded gRPC request is considered failed unless completed (in seconds). A "grpc-timeout" header may override this with a shorter value; defaults to 0 (unlimited)
      --http-request-timeout uint                     Time after which a forwarded HTTP request is considered failed unless completed (in seconds); Use 0 for unlimited (default 3600)
      --http-retry-count uint                         Number of retries performed after a forwarded request attempt fails (default 3)
      --http-retry-timeout uint                       Time after which a forwarded but uncompleted request is retried (connection failures are retried immediately); defaults to 0 (never)
      --hubble-event-queue-size int                   Buffer size of the channel to receive monitor events.
      --hubble-flow-buffer-size int                   Maximum number of flows in Hubble's buffer. The actual buffer size gets rounded up to the next power of 2, e.g. 4095 => 4096 (default 4095)
      --hubble-listen-address string                  An additional address for Hubble server to listen to, e.g. ":4244"
      --hubble-metrics strings                        List of Hubble metrics to enable.
      --hubble-metrics-server string                  Address to serve Hubble metrics on.
      --hubble-socket-path string                     Set hubble's socket path to listen for connections (default "/var/run/cilium/hubble.sock")
      --identity-allocation-mode string               Method to use for identity allocation (default "kvstore")
      --identity-change-grace-period duration         Time to wait before using new identity on endpoint identity change (default 5s)
      --install-iptables-rules                        Install base iptables rules for cilium to mainly interact with kube-proxy (and masquerading) (default true)
      --ip-allocation-timeout duration                Time after which an incomplete CIDR allocation is considered failed (default 2m0s)
      --ip-masq-agent-config-path string              ip-masq-agent configuration file path (default "/etc/config/ip-masq-agent")
      --ipam string                                   Backend to use for IPAM (default "hostscope-legacy")
      --ipsec-key-file string                         Path to IPSec key file
      --iptables-lock-timeout duration                Time to pass to each iptables invocation to wait for xtables lock acquisition (default 5s)
      --ipv4-node string                              IPv4 address of node (default "auto")
      --ipv4-pod-subnets strings                      List of IPv4 pod subnets to preconfigure for encryption
      --ipv4-range string                             Per-node IPv4 endpoint prefix, e.g. 10.16.0.0/16 (default "auto")
      --ipv4-service-loopback-address string          IPv4 address for service loopback SNAT (default "169.254.42.1")
      --ipv4-service-range string                     Kubernetes IPv4 services CIDR if not inside cluster prefix (default "auto")
      --ipv6-cluster-alloc-cidr string                IPv6 /64 CIDR used to allocate per node endpoint /96 CIDR (default "f00d::/64")
      --ipv6-node string                              IPv6 address of node (default "auto")
      --ipv6-pod-subnets strings                      List of IPv6 pod subnets to preconfigure for encryption
      --ipv6-range string                             Per-node IPv6 endpoint prefix, e.g. fd02:1:1::/96 (default "auto")
      --ipv6-service-range string                     Kubernetes IPv6 services CIDR if not inside cluster prefix (default "auto")
      --ipvlan-master-device string                   Device facing external network acting as ipvlan master (default "undefined")
      --k8s-api-server string                         Kubernetes API server URL
      --k8s-heartbeat-timeout duration                Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
      --k8s-kubeconfig-path string                    Absolute path of the kubernetes kubeconfig file
      --k8s-namespace string                          Name of the Kubernetes namespace in which Cilium is deployed in
      --k8s-require-ipv4-pod-cidr                     Require IPv4 PodCIDR to be specified in node resource
      --k8s-require-ipv6-pod-cidr                     Require IPv6 PodCIDR to be specified in node resource
      --k8s-watcher-endpoint-selector string          K8s endpoint watcher will watch for these k8s endpoints (default "metadata.name!=kube-scheduler,metadata.name!=kube-controller-manager,metadata.name!=etcd-operator,metadata.name!=gcp-controller-manager")
      --k8s-watcher-queue-size uint                   Queue size used to serialize each k8s event type (default 1024)
      --keep-config                                   When restoring state, keeps containers' configuration in place
      --kube-proxy-replacement string                 auto-enable available features for kube-proxy replacement ("probe"), or enable only selected features (will panic if any selected feature cannot be enabled) ("partial") or enable all features (will panic if any feature cannot be enabled) ("strict"), or completely disable it (ignores any selected feature) ("disabled") (default "partial")
      --kvstore string                                Key-value store type
      --kvstore-connectivity-timeout duration         Time after which an incomplete kvstore operation  is considered failed (default 2m0s)
      --kvstore-opt map                               Key-value store options (default map[])
      --kvstore-periodic-sync duration                Periodic KVstore synchronization interval (default 5m0s)
      --label-prefix-file string                      Valid label prefixes file path
      --labels strings                                List of label prefixes used to determine identity of an endpoint
      --lib-dir string                                Directory path to store runtime build environment (default "/var/lib/cilium")
      --log-driver strings                            Logging endpoints to use for example syslog
      --log-opt map                                   Log driver options for cilium (default map[])
      --log-system-load                               Enable periodic logging of system load
      --masquerade                                    Masquerade packets from endpoints leaving the host (default true)
      --metrics strings                               Metrics that should be enabled or disabled from the default metric list. (+metric_foo to enable metric_foo , -metric_bar to disable metric_bar)
      --monitor-aggregation string                    Level of monitor aggregation for traces from the datapath (default "None")
      --monitor-aggregation-flags strings             TCP flags that trigger monitor reports when monitor aggregation is enabled (default [syn,fin,rst])
      --monitor-aggregation-interval duration         Monitor report interval when monitor aggregation is enabled (default 5s)
      --monitor-queue-size int                        Size of the event queue when reading monitor events
      --mtu int                                       Overwrite auto-detected MTU of underlying network
      --nat46-range string                            IPv6 prefix to map IPv4 addresses to (default "0:0:0:0:0:FFFF::/96")
      --node-port-acceleration string                 BPF NodePort acceleration via XDP ("native", "none") (default "none")
      --node-port-bind-protection                     Reject application bind(2) requests to service ports in the NodePort range (default true)
      --node-port-mode string                         BPF NodePort mode ("snat", "dsr", "hybrid") (default "snat")
      --node-port-range strings                       Set the min/max NodePort port range (default [30000,32767])
      --policy-audit-mode                             Enable policy audit (non-drop) mode
      --policy-queue-size int                         size of queues for policy-related events (default 100)
      --pprof                                         Enable serving the pprof debugging API
      --preallocate-bpf-maps                          Enable BPF map pre-allocation (default true)
      --prefilter-device string                       Device facing external network for XDP prefiltering (default "undefined")
      --prefilter-mode string                         Prefilter mode via XDP ("native", "generic") (default "native")
      --prepend-iptables-chains                       Prepend custom iptables chains instead of appending (default true)
      --prometheus-serve-addr string                  IP:Port on which to serve prometheus metrics (pass ":Port" to bind on all interfaces, "" is off)
      --proxy-connect-timeout uint                    Time after which a TCP connect attempt is considered failed unless completed (in seconds) (default 1)
      --read-cni-conf string                          Read to the CNI configuration at specified path to extract per node configuration
      --restore                                       Restores state, if possible, from previous daemon (default true)
      --sidecar-istio-proxy-image string              Regular expression matching compatible Istio sidecar istio-proxy container image names (default "cilium/istio_proxy")
      --single-cluster-route                          Use a single cluster route instead of per node routes
      --skip-crd-creation                             Skip Kubernetes Custom Resource Definitions creations
      --socket-path string                            Sets daemon's socket path to listen for connections (default "/var/run/cilium/cilium.sock")
      --sockops-enable                                Enable sockops when kernel supported
      --state-dir string                              Directory path to store runtime state (default "/var/run/cilium")
      --tofqdns-dns-reject-response-code string       DNS response code for rejecting DNS requests, available options are '[nameError refused]' (default "refused")
      --tofqdns-enable-dns-compression                Allow the DNS proxy to compress responses to endpoints that are larger than 512 Bytes or the EDNS0 option, if present (default true)
      --tofqdns-endpoint-max-ip-per-hostname int      Maximum number of IPs to maintain per FQDN name for each endpoint (default 50)
      --tofqdns-max-deferred-connection-deletes int   Maximum number of IPs to retain for expired DNS lookups with still-active connections (default 10000)
      --tofqdns-min-ttl int                           The minimum time, in seconds, to use DNS data for toFQDNs policies. (default 3600 )
      --tofqdns-pre-cache string                      DNS cache data at this path is preloaded on agent startup
      --tofqdns-proxy-port int                        Global port on which the in-agent DNS proxy should listen. Default 0 is a OS-assigned port.
      --tofqdns-proxy-response-max-delay duration     The maximum time the DNS proxy holds an allowed DNS response before sending it along. Responses are sent as soon as the datapath is updated with the new IP information. (default 100ms)
      --trace-payloadlen int                          Length of payload to capture when tracing (default 128)
  -t, --tunnel string                                 Tunnel mode {vxlan, geneve, disabled} (default "vxlan" for the "veth" datapath mode)
      --version                                       Print version information
      --write-cni-conf-when-ready string              Write the CNI configuration as specified via --read-cni-conf to path when agent is ready

cilium

cilium

CLI

Synopsis

CLI for interacting with the local Cilium Agent

Options
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -h, --help            help for cilium
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf

Direct access to local BPF maps

Synopsis

Direct access to local BPF maps

Options
  -h, --help   help for bpf
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ct

Connection tracking tables

Synopsis

Connection tracking tables

Options
  -h, --help   help for ct
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ct flush

Flush all connection tracking entries

Synopsis

Flush all connection tracking entries

cilium bpf ct flush ( <endpoint identifier> | global ) [flags]
Options
  -h, --help   help for flush
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ct list

List connection tracking entries

Synopsis

List connection tracking entries

cilium bpf ct list ( <endpoint identifier> | global ) [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf endpoint

Local endpoint map

Synopsis

Local endpoint map

Options
  -h, --help   help for endpoint
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf endpoint delete

Delete local endpoint entries

Synopsis

Delete local endpoint entries

cilium bpf endpoint delete [flags]
Options
  -h, --help   help for delete
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf endpoint list

List local endpoint entries

Synopsis

List local endpoint entries

cilium bpf endpoint list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ipcache

Manage the IPCache mappings for IP/CIDR <-> Identity

Synopsis

Manage the IPCache mappings for IP/CIDR <-> Identity

Options
  -h, --help   help for ipcache
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ipcache get

Retrieve identity for an ip

Synopsis

Retrieve identity for an ip

cilium bpf ipcache get [flags]
Options
  -h, --help   help for get
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ipcache list

List endpoint IPs (local and remote) and their corresponding security identities

Synopsis

List endpoint IPs (local and remote) and their corresponding security identities.

Note that for Linux kernel versions between 4.11 and 4.15 inclusive, the native LPM map type used for implementing the IPCache does not provide the ability to walk / dump the entries, so on these kernel versions this tool will never return any entries, even if entries exist in the map. You may instead run: cilium map get cilium_ipcache

cilium bpf ipcache list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ipmasq

ip-masq-agent CIDRs

Synopsis

ip-masq-agent CIDRs

Options
  -h, --help   help for ipmasq
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf ipmasq list

List ip-masq-agent CIDRs

Synopsis

List ip-masq-agent CIDRs. Packets sent from pods to IPs from these CIDRs avoid masquerading.

cilium bpf ipmasq list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf lb

Load-balancing configuration

Synopsis

Load-balancing configuration

Options
  -h, --help   help for lb
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf lb list

List load-balancing configuration

Synopsis

List load-balancing configuration

cilium bpf lb list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
      --revnat          List reverse NAT entries
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf metrics

BPF datapath traffic metrics

Synopsis

BPF datapath traffic metrics

Options
  -h, --help   help for metrics
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf metrics list

List BPF datapath traffic metrics

Synopsis

List BPF datapath traffic metrics

cilium bpf metrics list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf nat

NAT mapping tables

Synopsis

NAT mapping tables

Options
  -h, --help   help for nat
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf nat flush

Flush all NAT mapping entries

Synopsis

Flush all NAT mapping entries

cilium bpf nat flush [flags]
Options
  -h, --help   help for flush
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf nat list

List all NAT mapping entries

Synopsis

List all NAT mapping entries

cilium bpf nat list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf policy

Manage policy related BPF maps

Synopsis

Manage policy related BPF maps

Options
  -h, --help   help for policy
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf policy add

Add/update policy entry

Synopsis

Add/update policy entry

cilium bpf policy add <endpoint id> <traffic-direction> <identity> [port/proto] [flags]
Options
  -h, --help   help for add
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf policy delete

Delete a policy entry

Synopsis

Delete a policy entry

cilium bpf policy delete <endpoint id> <identity> [port/proto] [flags]
Options
  -h, --help   help for delete
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf policy get

List contents of a policy BPF map

Synopsis

List contents of a policy BPF map

cilium bpf policy get [flags]
Options
      --all             Dump all policy maps
  -h, --help            help for get
  -n, --numeric         Do not resolve IDs
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf proxy

Proxy configuration

Synopsis

Proxy configuration

Options
  -h, --help   help for proxy
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf proxy flush

Flush all proxy entries (deprecated)

Synopsis

Flush all proxy entries (deprecated)

cilium bpf proxy flush [flags]
Options
  -h, --help   help for flush
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf proxy list

List proxy configuration (deprecated)

Synopsis

List proxy configuration (deprecated)

cilium bpf proxy list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf sha

Manage compiled BPF template objects

Synopsis

Manage compiled BPF template objects

Options
  -h, --help   help for sha
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf sha get

Get datapath SHA header

Synopsis

Get datapath SHA header

cilium bpf sha get <sha> [flags]
Options
  -h, --help            help for get
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf sha list

List BPF template objects.

Synopsis

List BPF template objects.

cilium bpf sha list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf tunnel

Tunnel endpoint map

Synopsis

Tunnel endpoint map

Options
  -h, --help   help for tunnel
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium bpf tunnel list

List tunnel endpoint entries

Synopsis

List tunnel endpoint entries

cilium bpf tunnel list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium cleanup

Reset the agent state

Synopsis

Reset the agent state

cilium cleanup [flags]
Options
      --all-state   Remove all cilium state
      --bpf-state   Remove BPF state
  -f, --force       Skip confirmation
  -h, --help        help for cleanup
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium completion

Output shell completion code

Synopsis

Output shell completion code

cilium completion [shell] [flags]
Examples

# Installing bash completion on macOS using homebrew
## If running Bash 3.2 included with macOS
    brew install bash-completion
## or, if running Bash 4.1+
    brew install bash-completion@2
## afterwards you only need to run
    cilium completion bash > $(brew --prefix)/etc/bash_completion.d/cilium


# Installing bash completion on Linux
## Load the cilium completion code for bash into the current shell
    source <(cilium completion bash)
## Write bash completion code to a file and source if from .bash_profile
    cilium completion bash > ~/.cilium/completion.bash.inc
    printf "
      # Cilium shell completion
      source '$HOME/.cilium/completion.bash.inc'
      " >> $HOME/.bash_profile
    source $HOME/.bash_profile


# Installing zsh completion on Linux/macOS
## Load the cilium completion code for zsh into the current shell
    source <(cilium completion zsh)
## Write zsh completion code to a file and source if from .zshrc
    cilium completion zsh > ~/.cilium/completion.zsh.inc
    printf "
      # Cilium shell completion
      source '$HOME/.cilium/completion.zsh.inc'
      " >> $HOME/.zshrc
    source $HOME/.zshrc

# Installing fish completion on Linux/macOS
## Write fish completion code to fish specific location
    cilium completion fish > ~/.config/fish/completions/cilium.fish
Options
  -h, --help   help for completion
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium config

Cilium configuration options

Synopsis

Cilium configuration options

cilium config [<option>=(enable|disable) ...] [flags]
Options
  -h, --help            help for config
      --list-options    List available options
  -n, --num-pages int   Number of pages for perf ring buffer. New values have to be > 0
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium debuginfo

Request available debugging information from agent

Synopsis

Request available debugging information from agent

cilium debuginfo [flags]
Options
  -f, --file                      Redirect output to file(s)
      --file-per-command          Generate a single file per command
  -h, --help                      help for debuginfo
      --output strings            markdown| html| json| jsonpath='{}'
      --output-directory string   directory for files (if specified will use directory in which this command was ran)
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint

Manage endpoints

Synopsis

Manage endpoints

Options
  -h, --help   help for endpoint
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint config

View & modify endpoint configuration

Synopsis

View & modify endpoint configuration

cilium endpoint config <endpoint id> [<option>=(enable|disable) ...] [flags]
Examples
endpoint config 5421 DropNotification=false TraceNotification=false PolicyVerdictNotification=true
Options
  -h, --help            help for config
      --list-options    List available options
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint disconnect

Disconnect an endpoint from the network

Synopsis

Disconnect an endpoint from the network

cilium endpoint disconnect <endpoint-id> [flags]
Options
  -h, --help   help for disconnect
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint get

Display endpoint information

Synopsis

Display endpoint information

cilium endpoint get ( <endpoint identifier> | -l <endpoint labels> )  [flags]
Examples
cilium endpoint get 4598, cilium endpoint get pod-name:default:foobar, cilium endpoint get -l id.baz
Options
  -h, --help             help for get
  -l, --labels strings   list of labels
  -o, --output string    json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint health

View endpoint health

Synopsis

View endpoint health

cilium endpoint health <endpoint id> [flags]
Examples
cilium endpoint health 5421
Options
  -h, --help            help for health
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint labels

Manage label configuration of endpoint

Synopsis

Manage label configuration of endpoint

cilium endpoint labels [flags]
Options
  -a, --add strings      Add/enable labels
  -d, --delete strings   Delete/disable labels
  -h, --help             help for labels
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint list

List all endpoints

Synopsis

List all endpoints

cilium endpoint list [flags]
Options
  -h, --help            help for list
      --no-headers      Do not print headers
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint log

View endpoint status log

Synopsis

View endpoint status log

cilium endpoint log <endpoint id> [flags]
Examples
cilium endpoint log 5421
Options
  -h, --help            help for log
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium endpoint regenerate

Force regeneration of endpoint program

Synopsis

Force regeneration of endpoint program

cilium endpoint regenerate <endpoint-id> [flags]
Options
  -h, --help   help for regenerate
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium fqdn

Manage fqdn proxy

Synopsis

Manage fqdn proxy

cilium fqdn [flags]
Options
  -h, --help   help for fqdn
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium fqdn cache

Manage fqdn proxy cache

Synopsis

Manage fqdn proxy cache

cilium fqdn cache [flags]
Options
  -h, --help   help for cache
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium fqdn cache clean

Clean fqdn cache

Synopsis

Clean fqdn cache

cilium fqdn cache clean [flags]
Options
  -f, --force                 Skip confirmation
  -h, --help                  help for clean
  -p, --matchpattern string   Delete cache entries with FQDNs that match matchpattern
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium fqdn cache list

List fqdn cache contents

Synopsis

List fqdn cache contents

cilium fqdn cache list [flags]
Options
  -e, --endpoint string       List cache entries for a specific endpoint id
  -h, --help                  help for list
  -p, --matchpattern string   List cache entries with FQDN that match matchpattern
  -o, --output string         json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium fqdn names

show internal state Cilium has for DNS names / regexes

Synopsis

show internal state Cilium has for DNS names / regexes

cilium fqdn names [flags]
Options
  -h, --help   help for names
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium identity

Manage security identities

Synopsis

Manage security identities

Options
  -h, --help   help for identity
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium identity get

Retrieve information about an identity

Synopsis

Retrieve information about an identity

cilium identity get [flags]
Options
  -h, --help            help for get
      --label strings   Label to lookup
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium identity list

List identities

Synopsis

List identities

cilium identity list [LABELS] [flags]
Options
      --endpoints       list identities of locally managed endpoints
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium ip

Manage IP addresses and associated information

Synopsis

Manage IP addresses and associated information

Options
  -h, --help   help for ip
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium ip list

List IP addresses in the userspace IPcache

Synopsis

List IP addresses in the userspace IPcache

cilium ip list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
  -v, --verbose         Print all fields of ipcache
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO
  • cilium ip - Manage IP addresses and associated information

cilium kvstore

Direct access to the kvstore

Synopsis

Direct access to the kvstore

Options
  -h, --help              help for kvstore
      --kvstore string    kvstore type
      --kvstore-opt map   kvstore options (default map[])
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium kvstore delete

Delete a key

Synopsis

Delete a key

cilium kvstore delete [options] <key> [flags]
Examples
cilium kvstore delete --recursive foo
Options
  -h, --help        help for delete
      --recursive   Recursive lookup
Options inherited from parent commands
      --config string     config file (default is $HOME/.cilium.yaml)
  -D, --debug             Enable debug messages
  -H, --host string       URI to server-side API
      --kvstore string    kvstore type
      --kvstore-opt map   kvstore options (default map[])
SEE ALSO

cilium kvstore get

Retrieve a key

Synopsis

Retrieve a key

cilium kvstore get [options] <key> [flags]
Examples
cilium kvstore get --recursive foo
Options
  -h, --help            help for get
  -o, --output string   json| jsonpath='{}'
      --recursive       Recursive lookup
Options inherited from parent commands
      --config string     config file (default is $HOME/.cilium.yaml)
  -D, --debug             Enable debug messages
  -H, --host string       URI to server-side API
      --kvstore string    kvstore type
      --kvstore-opt map   kvstore options (default map[])
SEE ALSO

cilium kvstore set

Set a key and value

Synopsis

Set a key and value

cilium kvstore set [options] <key> [flags]
Examples
cilium kvstore set foo=bar
Options
  -h, --help           help for set
      --key string     Key
      --value string   Value
Options inherited from parent commands
      --config string     config file (default is $HOME/.cilium.yaml)
  -D, --debug             Enable debug messages
  -H, --host string       URI to server-side API
      --kvstore string    kvstore type
      --kvstore-opt map   kvstore options (default map[])
SEE ALSO

cilium map

Access userspace cached content of BPF maps

Synopsis

Access userspace cached content of BPF maps

Options
  -h, --help   help for map
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium map get

Display cached content of given BPF map

Synopsis

Display cached content of given BPF map

cilium map get <name> [flags]
Examples
cilium map get cilium_ipcache
Options
  -h, --help            help for get
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO
  • cilium map - Access userspace cached content of BPF maps

cilium map list

List all open BPF maps

Synopsis

List all open BPF maps

cilium map list [flags]
Examples
cilium map list
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
      --verbose         Print cache contents of all maps
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO
  • cilium map - Access userspace cached content of BPF maps

cilium metrics

Access metric status

Synopsis

Access metric status

Options
  -h, --help   help for metrics
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium metrics list

List all metrics

Synopsis

List all metrics

cilium metrics list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium monitor

Display BPF program events

Synopsis

The monitor displays notifications and events emitted by the BPF programs attached to endpoints and devices. This includes:

  • Dropped packet notifications
  • Captured packet traces
  • Policy verdict notifications
  • Debugging information
cilium monitor [flags]
Options
      --from []uint16           Filter by source endpoint id
  -h, --help                    help for monitor
      --hex                     Do not dissect, print payload in HEX
  -j, --json                    Enable json output. Shadows -v flag
      --monitor-socket string   Configure monitor socket path
      --related-to []uint16     Filter by either source or destination endpoint id
      --to []uint16             Filter by destination endpoint id
  -t, --type []string           Filter by event types [agent capture debug drop l7 policy-verdict trace]
  -v, --verbose bools[=false]   Enable verbose output (-v, -vv) (default [])
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium node

Manage cluster nodes

Synopsis

Manage cluster nodes

Options
  -h, --help   help for node
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium node list

List nodes

Synopsis

List nodes

cilium node list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy

Manage security policies

Synopsis

Manage security policies

Options
  -h, --help   help for policy
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy delete

Delete policy rules

Synopsis

Delete policy rules

cilium policy delete [<labels>] [flags]
Options
      --all             Delete all policies
  -h, --help            help for delete
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy get

Display policy node information

Synopsis

Display policy node information

cilium policy get [<labels>] [flags]
Options
  -h, --help            help for get
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy import

Import security policy in JSON format

Synopsis

Import security policy in JSON format

cilium policy import <path> [flags]
Examples
  cilium policy import ~/policy.json
  cilium policy import ./policies/app/
Options
  -h, --help            help for import
  -o, --output string   json| jsonpath='{}'
      --print           Print policy after import
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy selectors

Display cached information about selectors

Synopsis

Display cached information about selectors

cilium policy selectors [flags]
Options
  -h, --help            help for selectors
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy trace

Trace a policy decision

Synopsis

Verifies if the source is allowed to consume destination. Source / destination can be provided as endpoint ID, security ID, Kubernetes Pod, YAML file, set of LABELs. LABEL is represented as SOURCE:KEY[=VALUE]. dports can be can be for example: 80/tcp, 53 or 23/udp. If multiple sources and / or destinations are provided, each source is tested whether there is a policy allowing traffic between it and each destination. –src-k8s-pod and –dst-k8s-pod requires cilium-agent to be running with disable-endpoint-crd option set to “false”.

cilium policy trace ( -s <label context> | --src-identity <security identity> | --src-endpoint <endpoint ID> | --src-k8s-pod <namespace:pod-name> | --src-k8s-yaml <path to YAML file> ) ( -d <label context> | --dst-identity <security identity> | --dst-endpoint <endpoint ID> | --dst-k8s-pod <namespace:pod-name> | --dst-k8s-yaml <path to YAML file>) [--dport <port>[/<protocol>] [flags]
Options
      --dport strings         L4 destination port to search on outgoing traffic of the source label context and on incoming traffic of the destination label context
  -d, --dst strings           Destination label context
      --dst-endpoint string   Destination endpoint
      --dst-identity int      Destination identity (default -1)
      --dst-k8s-pod string    Destination k8s pod ([namespace:]podname)
      --dst-k8s-yaml string   Path to YAML file for destination
  -h, --help                  help for trace
  -o, --output string         json| jsonpath='{}'
  -s, --src strings           Source label context
      --src-endpoint string   Source endpoint
      --src-identity int      Source identity (default -1)
      --src-k8s-pod string    Source k8s pod ([namespace:]podname)
      --src-k8s-yaml string   Path to YAML file for source
  -v, --verbose               Set tracing to TRACE_VERBOSE
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy validate

Validate a policy

Synopsis

Validate a policy

cilium policy validate <path> [flags]
Options
  -h, --help      help for validate
      --print     Print policy after validation
  -v, --verbose   Enable verbose output (default true)
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium policy wait

Wait for all endpoints to have updated to a given policy revision

Synopsis

Wait for all endpoints to have updated to a given policy revision

cilium policy wait <revision> [flags]
Options
      --fail-wait-time int   Wait time after which command fails if endpoint regeration fails (seconds) (default 60)
  -h, --help                 help for wait
      --max-wait-time int    Wait time after which command fails (seconds) (default 360)
      --sleep-time int       Sleep interval between checks (seconds) (default 1)
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium prefilter

Manage XDP CIDR filters

Synopsis

Manage XDP CIDR filters

Options
  -h, --help   help for prefilter
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium prefilter delete

Delete CIDR filters

Synopsis

Delete CIDR filters

cilium prefilter delete [flags]
Options
      --cidr strings    List of CIDR prefixes to delete
  -h, --help            help for delete
      --revision uint   Update revision
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium prefilter list

List CIDR filters

Synopsis

List CIDR filters

cilium prefilter list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium prefilter update

Update CIDR filters

Synopsis

Update CIDR filters

cilium prefilter update [flags]
Options
      --cidr strings    List of CIDR prefixes to block
  -h, --help            help for update
      --revision uint   Update revision
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium preflight

cilium upgrade helper

Synopsis

CLI to help upgrade cilium

Options
  -h, --help   help for preflight
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium preflight fqdn-poller

Prepare for DNS Polling upgrades to cilium 1.4

Synopsis

Prepare for DNS Polling upgrades to cilium 1.4 by creating a placeholder –tofqdns-pre-cache file that can be used to pre-seed the DNS cached used in toFQDNs rules. This is useful when upgrading cilium with DNS Polling policies where an interruption in allowed IPs is undesirable. It may also be used when switching from DNS Polling based DNS discovery to DNS Proxy based discovery where an endpoint may not make a DNS request soon enough to be used by toFQDNs policy rules

cilium preflight fqdn-poller [flags]
Options
  -h, --help                        help for fqdn-poller
      --tofqdns-pre-cache string    The path to write serialized ToFQDNs pre-cache information. stdout is the default
      --tofqdns-pre-cache-ttl int   TTL, in seconds, to set on generated ToFQDNs pre-cache information (default 604800)
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium preflight migrate-identity

Migrate KVStore-backed identities to kubernetes CRD-backed identities

Synopsis

migrate-identity allows migrating to CRD-backed identities while minimizing connection interruptions. It will allocate a CRD-backed identity, with the same numeric security identity, for each cilium security identity defined in the kvstore. When cilium-agents are restarted with identity-allocation-mode set to CRD the numeric identities will then be equivalent between new instances and not-upgraded ones. In cases where the numeric identity is already in-use by a different set of labels, a new numeric identity is created.

cilium preflight migrate-identity [flags]
Options
  -h, --help                         help for migrate-identity
      --k8s-api-server string        Kubernetes api address server (for https use --k8s-kubeconfig-path instead)
      --k8s-kubeconfig-path string   Absolute path of the kubernetes kubeconfig file
      --kvstore string               Key-value store type
      --kvstore-opt map              Key-value store options (default map[])
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium preflight validate-cnp

Validate Cilium Network Policies deployed in the cluster

Synopsis

Before upgrading Cilium it is recommended to run this validation checker to make sure the policies deployed are valid. The validator will verify if all policies deployed in the cluster are valid, in case they are not, an error is printed and the has an exit code -1 is returned.

cilium preflight validate-cnp [flags]
Options
  -h, --help                         help for validate-cnp
      --k8s-api-server string        Kubernetes api address server (for https use --k8s-kubeconfig-path instead)
      --k8s-kubeconfig-path string   Absolute path of the kubernetes kubeconfig file
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium service

Manage services & loadbalancers

Synopsis

Manage services & loadbalancers

Options
  -h, --help   help for service
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium service delete

Delete a service

Synopsis

Delete a service

cilium service delete { <service id> | --all } [flags]
Options
      --all    Delete all services
  -h, --help   help for delete
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium service get

Display service information

Synopsis

Display service information

cilium service get <service id> [flags]
Options
  -h, --help            help for get
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium service list

List services

Synopsis

List services

cilium service list [flags]
Options
  -h, --help            help for list
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium service update

Update a service

Synopsis

Update a service

cilium service update [flags]
Options
      --backends strings            Backend address or addresses (<IP:Port>)
      --frontend string             Frontend address
  -h, --help                        help for update
      --id uint                     Identifier
      --k8s-external                Set service as a k8s ExternalIPs
      --k8s-host-port               Set service as a k8s HostPort
      --k8s-node-port               Set service as a k8s NodePort
      --k8s-traffic-policy string   Set service with k8s externalTrafficPolicy as {Local,Cluster} (default "Cluster")
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium status

Display status of daemon

Synopsis

Display status of daemon

cilium status [flags]
Options
      --all-addresses      Show all allocated addresses, not just count
      --all-clusters       Show all clusters
      --all-controllers    Show all controllers, not just failing
      --all-health         Show all health status, not just failing
      --all-nodes          Show all nodes, not just localhost
      --all-redirects      Show all redirects
      --brief              Only print a one-line status message
  -h, --help               help for status
  -o, --output string      json| jsonpath='{}'
      --timeout duration   Sets the timeout to use when querying for health (default 30s)
      --verbose            Equivalent to --all-addresses --all-controllers --all-nodes --all-health
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium version

Print version information

Synopsis

Print version information

cilium version [flags]
Options
  -h, --help            help for version
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
      --config string   config file (default is $HOME/.cilium.yaml)
  -D, --debug           Enable debug messages
  -H, --host string     URI to server-side API
SEE ALSO

cilium-bugtool

Collects agent & system information useful for bug reporting

Synopsis

Collects agent & system information useful for bug reporting

cilium-bugtool [OPTIONS] [flags]
Examples
    # Collect information and create archive file
    $ cilium-bugtool
    [...]

    # Collect and retrieve archive if Cilium is running in a Kubernetes pod
    $ kubectl get pods --namespace kube-system
    NAME                          READY     STATUS    RESTARTS   AGE
    cilium-kg8lv                  1/1       Running   0          13m
    [...]
    $ kubectl -n kube-system exec cilium-kg8lv cilium-bugtool
    $ kubectl cp kube-system/cilium-kg8lv:/tmp/cilium-bugtool-243785589.tar /tmp/cilium-bugtool-243785589.tar
Options
      --archive                   Create archive when false skips deletion of the output directory (default true)
      --archive-prefix string     String to prefix to name of archive if created (e.g., with cilium pod-name)
  -o, --archiveType string        Archive type: tar | gz (default "tar")
      --config string             Configuration to decide what should be run (default "./.cilium-bugtool.config")
      --dry-run                   Create configuration file of all commands that would have been executed
      --enable-markdown           Dump output of commands in markdown format
      --exec-timeout duration     The default timeout for any cmd execution in seconds (default 30s)
      --get-pprof                 When set, only gets the pprof traces from the cilium-agent binary
  -h, --help                      help for cilium-bugtool
  -H, --host string               URI to server-side API
      --k8s-label string          Kubernetes label for Cilium pod (default "k8s-app=cilium")
      --k8s-mode                  Require Kubernetes pods to be found or fail
      --k8s-namespace string      Kubernetes namespace for Cilium pod (default "kube-system")
      --pprof-port int            Port on which pprof server is exposed (default 6060)
      --pprof-trace-seconds int   Amount of seconds used for pprof CPU traces (default 180)
  -t, --tmp string                Path to store extracted files (default "/tmp")

cilium-health get

Display local cilium agent status

Synopsis

Display local cilium agent status

cilium-health get [flags]
Options
  -h, --help            help for get
  -o, --output string   json| jsonpath='{}'
Options inherited from parent commands
  -D, --debug                Enable debug messages
  -H, --host string          URI to cilium-health server API
      --log-driver strings   Logging endpoints to use for example syslog
      --log-opt map          Log driver options for cilium-health (default map[])
SEE ALSO

cilium-health ping

Check whether the cilium-health API is up

Synopsis

Check whether the cilium-health API is up

cilium-health ping [flags]
Options
  -h, --help   help for ping
Options inherited from parent commands
  -D, --debug                Enable debug messages
  -H, --host string          URI to cilium-health server API
      --log-driver strings   Logging endpoints to use for example syslog
      --log-opt map          Log driver options for cilium-health (default map[])
SEE ALSO

cilium-health status

Display cilium connectivity to other nodes

Synopsis

Display cilium connectivity to other nodes

cilium-health status [flags]
Options
  -h, --help            help for status
  -o, --output string   json| jsonpath='{}'
      --probe           Synchronously probe connectivity status
      --succinct        Print the result succinctly (one node per line)
      --verbose         Print more information in results
Options inherited from parent commands
  -D, --debug                Enable debug messages
  -H, --host string          URI to cilium-health server API
      --log-driver strings   Logging endpoints to use for example syslog
      --log-opt map          Log driver options for cilium-health (default map[])
SEE ALSO

cilium-health

Cilium Health Client

Synopsis

Client for querying the Cilium health status API

cilium-health [flags]

Options

  -D, --debug                Enable debug messages
  -h, --help                 help for cilium-health
  -H, --host string          URI to cilium-health server API
      --log-driver strings   Logging endpoints to use for example syslog
      --log-opt map          Log driver options for cilium-health (default map[])

SEE ALSO

cilium-operator

Run cilium-operator

Synopsis

Run cilium-operator

cilium-operator [flags]

Options

      --aws-instance-limit-mapping map          Add or overwrite mappings of AWS instance limit in the form of {"AWS instance type": "Maximum Network Interfaces","IPv4 Addresses per Interface","IPv6 Addresses per Interface"}. cli example: --aws-instance-limit-mapping=a1.medium=2,4,4 --aws-instance-limit-mapping=a2.somecustomflavor=4,5,6 configmap example: {"a1.medium": "2,4,4", "a2.somecustomflavor": "4,5,6"} (default map[])
      --aws-release-excess-ips                  Enable releasing excess free IP addresses from AWS ENI.
      --azure-resource-group string             Resource group to use for Azure IPAM
      --azure-subscription-id string            Subscription ID to access Azure API
      --cilium-endpoint-gc-interval duration    GC interval for cilium endpoints (default 5m0s)
      --cluster-id int                          Unique identifier of the cluster
      --cluster-name string                     Name of the cluster (default "default")
      --cluster-pool-ipv4-cidr string           IPv4 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv4=true'
      --cluster-pool-ipv4-mask-size int         Mask size for each IPv4 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv4=true' (default 24)
      --cluster-pool-ipv6-cidr string           IPv6 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv6=true'
      --cluster-pool-ipv6-mask-size int         Mask size for each IPv6 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv6=true' (default 112)
      --cnp-node-status-gc-interval duration    GC interval for nodes which have been removed from the cluster in CiliumNetworkPolicy Status (default 2m0s)
      --cnp-status-update-interval duration     Interval between CNP status updates sent to the k8s-apiserver per-CNP (default 1s)
      --config string                           Configuration file (default "$HOME/ciliumd.yaml")
      --config-dir string                       Configuration directory that contains a file for each option
  -D, --debug                                   Enable debugging mode
      --enable-ipv4                             Enable IPv4 support (default true)
      --enable-ipv6                             Enable IPv6 support (default true)
      --enable-k8s-api-discovery                Enable discovery of Kubernetes API groups and resources with the discovery API
      --enable-k8s-endpoint-slice               Enables k8s EndpointSlice feature into Cilium-Operator if the k8s cluster supports it (default true)
      --enable-metrics                          Enable Prometheus metrics
      --eni-tags map                            ENI tags in the form of k1=v1 (multiple k/v pairs can be passed by repeating the CLI flag) (default map[])
  -h, --help                                    help for cilium-operator
      --identity-allocation-mode string         Method to use for identity allocation (default "kvstore")
      --identity-gc-interval duration           GC interval for security identities (default 15m0s)
      --identity-heartbeat-timeout duration     Timeout after which identity expires on lack of heartbeat (default 30m0s)
      --ipam string                             Backend to use for IPAM (default "hostscope-legacy")
      --k8s-api-server string                   Kubernetes API server URL
      --k8s-client-burst int                    Burst value allowed for the K8s client
      --k8s-client-qps float32                  Queries per second limit for the K8s client
      --k8s-heartbeat-timeout duration          Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
      --k8s-kubeconfig-path string              Absolute path of the kubernetes kubeconfig file
      --k8s-namespace string                    Name of the Kubernetes namespace in which Cilium Operator is deployed in
      --kvstore string                          Key-value store type
      --kvstore-opt map                         Key-value store options (default map[])
      --limit-ipam-api-burst int                Upper burst limit when accessing external APIs (default 4)
      --limit-ipam-api-qps float                Queries per second limit when accessing external IPAM APIs (default 20)
      --nodes-gc-interval duration              GC interval for nodes store in the kvstore (default 2m0s)
      --operator-api-serve-addr string          Address to serve API requests (default "localhost:9234")
      --operator-prometheus-serve-addr string   Address to serve Prometheus metrics (default ":6942")
      --parallel-alloc-workers int              Maximum number of parallel IPAM workers (default 50)
      --subnet-ids-filter strings               Subnets IDs (separated by commas)
      --subnet-tags-filter stringToString       Subnets tags in the form of k1=v1,k2=v2 (multiple k/v pairs can also be passed by repeating the CLI flag (default [])
      --synchronize-k8s-nodes                   Synchronize Kubernetes nodes to kvstore and perform CNP GC (default true)
      --synchronize-k8s-services                Synchronize Kubernetes services to kvstore (default true)
      --unmanaged-pod-watcher-interval int      Interval to check for unmanaged kube-dns pods (0 to disable) (default 15)
      --update-ec2-apdater-limit-via-api        Use the EC2 API to update the instance type to adapter limits
      --version                                 Print version information

cilium-operator-aws

Run cilium-operator-aws

Synopsis

Run cilium-operator-aws

cilium-operator-aws [flags]

Options

      --aws-instance-limit-mapping map          Add or overwrite mappings of AWS instance limit in the form of {"AWS instance type": "Maximum Network Interfaces","IPv4 Addresses per Interface","IPv6 Addresses per Interface"}. cli example: --aws-instance-limit-mapping=a1.medium=2,4,4 --aws-instance-limit-mapping=a2.somecustomflavor=4,5,6 configmap example: {"a1.medium": "2,4,4", "a2.somecustomflavor": "4,5,6"} (default map[])
      --aws-release-excess-ips                  Enable releasing excess free IP addresses from AWS ENI.
      --cilium-endpoint-gc-interval duration    GC interval for cilium endpoints (default 5m0s)
      --cluster-id int                          Unique identifier of the cluster
      --cluster-name string                     Name of the cluster (default "default")
      --cluster-pool-ipv4-cidr string           IPv4 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv4=true'
      --cluster-pool-ipv4-mask-size int         Mask size for each IPv4 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv4=true' (default 24)
      --cluster-pool-ipv6-cidr string           IPv6 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv6=true'
      --cluster-pool-ipv6-mask-size int         Mask size for each IPv6 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv6=true' (default 112)
      --cnp-node-status-gc-interval duration    GC interval for nodes which have been removed from the cluster in CiliumNetworkPolicy Status (default 2m0s)
      --cnp-status-update-interval duration     Interval between CNP status updates sent to the k8s-apiserver per-CNP (default 1s)
      --config string                           Configuration file (default "$HOME/ciliumd.yaml")
      --config-dir string                       Configuration directory that contains a file for each option
  -D, --debug                                   Enable debugging mode
      --enable-ipv4                             Enable IPv4 support (default true)
      --enable-ipv6                             Enable IPv6 support (default true)
      --enable-k8s-api-discovery                Enable discovery of Kubernetes API groups and resources with the discovery API
      --enable-k8s-endpoint-slice               Enables k8s EndpointSlice feature into Cilium-Operator if the k8s cluster supports it (default true)
      --enable-metrics                          Enable Prometheus metrics
      --eni-tags map                            ENI tags in the form of k1=v1 (multiple k/v pairs can be passed by repeating the CLI flag) (default map[])
  -h, --help                                    help for cilium-operator-aws
      --identity-allocation-mode string         Method to use for identity allocation (default "kvstore")
      --identity-gc-interval duration           GC interval for security identities (default 15m0s)
      --identity-heartbeat-timeout duration     Timeout after which identity expires on lack of heartbeat (default 30m0s)
      --ipam string                             Backend to use for IPAM (default "eni")
      --k8s-api-server string                   Kubernetes API server URL
      --k8s-client-burst int                    Burst value allowed for the K8s client
      --k8s-client-qps float32                  Queries per second limit for the K8s client
      --k8s-heartbeat-timeout duration          Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
      --k8s-kubeconfig-path string              Absolute path of the kubernetes kubeconfig file
      --k8s-namespace string                    Name of the Kubernetes namespace in which Cilium Operator is deployed in
      --kvstore string                          Key-value store type
      --kvstore-opt map                         Key-value store options (default map[])
      --limit-ipam-api-burst int                Upper burst limit when accessing external APIs (default 4)
      --limit-ipam-api-qps float                Queries per second limit when accessing external IPAM APIs (default 20)
      --nodes-gc-interval duration              GC interval for nodes store in the kvstore (default 2m0s)
      --operator-api-serve-addr string          Address to serve API requests (default "localhost:9234")
      --operator-prometheus-serve-addr string   Address to serve Prometheus metrics (default ":6942")
      --parallel-alloc-workers int              Maximum number of parallel IPAM workers (default 50)
      --subnet-ids-filter strings               Subnets IDs (separated by commas)
      --subnet-tags-filter stringToString       Subnets tags in the form of k1=v1,k2=v2 (multiple k/v pairs can also be passed by repeating the CLI flag (default [])
      --synchronize-k8s-nodes                   Synchronize Kubernetes nodes to kvstore and perform CNP GC (default true)
      --synchronize-k8s-services                Synchronize Kubernetes services to kvstore (default true)
      --unmanaged-pod-watcher-interval int      Interval to check for unmanaged kube-dns pods (0 to disable) (default 15)
      --update-ec2-apdater-limit-via-api        Use the EC2 API to update the instance type to adapter limits
      --version                                 Print version information

cilium-operator-azure

Run cilium-operator-azure

Synopsis

Run cilium-operator-azure

cilium-operator-azure [flags]

Options

      --azure-resource-group string             Resource group to use for Azure IPAM
      --azure-subscription-id string            Subscription ID to access Azure API
      --cilium-endpoint-gc-interval duration    GC interval for cilium endpoints (default 5m0s)
      --cluster-id int                          Unique identifier of the cluster
      --cluster-name string                     Name of the cluster (default "default")
      --cluster-pool-ipv4-cidr string           IPv4 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv4=true'
      --cluster-pool-ipv4-mask-size int         Mask size for each IPv4 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv4=true' (default 24)
      --cluster-pool-ipv6-cidr string           IPv6 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv6=true'
      --cluster-pool-ipv6-mask-size int         Mask size for each IPv6 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv6=true' (default 112)
      --cnp-node-status-gc-interval duration    GC interval for nodes which have been removed from the cluster in CiliumNetworkPolicy Status (default 2m0s)
      --cnp-status-update-interval duration     Interval between CNP status updates sent to the k8s-apiserver per-CNP (default 1s)
      --config string                           Configuration file (default "$HOME/ciliumd.yaml")
      --config-dir string                       Configuration directory that contains a file for each option
  -D, --debug                                   Enable debugging mode
      --enable-ipv4                             Enable IPv4 support (default true)
      --enable-ipv6                             Enable IPv6 support (default true)
      --enable-k8s-api-discovery                Enable discovery of Kubernetes API groups and resources with the discovery API
      --enable-k8s-endpoint-slice               Enables k8s EndpointSlice feature into Cilium-Operator if the k8s cluster supports it (default true)
      --enable-metrics                          Enable Prometheus metrics
  -h, --help                                    help for cilium-operator-azure
      --identity-allocation-mode string         Method to use for identity allocation (default "kvstore")
      --identity-gc-interval duration           GC interval for security identities (default 15m0s)
      --identity-heartbeat-timeout duration     Timeout after which identity expires on lack of heartbeat (default 30m0s)
      --ipam string                             Backend to use for IPAM (default "azure")
      --k8s-api-server string                   Kubernetes API server URL
      --k8s-client-burst int                    Burst value allowed for the K8s client
      --k8s-client-qps float32                  Queries per second limit for the K8s client
      --k8s-heartbeat-timeout duration          Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
      --k8s-kubeconfig-path string              Absolute path of the kubernetes kubeconfig file
      --k8s-namespace string                    Name of the Kubernetes namespace in which Cilium Operator is deployed in
      --kvstore string                          Key-value store type
      --kvstore-opt map                         Key-value store options (default map[])
      --limit-ipam-api-burst int                Upper burst limit when accessing external APIs (default 4)
      --limit-ipam-api-qps float                Queries per second limit when accessing external IPAM APIs (default 20)
      --nodes-gc-interval duration              GC interval for nodes store in the kvstore (default 2m0s)
      --operator-api-serve-addr string          Address to serve API requests (default "localhost:9234")
      --operator-prometheus-serve-addr string   Address to serve Prometheus metrics (default ":6942")
      --parallel-alloc-workers int              Maximum number of parallel IPAM workers (default 50)
      --subnet-ids-filter strings               Subnets IDs (separated by commas)
      --subnet-tags-filter stringToString       Subnets tags in the form of k1=v1,k2=v2 (multiple k/v pairs can also be passed by repeating the CLI flag (default [])
      --synchronize-k8s-nodes                   Synchronize Kubernetes nodes to kvstore and perform CNP GC (default true)
      --synchronize-k8s-services                Synchronize Kubernetes services to kvstore (default true)
      --unmanaged-pod-watcher-interval int      Interval to check for unmanaged kube-dns pods (0 to disable) (default 15)
      --update-ec2-apdater-limit-via-api        Use the EC2 API to update the instance type to adapter limits
      --version                                 Print version information

cilium-operator-generic

Run cilium-operator-generic

Synopsis

Run cilium-operator-generic

cilium-operator-generic [flags]

Options

      --cilium-endpoint-gc-interval duration    GC interval for cilium endpoints (default 5m0s)
      --cluster-id int                          Unique identifier of the cluster
      --cluster-name string                     Name of the cluster (default "default")
      --cluster-pool-ipv4-cidr string           IPv4 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv4=true'
      --cluster-pool-ipv4-mask-size int         Mask size for each IPv4 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv4=true' (default 24)
      --cluster-pool-ipv6-cidr string           IPv6 CIDR Range for Pods in cluster. Requires 'ipam=cluster-pool' and 'enable-ipv6=true'
      --cluster-pool-ipv6-mask-size int         Mask size for each IPv6 podCIDR per node. Requires 'ipam=cluster-pool' and 'enable-ipv6=true' (default 112)
      --cnp-node-status-gc-interval duration    GC interval for nodes which have been removed from the cluster in CiliumNetworkPolicy Status (default 2m0s)
      --cnp-status-update-interval duration     Interval between CNP status updates sent to the k8s-apiserver per-CNP (default 1s)
      --config string                           Configuration file (default "$HOME/ciliumd.yaml")
      --config-dir string                       Configuration directory that contains a file for each option
  -D, --debug                                   Enable debugging mode
      --enable-ipv4                             Enable IPv4 support (default true)
      --enable-ipv6                             Enable IPv6 support (default true)
      --enable-k8s-api-discovery                Enable discovery of Kubernetes API groups and resources with the discovery API
      --enable-k8s-endpoint-slice               Enables k8s EndpointSlice feature into Cilium-Operator if the k8s cluster supports it (default true)
      --enable-metrics                          Enable Prometheus metrics
  -h, --help                                    help for cilium-operator-generic
      --identity-allocation-mode string         Method to use for identity allocation (default "kvstore")
      --identity-gc-interval duration           GC interval for security identities (default 15m0s)
      --identity-heartbeat-timeout duration     Timeout after which identity expires on lack of heartbeat (default 30m0s)
      --ipam string                             Backend to use for IPAM (default "cluster-pool")
      --k8s-api-server string                   Kubernetes API server URL
      --k8s-client-burst int                    Burst value allowed for the K8s client
      --k8s-client-qps float32                  Queries per second limit for the K8s client
      --k8s-heartbeat-timeout duration          Configures the timeout for api-server heartbeat, set to 0 to disable (default 30s)
      --k8s-kubeconfig-path string              Absolute path of the kubernetes kubeconfig file
      --k8s-namespace string                    Name of the Kubernetes namespace in which Cilium Operator is deployed in
      --kvstore string                          Key-value store type
      --kvstore-opt map                         Key-value store options (default map[])
      --limit-ipam-api-burst int                Upper burst limit when accessing external APIs (default 4)
      --limit-ipam-api-qps float                Queries per second limit when accessing external IPAM APIs (default 20)
      --nodes-gc-interval duration              GC interval for nodes store in the kvstore (default 2m0s)
      --operator-api-serve-addr string          Address to serve API requests (default "localhost:9234")
      --operator-prometheus-serve-addr string   Address to serve Prometheus metrics (default ":6942")
      --parallel-alloc-workers int              Maximum number of parallel IPAM workers (default 50)
      --subnet-ids-filter strings               Subnets IDs (separated by commas)
      --subnet-tags-filter stringToString       Subnets tags in the form of k1=v1,k2=v2 (multiple k/v pairs can also be passed by repeating the CLI flag (default [])
      --synchronize-k8s-nodes                   Synchronize Kubernetes nodes to kvstore and perform CNP GC (default true)
      --synchronize-k8s-services                Synchronize Kubernetes services to kvstore (default true)
      --unmanaged-pod-watcher-interval int      Interval to check for unmanaged kube-dns pods (0 to disable) (default 15)
      --update-ec2-apdater-limit-via-api        Use the EC2 API to update the instance type to adapter limits
      --version                                 Print version information

Key-Value Store

Option Description Default
–kvstore TYPE Key Value Store Type: (consul, etcd)  
–kvstore-opt OPTS    

consul

When using consul, the consul agent address needs to be provided with the consul.address: consul.tlsconfig is optional, and is only required for TLS authentication:

Option Type Description
consul.address Address Address of consul agent
consul.tlsconfig Path Path to a consul configuration file for client server authentication

Example of the consul configuration file:

---
cafile: '/var/lib/cilium/consul-ca.pem'
keyfile: '/var/lib/cilium/client-key.pem'
certfile: '/var/lib/cilium/client.pem'
#insecureskipverify: true

etcd

When using etcd, one of the following options need to be provided to configure the etcd endpoints:

Option Type Description
etcd.address Address Address of etcd endpoint
etcd.operator Boolean When set to true, Cilium will resolve the domain name of the etcd server from the associated k8s service deployed.
etcd.config Path Path to an etcd configuration file.

Example of the etcd configuration file:

---
endpoints:
- https://192.168.0.1:2379
- https://192.168.0.2:2379
trusted-ca-file: '/var/lib/cilium/etcd-ca.pem'
# In case you want client to server authentication
key-file: '/var/lib/cilium/etcd-client.key'
cert-file: '/var/lib/cilium/etcd-client.crt'

Key-Value Store

Cilium uses an external key-value store to exchange information across multiple Cilium instances:

Layout

All data is stored under a common key prefix:

Prefix Description
cilium/ All keys share this common prefix.
cilium/state/ State stored by agents, data is automatically recreated on removal or corruption.

Cluster Nodes

Every agent will register itself as a node in the kvstore and make the following information available to other agents:

  • Name
  • IP addresses of the node
  • Health checking IP addresses
  • Allocation range of endpoints on the node
Key Value
cilium/state/nodes/v1/<cluster>/<node> node.Node

All node keys are attached to a lease owned by the agent of the respective node.

Services

All Kubernetes services are mirrored into the kvstore by the Cilium operator. This is required to implement multi cluster service discovery.

Key Value
cilium/state/services/v1/<cluster>/<namespace>/<service> serviceStore.ClusterService

Identities

Any time a new endpoint is started on a Cilium node, it will determine whether the labels for the endpoint are unique and allocate an identity for that set of labels. These identities are only meaningful within the local cluster.

Key Value
cilium/state/identities/v1/id/<identity> labels.LabelArray
cilium/state/identities/v1/value/<labels>/<node> identity.NumericIdentity

Endpoints

All endpoint IPs and corresponding identities are mirrored to the kvstore by the agent on the node where the endpoint is launched, to allow peer nodes to configure egress policies to endpoints backed by these IPs.

Key Value
cilium/state/ip/v1/<cluster>/<ip> identity.IPIdentityPair

CiliumNetworkPolicyNodeStatus

If handover to Kubernetes is enabled, then each cilium-agent will propagate the state of whether it has realized a given CNP to the key-value store instead of directly writing to kube-apiserver. cilium-operator will listen for updates to this prefix from the key-value store, and will be the sole updater of statuses for CNPs in the cluster.

Key Value
cilium/state/cnpstatuses/v2/<UID>/<namespace>/<name>/<node> k8s.CNPNSWithMeta

Leases

With a few exceptions, all keys in the key-value store are owned by a particular agent running on a node. All such keys have a lease attached. The lease is renewed automatically. When the lease expires, the key is removed from the key-value store. This guarantees that keys are removed from the key-value store in the event that an agent dies on a particular and never reappears.

The lease lifetime is set to 15 minutes. The exact expiration behavior is dependent on the kvstore implementation but the expiration typically occurs after double the lifetime

Debugging

The contents stored in the kvstore can be queued and manipulate using the cilium kvstore command. For additional details, see the command reference.

Example:

$ cilium kvstore get --recursive cilium/state/nodes/
cilium/state/nodes/v1/default/runtime1 => {"Name":"runtime1","IPAddresses":[{"AddressType":"InternalIP","IP":"10.0.2.15"}],"IPv4AllocCIDR":{"IP":"10.11.0.0","Mask":"//8AAA=="},"IPv6AllocCIDR":{"IP":"f00d::a0f:0:0:0","Mask":"//////////////////8AAA=="},"IPv4HealthIP":"","IPv6HealthIP":""}

Further Reading

Presentations

  • Fosdem, Brussels, 2020 - BPF as a revolutionary technology for the container landscape: Slides, Video
  • KubeCon, North America 2019 - Liberating Kubernetes from kube-proxy and iptables: Slides, Video
  • KubeCon, Europe 2019 - Using eBPF to Bring Kubernetes-Aware Security to the Linux Kernel: Video
  • KubeCon, Europe 2019 - Transparent Chaos Testing with Envoy , Cilium and BPF: Slides, Video
  • All Systems Go!, Berlin, Sept 2018 - Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security Slides, Video
  • QCon, San Francisco 2018 - How to Make Linux Microservice-Aware with Cilium and eBPF: Slides, Video
  • KubeCon, North America 2018 - Connecting Kubernetes Clusters Across Cloud Providers: Slides, Video
  • KubeCon, North America 2018 - Implementing Least Privilege Security and Networking with BPF on Kubernetes: Slides, Video
  • KubeCon, Europe 2018 - Accelerating Envoy with the Linux Kernel: Video
  • Open Source Summit, North America - Cilium: Networking and security for containers with BPF and XDP: Video
  • DockerCon, Austin TX, Apr 2017 - Cilium - Network and Application Security with BPF and XDP: Slides, Video
  • CNCF/KubeCon Meetup, Berlin, Mar 2017 - Linux Native, HTTP Aware Network Security: Slides, Video
  • Docker Distributed Systems Summit, Berlin, Oct 2016: Slides, Video
  • NetDev1.2, Tokyo, Sep 2016 - cls_bpf/eBPF updates since netdev 1.1: Slides, Video
  • NetDev1.2, Tokyo, Sep 2016 - Advanced programmability and recent updates with tc’s cls_bpf: Slides, Video
  • ContainerCon NA, Toronto, Aug 2016 - Fast IPv6 container networking with BPF & XDP: Slides

Podcasts

  • Software Gone Wild by Ivan Pepelnjak, Oct 2016: Blog, MP3
  • OVS Orbit by Ben Pfaff, May 2016: Blog, MP3

Glossary

Cilium has some terms with special meanings. These should all be covered throughout the documentation but for convenience we have also listed some of them below with short descriptions. If you need more information, please ask us on Slack. Feel free to extend this document with words you expected to see here.

CNI
https://github.com/containernetworking/cni
ConfigMap
https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/
CustomResourceDefinition
https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#customresourcedefinitions
DaemonSet
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/
Endpoint
Endpoint
Geneve
https://tools.ietf.org/html/draft-ietf-nvo3-geneve-04
HeadlessServices
https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
Helm
https://helm.sh/
iproute2
https://www.kernel.org/pub/linux/utils/net/iproute2/
Linux kernel
https://www.kernel.org/
llvm
https://releases.llvm.org/
NodeSelector
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
Pod
Pods
https://kubernetes.io/docs/concepts/workloads/pods/pod/
Policy
A Cilium policy consists of a list of rules. The security policy can be specified in The Kubernetes NetworkPolicy format or The Cilium policy language.
RBAC
https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Service
https://kubernetes.io/docs/concepts/services-networking/service/
Slack channel
Public community slack channel for everyone to ask questions https://cilium.herokuapp.com
Volumes
https://kubernetes.io/docs/tasks/configure-pod-container/configure-volume-storage/
VXLAN
https://tools.ietf.org/html/rfc7348