seantywork

container-kube-net

In here, I outlined how to install Kubernetes with Cilium CNI on Google Cloud (though also workable on
local VMs) and explored how communication between pods on different nodes works.

Let’s suppose we created two VMs that can fully communicate with each other. We’re going to make one of them
control node and the other worker node.

The full script to turn a VM into a control node is available as node-ctl.sh.

In the script at the start, we’ll see the basic setup information defined as below.

HOME="/root" 
IP="10.168.0.2"
VERSION="1.33"
CILIUM_VERSION="1.17.4"

We should take care of the version, because it’s already approaching EOL
if we’re going to use 1.33. The IP address there refers to the VM’s internal IP
assigned by the cloud provider (in this case Google cloud)

If those are correctly configure, and Kubernetes fellas didn’t mess up
the releases once again, the node-ctl.sh script should work and we’d
be able to see something like below.


root@node-0:~# kubectl get nodes
NAME     STATUS   ROLES           AGE   VERSION
node-0   Ready    control-plane   73s   v1.33.1

On the control node, we can create token for other nodes to join the cluster.


root@node-0:~# kubeadm token create --print-join-command 
kubeadm join 10.168.0.2:6443 --token r61w3k.oom2m7zqt6m8p0fc --discovery-token-ca-cert-hash sha256:9dcf53ebff2089c12cf3af75e4540e58674ccd44282ba0285420852e4ebc5114 

On the other node, we can run node-wrk.sh to turn it into a worker node.
As with the control node, the variables at the start of the script should be
configured correctly. If done, we can use token we’ve created on the control
node to join the worker node.


root@node-1:~# kubeadm join 10.168.0.2:6443 --token r61w3k.oom2m7zqt6m8p0fc --discovery-token-ca-cert-hash sha256:9dcf53ebff2089c12cf3af75e4540e58674ccd44282ba0285420852e4ebc5114 
[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config-file' to re-upload it.
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 1.002236822s
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Now, if we run the kubectl again, we’ll see that the cluster is up and running.


root@node-0:~# kubectl get nodes
NAME                                             STATUS   ROLES           AGE     VERSION
node-0                                           Ready    control-plane   8m21s   v1.33.1
node-1.us-east4-b.c.vpn-server-422904.internal   Ready    <none>          47s     v1.33.1

Using the same procedure, we’re going to create one more worker node and join it
with the cluster as well.

root@node-2:~# kubeadm join 10.168.0.2:6443 --token r61w3k.oom2m7zqt6m8p0fc --discovery-token-ca-cert-hash sha256:9dcf53ebff2089c12cf3af75e4540e58674ccd44282ba0285420852e4ebc5114
[preflight] Running pre-flight checks
[preflight] Reading configuration from the "kubeadm-config" ConfigMap in namespace "kube-system"...
[preflight] Use 'kubeadm init phase upload-config --config your-config-file' to re-upload it.
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 1.003909562s
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

Now we have a total of three nodes making up the cluster!


root@node-0:~# kubectl get nodes
NAME                                             STATUS   ROLES           AGE    VERSION
node-0                                           Ready    control-plane   12m    v1.33.1
node-1.us-east4-b.c.vpn-server-422904.internal   Ready    <none>          5m3s   v1.33.1
node-2.us-east4-b.c.vpn-server-422904.internal   Ready    <none>          51s    v1.33.1

For our purpose of this tutorial where we want to track how a packet flows through
between two nodes, we’re going to need a method to pin down a container on node-1
and the other on node-2.

To do so, we can use node labelling provided by Kubernetes.


root@node-0:~# kubectl label node node-1.us-east4-b.c.vpn-server-422904.internal nodelabel=node-wrk-1 
node/node-1.us-east4-b.c.vpn-server-422904.internal labeled
root@node-0:~# kubectl label node node-2.us-east4-b.c.vpn-server-422904.internal nodelabel=node-wrk-2
node/node-2.us-east4-b.c.vpn-server-422904.internal labeled

Also, to add another layer of separation(though not needed for the purpose
of this tutorial), we’re going create namespce for each as well.


root@node-0:~# kubectl create namespace wrk-1
namespace/wrk-1 created
root@node-0:~# kubectl create namespace wrk-2
namespace/wrk-2 created
root@node-0:~# vim 1.yaml

Look at the YAML file below (also available in the directory) with which we’re
going to create. This is the YAML for creating a pod on node1. The other YAML
file looks similar except for names used for the pod.

What it does is essentially opening up a port 9999 on TCP, UDP so that other pods
can talk to the pod using the channel.

apiVersion: v1
kind: Service
metadata:
  name: node-wrk-1-ubuntu24
  labels:
    app: node-wrk-1-ubuntu24
spec:
  type: ClusterIP
  ports:
  - name: tcp-9999
    port: 9999
    targetPort: 9999
    protocol: TCP
  - name: udp-9999
    port: 9999
    targetPort: 9999
    protocol: UDP
  selector:
    app: node-wrk-1-ubuntu24
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: node-wrk-1-ubuntu24
spec:
  selector:
    matchLabels:
      app: node-wrk-1-ubuntu24
  replicas: 1
  template:
    metadata:
      labels:
        app: node-wrk-1-ubuntu24
    spec:
      containers:
        - name: node-wrk-1-ubuntu24
          image: docker.io/seantywork/ubuntu24
          imagePullPolicy: Always
          ports:
          - containerPort: 9999
            protocol: TCP
          - containerPort: 9999
            protocol: UDP
      nodeSelector:
        nodelabel: node-wrk-1

Now, let’s look at what exactly is going on inside the pod.

FROM ubuntu:24.04

ARG DEBIAN_FRONTEND=noninteractive

WORKDIR /workspace

RUN apt-get update 

RUN apt-get install -y ncat tshark

CMD ["tail", "-f","/dev/null"]

Well, nothing at all. Because what we want to do is to capture the network packets
as they fly around, not actually up and running a service.

Let’s create the first pod on the worker node 1.

root@node-0:~# kubectl -n wrk-1 apply -f ./1.yaml 
service/node-wrk-1-ubuntu24 created
deployment.apps/node-wrk-1-ubuntu24 created

If successful, we’d be able to see the below status.


root@node-0:~# kubectl -n wrk-1 get pods 
NAME                                   READY   STATUS    RESTARTS   AGE
node-wrk-1-ubuntu24-684f7d8fd6-2zncq   1/1     Running   0          112s

Do the same thing for the second pod on the worker node 2.


root@node-0:~# kubectl -n wrk-2 apply -f ./2.yaml 
service/node-wrk-2-ubuntu24 created
deployment.apps/node-wrk-2-ubuntu24 created

# a few seconds later...

root@node-0:~# kubectl -n wrk-2 get pods 
NAME                                   READY   STATUS    RESTARTS   AGE
node-wrk-2-ubuntu24-85748464f7-mwmrt   1/1     Running   0          3m5s

To inspect packets on the host machines (not inside the pod), let’s install
tshark on each node.


# on node 1
root@node-1:~# apt update && apt install -y tshark
# on node 2
root@node-2:~# apt update && apt install -y tshark

When I set up a brand-new cluster, I prefer to restart coredns just in case.


root@node-0:~# kubectl -n kube-system rollout restart deployment coredns 

Now, we’re going to keep two terminals opened for persistent connection to each pod.

# terminal 1
root@node-0:~# kubectl -n wrk-1 get pods
NAME                                   READY   STATUS    RESTARTS   AGE
node-wrk-1-ubuntu24-684f7d8fd6-2zncq   1/1     Running   0          13m
root@node-0:~# kubectl -n wrk-1 exec -it node-wrk-1-ubuntu24-684f7d8fd6-2zncq -- /bin/bash
root@node-wrk-1-ubuntu24-684f7d8fd6-2zncq:/workspace# 

# terminal 2
root@node-0:~# kubectl -n wrk-2 get pods
NAME                                   READY   STATUS    RESTARTS   AGE
node-wrk-2-ubuntu24-85748464f7-mwmrt   1/1     Running   0          11m
root@node-0:~# kubectl -n wrk-2 exec -it node-wrk-2-ubuntu24-85748464f7-mwmrt -- /bin/bash
root@node-wrk-2-ubuntu24-85748464f7-mwmrt:/workspace# 

In this case, I’m going to run a simple TCP server inside the second pod and a client from
the first pod.

# run the server inside the pod 2, with port 9999
root@node-wrk-2-ubuntu24-85748464f7-mwmrt:/workspace# nc -l 0.0.0.0 9999

# connect to the server, with ${SERVICE_NAME}.${NAMESPACE_NAME}, then send whatever payload
# from the pod 1
root@node-wrk-1-ubuntu24-684f7d8fd6-2zncq:/workspace# nc node-wrk-2-ubuntu24.wrk-2 9999
asdfasdfasdfasd

# ...got the data on the pod 2!
root@node-wrk-2-ubuntu24-85748464f7-mwmrt:/workspace# nc -l 0.0.0.0 9999
asdfasdfasdfasd

To find out where exactly to put our mighty tshark to work, we need to figure out
which network interface our pods are using to communicate with each other. To do so,
we’re going to start a long running process inside the pod, and then look for it for
each of network namespaces (because pods are essentially glorified Linux network namespace)

root@node-wrk-1-ubuntu24-684f7d8fd6-2zncq:/workspace# sleep 3000

Let’s see what namespaces we have on worker 1.


root@node-1:~# ip netns list
36322294-9a3a-47fd-8be4-6530f0123581 (id: 2)
5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18 (id: 1)
02aa7ed5-2d8b-44ff-9865-d1b2ef17665c
dbe865d4-e332-40cc-8d95-459445ff6574
a300e0fa-b79e-48b5-aabf-bd8bbcebc428

Seeing id: 1 on the entry 5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18, out of a blind guessing that
it might be the namespace we’re looking for because there is no other pod we’ve created on the worker 1,
I ran grep command on ps output. It reveals the guess was correct. This is the namespace we’re looking for,
indeed.


root@node-1:~# ip netns exec 5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18 ps aux | grep sleep
root       85586  0.0  0.0   2696  1380 pts/1    S+   00:41   0:00 sleep 3000

To find out the interface IP address inside the pod, run ip command.
It shows that if when we communicate with other services outside the pod , the source \ IP address will be 10.0.1.15`.

root@node-1:~# ip netns exec 5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default qlen 1000
    link/ether 4e:e0:85:d5:68:f9 brd ff:ff:ff:ff:ff:ff link-netns 02aa7ed5-2d8b-44ff-9865-d1b2ef17665c
    inet 10.0.1.15/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::4ce0:85ff:fed5:68f9/64 scope link 
       valid_lft forever preferred_lft forever

To check out which interface on host is connected to our pod 1, we can use interface info. We can see in the info
that the namespace 5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18 is connected to the host interface lxc7dc050ebabd6. Look at
link-netns field in the info.

root@node-1:~# ip -d link show 
...
9: lxc7dc050ebabd6@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether a6:8a:15:36:dc:cb brd ff:ff:ff:ff:ff:ff link-netns 5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18 promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535 
    veth addrgenmode eui64 numtxqueues 2 numrxqueues 2 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 
11: lxc1e0b7c2c5527@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0e:9c:2c:8c:64:77 brd ff:ff:ff:ff:ff:ff link-netns 36322294-9a3a-47fd-8be4-6530f0123581 promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535 
    veth addrgenmode eui64 numtxqueues 2 numrxqueues 2 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536

root@node-1:~# ip netns list
36322294-9a3a-47fd-8be4-6530f0123581 (id: 2)
5b58d4f1-eb3e-4cb7-b823-bbb08ee37b18 (id: 1)
....

If we run the same nc command we’ve run just a moment ago again, but with tshark attached to the interface lxc7dc050ebabd6 on worker 1 ,
we can see DNS query followed by the actual TCP communication (and, indeed, our pod 1’s source IP 10.0.1.15).


root@node-1:~# tshark -i lxc7dc050ebabd6
Running as user "root" and group "root". This could be dangerous.
Capturing on 'lxc7dc050ebabd6'
    1 0.000000000    10.0.1.15 → 10.96.0.10   DNS 109 Standard query 0x262f A node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local
    2 0.000122975    10.0.1.15 → 10.96.0.10   DNS 109 Standard query 0xf132 AAAA node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local
    3 0.001248872   10.96.0.10 → 10.0.1.15    DNS 202 Standard query response 0xf132 No such name AAAA node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local SOA ns.dns.cluster.local
    4 0.001283013   10.96.0.10 → 10.0.1.15    DNS 202 Standard query response 0x262f No such name A node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local SOA ns.dns.cluster.local
    5 0.001410506    10.0.1.15 → 10.96.0.10   DNS 103 Standard query 0x76e7 A node-wrk-2-ubuntu24.wrk-2.svc.cluster.local
    6 0.001456907    10.0.1.15 → 10.96.0.10   DNS 103 Standard query 0x67e4 AAAA node-wrk-2-ubuntu24.wrk-2.svc.cluster.local
    7 0.001863086   10.96.0.10 → 10.0.1.15    DNS 196 Standard query response 0x67e4 AAAA node-wrk-2-ubuntu24.wrk-2.svc.cluster.local SOA ns.dns.cluster.local
    8 0.001984626   10.96.0.10 → 10.0.1.15    DNS 162 Standard query response 0x76e7 A node-wrk-2-ubuntu24.wrk-2.svc.cluster.local A 10.105.134.33
    9 0.091833823    10.0.1.15 → 10.105.134.33 TCP 74 38430 → 9999 [SYN] Seq=0 Win=64390 Len=0 MSS=1370 SACK_PERM TSval=167971729 TSecr=0 WS=128
   10 0.092468056 10.105.134.33 → 10.0.1.15    TCP 74 9999 → 38430 [SYN, ACK] Seq=0 Ack=1 Win=65184 Len=0 MSS=1370 SACK_PERM TSval=2085648661 TSecr=167971729 WS=128
   11 0.092505544    10.0.1.15 → 10.105.134.33 TCP 66 38430 → 9999 [ACK] Seq=1 Ack=1 Win=64512 Len=0 TSval=167971730 TSecr=2085648661

As we can see from the packet capture, the coredns returns 10.105.134.33 as the destination IP address.
In fact, that is not exactly the “true” IP address of the pod 2 where nc server is running. The part of
Kubernetes that handles this NATing stuff is called kube-proxy and it could use iptables or ebpf or both combined.

If we look at the iptables rules on worker 1 host, we can see the IP address 10.105.134.33 is certainly related to
NATing if destined to pod 2.


root@node-1:~# iptables -t nat -L -v | grep "10.105.134.33"
    0     0 KUBE-SVC-IWPXKGE4TAJJE4GD  tcp  --  any    any     anywhere             10.105.134.33        /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 cluster IP */ tcp dpt:9999
    0     0 KUBE-SVC-HX23KANCFUYJINGR  udp  --  any    any     anywhere             10.105.134.33        /* wrk-2/node-wrk-2-ubuntu24:udp-9999 cluster IP */ udp dpt:9999
    0     0 KUBE-MARK-MASQ  udp  --  any    any    !10.10.0.0/16         10.105.134.33        /* wrk-2/node-wrk-2-ubuntu24:udp-9999 cluster IP */ udp dpt:9999
    0     0 KUBE-MARK-MASQ  tcp  --  any    any    !10.10.0.0/16         10.105.134.33        /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 cluster IP */ tcp dpt:9999

Chain KUBE-SVC-IWPXKGE4TAJJE4GD (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-MARK-MASQ  tcp  --  any    any    !10.10.0.0/16         10.105.134.33        /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 cluster IP */ tcp dpt:9999
    0     0 KUBE-SEP-FXHI2MOU7V5XIHJD  all  --  any    any     anywhere             anywhere             /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 -> 10.0.2.215:9999 */

However, we have to be aware that Cilium, which we’ve installed along with Kubernetes to handle networking between pods,
uses eBPF to implement the features provided by iptables so that the iptables rules output above don’t show any matching
packet stats.

To check out what eBPF programs are at work, we can use bpftool.

root@node-1:~/bpftool/src# bpftool link
2: tcx  prog 572  
        ifindex cilium_vxlan(5)  attach_type tcx_ingress  
3: tcx  prog 571  
        ifindex cilium_vxlan(5)  attach_type tcx_egress  
4: tcx  prog 657  
        ifindex cilium_host(4)  attach_type tcx_ingress  
5: tcx  prog 652  
        ifindex cilium_host(4)  attach_type tcx_egress  
6: tcx  prog 664  
        ifindex cilium_net(3)  attach_type tcx_ingress  
7: tcx  prog 674  
        ifindex ens4(2)  attach_type tcx_ingress  
8: tcx  prog 600  
        ifindex lxc_health(7)  attach_type tcx_ingress  
9: tcx  prog 681  
        ifindex lxc7dc050ebabd6(9)  attach_type tcx_ingress  
10: tcx  prog 694  
        ifindex lxc1e0b7c2c5527(11)  attach_type tcx_ingress 

Here’s the source code of Cilium’s eBPF program.


# https://github.com/cilium/cilium/blob/main/bpf/bpf_lxc.c

Also, Cilium makes possible node-to-node communication using vxlan. If we attach tshark to
the vxlan interface, we’d be able to see the packet’s final destination IP right before it gets
tunneled inside vxlan.


root@node-1:~# tshark -i cilium_vxlan 
Running as user "root" and group "root". This could be dangerous.
Capturing on 'cilium_vxlan'
...
    3 0.153379717    10.0.1.15 → 10.0.2.17    DNS 109 Standard query 0x69a9 A node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local
    4 0.153453149    10.0.1.15 → 10.0.2.17    DNS 109 Standard query 0x71b4 AAAA node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local
    5 0.154519494    10.0.2.17 → 10.0.1.15    DNS 202 Standard query response 0x69a9 No such name A node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local SOA ns.dns.cluster.local
    6 0.156446844    10.0.2.17 → 10.0.1.15    DNS 202 Standard query response 0x71b4 No such name AAAA node-wrk-2-ubuntu24.wrk-2.wrk-1.svc.cluster.local SOA ns.dns.cluster.local
    7 0.156651889    10.0.1.15 → 10.0.2.17    DNS 103 Standard query 0x0fd2 A node-wrk-2-ubuntu24.wrk-2.svc.cluster.local
    8 0.156713018    10.0.1.15 → 10.0.2.17    DNS 103 Standard query 0xf8cf AAAA node-wrk-2-ubuntu24.wrk-2.svc.cluster.local
    9 0.157366940    10.0.2.17 → 10.0.1.15    DNS 196 Standard query response 0xf8cf AAAA node-wrk-2-ubuntu24.wrk-2.svc.cluster.local SOA ns.dns.cluster.local
   10 0.157367119    10.0.2.17 → 10.0.1.15    DNS 162 Standard query response 0x0fd2 A node-wrk-2-ubuntu24.wrk-2.svc.cluster.local A 10.105.134.33
   11 0.247844051    10.0.1.15 → 10.0.2.215   TCP 74 41208 → 9999 [SYN] Seq=0 Win=64390 Len=0 MSS=1370 SACK_PERM TSval=168433304 TSecr=0 WS=128
   12 0.248302570   10.0.2.215 → 10.0.1.15    TCP 74 9999 → 41208 [SYN, ACK] Seq=0 Ack=1 Win=65184 Len=0 MSS=1370 SACK_PERM TSval=2086110236 TSecr=168433304 WS=128
   13 0.248397099    10.0.1.15 → 10.0.2.215   TCP 66 41208 → 9999 [ACK] Seq=1 Ack=1 Win=64512 Len=0 TSval=168433305 TSecr=2086110236
...

10.0.2.215, it is.

root@node-1:~# iptables -t nat -L -v | grep 10.0.2.215
    0     0 KUBE-MARK-MASQ  all  --  any    any     10.0.2.215           anywhere             /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 */
    0     0 DNAT       tcp  --  any    any     anywhere             anywhere             /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 */ tcp to:10.0.2.215:9999
    0     0 KUBE-MARK-MASQ  all  --  any    any     10.0.2.215           anywhere             /* wrk-2/node-wrk-2-ubuntu24:udp-9999 */
    0     0 DNAT       udp  --  any    any     anywhere             anywhere             /* wrk-2/node-wrk-2-ubuntu24:udp-9999 */ udp to:10.0.2.215:9999
    0     0 KUBE-SEP-VWQM2HSDJBARAX5I  all  --  any    any     anywhere             anywhere             /* wrk-2/node-wrk-2-ubuntu24:udp-9999 -> 10.0.2.215:9999 */
    0     0 KUBE-SEP-FXHI2MOU7V5XIHJD  all  --  any    any     anywhere             anywhere             /* wrk-2/node-wrk-2-ubuntu24:tcp-9999 -> 10.0.2.215:9999 */

When we attach tshark on the actual interface that is connected to the switch (or whatever it is as we’re using Google Cloud), we can see that tunneled vxlan packets flowing between nodes.


root@node-1:~# tshark -i ens4 -f udp
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ens4'
    1 0.000000000   10.168.0.4 → 10.168.0.5   UDP 159 45304 → 8472 Len=117
    2 0.000026094   10.168.0.4 → 10.168.0.5   UDP 159 45304 → 8472 Len=117
    3 0.000787520   10.168.0.5 → 10.168.0.4   UDP 252 45292 → 8472 Len=210
    4 0.000879542   10.168.0.5 → 10.168.0.4   UDP 252 45292 → 8472 Len=210
    5 0.001084107   10.168.0.4 → 10.168.0.5   UDP 153 59946 → 8472 Len=111
    6 0.001121632   10.168.0.4 → 10.168.0.5   UDP 153 59946 → 8472 Len=111
    7 0.001991176   10.168.0.5 → 10.168.0.4   UDP 246 56913 → 8472 Len=204
    8 0.003240963   10.168.0.5 → 10.168.0.4   UDP 212 56913 → 8472 Len=170
    9 0.091070101   10.168.0.4 → 10.168.0.5   UDP 124 32918 → 8472 Len=82
   10 0.091373667   10.168.0.5 → 10.168.0.4   UDP 124 43769 → 8472 Len=82
   11 0.091482484   10.168.0.4 → 10.168.0.5   UDP 116 32918 → 8472 Len=74

Now, we’re going to move on to worker 2. Attach tshark to the interface connected to switch
to observe the vxlan packets.


root@node-2:~# tshark -i ens4 -f udp
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ens4'
    1 0.000000000   10.168.0.4 → 10.168.0.5   UDP 159 54392 → 8472 Len=117
    2 0.000000394   10.168.0.4 → 10.168.0.5   UDP 159 54392 → 8472 Len=117
    3 0.000924839   10.168.0.5 → 10.168.0.4   UDP 252 33949 → 8472 Len=210
    4 0.001091915   10.168.0.5 → 10.168.0.4   UDP 252 33949 → 8472 Len=210
    5 0.090827819   10.168.0.4 → 10.168.0.5   UDP 124 46625 → 8472 Len=82
    6 0.091005902   10.168.0.5 → 10.168.0.4   UDP 124 39053 → 8472 Len=82
    7 0.091299352   10.168.0.4 → 10.168.0.5   UDP 116 46625 → 8472 Len=74

Doing the steps on worker 1 in reverse, we’re going to look at worker 2’s vxlan interface.


root@node-2:~# tshark -i cilium_vxlan -f "tcp port 9999"
Running as user "root" and group "root". This could be dangerous.
Capturing on 'cilium_vxlan'
    1 0.000000000    10.0.1.15 → 10.0.2.215   TCP 74 40686 → 9999 [SYN] Seq=0 Win=64390 Len=0 MSS=1370 SACK_PERM TSval=166156308 TSecr=0 WS=128
    2 0.000242183   10.0.2.215 → 10.0.1.15    TCP 74 9999 → 40686 [SYN, ACK] Seq=0 Ack=1 Win=65184 Len=0 MSS=1370 SACK_PERM TSval=2083833240 TSecr=166156308 WS=128
    3 0.000623879    10.0.1.15 → 10.0.2.215   TCP 66 40686 → 9999 [ACK] Seq=1 Ack=1 Win=64512 Len=0 TSval=166156309 TSecr=2083833240

We can see that the decapsulated vxlan packet shows our pod 1’s source IP and the destination IP for pod 2, 10.0.2.215.

Applying the same logic when we’ve found which namespace gets translated to the pod 1 on worker 1, we can find that
the pod 2 is ceb7eaea-1923-4233-9253-9b7d25a9fb93 on worker 2.

# to see all namespaces on worker 2
root@node-2:~# ip netns list
47f92595-eb21-46c7-b0ac-5efbf1cd4d59 (id: 2)
ceb7eaea-1923-4233-9253-9b7d25a9fb93 (id: 1)
fe1dc96d-ef17-4afd-a1f7-4b65bdd64bd0
c686294b-4802-4767-9057-69c83626a5ee
477ebbff-90d2-43f3-9e8f-15dbac6501f2

# ...there it is!
root@node-2:~# ip netns exec ceb7eaea-1923-4233-9253-9b7d25a9fb93 ip a
...
8: eth0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default qlen 1000
    link/ether 92:42:dd:7b:6f:be brd ff:ff:ff:ff:ff:ff link-netns fe1dc96d-ef17-4afd-a1f7-4b65bdd64bd0
    inet 10.0.2.215/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::9042:ddff:fe7b:6fbe/64 scope link 
       valid_lft forever preferred_lft forever

We can make sure the nc server is running in that namespace.


root@node-2:~# ip netns exec ceb7eaea-1923-4233-9253-9b7d25a9fb93 ps aux | grep nc
...
root       39372  0.0  0.1  14912  5548 pts/0    S+   May21   0:00 nc -l 0.0.0.0 9999

As we did on worker 1, we can check out which interface is connected to the namespace ceb7eaea-1923-4233-9253-9b7d25a9fb93.


root@node-2:~# ip -d link show
...
9: lxc3afb1f126f2c@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 52:17:b0:95:31:94 brd ff:ff:ff:ff:ff:ff link-netns ceb7eaea-1923-4233-9253-9b7d25a9fb93 promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535 
    veth addrgenmode eui64 numtxqueues 2 numrxqueues 2 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 
11: lxc3fe8b5095c99@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 72:15:1d:3a:ad:d0 brd ff:ff:ff:ff:ff:ff link-netns 47f92595-eb21-46c7-b0ac-5efbf1cd4d59 promiscuity 0  allmulti 0 minmtu 68 maxmtu 65535 
    veth addrgenmode eui64 numtxqueues 2 numrxqueues 2 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 
...

If we attach tshark on lxc3afb1f126f2c, it’s clear the packet set sail from a namespace on worker 1 ends up in another namespace on worker 2.

root@node-2:~# tshark -i lxc3afb1f126f2c
Running as user "root" and group "root". This could be dangerous.
Capturing on 'lxc3afb1f126f2c'
    1 0.000000000    10.0.1.15 → 10.0.2.215   TCP 74 50210 → 9999 [SYN] Seq=0 Win=64390 Len=0 MSS=1370 SACK_PERM TSval=173851431 TSecr=0 WS=128
    2 0.000035143   10.0.2.215 → 10.0.1.15    TCP 74 9999 → 50210 [SYN, ACK] Seq=0 Ack=1 Win=65184 Len=0 MSS=1370 SACK_PERM TSval=2091528363 TSecr=173851431 WS=128
    3 0.000359301    10.0.1.15 → 10.0.2.215   TCP 66 50210 → 9999 [ACK] Seq=1 Ack=1 Win=64512 Len=0 TSval=173851431 TSecr=2091528363