Kubernetes

Understanding Networking of Microservices Applications

Damian Igbe, Phd
March 24, 2022, 7:07 p.m.

Subscribe to Newsletter

Be first to know about new blogs, training offers, and company news.

This is part 6 of the series on Managing Microservices with Kubernetes. You can read part 1 here, part 2 here, part 3 here,  part 4 here and part 5 here.

In part 1, we understood how Kubernetes is used to deploy a Microservice. In that blog, I mentioned that a couple of Kubernetes objects are used to deploy the voting application – Namespaces, Labels and Selectors, Pods, ReplicaSets, Deployment, and Service Objects.   In part 2 we explored some of the Kubernetes objects used in building the microservice application. We also explored scaling a microservices application using the ReplicaSet Controller object. In Part 3 we explored using the Deployment Object to scale the microservice application. In part 4 we explored the Kubernetes service object and in part 5 we explored  CI/CD of Microservices.

In this blog, I will explore communication in Microservices, expatiating on the last discussion on Service Object in part 4. After this blog, Kubernetes networking should make more sense to you.

Kubernetes Network Topology

The diagram below shows a Kubernetes cluster comprising of 2 worker nodes. The master node is not shown. Each of the hosts is connected to the same L2 network (192.168.0.0/24). Host 1 has IP address 192.168.0.10 while node 2 has IP address 192.168.0.11. Pods are scheduled on the worker nodes so each worker node here is running 2 Pods each. Each worker node has a virtual switch created by Docker called Docker0. Each Docker0 switch has a network CIDR different from the other node. It’s compulsory that the Docker0 switches have different network CIDRs so that each Pod in the cluster can have a unique IP address. All the pods in a cluster node are connected to the same Docker0 virtual switch so they can reach others.

To ensure that each pod is fully isolated from each other, each pod resides in a Namespace. There is also a Default namespace that the Linux OS on the worker node resides. This is to separate the parent OS namespace from the rest of the pod namespaces. Linux namespace is a kernel feature that Docker container (or any other container runtime) uses to provide a separate, secure and isolated environment for each container/pod. Within the namespace, a pod sees only itself. It can only reach another pod outside of itself through virtual networking concepts. In the diagram, each node has 3 namespaces, 1 for the Host OS, and the other 2 for the 2 pods on the node.

Each pod is connected to the Docker0 virtual switch by a pair of virtual cable called virtual ethernet (Veth) pair. Each pod has a virtual NIC seen as eth0 inside the pod and this represents the first part of Veth on the pod. The other part of the Veth is inside the Default namespace of the host OS. It can also be seen as a port inside the Docker0 virtual switch.

 

The kubernetes networking architecture

The Kubernetes Networking Model

Here is the summary of how networking should work in a Kubernetes cluster.

  1. Every pod has its own IP address
  2. Pod on a node can communicate with all pods on all nodes without NAT
  3. Agents on a node (e.g system daemons, kubelet) can communicate with all pods on that node
  4. All pods are able to communicate with all nodes and vice versa

This network model makes it easy to migrate on-premise solutions to a Kubernetes cluster as the pod is treated much like a VM and/or a bare-metal machine. The Network model is implemented in different ways by different vendors using the  Container Network Interface (CNI) plugin specification located at this GitHub location. Two of the most popular implementations of this are Calico and Weave.

Kubernetes Communication

There are different types of communication that can happen in a Kubernetes cluster and hence between microservices.

  • Container to container inside a pod
  • Pod to pod communication inside the same node
  • Pod to pod communication on different nodes
  • Pod to service Communication
  • External to pod communication (This was discussed under service objects in part 4 of this series)

Container to container communication inside a pod

Containers in a pod share the same IP of the pod so Containers that exist in a  pod must have different ports. In this way, containers can reach themselves through the pod IP or the loopback address (127.0.0.1). A container can communicate with another container just by using either the IP of the pod or the loopback address, along with the port. For example, one container can ‘curl’ another container on localhost:80 or 10.0.0.5:80, where 10.0.0.5 is the IP of the pod.

Pod to pod communication inside the same node

Pods on the same node are connected to the same bridge, acting as an L2 device. A bridge is created by the CNI plugin on every node, given a subnet CIDR, and every pod that connects to that bridge gets an IP address allocated from the CIDR. The output below is taken from a cluster that has  3 nodes (cl2node1,cl2node2,cl2node3). As you can see below, all the pods on the same node have IP addresses from the same CIDR, which is different from the IP addresses of pods on the other nodes. For example, pods on node  cl2node1 have IP addresses 172.16.111.1,172.16.111.2,172.16.111.3, pods on node cl2node2 have IP addresses  172.16.227.129, 172.16.227.130,  while pods on cl2node3 have IP addresses 172.16.228.65 and 172.16.228.66.

cloudexperts@cl2master:~$ kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP               NODE       NOMINATED NODE   READINESS GATES
nginx-f89759699-4txhr     1/1     Running   0          2d20h   172.16.227.129   cl2node2   <none>           <none>
nginx-f89759699-crhvg     1/1     Running   0          2d20h   172.16.227.130   cl2node2   <none>           <none>
nginx-f89759699-s5288     1/1     Running   0          2d20h   172.16.111.2     cl2node1   <none>           <none>
nginx-f89759699-vffx5     1/1     Running   0          2d20h   172.16.111.1     cl2node1   <none>           <none>
nginx1-56db585f94-9swp5   1/1     Running   0          2d15h   172.16.228.66    cl2node3   <none>           <none>
nginx1-56db585f94-t6tqv   1/1     Running   0          2d15h   172.16.111.3     cl2node1   <none>           <none>
nginx1-56db585f94-xgtqd   1/1     Running   0          2d15h   172.16.228.65    cl2node3   <none>           <none>

Since pods on the same node are on the same virtual layer 2 networks, they communicate using L2 principles by first obtaining the Mac address of the destination pod (using ARP) after which they can then communicate with the MAC address at layer 2.

Pod to Pod communication on different nodes

Here we have to realize that:

The network of the pods is different from the network connecting the nodes together. As you can see from the listing below, the nodes are on the network 192.168.0.0/24

cloudexperts@cl2master:~$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cl2master Ready master 11d v1.18.3 192.168.0.14 <none> Ubuntu 18.04.4 LTS 4.15.0-101-generic docker://19.3.9
cl2node1 Ready <none> 11d v1.18.3 192.168.0.16 <none> Ubuntu 18.04.4 LTS 4.15.0-101-generic docker://19.3.9
cl2node2 Ready <none> 11d v1.18.3 192.168.0.18 <none> Ubuntu 18.04.4 LTS 4.15.0-106-generic docker://19.3.9
cl2node3 Ready <none> 2d20h v1.18.3 192.168.0.19 <none> Ubuntu 18.04.4 LTS 4.15.0-106-generic docker://19.3.9

And the pods are on their respective node network as explained earlier. You can see the listing of pods below.

cloudexperts@cl2master:~$ kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP               NODE       NOMINATED NODE   READINESS GATES
nginx-f89759699-4txhr     1/1     Running   0          2d20h   172.16.227.129   cl2node2   <none>           <none>
nginx-f89759699-crhvg     1/1     Running   0          2d20h   172.16.227.130   cl2node2   <none>           <none>
nginx-f89759699-s5288     1/1     Running   0          2d20h   172.16.111.2     cl2node1   <none>           <none>
nginx-f89759699-vffx5     1/1     Running   0          2d20h   172.16.111.1     cl2node1   <none>           <none>
nginx1-56db585f94-9swp5   1/1     Running   0          2d15h   172.16.228.66    cl2node3   <none>           <none>
nginx1-56db585f94-t6tqv   1/1     Running   0          2d15h   172.16.111.3     cl2node1   <none>           <none>
nginx1-56db585f94-xgtqd   1/1     Running   0          2d15h   172.16.228.65    cl2node3   <none>           <none>

To make one pod on one node communicate with another pod on another node, therefore, requires a way to prevent the node network from dropping the pod packet. This is the job of the CNI. CNIs majorly use either overlay technologies like VXLAN, IPnIP, or L3 routing. Here I will discuss how 2 major CNIs implement this.

Overlay CNI With Weave

You can read more about Weave Networking on this link.

  • The WeaveNet is a full-mesh, overlay network
  • Operate in 2 modes: Fast datapath mode using the Linux kernel to route traffic or the sleeve mode which uses the weave routers in the userspace to route traffic between the peer weave routers. Datapath is the fastest and enabled by default. Weavenet automatically decides when to use datapath or sleeve mode.
  • Uses Encapsulation–before the packet is put on the wire, an overlay bridge will modify the packet header, adding the underlay host that the destination pod resides.
  • The packet then gets pushed unto the underlay network wire and transported to the destination node.
  • Packet de-encapsulation happens at the destination node when received and the packet gets forwarded to the destination pod.

On a cluster deployed with Weave plugin which uses VXLAN overlay, here is the listing of a traceroute showing the traffic hops from the master node to the pod 10.47.0.8 residing in one of the cluster nodes.  This will always show a direct connection as traffic travels from the established tunnels to the nodes.

cloudexperts@master1:~$ traceroute 10.47.0.8
traceroute to 10.47.0.8 (10.47.0.8), 30 hops max, 60 byte packets
1 10.47.0.8 (10.47.0.8) 25.239 ms 24.822 ms 24.800 ms

L3 CNI With Calico

You can read more about Calico Networking on this link.

  • Calico supports both the L3 routing and the Overlay network using encapsulation.
  • It builds a layer 3 network with  BGP routing protocol for routing packets between hosts. This is the default and preferred method.
  • Overlay mode is recommended when the underlying network cannot easily be made aware of workload IPs, for example,  when using multiple VPCs/subnets on AWS.
  • Can use 2  Encapsulation protocols — IPnIP and VXLAN. IPnIP is default and recommended.
  • When using an overlay, the process is the same as in Weavenet
    • Uses Encapsulation–before the packet is put on the wire, an overlay bridge will modify the packet header, adding the underlay host that the destination pod resides.
    • The packet then gets pushed unto the underlay network wire and transported to the destination node.
    • Packet de-encapsulation happens at the destination node when received and the packet gets forwarded to the destination pod.
  • When using L3 routing, Calico nodes exchange routing information over BGP with which enables Calico networked workloads to communicate without the need for encapsulation.

Here is the traceroute showing the traffic hops from the master node to the pod 72.16.111.25 residing in one of the cluster nodes.  Here in the listing we can see that a hop/router is used before getting to the destination pod, unlike when using the Weavenet CNI that creates a direct connection between the pods.

cloudexperts@cl2master:~$ traceroute 172.16.111.25
traceroute to 172.16.111.25 (172.16.111.25), 30 hops max, 60 byte packets
1 172.16.111.0 (172.16.111.0) 1.386 ms 1.257 ms 1.098 ms
2 172.16.111.25 (172.16.111.25) 1.463 ms 1.327 ms 7.887 ms

Pod to Service communication

In this section, I will break down the various steps involves in the pod to service communication.

Service Discovery

How do pods find the IP address of the service objects to connect to? Pods in Kubernetes can connect to other services either using either :

  • environmental variables
    • When a pod is created, it is provided with all environment variables pointing to the services that were running when the pod was started.
    • Hence, it requires that the service must be created before the pod is created otherwise, the pod can’t find the service that was created after it.
    • examples of environment variables when connected to the vote microservice are displayed below showing the service IPs of the other services.
VOTE_SERVICE_HOST=10.103.71.163
KUBERNETES_PORT=tcp://10.96.0.1:443
REDIS_SERVICE_PORT=6379
KUBERNETES_SERVICE_PORT=443
REDIS_PORT=tcp://10.106.221.98:6379
VOTE_PORT_5000_TCP=tcp://10.103.71.163:5000
REDIS_SERVICE_PORT_REDIS_SERVICE=6379
REDIS_PORT_6379_TCP_ADDR=10.106.221.98
DB_PORT=tcp://10.103.146.13:5432
DB_SERVICE_PORT=5432
HOSTNAME=vote-59654d4c9-xdgvp
REDIS_PORT_6379_TCP_PORT=6379
REDIS_PORT_6379_TCP_PROTO=tcp
DB_PORT_5432_TCP=tcp://10.103.146.13:5432
VOTE_SERVICE_PORT=5000
VOTE_PORT=tcp://10.103.71.163:5000
RESULT_PORT_5001_TCP_ADDR=10.102.81.197
RESULT_PORT_5001_TCP_PORT=5001
RESULT_PORT_5001_TCP_PROTO=tcp
REDIS_PORT_6379_TCP=tcp://10.106.221.98:6379
RESULT_SERVICE_PORT_RESULT_SERVICE=5001
DB_PORT_5432_TCP_ADDR=10.103.146.13
DB_SERVICE_HOST=10.103.146.13
DB_PORT_5432_TCP_PORT=5432
DB_SERVICE_PORT_DB_SERVICE=5432
DB_PORT_5432_TCP_PROTO=tcp
  • or using a DNS service.
    • It is the preferred way of connecting to other services
    • The DNS services use a Cluster DNS addon called CoreDNS and it must be enabled before it can be used.
    • Whenever a pod is created the IP address of KubeDNS is insert into the container(s) at the usual location of /etc/resolv.conf
    • The containers can then resolve the IP address of any service object
    • Here we can see the IP address of the DNS server configured on the pod’s /etc/resolv.conf
    • cloudexperts@master1:~$ kubectl exec -it vote-59654d4c9-xdgvp /bin/sh -n vote
      kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
      
      /app # cat /etc/resolv.conf 
      nameserver 10.96.0.10
      search vote.svc.cluster.local svc.cluster.local cluster.local tx.rr.com
      

      The IP address 10.96.0.10 is the IP address of the DNS service as can be seen below

      cloudexperts@master1:~$ kubectl get svc -n kube-system
      NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                        AGE
      kube-dns                             ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP         101d
      
      

Service Endpoints

The kube-proxy which resides on every node of the Kubernetes cluster coordinates the forwarding of the traffic to the service endpoints.
Below, when you do kubectl get endpoints, you see all the pods that the vote service frontend forwards traffic to:

First we check the services:

cloudexperts@master1:~$ kubectl get svc -n vote -o wide
NAME     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE     SELECTOR
db       ClusterIP   10.97.59.155    <none>        5432/TCP         2d23h   app=db
redis    ClusterIP   10.98.165.243   <none>        6379/TCP         2d23h   app=redis
result   NodePort    10.101.97.208   <none>        5001:31001/TCP   2d23h   app=result
vote     NodePort    10.102.56.81    <none>        5000:31000/TCP   2d23h   app=vote

Then we see the endpoints

cloudexperts@master1:~/example-voting-app/k8s-specifications$ kubectl get endpoint -n vote
NAME ENDPOINTS AGE
db 10.47.0.3:5432 3m50s
redis 10.36.0.1:6379 3m49s
result 10.36.0.2:80 3m48s
vote 10.47.0.4:80 3m47s

Below,  I will scale the vote deployment and we will see how the endpoints have changed. But before doing that, let’s observe the rules being set up by the kube-proxy.

Dynamic IPtable Rules Created by Kube-proxy

Let us look at the IPtables rules as there were created to allow traffic to be forwarded to the vote service on port 5000:31000/

For the vote service of type Nodeport with IP of 10.102.56.81

root@node01:~# iptables -t nat -L -n | grep 10.102.56.81
KUBE-SVC-DUGGATBIC525RBNS  tcp  --  0.0.0.0/0   10.102.56.81  /* vote/vote:vote-service cluster IP */ tcp dpt:5000

root@node01:~# iptables-save | grep 10.102.56.81
-A KUBE-SERVICES -d 10.102.56.81/32 -p tcp -m comment --comment "vote/vote:vote-service cluster IP" -m tcp --dport 5000 -j KUBE-SVC-DUGGATBIC525RBNS

For the endpoint IP of 10.47.0.4

root@node01:~# iptables -t nat -L -n | grep 10.47.0.4
KUBE-MARK-MASQ  all  --  10.47.0.4  0.0.0.0/0     /* vote/vote:vote-service */
DNAT       tcp  --  0.0.0.0/0       0.0.0.0/0     /* vote/vote:vote-service */ tcp to:10.47.0.4:80


root@node01:~# iptables-save | grep 10.47.0.4
-A KUBE-SEP-QACHRRHBJBZ6KHV4 -s 10.47.0.4/32 -m comment --comment "vote/vote:vote-service" -j KUBE-MARK-MASQ
-A KUBE-SEP-QACHRRHBJBZ6KHV4 -p tcp -m comment --comment "vote/vote:vote-service" -m tcp -j DNAT --to-destination 10.47.0.4:80

Now let us scale the deployment and check the endpoints. We will notice that the endpoints/pods for the vote service have increased to reflect the scaled pods

cloudexperts@master1:~$ kubectl scale deployment vote --replicas=4 -n vote
deployment.apps/vote scaled

cloudexperts@master1:~$ kubectl get pods -n vote -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP          NODE     NOMINATED NODE   READINESS GATES
db-6789fcc76c-7hkfl       1/1     Running   0          3d    10.47.0.3   node04   <none>           <none>
redis-554668f9bf-qfd4q    1/1     Running   0          3d    10.36.0.1   node02   <none>           <none>
result-79bf6bc748-rz848   1/1     Running   246        3d    10.36.0.2   node02   <none>           <none>
vote-7478984bfb-76jp6     1/1     Running   0          30s   10.36.0.7   node02   <none>           <none>
vote-7478984bfb-g64rq     1/1     Running   0          30s   10.44.0.1   node01   <none>           <none>
vote-7478984bfb-kxzgj     1/1     Running   0          3d    10.47.0.4   node04   <none>           <none>
vote-7478984bfb-m94fn     1/1     Running   0          30s   10.39.0.7   node03   <none>           <none>
worker-dd46d7584-2pbg8    1/1     Running   35         3d    10.47.0.5   node04   <none>           <none>

cloudexperts@master1:~$ kubectl get endpoints -n vote -o wide
NAME     ENDPOINTS                                            AGE
db       10.47.0.3:5432                                       3d
redis    10.36.0.1:6379                                       3d
result   10.36.0.2:80                                         3d
vote     10.36.0.7:80,10.39.0.7:80,10.44.0.1:80 + 1 more...   3d

Checking the iptable rules, to see if one of the newly created endpoints (ip 10.36.0.7)  is seen in the vote service rules, we noticed that the rule has not changed since it captures every IP address with 0.0.0.0/0

cloudexperts@master1:~$ sudo iptables-save | grep 10.36.0.7
-A KUBE-SEP-UDZ74PXMXU63SJ4J -s 10.36.0.7/32 -m comment --comment "vote/vote:vote-service" -j KUBE-MARK-MASQ
-A KUBE-SEP-UDZ74PXMXU63SJ4J -p tcp -m comment --comment "vote/vote:vote-service" -m tcp -j DNAT --to-destination 10.36.0.7:80

cloudexperts@master1:~$ sudo iptables -t nat -L -n | grep 10.36.0.7
KUBE-MARK-MASQ  all  --  10.36.0.7            0.0.0.0/0            /* vote/vote:vote-service */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* vote/vote:vote-service */ tcp to:10.36.0.7:80

Let us check that the new endpoints/pods (ip 10.36.0.7) has an entry on the IPtables and we see that the iptable rules are updated to reflect the changes.

cloudexperts@master1:~$ sudo iptables-save | grep 10.36.0.7
-A KUBE-SEP-UDZ74PXMXU63SJ4J -s 10.36.0.7/32 -m comment --comment "vote/vote:vote-service" -j KUBE-MARK-MASQ
-A KUBE-SEP-UDZ74PXMXU63SJ4J -p tcp -m comment --comment "vote/vote:vote-service" -m tcp -j DNAT --to-destination 10.36.0.7:80

cloudexperts@master1:~$ sudo iptables -t nat -L -n | grep 10.36.0.7
KUBE-MARK-MASQ  all  --  10.36.0.7            0.0.0.0/0            /* vote/vote:vote-service */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* vote/vote:vote-service */ tcp to:10.36.0.7:80

Note that the iptable rules on every node on the cluster are the same.

Traffic Flow from Service to Pods

The question here is how does the service forward traffic to one of the pods that will serve out the workload? To see how this works, let us take a look using the vote NodePort as an example. Note that with nodeport, you can access the service using any IP of the cluster nodes along with the proxy port (port 30000 – 32767).

Step 1: Traffic hits a Kubernetes Node on a Service Nodeport

Step 2: The IPtable rules forward the traffic to the proxy Port (we explored the rules above). There are Iptable rules for traffic originating from the node  (DNAT RULES) and traffic passing through the node (REDIRECTrules). All nodes have the same IPtables rules running on them. The Kube-proxy on each node watches over the kube-api and creates the Iptable rules whenever a service object is created.

Step 3: The Nodeport port,  controlled by the Kube-proxy , has a load balancer Endpoint object running on the port.

Step 4: The Endpoint object points to a number of pods selected by the label Selectors. Some of the pods may be on the same node or on different hosts. See output of kubectl get endpoints -n vote above

Step 5: How a pod is selected by the service endpoint depends on the Kube-proxy mode. There are 3 modes:

  • Userspace- Round Robin
  • Iptables – Random
  • IPVS – round-robin, least connections, destination hashing, etc

Userspace has the lowest performance. Iptables works quite well for small clusters but for large clusters (say more than 5000 nodes), IPVS is recommended.

Step 6:  A pod is selected by the endpoint object and Traffic is forwarded to the selected pod. Each pod has IPtables rules to forward traffic to it. If the pod is on another node, the CNI takes over to forward the traffic to the node where the pod resides. The Kube-proxy can maintain session affinity with the pod that is selected. It uses the ONN_TRACT feature of the kernel to establish and maintain a connection with the same pod.

 

Conclusion

Here I have explored how the Kubernetes networking is set up to enable communication between the microservices that are running on it. I explained how the networking topology is set up and then proceed to explain the different communications that take place between the microservices. Communication between pods on the same node and on different nodes was explored. I then looked into service discovery explaining how one service finds another service.  The low-level details of how the Kube-proxy works in forwarding traffic from a service object to the microservices endpoints were explored.

Zero-to-Hero Program: We Train and Mentor you to land your first Tech role