This is part 6 of the series on Managing Microservices with Kubernetes. You can read part 1 here, part 2 here, part 3 here, part 4 here and part 5 here.
In part 1, we understood how Kubernetes is used to deploy a Microservice. In that blog, I mentioned that a couple of Kubernetes objects are used to deploy the voting application – Namespaces, Labels and Selectors, Pods, ReplicaSets, Deployment, and Service Objects. In part 2 we explored some of the Kubernetes objects used in building the microservice application. We also explored scaling a microservices application using the ReplicaSet Controller object. In Part 3 we explored using the Deployment Object to scale the microservice application. In part 4 we explored the Kubernetes service object and in part 5 we explored CI/CD of Microservices.
In this blog, I will explore communication in Microservices, expatiating on the last discussion on Service Object in part 4. After this blog, Kubernetes networking should make more sense to you.
The diagram below shows a Kubernetes cluster comprising of 2 worker nodes. The master node is not shown. Each of the hosts is connected to the same L2 network (192.168.0.0/24). Host 1 has IP address 192.168.0.10 while node 2 has IP address 192.168.0.11. Pods are scheduled on the worker nodes so each worker node here is running 2 Pods each. Each worker node has a virtual switch created by Docker called Docker0. Each Docker0 switch has a network CIDR different from the other node. It’s compulsory that the Docker0 switches have different network CIDRs so that each Pod in the cluster can have a unique IP address. All the pods in a cluster node are connected to the same Docker0 virtual switch so they can reach others.
To ensure that each pod is fully isolated from each other, each pod resides in a Namespace. There is also a Default namespace that the Linux OS on the worker node resides. This is to separate the parent OS namespace from the rest of the pod namespaces. Linux namespace is a kernel feature that Docker container (or any other container runtime) uses to provide a separate, secure and isolated environment for each container/pod. Within the namespace, a pod sees only itself. It can only reach another pod outside of itself through virtual networking concepts. In the diagram, each node has 3 namespaces, 1 for the Host OS, and the other 2 for the 2 pods on the node.
Each pod is connected to the Docker0 virtual switch by a pair of virtual cable called virtual ethernet (Veth) pair. Each pod has a virtual NIC seen as eth0 inside the pod and this represents the first part of Veth on the pod. The other part of the Veth is inside the Default namespace of the host OS. It can also be seen as a port inside the Docker0 virtual switch.
Here is the summary of how networking should work in a Kubernetes cluster.
This network model makes it easy to migrate on-premise solutions to a Kubernetes cluster as the pod is treated much like a VM and/or a bare-metal machine. The Network model is implemented in different ways by different vendors using the Container Network Interface (CNI) plugin specification located at this GitHub location. Two of the most popular implementations of this are Calico and Weave.
There are different types of communication that can happen in a Kubernetes cluster and hence between microservices.
Containers in a pod share the same IP of the pod so Containers that exist in a pod must have different ports. In this way, containers can reach themselves through the pod IP or the loopback address (127.0.0.1). A container can communicate with another container just by using either the IP of the pod or the loopback address, along with the port. For example, one container can ‘curl’ another container on localhost:80 or 10.0.0.5:80, where 10.0.0.5 is the IP of the pod.
Pods on the same node are connected to the same bridge, acting as an L2 device. A bridge is created by the CNI plugin on every node, given a subnet CIDR, and every pod that connects to that bridge gets an IP address allocated from the CIDR. The output below is taken from a cluster that has 3 nodes (cl2node1,cl2node2,cl2node3). As you can see below, all the pods on the same node have IP addresses from the same CIDR, which is different from the IP addresses of pods on the other nodes. For example, pods on node cl2node1 have IP addresses 172.16.111.1,172.16.111.2,172.16.111.3, pods on node cl2node2 have IP addresses 172.16.227.129, 172.16.227.130, while pods on cl2node3 have IP addresses 172.16.228.65 and 172.16.228.66.
cloudexperts@cl2master:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-f89759699-4txhr 1/1 Running 0 2d20h 172.16.227.129 cl2node2 <none> <none> nginx-f89759699-crhvg 1/1 Running 0 2d20h 172.16.227.130 cl2node2 <none> <none> nginx-f89759699-s5288 1/1 Running 0 2d20h 172.16.111.2 cl2node1 <none> <none> nginx-f89759699-vffx5 1/1 Running 0 2d20h 172.16.111.1 cl2node1 <none> <none> nginx1-56db585f94-9swp5 1/1 Running 0 2d15h 172.16.228.66 cl2node3 <none> <none> nginx1-56db585f94-t6tqv 1/1 Running 0 2d15h 172.16.111.3 cl2node1 <none> <none> nginx1-56db585f94-xgtqd 1/1 Running 0 2d15h 172.16.228.65 cl2node3 <none> <none>
Since pods on the same node are on the same virtual layer 2 networks, they communicate using L2 principles by first obtaining the Mac address of the destination pod (using ARP) after which they can then communicate with the MAC address at layer 2.
Here we have to realize that:
The network of the pods is different from the network connecting the nodes together. As you can see from the listing below, the nodes are on the network 192.168.0.0/24
cloudexperts@cl2master:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME cl2master Ready master 11d v1.18.3 192.168.0.14 <none> Ubuntu 18.04.4 LTS 4.15.0-101-generic docker://19.3.9 cl2node1 Ready <none> 11d v1.18.3 192.168.0.16 <none> Ubuntu 18.04.4 LTS 4.15.0-101-generic docker://19.3.9 cl2node2 Ready <none> 11d v1.18.3 192.168.0.18 <none> Ubuntu 18.04.4 LTS 4.15.0-106-generic docker://19.3.9 cl2node3 Ready <none> 2d20h v1.18.3 192.168.0.19 <none> Ubuntu 18.04.4 LTS 4.15.0-106-generic docker://19.3.9
And the pods are on their respective node network as explained earlier. You can see the listing of pods below.
cloudexperts@cl2master:~$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-f89759699-4txhr 1/1 Running 0 2d20h 172.16.227.129 cl2node2 <none> <none> nginx-f89759699-crhvg 1/1 Running 0 2d20h 172.16.227.130 cl2node2 <none> <none> nginx-f89759699-s5288 1/1 Running 0 2d20h 172.16.111.2 cl2node1 <none> <none> nginx-f89759699-vffx5 1/1 Running 0 2d20h 172.16.111.1 cl2node1 <none> <none> nginx1-56db585f94-9swp5 1/1 Running 0 2d15h 172.16.228.66 cl2node3 <none> <none> nginx1-56db585f94-t6tqv 1/1 Running 0 2d15h 172.16.111.3 cl2node1 <none> <none> nginx1-56db585f94-xgtqd 1/1 Running 0 2d15h 172.16.228.65 cl2node3 <none> <none>
To make one pod on one node communicate with another pod on another node, therefore, requires a way to prevent the node network from dropping the pod packet. This is the job of the CNI. CNIs majorly use either overlay technologies like VXLAN, IPnIP, or L3 routing. Here I will discuss how 2 major CNIs implement this.
You can read more about Weave Networking on this link.
On a cluster deployed with Weave plugin which uses VXLAN overlay, here is the listing of a traceroute showing the traffic hops from the master node to the pod 10.47.0.8 residing in one of the cluster nodes. This will always show a direct connection as traffic travels from the established tunnels to the nodes.
cloudexperts@master1:~$ traceroute 10.47.0.8 traceroute to 10.47.0.8 (10.47.0.8), 30 hops max, 60 byte packets 1 10.47.0.8 (10.47.0.8) 25.239 ms 24.822 ms 24.800 ms
You can read more about Calico Networking on this link.
Here is the traceroute showing the traffic hops from the master node to the pod 72.16.111.25 residing in one of the cluster nodes. Here in the listing we can see that a hop/router is used before getting to the destination pod, unlike when using the Weavenet CNI that creates a direct connection between the pods.
cloudexperts@cl2master:~$ traceroute 172.16.111.25 traceroute to 172.16.111.25 (172.16.111.25), 30 hops max, 60 byte packets 1 172.16.111.0 (172.16.111.0) 1.386 ms 1.257 ms 1.098 ms 2 172.16.111.25 (172.16.111.25) 1.463 ms 1.327 ms 7.887 ms
In this section, I will break down the various steps involves in the pod to service communication.
How do pods find the IP address of the service objects to connect to? Pods in Kubernetes can connect to other services either using either :
VOTE_SERVICE_HOST=10.103.71.163 KUBERNETES_PORT=tcp://10.96.0.1:443 REDIS_SERVICE_PORT=6379 KUBERNETES_SERVICE_PORT=443 REDIS_PORT=tcp://10.106.221.98:6379 VOTE_PORT_5000_TCP=tcp://10.103.71.163:5000 REDIS_SERVICE_PORT_REDIS_SERVICE=6379 REDIS_PORT_6379_TCP_ADDR=10.106.221.98 DB_PORT=tcp://10.103.146.13:5432 DB_SERVICE_PORT=5432 HOSTNAME=vote-59654d4c9-xdgvp REDIS_PORT_6379_TCP_PORT=6379 REDIS_PORT_6379_TCP_PROTO=tcp DB_PORT_5432_TCP=tcp://10.103.146.13:5432 VOTE_SERVICE_PORT=5000 VOTE_PORT=tcp://10.103.71.163:5000 RESULT_PORT_5001_TCP_ADDR=10.102.81.197 RESULT_PORT_5001_TCP_PORT=5001 RESULT_PORT_5001_TCP_PROTO=tcp REDIS_PORT_6379_TCP=tcp://10.106.221.98:6379 RESULT_SERVICE_PORT_RESULT_SERVICE=5001 DB_PORT_5432_TCP_ADDR=10.103.146.13 DB_SERVICE_HOST=10.103.146.13 DB_PORT_5432_TCP_PORT=5432 DB_SERVICE_PORT_DB_SERVICE=5432 DB_PORT_5432_TCP_PROTO=tcp
cloudexperts@master1:~$ kubectl exec -it vote-59654d4c9-xdgvp /bin/sh -n vote kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead. /app # cat /etc/resolv.conf nameserver 10.96.0.10 search vote.svc.cluster.local svc.cluster.local cluster.local tx.rr.com
The IP address 10.96.0.10 is the IP address of the DNS service as can be seen below
cloudexperts@master1:~$ kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 101d
The kube-proxy which resides on every node of the Kubernetes cluster coordinates the forwarding of the traffic to the service endpoints.
Below, when you do kubectl get endpoints, you see all the pods that the vote service frontend forwards traffic to:
First we check the services:
cloudexperts@master1:~$ kubectl get svc -n vote -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR db ClusterIP 10.97.59.155 <none> 5432/TCP 2d23h app=db redis ClusterIP 10.98.165.243 <none> 6379/TCP 2d23h app=redis result NodePort 10.101.97.208 <none> 5001:31001/TCP 2d23h app=result vote NodePort 10.102.56.81 <none> 5000:31000/TCP 2d23h app=vote
Then we see the endpoints
cloudexperts@master1:~/example-voting-app/k8s-specifications$ kubectl get endpoint -n vote NAME ENDPOINTS AGE db 10.47.0.3:5432 3m50s redis 10.36.0.1:6379 3m49s result 10.36.0.2:80 3m48s vote 10.47.0.4:80 3m47s
Below, I will scale the vote deployment and we will see how the endpoints have changed. But before doing that, let’s observe the rules being set up by the kube-proxy.
Let us look at the IPtables rules as there were created to allow traffic to be forwarded to the vote service on port 5000:31000/
For the vote service of type Nodeport with IP of 10.102.56.81
root@node01:~# iptables -t nat -L -n | grep 10.102.56.81 KUBE-SVC-DUGGATBIC525RBNS tcp -- 0.0.0.0/0 10.102.56.81 /* vote/vote:vote-service cluster IP */ tcp dpt:5000 root@node01:~# iptables-save | grep 10.102.56.81 -A KUBE-SERVICES -d 10.102.56.81/32 -p tcp -m comment --comment "vote/vote:vote-service cluster IP" -m tcp --dport 5000 -j KUBE-SVC-DUGGATBIC525RBNS
For the endpoint IP of 10.47.0.4
root@node01:~# iptables -t nat -L -n | grep 10.47.0.4 KUBE-MARK-MASQ all -- 10.47.0.4 0.0.0.0/0 /* vote/vote:vote-service */ DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* vote/vote:vote-service */ tcp to:10.47.0.4:80 root@node01:~# iptables-save | grep 10.47.0.4 -A KUBE-SEP-QACHRRHBJBZ6KHV4 -s 10.47.0.4/32 -m comment --comment "vote/vote:vote-service" -j KUBE-MARK-MASQ -A KUBE-SEP-QACHRRHBJBZ6KHV4 -p tcp -m comment --comment "vote/vote:vote-service" -m tcp -j DNAT --to-destination 10.47.0.4:80
Now let us scale the deployment and check the endpoints. We will notice that the endpoints/pods for the vote service have increased to reflect the scaled pods
cloudexperts@master1:~$ kubectl scale deployment vote --replicas=4 -n vote deployment.apps/vote scaled cloudexperts@master1:~$ kubectl get pods -n vote -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES db-6789fcc76c-7hkfl 1/1 Running 0 3d 10.47.0.3 node04 <none> <none> redis-554668f9bf-qfd4q 1/1 Running 0 3d 10.36.0.1 node02 <none> <none> result-79bf6bc748-rz848 1/1 Running 246 3d 10.36.0.2 node02 <none> <none> vote-7478984bfb-76jp6 1/1 Running 0 30s 10.36.0.7 node02 <none> <none> vote-7478984bfb-g64rq 1/1 Running 0 30s 10.44.0.1 node01 <none> <none> vote-7478984bfb-kxzgj 1/1 Running 0 3d 10.47.0.4 node04 <none> <none> vote-7478984bfb-m94fn 1/1 Running 0 30s 10.39.0.7 node03 <none> <none> worker-dd46d7584-2pbg8 1/1 Running 35 3d 10.47.0.5 node04 <none> <none> cloudexperts@master1:~$ kubectl get endpoints -n vote -o wide NAME ENDPOINTS AGE db 10.47.0.3:5432 3d redis 10.36.0.1:6379 3d result 10.36.0.2:80 3d vote 10.36.0.7:80,10.39.0.7:80,10.44.0.1:80 + 1 more... 3d
Checking the iptable rules, to see if one of the newly created endpoints (ip 10.36.0.7) is seen in the vote service rules, we noticed that the rule has not changed since it captures every IP address with 0.0.0.0/0
cloudexperts@master1:~$ sudo iptables-save | grep 10.36.0.7 -A KUBE-SEP-UDZ74PXMXU63SJ4J -s 10.36.0.7/32 -m comment --comment "vote/vote:vote-service" -j KUBE-MARK-MASQ -A KUBE-SEP-UDZ74PXMXU63SJ4J -p tcp -m comment --comment "vote/vote:vote-service" -m tcp -j DNAT --to-destination 10.36.0.7:80 cloudexperts@master1:~$ sudo iptables -t nat -L -n | grep 10.36.0.7 KUBE-MARK-MASQ all -- 10.36.0.7 0.0.0.0/0 /* vote/vote:vote-service */ DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* vote/vote:vote-service */ tcp to:10.36.0.7:80
Let us check that the new endpoints/pods (ip 10.36.0.7) has an entry on the IPtables and we see that the iptable rules are updated to reflect the changes.
cloudexperts@master1:~$ sudo iptables-save | grep 10.36.0.7 -A KUBE-SEP-UDZ74PXMXU63SJ4J -s 10.36.0.7/32 -m comment --comment "vote/vote:vote-service" -j KUBE-MARK-MASQ -A KUBE-SEP-UDZ74PXMXU63SJ4J -p tcp -m comment --comment "vote/vote:vote-service" -m tcp -j DNAT --to-destination 10.36.0.7:80 cloudexperts@master1:~$ sudo iptables -t nat -L -n | grep 10.36.0.7 KUBE-MARK-MASQ all -- 10.36.0.7 0.0.0.0/0 /* vote/vote:vote-service */ DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 /* vote/vote:vote-service */ tcp to:10.36.0.7:80
Note that the iptable rules on every node on the cluster are the same.
The question here is how does the service forward traffic to one of the pods that will serve out the workload? To see how this works, let us take a look using the vote NodePort as an example. Note that with nodeport, you can access the service using any IP of the cluster nodes along with the proxy port (port 30000 – 32767).
Step 1: Traffic hits a Kubernetes Node on a Service Nodeport
Step 2: The IPtable rules forward the traffic to the proxy Port (we explored the rules above). There are Iptable rules for traffic originating from the node (DNAT RULES) and traffic passing through the node (REDIRECTrules). All nodes have the same IPtables rules running on them. The Kube-proxy on each node watches over the kube-api and creates the Iptable rules whenever a service object is created.
Step 3: The Nodeport port, controlled by the Kube-proxy , has a load balancer Endpoint object running on the port.
Step 4: The Endpoint object points to a number of pods selected by the label Selectors. Some of the pods may be on the same node or on different hosts. See output of kubectl get endpoints -n vote above
Step 5: How a pod is selected by the service endpoint depends on the Kube-proxy mode. There are 3 modes:
Userspace has the lowest performance. Iptables works quite well for small clusters but for large clusters (say more than 5000 nodes), IPVS is recommended.
Step 6: A pod is selected by the endpoint object and Traffic is forwarded to the selected pod. Each pod has IPtables rules to forward traffic to it. If the pod is on another node, the CNI takes over to forward the traffic to the node where the pod resides. The Kube-proxy can maintain session affinity with the pod that is selected. It uses the ONN_TRACT feature of the kernel to establish and maintain a connection with the same pod.
Here I have explored how the Kubernetes networking is set up to enable communication between the microservices that are running on it. I explained how the networking topology is set up and then proceed to explain the different communications that take place between the microservices. Communication between pods on the same node and on different nodes was explored. I then looked into service discovery explaining how one service finds another service. The low-level details of how the Kube-proxy works in forwarding traffic from a service object to the microservices endpoints were explored.