Intro to K8s
What is K8s?
An open-source system for deployment, scaling and management automation for containerised applications. It has support for on-prem, public and hybrid cloud.
Key Features:
Service discovery and load balancing
Self Healing
Horizontal Scaling
IPv4 / IPv6 dual stack
Automated rollouts and rollbacks
Secret and configuration management
Storage Orchestration
A Kubernetes cluster is made of masters and nodes, these can be Linux hosts, VMs, bare metal servers, instances on cloud etc.
Control Plane (MASTER)
The master is a collection of services that make up the control plane for the cluster. Brain of the cluster where all the control and scheduling decisions are made. Runs a lot of specialised control loops and services.
Best way to set up the master
Simple setups like labs and test environments can have all the master services run on a single host
The best practice for all other setups is to have a multi-master high availability (HA)
HA masters are a default in most cloud providers like Azure Kubernetes Service (AKS), AWS Elastic Kubernetes Service (EKS) and Google Kubernetes Engine (GKE)
Running 3-4 replicas of the master in HA configuration is a common practice
Do not run user applications on the master since you want the master to concentrate fully on managing the cluster
Control plane provides :
API server :
The frontend / gateway to the control plane, all instructions and communications go through thisT
his is the central and main component of the Kubernetes cluster. All internal and external components communicate via the API server.
It exposes a RESTful API that you POST YAML configurations file to over HTTPS. These YAML files are called manifests. Manifests contain the desired state of the application that is running (which image is used for which container, which ports are exposed, how many pod replicas we want)
All the requests the API server receives are authenticated and authorised. Once done, the configuration is validated, persisted and stores in the cluster store and deployed.
ETCD (cluster store) :
The only stateful (information retained for future use) part of the control plane
Persistently stores the entire configuration and the state of the cluster
Based on ETCD (a popular distributed database)
A good practice is to run 3-5 etcd replicas for HA, this provides adequate backup to recover when things go wrong
Prefers consistency over availability (uses RAFT consensus algorithm)
Kube Controller:
Implements all the background control loops that monitor the cluster
It’s a controller of controllers : spawns all the independent control loops and monitors them
Some of these control loops include : node controller, endpoint controller and the ReplicaSet controller
These control loops run in the background and continuously watches the API server for any changes The aim is to ensure that the current state of the controller matches that of the desired state
The logic of all these loops is essentially :
obtain the desired state
observe the current state
determine the difference if any
reconcile the differences
Scheduler :
Watches the API server for any new tasks, and assigns them to the appropriate nodes
Runs on a complex logic that filters out nodes incapable of running the tasks and internally ranks the nodes that are capable of running the task - the node with the highest ranking is selected to run the task
How is the ranking done?
Is the node tainted? (ref : https://www.densify.com/kubernetes-autoscaling/kubernetes-taints/)
Are there any affinity or anti-affinity rules? (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)
Is the required network port available on this node?
Does it have enough resources
Does it already have the required image?
How much free resources does it have?
How many tasks is it already running?
Each of these questions have a weightage and based on the most number of points gained by a node, the task is assigned to the node with the highest points.
Cloud controller manager :
If your cloud is running on a public cloud platform like AWS, AZure, GCP the control plane will also run a cloud controller manager
It manages the integrations with the underlying cloud technologies and services (loud balancers, storage etc)
Worker (Nodes)
Nodes are the workers of a Kubernetes cluster. They do the following:
Watch the API server for new work assignments
Execute new assignments
Report back to the control plane using via the API server
They are built simpler than the master node.
Kube Proxy :
network proxy to allow for communication both inside and outside the cluster (pod-pod communication)
Runs on every node in the cluster and is responsible for local cluster networking
Makes sure that each node gets it’s own unique IP address and implements local IPTABLES or IPVS rules to handle routing and load balancing of traffic on the Pod network
CRI (container runtime interface) :
A software that dictates how containers and images are run, for example Docker
It works on pulling images, starting and stopping containers
examples : docker, cri-containerd (container-dee)
Kubelet :
Main component of the worker node
Runs on every node in the cluster
The terms kubelet and node are often used interchangeably
Watches the API server for new work assignments
Maintains a reporting channel to the control plane
If it can not execute a task, it lets the control plane know
Monitors and maintains information about the server it runs on and exchanges this information with the Kube API server (if pod gets killed, does not run etc). Kube Scheduler decides where the service can again be deployed etc.
Kubernetes DNS
Every Kubernetes cluster has an internal DNS service that is vital to its operations
The cluster’s DNS service has a static IP address that is hard-coded into every pod on the cluster (all the nodes know how to find it)
Data Plane
Pods
A group of whales is called a “pod of whales”. The docker logo is a whale, so a group of containers are called pods.
There are two ways to run a pod
Single container per pod
Multiple containers per pod (multi-container pods)
Either way, a Kubernetes Pod is a construct for running one or more containers.
Pod Anatomy
Pod is an environment to run containers. Pods themselves do not run anything, it’s just a sandbox for hosting containers. Keeping it high level, you ring-fence an area of the host OS, build a network stack, create a bunch of kernel namespaces, and run one or more containers in it. That’s a Pod.
Minimum unit for scaling
Pods are the minimum unit of scaling in K8s. If you are scaling up or down, you need to add or remove pods respectively. You should not scale by adding more containers in the same pod.
Atomic Operations
Deploying a pod is an atomic operation. A pod is considered ready only when all it’s containers are up and running.
Pod Lifecycle
Pods are mortal, you can die. If they die unexpectedly, K8s brings up a new one in its place. This new pod will have a new IP and ID. Kubernetes makes sure that all pods that are created are in the same subnet, so containers in different pods can still communicate. Kubelet proxy assigns these IP addresses.
Should all my containers be in the same pod?
do all the services need to be co-located and co-scheduled?
example : database service and front end service
These two do they have to be on the same pods?
you can scale front end to 10-12 instances but can not do that to databases
these two services may not be co-scheduled
How does K8s ensure replication of pods?
Deployment : is a YAML object that defines the pods and the number of container instances called replicas for each pod
Replicas controller is a service running on the control plane
Scheduler decides which nodes should have how many pods and which pods
Kubernetes Networking
Pods communicate with other pods without NAT (network admin protocol) - they communicate directly with Ip address
Nodes communicate with pods without NAT
Every node in the cluster is assigned a CIDR block - IP addresses for pods running on it
Namespace
Provides logical networking stack with separate routes, firewall rules and network devices
By default, “root” namespace is used by nodes (unless specified). When a pod comes up, different namespaces are given
Namespaces are completely isolated - communication between namespaces happens via virtual ethernet devices that behave like patch panels
Example :
If you have 3 nodes (VMs/PCS etc) and you want these nodes to behave like a cluster of nodes they should be able to communicate between themselves.
Each pod gets a unique IP address. By default, it allocates one range of IPs for each node to avoid conflicting IPs.
Within a pod all containers are in the same namespace - they can communicate with each other using local host
What about communication from one pod to another?
Deployments
For any application to run on a K8s cluster, it needs to be:
Packaged in a container
Wrapped in a pod (pod is a wrapper that allows a container to run on a kubernetes cluster)
Deployed via a manifest file
Pods are deployed using a high level controller. The most common controller is the Deployment. It offers :
Scalability
self-healing
rolling updates
Deployments are defined in YAML manifest files that specify things like which image to use and how many replicas to deploy. Once defined in the deployment YAML file, it is posted to the API server as the desired state of the application and let K8s implement it.
The Declarative model
Declare the desired state of an application in a manifest file
Manifest files are written in YAML and they tell K8s how the application should look (desired state)
It mentions which image to use, how many replicas to run, which network ports to listen on, how to perform updates
POST it to the API server
The most common way of doing this is using the kubectl command-line utility
This sends the manifest to the control plane as an HTTP POST, usually on port 443
K8s stores it in the cluster store as the applications desired state
Once the request is authenticated and authorized, K8s inspects the manifest and identifies which controller to send it to
It records the config in the cluster store as part of the cluster’s overall desired state
K8s implements the desired state
Work gets scheduled on the cluster
Images are pulled, containers are started, networks building, starting the application process
K8s implements watch loops to make sure the current state of the application does not vary from the desired state
K8s uses background reconciliation lops that constantly monitor the state of the cluster
If current state of the cluster varies from the desired state, K8s will perform whatever tasks are necessary to reconcile the issue
Declarative vs Imperative Model
Imperative model is where you provide a log list of platform specific commands to build things. Declarative model is the opposite of this.
Declarative model is :
Simpler than the imperative model
enables self healing, scaling, lends itself to version control and self-documentation
tells the cluster how things should look (desired state)
if it stops looking like this, the cluster notices the discrepancy and does the reconciliation to bring it back to the desired state
That’s all for this post! In the next post I will document more on pods, networking and installation and usage of K8s.