×
THANKS FOR SIGNING UP!
We’ll make sure to share the best materials crafted for you!
15 minutes read
Debugging Kubernetes Deployments

Written by Serkan Özal
Founder and CTO of Thundra

Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.
This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.
Tools to Use to Debug the Kubernetes Cluster
It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.
Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide]
nodes
, pods
, and svc
.
One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide
).
Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w
):
kubectl describe (node, pod, svc) <name> -o yaml
The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml
).
kubectl logs
Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name>
prints the pod’s logs with a name pod name.
Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>
.
kubectl -v(verbosity)
For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.
kubectl exec
Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages
to look at the logs from a given pod, or kubectl exec -it <pod name> --sh
to log in to the given pod.
kubectl get events
Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events
, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning
.
Debugging Pod
There are two common reasons why pods fail in Kubernetes:
- Startup failure: A container inside the pod doesn’t start.
- Runtime failure: The application code fails after container startup.
Debugging Common Pod Errors with Step-by-Step and Real-World Examples
CrashLoopBackoff
A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:
- An error in the application inside the container.
- Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
- A liveness probe that failed too many times.
Commands to Run to Identify the Issue
The first step in debugging this issue is to check pod status by running the kubectl get pods command.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 0/1 CrashLoopBackOff 1 27s
Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.
As a next troubleshooting step, we are going to run the kubectl logs
command to print the logs for a container in a pod.
$ kubectl logs busybox
sleep: invalid number '-z'
Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.
$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
name: busybox
labels:
env: Prod
spec:
containers:
- name: busybox
image: busybox
args:
- "sleep"
- "-z"
To fix this issue, you’ll need to specify a valid option under the sleep
command in your pod definition file and then run the following command
$ kubectl create -f broken.yml
pod/busybox created
Verify the status of your pod.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 1 63s
ErrImagePull/ImagePullBackOff
An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:
- The image you have provided has an invalid name.
- The image tag doesn’t exist.
- The specified image is in the private registry.
Commands to Run to Identify the Issue
As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods
command.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
redis 0/1 ErrImagePull 0 117s
Next we will run the kubectl describe
command to get detailed information about the pod.
$ kubectl describe pod redis
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 46s default-scheduler Successfully assigned default/redis to node01
Normal BackOff 16s (x2 over 43s) kubelet, node01 Back-off pulling image "redis123"
Warning Failed 16s (x2 over 43s) kubelet, node01 Error: ImagePullBackOff
Normal Pulling 2s (x3 over 45s) kubelet, node01 Pulling image "redis123"
Warning Failed 1s (x3 over 44s) kubelet, node01 Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Warning Failed 1s (x3 over 44s) kubelet, node01 Error: ErrImagePull
As you can see in the output of the kubectl describe
command, it was unable to pull the image name redis123.
To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.
$ docker search redis
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
redis Redis is an open source key-value store that… 8985 [OK]
To fix this issue, we can follow either one of the following approaches:
- Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
- If you already have the yaml file, you can edit it directly, and if you don’t, you can use the
--dry-run
option to generate it.
kubectl edit pod redis
...
spec:
containers:
- image: redis
...
$ kubectl run redis --image redis --dry-run=client -o yaml > redispod.yml
$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: redis
name: redis
spec:
containers:
- image: redis
name: redis123
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Similarly, under image spec, change the image name to redis and recreate the pod.
$ kubectl create -f redispod.yml
pod/redis created
CreateContainerConfigError
This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods
.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 0/1 CreateContainerConfigError 0 58s
As the output of kubectl get pods
shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe
, which gets the detailed information about the pod.
$ kubectl describe pod configmap-pod
Warning Failed 44s (x8 over 2m10s) kubelet Error: configmap "my-config" not found
Normal Pulled 44s kubelet Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal Pulling 31s (x9 over 2m14s) kubelet Pulling image "gcr.io/google_containers/busybox"
To retrieve information about the ConfigMap, use this command:
$ kubectl get configmap
As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: mytest-container
image: busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: MY_KEY
valueFrom:
configMapKeyRef:
name: my-config
key: name
To fix the error, create a ConfigMap file, whose content will look like this:
$ cat my-config.yaml
apiVersion: v1
data:
name: test
value: user
kind: ConfigMap
metadata:
name: my-config
$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created
Now run kubectl get configmap
to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.
$ kubectl get configmap
NAME DATA AGE
my-config 2 11s
Verify the status of the pod, and you will see the pod will be running state now.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 1/1 Running 4 7m29s
ContainerCreating Pod
The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
secret-pod 0/1 ContainerCreating 0 7s
As the output of kubectl get pods
shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe
command, which gets the detailed information about the pod.
$ kubectl describe pod secret-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25s default-scheduler Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
Normal Pulled 18s kubelet Successfully pulled image "nginx" in 666.826881ms
Normal Pulled 17s kubelet Successfully pulled image "nginx" in 472.277634ms
Normal Pulling 5s (x3 over 18s) kubelet Pulling image "nginx"
Warning Failed 4s (x3 over 17s) kubelet Error: secret "myothersecret" not found
Normal Pulled 4s kubelet Successfully pulled image "nginx" in 476.69613ms
To retrieve information about the secret, use this command:
$ kubectl get secret
As the above command’s output is null, the next step is to verify the pod definition file and create the secret.
$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
name: secret-pod
spec:
containers:
- name: test-contaner
image: nginx
envFrom:
- secretRef:
name: myothersecret
To fix the error, create a secret file whose content looks like this.
$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
name: myothersecret
data:
USER_NAME: YWRtaW4=
PASSWORD: MWYyZDFlMmU2N2Rm
Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.
For example, you can use the base64 command to encode any text data.
$ echo -n "username" | base64
dXNlcm5hbWU=
Use the following commands to decode the text and print the original text:
$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created
Now run kubectl get secret
to retrieve information about a secret, and this time you will see the newly created secret.
$ kubectl get secret
NAME TYPE DATA AGE
myothersecret Opaque 2 20m
Verify the status of the pod, and you will see that the pod is in running state now.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
secret-pod 1/1 Running 0 2m36s
Debugging Worker Nodes
A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.
NotReady State
There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:
- A virtual machine where the worker node is running shut down.
- There is a network issue between worker and master node.
- There is a crash within the Kubernetes software.
But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.
Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)
Commands to Run to Identify the Issue
Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 21m v1.19.0
node01 NotReady <none> 20m v1.19.0
Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.
$ ps -ef | grep -i kubelet
root 21200 16471 0 04:46 pts/1 00:00:00 grep -i kubelet
As we see, the kubelet process is not running. We can run the systemctl
command to verify it further.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
Docs: https://kubernetes.io/docs/home/
Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
Main PID: 1455 (code=exited, status=0/SUCCESS)
Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Using the systemctl
command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet
command.
# systemctl start kubelet
Verify the status of kubelet again.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 13240 (kubelet)
Tasks: 8 (limit: 4678)
CGroup: /system.slice/kubelet.service
└─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube
Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 23m v1.19.0
node01 Ready <none> 23m v1.19.0
As you can see, the worker node is now in Ready state.
Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)
In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet
), but that doesn’t help. This time, the kubelet service status is activating (auto-restart
), which means it’s not able to start.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago
Here we’ll start our debugging with a command called journalctl
. Using journalctl
, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.
# journalctl -u kubelet
You will see the following error message:
Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580 28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:
Note: To reach the bottom of the command, press SHIFT+G.
Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.
# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:
# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/my-test-file.crt
You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.
# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt
Replace it with the correct file path and restart the daemon.
clientCAFile: /etc/kubernetes/pki/my-test-file.crt
TO
clientCAFile: /etc/kubernetes/pki/ca.crt
# systemctl restart kubelet
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago
Again, verify it from the Kubernetes master node.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 23m v1.19.0
node01 Ready <none> 23m v1.19.0
As you can see, the worker node is now in Ready state.
Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)
In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet
), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status
command.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
Docs: https://kubernetes.io/docs/home/
Main PID: 6904 (kubelet)
Tasks: 13 (limit: 4678)
CGroup: /system.slice/kubelet.service
└─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242 6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262 6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256 6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"
If we look at the last line of the systemctl status
command, we’ll see that the kubelet cannot communicate with the API server.
To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info
.
$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.
# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.
If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info
.
# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
server: https://172.17.0.39:6553
name: default-cluster
Change port 6553 to 6443 and restart the kubelet daemon.
# systemctl daemon-reload
# systemctl restart kubelet
Debugging Control Plane
The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.
Pod in a Pending State
What are some of the reasons that a pod goes into a Pending state?
- The cluster doesn’t have enough resources.
- The current namespace has a resource quota.
- The pod is bound to a persistent volume claim.
- You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue
We will start our debugging with kubectl get all
commands. So far, we have used kubectl get pods
and nodes
, but to list all the resources, we need to pass all command line parameters. The command looks like this:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/app-586bddbc54-xd8hs 0/1 Pending 0 16s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 83s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 0/1 1 0 16s
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-586bddbc54 1 1 0 16s
As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system
namespace.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-48wdf 1/1 Running 0 5m54s
coredns-f9fd979d6-rl55d 1/1 Running 0 5m54s
etcd-controlplane 1/1 Running 0 6m2s
kube-apiserver-controlplane 1/1 Running 0 6m2s
kube-controller-manager-controlplane 1/1 Running 0 6m2s
kube-flannel-ds-amd64-qbn7x 1/1 Running 0 5m44s
kube-flannel-ds-amd64-wzcmn 1/1 Running 0 5m53s
kube-proxy-b645c 1/1 Running 0 5m53s
kube-proxy-m4lnk 1/1 Running 0 5m44s
kube-scheduler-controlplane 0/1 CrashLoopBackOff 5 5m2s
As you can see in the output, the kube-scheduler-controlplane
pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.
The next command we’ll run is the kubectl describe
command to get detailed information about the pod.
$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 5m16s (x5 over 6m44s) kubelet, controlplane Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
Normal Created 5m16s (x5 over 6m44s) kubelet, controlplane Created container kube-scheduler
Warning Failed 5m16s (x5 over 6m44s) kubelet, controlplane Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
Warning BackOff 103s (x27 over 6m42s) kubelet, controlplane Back-off restarting failed container
As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.
# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-schedulerror
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.
Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
app-586bddbc54-xd8hs 1/1 Running 0 13m
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/app-586bddbc54-xd8hs 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 15m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-586bddbc54 1 1 1 14m
Note: This is the other useful command you can use to debug the issue:
$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)
Summary
In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.
Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.
This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.
Tools to Use to Debug the Kubernetes Cluster
It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.
Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide]
nodes
, pods
, and svc
.
One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide
).
Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w
):
kubectl describe (node, pod, svc) <name> -o yaml
The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml
).
kubectl logs
Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name>
prints the pod’s logs with a name pod name.
Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>
.
kubectl -v(verbosity)
For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.
kubectl exec
Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages
to look at the logs from a given pod, or kubectl exec -it <pod name> --sh
to log in to the given pod.
kubectl get events
Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events
, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning
.
Debugging Pod
There are two common reasons why pods fail in Kubernetes:
- Startup failure: A container inside the pod doesn’t start.
- Runtime failure: The application code fails after container startup.
Debugging Common Pod Errors with Step-by-Step and Real-World Examples
CrashLoopBackoff
A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:
- An error in the application inside the container.
- Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
- A liveness probe that failed too many times.
Commands to Run to Identify the Issue
The first step in debugging this issue is to check pod status by running the kubectl get pods command.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 0/1 CrashLoopBackOff 1 27s
Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.
As a next troubleshooting step, we are going to run the kubectl logs
command to print the logs for a container in a pod.
$ kubectl logs busybox
sleep: invalid number '-z'
Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.
$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
name: busybox
labels:
env: Prod
spec:
containers:
- name: busybox
image: busybox
args:
- "sleep"
- "-z"
To fix this issue, you’ll need to specify a valid option under the sleep
command in your pod definition file and then run the following command
$ kubectl create -f broken.yml
pod/busybox created
Verify the status of your pod.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 1 63s
ErrImagePull/ImagePullBackOff
An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:
- The image you have provided has an invalid name.
- The image tag doesn’t exist.
- The specified image is in the private registry.
Commands to Run to Identify the Issue
As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods
command.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
redis 0/1 ErrImagePull 0 117s
Next we will run the kubectl describe
command to get detailed information about the pod.
$ kubectl describe pod redis
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 46s default-scheduler Successfully assigned default/redis to node01
Normal BackOff 16s (x2 over 43s) kubelet, node01 Back-off pulling image "redis123"
Warning Failed 16s (x2 over 43s) kubelet, node01 Error: ImagePullBackOff
Normal Pulling 2s (x3 over 45s) kubelet, node01 Pulling image "redis123"
Warning Failed 1s (x3 over 44s) kubelet, node01 Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Warning Failed 1s (x3 over 44s) kubelet, node01 Error: ErrImagePull
As you can see in the output of the kubectl describe
command, it was unable to pull the image name redis123.
To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.
$ docker search redis
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
redis Redis is an open source key-value store that… 8985 [OK]
To fix this issue, we can follow either one of the following approaches:
- Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
- If you already have the yaml file, you can edit it directly, and if you don’t, you can use the
--dry-run
option to generate it.
kubectl edit pod redis
...
spec:
containers:
- image: redis
...
$ kubectl run redis --image redis --dry-run=client -o yaml > redispod.yml
$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: redis
name: redis
spec:
containers:
- image: redis
name: redis123
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Similarly, under image spec, change the image name to redis and recreate the pod.
$ kubectl create -f redispod.yml
pod/redis created
CreateContainerConfigError
This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods
.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 0/1 CreateContainerConfigError 0 58s
As the output of kubectl get pods
shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe
, which gets the detailed information about the pod.
$ kubectl describe pod configmap-pod
Warning Failed 44s (x8 over 2m10s) kubelet Error: configmap "my-config" not found
Normal Pulled 44s kubelet Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal Pulling 31s (x9 over 2m14s) kubelet Pulling image "gcr.io/google_containers/busybox"
To retrieve information about the ConfigMap, use this command:
$ kubectl get configmap
As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: mytest-container
image: busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: MY_KEY
valueFrom:
configMapKeyRef:
name: my-config
key: name
To fix the error, create a ConfigMap file, whose content will look like this:
$ cat my-config.yaml
apiVersion: v1
data:
name: test
value: user
kind: ConfigMap
metadata:
name: my-config
$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created
Now run kubectl get configmap
to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.
$ kubectl get configmap
NAME DATA AGE
my-config 2 11s
Verify the status of the pod, and you will see the pod will be running state now.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 1/1 Running 4 7m29s
ContainerCreating Pod
The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
secret-pod 0/1 ContainerCreating 0 7s
As the output of kubectl get pods
shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe
command, which gets the detailed information about the pod.
$ kubectl describe pod secret-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25s default-scheduler Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
Normal Pulled 18s kubelet Successfully pulled image "nginx" in 666.826881ms
Normal Pulled 17s kubelet Successfully pulled image "nginx" in 472.277634ms
Normal Pulling 5s (x3 over 18s) kubelet Pulling image "nginx"
Warning Failed 4s (x3 over 17s) kubelet Error: secret "myothersecret" not found
Normal Pulled 4s kubelet Successfully pulled image "nginx" in 476.69613ms
To retrieve information about the secret, use this command:
$ kubectl get secret
As the above command’s output is null, the next step is to verify the pod definition file and create the secret.
$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
name: secret-pod
spec:
containers:
- name: test-contaner
image: nginx
envFrom:
- secretRef:
name: myothersecret
To fix the error, create a secret file whose content looks like this.
$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
name: myothersecret
data:
USER_NAME: YWRtaW4=
PASSWORD: MWYyZDFlMmU2N2Rm
Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.
For example, you can use the base64 command to encode any text data.
$ echo -n "username" | base64
dXNlcm5hbWU=
Use the following commands to decode the text and print the original text:
$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created
Now run kubectl get secret
to retrieve information about a secret, and this time you will see the newly created secret.
$ kubectl get secret
NAME TYPE DATA AGE
myothersecret Opaque 2 20m
Verify the status of the pod, and you will see that the pod is in running state now.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
secret-pod 1/1 Running 0 2m36s
Debugging Worker Nodes
A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.
NotReady State
There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:
- A virtual machine where the worker node is running shut down.
- There is a network issue between worker and master node.
- There is a crash within the Kubernetes software.
But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.
Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)
Commands to Run to Identify the Issue
Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 21m v1.19.0
node01 NotReady <none> 20m v1.19.0
Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.
$ ps -ef | grep -i kubelet
root 21200 16471 0 04:46 pts/1 00:00:00 grep -i kubelet
As we see, the kubelet process is not running. We can run the systemctl
command to verify it further.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
Docs: https://kubernetes.io/docs/home/
Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
Main PID: 1455 (code=exited, status=0/SUCCESS)
Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Using the systemctl
command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet
command.
# systemctl start kubelet
Verify the status of kubelet again.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 13240 (kubelet)
Tasks: 8 (limit: 4678)
CGroup: /system.slice/kubelet.service
└─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube
Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 23m v1.19.0
node01 Ready <none> 23m v1.19.0
As you can see, the worker node is now in Ready state.
Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)
In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet
), but that doesn’t help. This time, the kubelet service status is activating (auto-restart
), which means it’s not able to start.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago
Here we’ll start our debugging with a command called journalctl
. Using journalctl
, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.
# journalctl -u kubelet
You will see the following error message:
Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580 28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:
Note: To reach the bottom of the command, press SHIFT+G.
Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.
# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:
# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/my-test-file.crt
You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.
# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt
Replace it with the correct file path and restart the daemon.
clientCAFile: /etc/kubernetes/pki/my-test-file.crt
TO
clientCAFile: /etc/kubernetes/pki/ca.crt
# systemctl restart kubelet
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago
Again, verify it from the Kubernetes master node.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 23m v1.19.0
node01 Ready <none> 23m v1.19.0
As you can see, the worker node is now in Ready state.
Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)
In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet
), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status
command.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
Docs: https://kubernetes.io/docs/home/
Main PID: 6904 (kubelet)
Tasks: 13 (limit: 4678)
CGroup: /system.slice/kubelet.service
└─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242 6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262 6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256 6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"
If we look at the last line of the systemctl status
command, we’ll see that the kubelet cannot communicate with the API server.
To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info
.
$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.
# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.
If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info
.
# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
server: https://172.17.0.39:6553
name: default-cluster
Change port 6553 to 6443 and restart the kubelet daemon.
# systemctl daemon-reload
# systemctl restart kubelet
Debugging Control Plane
The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.
Pod in a Pending State
What are some of the reasons that a pod goes into a Pending state?
- The cluster doesn’t have enough resources.
- The current namespace has a resource quota.
- The pod is bound to a persistent volume claim.
- You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue
We will start our debugging with kubectl get all
commands. So far, we have used kubectl get pods
and nodes
, but to list all the resources, we need to pass all command line parameters. The command looks like this:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/app-586bddbc54-xd8hs 0/1 Pending 0 16s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 83s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 0/1 1 0 16s
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-586bddbc54 1 1 0 16s
As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system
namespace.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-48wdf 1/1 Running 0 5m54s
coredns-f9fd979d6-rl55d 1/1 Running 0 5m54s
etcd-controlplane 1/1 Running 0 6m2s
kube-apiserver-controlplane 1/1 Running 0 6m2s
kube-controller-manager-controlplane 1/1 Running 0 6m2s
kube-flannel-ds-amd64-qbn7x 1/1 Running 0 5m44s
kube-flannel-ds-amd64-wzcmn 1/1 Running 0 5m53s
kube-proxy-b645c 1/1 Running 0 5m53s
kube-proxy-m4lnk 1/1 Running 0 5m44s
kube-scheduler-controlplane 0/1 CrashLoopBackOff 5 5m2s
As you can see in the output, the kube-scheduler-controlplane
pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.
The next command we’ll run is the kubectl describe
command to get detailed information about the pod.
$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 5m16s (x5 over 6m44s) kubelet, controlplane Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
Normal Created 5m16s (x5 over 6m44s) kubelet, controlplane Created container kube-scheduler
Warning Failed 5m16s (x5 over 6m44s) kubelet, controlplane Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
Warning BackOff 103s (x27 over 6m42s) kubelet, controlplane Back-off restarting failed container
As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.
# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-schedulerror
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.
Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
app-586bddbc54-xd8hs 1/1 Running 0 13m
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/app-586bddbc54-xd8hs 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 15m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-586bddbc54 1 1 1 14m
Note: This is the other useful command you can use to debug the issue:
$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)
Summary
In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.
Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.
This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.
Tools to Use to Debug the Kubernetes Cluster
It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.
Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide]
nodes
, pods
, and svc
.
One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide
).
Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w
):
kubectl describe (node, pod, svc) <name> -o yaml
The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml
).
kubectl logs
Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name>
prints the pod’s logs with a name pod name.
Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>
.
kubectl -v(verbosity)
For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.
kubectl exec
Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages
to look at the logs from a given pod, or kubectl exec -it <pod name> --sh
to log in to the given pod.
kubectl get events
Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events
, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning
.
Debugging Pod
There are two common reasons why pods fail in Kubernetes:
- Startup failure: A container inside the pod doesn’t start.
- Runtime failure: The application code fails after container startup.
Debugging Common Pod Errors with Step-by-Step and Real-World Examples
CrashLoopBackoff
A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:
- An error in the application inside the container.
- Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
- A liveness probe that failed too many times.
Commands to Run to Identify the Issue
The first step in debugging this issue is to check pod status by running the kubectl get pods command.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 0/1 CrashLoopBackOff 1 27s
Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.
As a next troubleshooting step, we are going to run the kubectl logs
command to print the logs for a container in a pod.
$ kubectl logs busybox
sleep: invalid number '-z'
Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.
$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
name: busybox
labels:
env: Prod
spec:
containers:
- name: busybox
image: busybox
args:
- "sleep"
- "-z"
To fix this issue, you’ll need to specify a valid option under the sleep
command in your pod definition file and then run the following command
$ kubectl create -f broken.yml
pod/busybox created
Verify the status of your pod.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 1 63s
ErrImagePull/ImagePullBackOff
An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:
- The image you have provided has an invalid name.
- The image tag doesn’t exist.
- The specified image is in the private registry.
Commands to Run to Identify the Issue
As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods
command.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
redis 0/1 ErrImagePull 0 117s
Next we will run the kubectl describe
command to get detailed information about the pod.
$ kubectl describe pod redis
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 46s default-scheduler Successfully assigned default/redis to node01
Normal BackOff 16s (x2 over 43s) kubelet, node01 Back-off pulling image "redis123"
Warning Failed 16s (x2 over 43s) kubelet, node01 Error: ImagePullBackOff
Normal Pulling 2s (x3 over 45s) kubelet, node01 Pulling image "redis123"
Warning Failed 1s (x3 over 44s) kubelet, node01 Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Warning Failed 1s (x3 over 44s) kubelet, node01 Error: ErrImagePull
As you can see in the output of the kubectl describe
command, it was unable to pull the image name redis123.
To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.
$ docker search redis
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
redis Redis is an open source key-value store that… 8985 [OK]
To fix this issue, we can follow either one of the following approaches:
- Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
- If you already have the yaml file, you can edit it directly, and if you don’t, you can use the
--dry-run
option to generate it.
kubectl edit pod redis
...
spec:
containers:
- image: redis
...
$ kubectl run redis --image redis --dry-run=client -o yaml > redispod.yml
$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: redis
name: redis
spec:
containers:
- image: redis
name: redis123
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Similarly, under image spec, change the image name to redis and recreate the pod.
$ kubectl create -f redispod.yml
pod/redis created
CreateContainerConfigError
This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods
.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 0/1 CreateContainerConfigError 0 58s
As the output of kubectl get pods
shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe
, which gets the detailed information about the pod.
$ kubectl describe pod configmap-pod
Warning Failed 44s (x8 over 2m10s) kubelet Error: configmap "my-config" not found
Normal Pulled 44s kubelet Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal Pulling 31s (x9 over 2m14s) kubelet Pulling image "gcr.io/google_containers/busybox"
To retrieve information about the ConfigMap, use this command:
$ kubectl get configmap
As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: mytest-container
image: busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: MY_KEY
valueFrom:
configMapKeyRef:
name: my-config
key: name
To fix the error, create a ConfigMap file, whose content will look like this:
$ cat my-config.yaml
apiVersion: v1
data:
name: test
value: user
kind: ConfigMap
metadata:
name: my-config
$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created
Now run kubectl get configmap
to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.
$ kubectl get configmap
NAME DATA AGE
my-config 2 11s
Verify the status of the pod, and you will see the pod will be running state now.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 1/1 Running 4 7m29s
ContainerCreating Pod
The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
secret-pod 0/1 ContainerCreating 0 7s
As the output of kubectl get pods
shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe
command, which gets the detailed information about the pod.
$ kubectl describe pod secret-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25s default-scheduler Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
Normal Pulled 18s kubelet Successfully pulled image "nginx" in 666.826881ms
Normal Pulled 17s kubelet Successfully pulled image "nginx" in 472.277634ms
Normal Pulling 5s (x3 over 18s) kubelet Pulling image "nginx"
Warning Failed 4s (x3 over 17s) kubelet Error: secret "myothersecret" not found
Normal Pulled 4s kubelet Successfully pulled image "nginx" in 476.69613ms
To retrieve information about the secret, use this command:
$ kubectl get secret
As the above command’s output is null, the next step is to verify the pod definition file and create the secret.
$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
name: secret-pod
spec:
containers:
- name: test-contaner
image: nginx
envFrom:
- secretRef:
name: myothersecret
To fix the error, create a secret file whose content looks like this.
$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
name: myothersecret
data:
USER_NAME: YWRtaW4=
PASSWORD: MWYyZDFlMmU2N2Rm
Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.
For example, you can use the base64 command to encode any text data.
$ echo -n "username" | base64
dXNlcm5hbWU=
Use the following commands to decode the text and print the original text:
$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created
Now run kubectl get secret
to retrieve information about a secret, and this time you will see the newly created secret.
$ kubectl get secret
NAME TYPE DATA AGE
myothersecret Opaque 2 20m
Verify the status of the pod, and you will see that the pod is in running state now.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
secret-pod 1/1 Running 0 2m36s
Debugging Worker Nodes
A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.
NotReady State
There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:
- A virtual machine where the worker node is running shut down.
- There is a network issue between worker and master node.
- There is a crash within the Kubernetes software.
But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.
Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)
Commands to Run to Identify the Issue
Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 21m v1.19.0
node01 NotReady <none> 20m v1.19.0
Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.
$ ps -ef | grep -i kubelet
root 21200 16471 0 04:46 pts/1 00:00:00 grep -i kubelet
As we see, the kubelet process is not running. We can run the systemctl
command to verify it further.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
Docs: https://kubernetes.io/docs/home/
Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
Main PID: 1455 (code=exited, status=0/SUCCESS)
Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Using the systemctl
command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet
command.
# systemctl start kubelet
Verify the status of kubelet again.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 13240 (kubelet)
Tasks: 8 (limit: 4678)
CGroup: /system.slice/kubelet.service
└─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube
Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 23m v1.19.0
node01 Ready <none> 23m v1.19.0
As you can see, the worker node is now in Ready state.
Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)
In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet
), but that doesn’t help. This time, the kubelet service status is activating (auto-restart
), which means it’s not able to start.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago
Here we’ll start our debugging with a command called journalctl
. Using journalctl
, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.
# journalctl -u kubelet
You will see the following error message:
Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580 28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:
Note: To reach the bottom of the command, press SHIFT+G.
Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.
# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:
# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/my-test-file.crt
You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.
# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt
Replace it with the correct file path and restart the daemon.
clientCAFile: /etc/kubernetes/pki/my-test-file.crt
TO
clientCAFile: /etc/kubernetes/pki/ca.crt
# systemctl restart kubelet
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago
Again, verify it from the Kubernetes master node.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controlplane Ready master 23m v1.19.0
node01 Ready <none> 23m v1.19.0
As you can see, the worker node is now in Ready state.
Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)
In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet
), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status
command.
# systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
Docs: https://kubernetes.io/docs/home/
Main PID: 6904 (kubelet)
Tasks: 13 (limit: 4678)
CGroup: /system.slice/kubelet.service
└─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242 6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262 6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256 6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"
If we look at the last line of the systemctl status
command, we’ll see that the kubelet cannot communicate with the API server.
To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info
.
$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.
# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.
If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info
.
# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
certificate-authority-data:
server: https://172.17.0.39:6553
name: default-cluster
Change port 6553 to 6443 and restart the kubelet daemon.
# systemctl daemon-reload
# systemctl restart kubelet
Debugging Control Plane
The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.
Pod in a Pending State
What are some of the reasons that a pod goes into a Pending state?
- The cluster doesn’t have enough resources.
- The current namespace has a resource quota.
- The pod is bound to a persistent volume claim.
- You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue
We will start our debugging with kubectl get all
commands. So far, we have used kubectl get pods
and nodes
, but to list all the resources, we need to pass all command line parameters. The command looks like this:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/app-586bddbc54-xd8hs 0/1 Pending 0 16s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 83s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 0/1 1 0 16s
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-586bddbc54 1 1 0 16s
As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system
namespace.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-48wdf 1/1 Running 0 5m54s
coredns-f9fd979d6-rl55d 1/1 Running 0 5m54s
etcd-controlplane 1/1 Running 0 6m2s
kube-apiserver-controlplane 1/1 Running 0 6m2s
kube-controller-manager-controlplane 1/1 Running 0 6m2s
kube-flannel-ds-amd64-qbn7x 1/1 Running 0 5m44s
kube-flannel-ds-amd64-wzcmn 1/1 Running 0 5m53s
kube-proxy-b645c 1/1 Running 0 5m53s
kube-proxy-m4lnk 1/1 Running 0 5m44s
kube-scheduler-controlplane 0/1 CrashLoopBackOff 5 5m2s
As you can see in the output, the kube-scheduler-controlplane
pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.
The next command we’ll run is the kubectl describe
command to get detailed information about the pod.
$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 5m16s (x5 over 6m44s) kubelet, controlplane Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
Normal Created 5m16s (x5 over 6m44s) kubelet, controlplane Created container kube-scheduler
Warning Failed 5m16s (x5 over 6m44s) kubelet, controlplane Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
Warning BackOff 103s (x27 over 6m42s) kubelet, controlplane Back-off restarting failed container
As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.
# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-schedulerror
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --bind-address=127.0.0.1
- --kubeconfig=/etc/kubernetes/scheduler.conf
- --leader-elect=true
- --port=0
Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.
Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
app-586bddbc54-xd8hs 1/1 Running 0 13m
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/app-586bddbc54-xd8hs 1/1 Running 0 14m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 15m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 1/1 1 1 14m
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-586bddbc54 1 1 1 14m
Note: This is the other useful command you can use to debug the issue:
$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)
Summary
In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.