SBN

Debugging Kubernetes Deployments

15 minutes read


POSTED Jan, 2021

dot

IN
Debugging

Debugging Kubernetes Deployments

Serkan Özal

Written by Serkan Özal

Founder and CTO of Thundra


 X

Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.

This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.

Tools to Use to Debug the Kubernetes Cluster

It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.

Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide] nodes, pods, and svc.

One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide).

Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w):

kubectl describe (node, pod, svc) <name> -o yaml

The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml).

kubectl logs

Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name> prints the pod’s logs with a name pod name.

Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>.

kubectl -v(verbosity)

For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.

kubectl exec

Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages to look at the logs from a given pod, or kubectl exec -it <pod name> --sh to log in to the given pod.

kubectl get events

Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning.

Debugging Pod

There are two common reasons why pods fail in Kubernetes:

  • Startup failure: A container inside the pod doesn’t start.
  • Runtime failure: The application code fails after container startup.

Debugging Common Pod Errors with Step-by-Step and Real-World Examples

CrashLoopBackoff

A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:

  • An error in the application inside the container.
  • Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
  • A liveness probe that failed too many times.
Commands to Run to Identify the Issue

The first step in debugging this issue is to check pod status by running the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     0/1       CrashLoopBackOff      1           27s

Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.

As a next troubleshooting step, we are going to run the kubectl logs command to print the logs for a container in a pod.

$ kubectl logs busybox
sleep: invalid number '-z'

Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.

$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
  name: busybox
  labels:
    env: Prod
spec:
  containers:
  - name: busybox
    image: busybox
    args:
      - "sleep"
      - "-z"

To fix this issue, you’ll need to specify a valid option under the sleep command in your pod definition file and then run the following command

$ kubectl create -f broken.yml
pod/busybox created

Verify the status of your pod.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     1/1       Running               1           63s

ErrImagePull/ImagePullBackOff

An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:

  • The image you have provided has an invalid name.
  • The image tag doesn’t exist.
  • The specified image is in the private registry.
Commands to Run to Identify the Issue

As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
redis       0/1       ErrImagePull            0         117s

Next we will run the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod redis
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   46s                 default-scheduler   Successfully assigned default/redis to node01
  Normal    BackOff     16s (x2 over 43s)   kubelet, node01     Back-off pulling image "redis123"
  Warning   Failed      16s (x2 over 43s)   kubelet, node01     Error: ImagePullBackOff
  Normal    Pulling     2s (x3 over 45s)    kubelet, node01     Pulling image "redis123"
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Error: ErrImagePull

As you can see in the output of the kubectl describe command, it was unable to pull the image name redis123.

To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.

$ docker search redis
NAME                             DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
redis                            Redis is an open source key-value store that…   8985      [OK]

To fix this issue, we can follow either one of the following approaches:

  1. Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
  2. If you already have the yaml file, you can edit it directly, and if you don’t, you can use the --dry-run option to generate it.
kubectl edit pod redis
...
spec:
  containers:
  - image: redis
...
$ kubectl  run redis --image redis --dry-run=client -o yaml > redispod.yml

$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: redis
    name: redis
spec:
  containers:
  - image: redis
    name: redis123
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Similarly, under image spec, change the image name to redis and recreate the pod.

$ kubectl create -f redispod.yml
pod/redis created

CreateContainerConfigError

This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                        RESTARTS    AGE
configmap-pod     0/1       CreateContainerConfigError    0           58s

As the output of kubectl get pods shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe, which gets the detailed information about the pod.

$ kubectl describe pod configmap-pod
Warning  Failed       44s (x8 over 2m10s)     kubelet       Error: configmap "my-config" not found
Normal   Pulled       44s                     kubelet       Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal   Pulling      31s (x9 over 2m14s)     kubelet       Pulling image "gcr.io/google_containers/busybox"

To retrieve information about the ConfigMap, use this command:

$ kubectl get configmap

 

As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.

apiVersion: v1
kind: Pod
metadata:
  name: configmap-pod
spec:
  containers:
    - name: mytest-container
      image: busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:
        - name: MY_KEY
          valueFrom:
            configMapKeyRef:
              name: my-config
              key: name

To fix the error, create a ConfigMap file, whose content will look like this:

$ cat my-config.yaml
apiVersion: v1
data:
  name: test
  value: user
kind: ConfigMap
metadata:
  name: my-config

$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created

Now run kubectl get configmap to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.

$ kubectl get configmap
NAME              DATA    AGE
my-config         2       11s

Verify the status of the pod, and you will see the pod will be running state now.

$ kubectl get pods
NAME              READY     STATUS        RESTARTS    AGE
configmap-pod     1/1       Running       4           7m29s

ContainerCreating Pod

The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                RESTARTS    AGE
secret-pod        0/1       ContainerCreating     0           7s

As the output of kubectl get pods shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe command, which gets the detailed information about the pod.

$ kubectl describe pod secret-pod
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   25s                 default-scheduler   Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
  Normal    Pulled      18s                 kubelet             Successfully pulled image "nginx" in 666.826881ms
  Normal    Pulled      17s                 kubelet             Successfully pulled image "nginx" in 472.277634ms
  Normal    Pulling     5s (x3 over 18s)    kubelet             Pulling image "nginx"
  Warning   Failed      4s (x3 over 17s)    kubelet             Error: secret "myothersecret" not found
  Normal    Pulled      4s                  kubelet             Successfully pulled image "nginx" in 476.69613ms

To retrieve information about the secret, use this command:

$ kubectl get secret

As the above command’s output is null, the next step is to verify the pod definition file and create the secret.

$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
  name: secret-pod
spec:
  containers:
    - name: test-contaner
      image: nginx
      envFrom:
      - secretRef:
          name: myothersecret

To fix the error, create a secret file whose content looks like this.

$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
  name: myothersecret
data:
  USER_NAME: YWRtaW4=
  PASSWORD: MWYyZDFlMmU2N2Rm

Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.

For example, you can use the base64 command to encode any text data.

$ echo -n "username" | base64
dXNlcm5hbWU=

Use the following commands to decode the text and print the original text:

$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created

Now run kubectl get secret to retrieve information about a secret, and this time you will see the newly created secret.

$ kubectl get secret
NAME              TYPE      DATA    AGE
myothersecret     Opaque    2       20m

Verify the status of the pod, and you will see that the pod is in running state now.

$ kubectl get pods
NAME          READY     STATUS      RESTARTS    AGE
secret-pod    1/1       Running     0           2m36s

Debugging Worker Nodes

A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.

NotReady State

There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:

  • A virtual machine where the worker node is running shut down.
  • There is a network issue between worker and master node.
  • There is a crash within the Kubernetes software.

But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.

Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)

Commands to Run to Identify the Issue

Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    21m     v1.19.0
node01          NotReady    <none>    20m     v1.19.0

Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.

$ ps -ef | grep -i kubelet
root    21200 16471  0 04:46 pts/1    00:00:00 grep -i kubelet

As we see, the kubelet process is not running. We can run the systemctl command to verify it further.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
      Docs: https://kubernetes.io/docs/home/
   Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
  Main PID: 1455 (code=exited, status=0/SUCCESS)

Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Using the systemctl command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet command.

# systemctl start kubelet

Verify the status of kubelet again.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 13240 (kubelet)
     Tasks: 8 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube

Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE   VERSION
controlplane    Ready       master    23m   v1.19.0
node01          Ready       <none>    23m   v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)

In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet), but that doesn’t help. This time, the kubelet service status is activating (auto-restart), which means it’s not able to start.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago

Here we’ll start our debugging with a command called journalctl. Using journalctl, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.

# journalctl -u kubelet

You will see the following error message:

Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580   28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:

Note: To reach the bottom of the command, press SHIFT+G.

Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:

# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/my-test-file.crt

You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.

# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt

Replace it with the correct file path and restart the daemon.

clientCAFile: /etc/kubernetes/pki/my-test-file.crt

TO

clientCAFile: /etc/kubernetes/pki/ca.crt

# systemctl restart kubelet
# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago

Again, verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    23m     v1.19.0
node01          Ready       <none>    23m     v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)

In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status command.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 6904 (kubelet)
     Tasks: 13 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl

Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242	6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262	6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256	6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"

If we look at the last line of the systemctl status command, we’ll see that the kubelet cannot communicate with the API server.

To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info.

$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.

If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info.

# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:
    server: https://172.17.0.39:6553
  name: default-cluster

Change port 6553 to 6443 and restart the kubelet daemon.

# systemctl daemon-reload
# systemctl restart kubelet

Debugging Control Plane

The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.

Pod in a Pending State

What are some of the reasons that a pod goes into a Pending state?

  • The cluster doesn’t have enough resources.
  • The current namespace has a resource quota.
  • The pod is bound to a persistent volume claim.
  • You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue

We will start our debugging with kubectl get all commands. So far, we have used kubectl get pods and nodes, but to list all the resources, we need to pass all command line parameters. The command looks like this:

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    0/1       Pending     0           16s

NAME                  TYPE          CLUSTER-IP    EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1     <none>          443/TCP     83s

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     0/1       1             0             16s

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           0         16s

As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system namespace.

$ kubectl get pods -n kube-system
NAME                                    READY     STATUS              RESTARTS    AGE
coredns-f9fd979d6-48wdf                 1/1       Running             0           5m54s
coredns-f9fd979d6-rl55d                 1/1       Running             0           5m54s
etcd-controlplane                       1/1       Running             0           6m2s
kube-apiserver-controlplane             1/1       Running             0           6m2s
kube-controller-manager-controlplane    1/1       Running             0           6m2s
kube-flannel-ds-amd64-qbn7x             1/1       Running             0           5m44s
kube-flannel-ds-amd64-wzcmn             1/1       Running             0           5m53s
kube-proxy-b645c                        1/1       Running             0           5m53s
kube-proxy-m4lnk                        1/1       Running             0           5m44s
kube-scheduler-controlplane             0/1       CrashLoopBackOff    5           5m2s

As you can see in the output, the kube-scheduler-controlplane pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.

The next command we’ll run is the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
  Type      Reason      Age                     From                    Message
  ----      ------      ----                    ----                    -------
  Normal    Pulled      5m16s (x5 over 6m44s)   kubelet, controlplane   Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
  Normal    Created     5m16s (x5 over 6m44s)   kubelet, controlplane   Created container kube-scheduler
  Warning   Failed      5m16s (x5 over 6m44s)   kubelet, controlplane   Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
  Warning   BackOff     103s (x27 over 6m42s)   kubelet, controlplane   Back-off restarting failed container

As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.

# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-schedulerror
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0

Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.

Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.

$ kubectl get pods
NAME                    READY   STATUS                RESTARTS    AGE
app-586bddbc54-xd8hs    1/1     Running               0           13m

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    1/1       Running     0           14m

NAME                  TYPE          CLUSTER-IP      EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1       <none>          443/TCP     15m

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     1/1       1             1             14m

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           1         14m

Note: This is the other useful command you can use to debug the issue:

$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)

Summary

In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.


×

SUBSCRIBE TO OUR BLOG

Get our new blogs delivered straight to your inbox.

 

THANKS FOR SIGNING UP!

We’ll make sure to share the best materials crafted for you!

15 minutes read


POSTED Jan, 2021

dot

IN
Debugging

Debugging Kubernetes Deployments

Serkan Özal

Written by Serkan Özal

Founder and CTO of Thundra


 X

Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.

This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.

Tools to Use to Debug the Kubernetes Cluster

It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.

Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide] nodes, pods, and svc.

One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide).

Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w):

kubectl describe (node, pod, svc) <name> -o yaml

The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml).

kubectl logs

Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name> prints the pod’s logs with a name pod name.

Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>.

kubectl -v(verbosity)

For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.

kubectl exec

Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages to look at the logs from a given pod, or kubectl exec -it <pod name> --sh to log in to the given pod.

kubectl get events

Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning.

Debugging Pod

There are two common reasons why pods fail in Kubernetes:

  • Startup failure: A container inside the pod doesn’t start.
  • Runtime failure: The application code fails after container startup.

Debugging Common Pod Errors with Step-by-Step and Real-World Examples

CrashLoopBackoff

A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:

  • An error in the application inside the container.
  • Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
  • A liveness probe that failed too many times.
Commands to Run to Identify the Issue

The first step in debugging this issue is to check pod status by running the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     0/1       CrashLoopBackOff      1           27s

Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.

As a next troubleshooting step, we are going to run the kubectl logs command to print the logs for a container in a pod.

$ kubectl logs busybox
sleep: invalid number '-z'

Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.

$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
  name: busybox
  labels:
    env: Prod
spec:
  containers:
  - name: busybox
    image: busybox
    args:
      - "sleep"
      - "-z"

To fix this issue, you’ll need to specify a valid option under the sleep command in your pod definition file and then run the following command

$ kubectl create -f broken.yml
pod/busybox created

Verify the status of your pod.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     1/1       Running               1           63s

ErrImagePull/ImagePullBackOff

An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:

  • The image you have provided has an invalid name.
  • The image tag doesn’t exist.
  • The specified image is in the private registry.
Commands to Run to Identify the Issue

As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
redis       0/1       ErrImagePull            0         117s

Next we will run the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod redis
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   46s                 default-scheduler   Successfully assigned default/redis to node01
  Normal    BackOff     16s (x2 over 43s)   kubelet, node01     Back-off pulling image "redis123"
  Warning   Failed      16s (x2 over 43s)   kubelet, node01     Error: ImagePullBackOff
  Normal    Pulling     2s (x3 over 45s)    kubelet, node01     Pulling image "redis123"
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Error: ErrImagePull

As you can see in the output of the kubectl describe command, it was unable to pull the image name redis123.

To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.

$ docker search redis
NAME                             DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
redis                            Redis is an open source key-value store that…   8985      [OK]

To fix this issue, we can follow either one of the following approaches:

  1. Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
  2. If you already have the yaml file, you can edit it directly, and if you don’t, you can use the --dry-run option to generate it.
kubectl edit pod redis
...
spec:
  containers:
  - image: redis
...
$ kubectl  run redis --image redis --dry-run=client -o yaml > redispod.yml

$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: redis
    name: redis
spec:
  containers:
  - image: redis
    name: redis123
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Similarly, under image spec, change the image name to redis and recreate the pod.

$ kubectl create -f redispod.yml
pod/redis created

CreateContainerConfigError

This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                        RESTARTS    AGE
configmap-pod     0/1       CreateContainerConfigError    0           58s

As the output of kubectl get pods shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe, which gets the detailed information about the pod.

$ kubectl describe pod configmap-pod
Warning  Failed       44s (x8 over 2m10s)     kubelet       Error: configmap "my-config" not found
Normal   Pulled       44s                     kubelet       Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal   Pulling      31s (x9 over 2m14s)     kubelet       Pulling image "gcr.io/google_containers/busybox"

To retrieve information about the ConfigMap, use this command:

$ kubectl get configmap

 

As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.

apiVersion: v1
kind: Pod
metadata:
  name: configmap-pod
spec:
  containers:
    - name: mytest-container
      image: busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:
        - name: MY_KEY
          valueFrom:
            configMapKeyRef:
              name: my-config
              key: name

To fix the error, create a ConfigMap file, whose content will look like this:

$ cat my-config.yaml
apiVersion: v1
data:
  name: test
  value: user
kind: ConfigMap
metadata:
  name: my-config

$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created

Now run kubectl get configmap to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.

$ kubectl get configmap
NAME              DATA    AGE
my-config         2       11s

Verify the status of the pod, and you will see the pod will be running state now.

$ kubectl get pods
NAME              READY     STATUS        RESTARTS    AGE
configmap-pod     1/1       Running       4           7m29s

ContainerCreating Pod

The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                RESTARTS    AGE
secret-pod        0/1       ContainerCreating     0           7s

As the output of kubectl get pods shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe command, which gets the detailed information about the pod.

$ kubectl describe pod secret-pod
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   25s                 default-scheduler   Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
  Normal    Pulled      18s                 kubelet             Successfully pulled image "nginx" in 666.826881ms
  Normal    Pulled      17s                 kubelet             Successfully pulled image "nginx" in 472.277634ms
  Normal    Pulling     5s (x3 over 18s)    kubelet             Pulling image "nginx"
  Warning   Failed      4s (x3 over 17s)    kubelet             Error: secret "myothersecret" not found
  Normal    Pulled      4s                  kubelet             Successfully pulled image "nginx" in 476.69613ms

To retrieve information about the secret, use this command:

$ kubectl get secret

As the above command’s output is null, the next step is to verify the pod definition file and create the secret.

$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
  name: secret-pod
spec:
  containers:
    - name: test-contaner
      image: nginx
      envFrom:
      - secretRef:
          name: myothersecret

To fix the error, create a secret file whose content looks like this.

$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
  name: myothersecret
data:
  USER_NAME: YWRtaW4=
  PASSWORD: MWYyZDFlMmU2N2Rm

Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.

For example, you can use the base64 command to encode any text data.

$ echo -n "username" | base64
dXNlcm5hbWU=

Use the following commands to decode the text and print the original text:

$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created

Now run kubectl get secret to retrieve information about a secret, and this time you will see the newly created secret.

$ kubectl get secret
NAME              TYPE      DATA    AGE
myothersecret     Opaque    2       20m

Verify the status of the pod, and you will see that the pod is in running state now.

$ kubectl get pods
NAME          READY     STATUS      RESTARTS    AGE
secret-pod    1/1       Running     0           2m36s

Debugging Worker Nodes

A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.

NotReady State

There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:

  • A virtual machine where the worker node is running shut down.
  • There is a network issue between worker and master node.
  • There is a crash within the Kubernetes software.

But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.

Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)

Commands to Run to Identify the Issue

Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    21m     v1.19.0
node01          NotReady    <none>    20m     v1.19.0

Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.

$ ps -ef | grep -i kubelet
root    21200 16471  0 04:46 pts/1    00:00:00 grep -i kubelet

As we see, the kubelet process is not running. We can run the systemctl command to verify it further.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
      Docs: https://kubernetes.io/docs/home/
   Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
  Main PID: 1455 (code=exited, status=0/SUCCESS)

Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Using the systemctl command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet command.

# systemctl start kubelet

Verify the status of kubelet again.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 13240 (kubelet)
     Tasks: 8 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube

Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE   VERSION
controlplane    Ready       master    23m   v1.19.0
node01          Ready       <none>    23m   v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)

In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet), but that doesn’t help. This time, the kubelet service status is activating (auto-restart), which means it’s not able to start.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago

Here we’ll start our debugging with a command called journalctl. Using journalctl, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.

# journalctl -u kubelet

You will see the following error message:

Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580   28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:

Note: To reach the bottom of the command, press SHIFT+G.

Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:

# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/my-test-file.crt

You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.

# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt

Replace it with the correct file path and restart the daemon.

clientCAFile: /etc/kubernetes/pki/my-test-file.crt

TO

clientCAFile: /etc/kubernetes/pki/ca.crt

# systemctl restart kubelet
# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago

Again, verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    23m     v1.19.0
node01          Ready       <none>    23m     v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)

In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status command.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 6904 (kubelet)
     Tasks: 13 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl

Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242	6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262	6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256	6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"

If we look at the last line of the systemctl status command, we’ll see that the kubelet cannot communicate with the API server.

To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info.

$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.

If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info.

# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:
    server: https://172.17.0.39:6553
  name: default-cluster

Change port 6553 to 6443 and restart the kubelet daemon.

# systemctl daemon-reload
# systemctl restart kubelet

Debugging Control Plane

The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.

Pod in a Pending State

What are some of the reasons that a pod goes into a Pending state?

  • The cluster doesn’t have enough resources.
  • The current namespace has a resource quota.
  • The pod is bound to a persistent volume claim.
  • You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue

We will start our debugging with kubectl get all commands. So far, we have used kubectl get pods and nodes, but to list all the resources, we need to pass all command line parameters. The command looks like this:

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    0/1       Pending     0           16s

NAME                  TYPE          CLUSTER-IP    EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1     <none>          443/TCP     83s

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     0/1       1             0             16s

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           0         16s

As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system namespace.

$ kubectl get pods -n kube-system
NAME                                    READY     STATUS              RESTARTS    AGE
coredns-f9fd979d6-48wdf                 1/1       Running             0           5m54s
coredns-f9fd979d6-rl55d                 1/1       Running             0           5m54s
etcd-controlplane                       1/1       Running             0           6m2s
kube-apiserver-controlplane             1/1       Running             0           6m2s
kube-controller-manager-controlplane    1/1       Running             0           6m2s
kube-flannel-ds-amd64-qbn7x             1/1       Running             0           5m44s
kube-flannel-ds-amd64-wzcmn             1/1       Running             0           5m53s
kube-proxy-b645c                        1/1       Running             0           5m53s
kube-proxy-m4lnk                        1/1       Running             0           5m44s
kube-scheduler-controlplane             0/1       CrashLoopBackOff    5           5m2s

As you can see in the output, the kube-scheduler-controlplane pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.

The next command we’ll run is the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
  Type      Reason      Age                     From                    Message
  ----      ------      ----                    ----                    -------
  Normal    Pulled      5m16s (x5 over 6m44s)   kubelet, controlplane   Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
  Normal    Created     5m16s (x5 over 6m44s)   kubelet, controlplane   Created container kube-scheduler
  Warning   Failed      5m16s (x5 over 6m44s)   kubelet, controlplane   Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
  Warning   BackOff     103s (x27 over 6m42s)   kubelet, controlplane   Back-off restarting failed container

As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.

# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-schedulerror
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0

Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.

Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.

$ kubectl get pods
NAME                    READY   STATUS                RESTARTS    AGE
app-586bddbc54-xd8hs    1/1     Running               0           13m

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    1/1       Running     0           14m

NAME                  TYPE          CLUSTER-IP      EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1       <none>          443/TCP     15m

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     1/1       1             1             14m

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           1         14m

Note: This is the other useful command you can use to debug the issue:

$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)

Summary

In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.

Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.

This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.

Tools to Use to Debug the Kubernetes Cluster

It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.

Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide] nodes, pods, and svc.

One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide).

Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w):

kubectl describe (node, pod, svc) <name> -o yaml

The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml).

kubectl logs

Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name> prints the pod’s logs with a name pod name.

Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>.

kubectl -v(verbosity)

For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.

kubectl exec

Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages to look at the logs from a given pod, or kubectl exec -it <pod name> --sh to log in to the given pod.

kubectl get events

Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning.

Debugging Pod

There are two common reasons why pods fail in Kubernetes:

  • Startup failure: A container inside the pod doesn’t start.
  • Runtime failure: The application code fails after container startup.

Debugging Common Pod Errors with Step-by-Step and Real-World Examples

CrashLoopBackoff

A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:

  • An error in the application inside the container.
  • Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
  • A liveness probe that failed too many times.
Commands to Run to Identify the Issue

The first step in debugging this issue is to check pod status by running the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     0/1       CrashLoopBackOff      1           27s

Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.

As a next troubleshooting step, we are going to run the kubectl logs command to print the logs for a container in a pod.

$ kubectl logs busybox
sleep: invalid number '-z'

Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.

$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
  name: busybox
  labels:
    env: Prod
spec:
  containers:
  - name: busybox
    image: busybox
    args:
      - "sleep"
      - "-z"

To fix this issue, you’ll need to specify a valid option under the sleep command in your pod definition file and then run the following command

$ kubectl create -f broken.yml
pod/busybox created

Verify the status of your pod.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     1/1       Running               1           63s

ErrImagePull/ImagePullBackOff

An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:

  • The image you have provided has an invalid name.
  • The image tag doesn’t exist.
  • The specified image is in the private registry.
Commands to Run to Identify the Issue

As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
redis       0/1       ErrImagePull            0         117s

Next we will run the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod redis
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   46s                 default-scheduler   Successfully assigned default/redis to node01
  Normal    BackOff     16s (x2 over 43s)   kubelet, node01     Back-off pulling image "redis123"
  Warning   Failed      16s (x2 over 43s)   kubelet, node01     Error: ImagePullBackOff
  Normal    Pulling     2s (x3 over 45s)    kubelet, node01     Pulling image "redis123"
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Error: ErrImagePull

As you can see in the output of the kubectl describe command, it was unable to pull the image name redis123.

To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.

$ docker search redis
NAME                             DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
redis                            Redis is an open source key-value store that…   8985      [OK]

To fix this issue, we can follow either one of the following approaches:

  1. Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
  2. If you already have the yaml file, you can edit it directly, and if you don’t, you can use the --dry-run option to generate it.
kubectl edit pod redis
...
spec:
  containers:
  - image: redis
...
$ kubectl  run redis --image redis --dry-run=client -o yaml > redispod.yml

$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: redis
    name: redis
spec:
  containers:
  - image: redis
    name: redis123
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Similarly, under image spec, change the image name to redis and recreate the pod.

$ kubectl create -f redispod.yml
pod/redis created

CreateContainerConfigError

This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                        RESTARTS    AGE
configmap-pod     0/1       CreateContainerConfigError    0           58s

As the output of kubectl get pods shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe, which gets the detailed information about the pod.

$ kubectl describe pod configmap-pod
Warning  Failed       44s (x8 over 2m10s)     kubelet       Error: configmap "my-config" not found
Normal   Pulled       44s                     kubelet       Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal   Pulling      31s (x9 over 2m14s)     kubelet       Pulling image "gcr.io/google_containers/busybox"

To retrieve information about the ConfigMap, use this command:

$ kubectl get configmap

 

As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.

apiVersion: v1
kind: Pod
metadata:
  name: configmap-pod
spec:
  containers:
    - name: mytest-container
      image: busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:
        - name: MY_KEY
          valueFrom:
            configMapKeyRef:
              name: my-config
              key: name

To fix the error, create a ConfigMap file, whose content will look like this:

$ cat my-config.yaml
apiVersion: v1
data:
  name: test
  value: user
kind: ConfigMap
metadata:
  name: my-config

$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created

Now run kubectl get configmap to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.

$ kubectl get configmap
NAME              DATA    AGE
my-config         2       11s

Verify the status of the pod, and you will see the pod will be running state now.

$ kubectl get pods
NAME              READY     STATUS        RESTARTS    AGE
configmap-pod     1/1       Running       4           7m29s

ContainerCreating Pod

The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                RESTARTS    AGE
secret-pod        0/1       ContainerCreating     0           7s

As the output of kubectl get pods shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe command, which gets the detailed information about the pod.

$ kubectl describe pod secret-pod
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   25s                 default-scheduler   Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
  Normal    Pulled      18s                 kubelet             Successfully pulled image "nginx" in 666.826881ms
  Normal    Pulled      17s                 kubelet             Successfully pulled image "nginx" in 472.277634ms
  Normal    Pulling     5s (x3 over 18s)    kubelet             Pulling image "nginx"
  Warning   Failed      4s (x3 over 17s)    kubelet             Error: secret "myothersecret" not found
  Normal    Pulled      4s                  kubelet             Successfully pulled image "nginx" in 476.69613ms

To retrieve information about the secret, use this command:

$ kubectl get secret

As the above command’s output is null, the next step is to verify the pod definition file and create the secret.

$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
  name: secret-pod
spec:
  containers:
    - name: test-contaner
      image: nginx
      envFrom:
      - secretRef:
          name: myothersecret

To fix the error, create a secret file whose content looks like this.

$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
  name: myothersecret
data:
  USER_NAME: YWRtaW4=
  PASSWORD: MWYyZDFlMmU2N2Rm

Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.

For example, you can use the base64 command to encode any text data.

$ echo -n "username" | base64
dXNlcm5hbWU=

Use the following commands to decode the text and print the original text:

$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created

Now run kubectl get secret to retrieve information about a secret, and this time you will see the newly created secret.

$ kubectl get secret
NAME              TYPE      DATA    AGE
myothersecret     Opaque    2       20m

Verify the status of the pod, and you will see that the pod is in running state now.

$ kubectl get pods
NAME          READY     STATUS      RESTARTS    AGE
secret-pod    1/1       Running     0           2m36s

Debugging Worker Nodes

A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.

NotReady State

There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:

  • A virtual machine where the worker node is running shut down.
  • There is a network issue between worker and master node.
  • There is a crash within the Kubernetes software.

But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.

Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)

Commands to Run to Identify the Issue

Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    21m     v1.19.0
node01          NotReady    <none>    20m     v1.19.0

Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.

$ ps -ef | grep -i kubelet
root    21200 16471  0 04:46 pts/1    00:00:00 grep -i kubelet

As we see, the kubelet process is not running. We can run the systemctl command to verify it further.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
      Docs: https://kubernetes.io/docs/home/
   Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
  Main PID: 1455 (code=exited, status=0/SUCCESS)

Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Using the systemctl command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet command.

# systemctl start kubelet

Verify the status of kubelet again.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 13240 (kubelet)
     Tasks: 8 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube

Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE   VERSION
controlplane    Ready       master    23m   v1.19.0
node01          Ready       <none>    23m   v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)

In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet), but that doesn’t help. This time, the kubelet service status is activating (auto-restart), which means it’s not able to start.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago

Here we’ll start our debugging with a command called journalctl. Using journalctl, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.

# journalctl -u kubelet

You will see the following error message:

Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580   28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:

Note: To reach the bottom of the command, press SHIFT+G.

Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:

# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/my-test-file.crt

You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.

# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt

Replace it with the correct file path and restart the daemon.

clientCAFile: /etc/kubernetes/pki/my-test-file.crt

TO

clientCAFile: /etc/kubernetes/pki/ca.crt

# systemctl restart kubelet
# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago

Again, verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    23m     v1.19.0
node01          Ready       <none>    23m     v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)

In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status command.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 6904 (kubelet)
     Tasks: 13 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl

Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242	6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262	6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256	6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"

If we look at the last line of the systemctl status command, we’ll see that the kubelet cannot communicate with the API server.

To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info.

$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.

If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info.

# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:
    server: https://172.17.0.39:6553
  name: default-cluster

Change port 6553 to 6443 and restart the kubelet daemon.

# systemctl daemon-reload
# systemctl restart kubelet

Debugging Control Plane

The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.

Pod in a Pending State

What are some of the reasons that a pod goes into a Pending state?

  • The cluster doesn’t have enough resources.
  • The current namespace has a resource quota.
  • The pod is bound to a persistent volume claim.
  • You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue

We will start our debugging with kubectl get all commands. So far, we have used kubectl get pods and nodes, but to list all the resources, we need to pass all command line parameters. The command looks like this:

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    0/1       Pending     0           16s

NAME                  TYPE          CLUSTER-IP    EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1     <none>          443/TCP     83s

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     0/1       1             0             16s

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           0         16s

As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system namespace.

$ kubectl get pods -n kube-system
NAME                                    READY     STATUS              RESTARTS    AGE
coredns-f9fd979d6-48wdf                 1/1       Running             0           5m54s
coredns-f9fd979d6-rl55d                 1/1       Running             0           5m54s
etcd-controlplane                       1/1       Running             0           6m2s
kube-apiserver-controlplane             1/1       Running             0           6m2s
kube-controller-manager-controlplane    1/1       Running             0           6m2s
kube-flannel-ds-amd64-qbn7x             1/1       Running             0           5m44s
kube-flannel-ds-amd64-wzcmn             1/1       Running             0           5m53s
kube-proxy-b645c                        1/1       Running             0           5m53s
kube-proxy-m4lnk                        1/1       Running             0           5m44s
kube-scheduler-controlplane             0/1       CrashLoopBackOff    5           5m2s

As you can see in the output, the kube-scheduler-controlplane pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.

The next command we’ll run is the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
  Type      Reason      Age                     From                    Message
  ----      ------      ----                    ----                    -------
  Normal    Pulled      5m16s (x5 over 6m44s)   kubelet, controlplane   Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
  Normal    Created     5m16s (x5 over 6m44s)   kubelet, controlplane   Created container kube-scheduler
  Warning   Failed      5m16s (x5 over 6m44s)   kubelet, controlplane   Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
  Warning   BackOff     103s (x27 over 6m42s)   kubelet, controlplane   Back-off restarting failed container

As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.

# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-schedulerror
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0

Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.

Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.

$ kubectl get pods
NAME                    READY   STATUS                RESTARTS    AGE
app-586bddbc54-xd8hs    1/1     Running               0           13m

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    1/1       Running     0           14m

NAME                  TYPE          CLUSTER-IP      EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1       <none>          443/TCP     15m

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     1/1       1             1             14m

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           1         14m

Note: This is the other useful command you can use to debug the issue:

$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)

Summary

In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.

Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.

This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.

Tools to Use to Debug the Kubernetes Cluster

It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we’re going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.

Kubectl is a command-line tool that lets you control your Kubernetes cluster and comes with a wide range of commands, such as kubectl get [-o wide] nodes, pods, and svc.

One of the critical commands is get, which lists one or more resources. For example, kubectl get pods will help you check your pod status, and you can check the status of worker nodes using kubectl get nodes. You can also pass -o with an output format to get additional information. For example, when you pass -o wide with kubectl get pods, the node is included in the output (kubectl get pods -o wide).

Note: One of the important flags that come with the kubectl command is -w(watch flag), which lets you start watching the updates to a particular object (for example, kubectl get pods -w):

kubectl describe (node, pod, svc) <name> -o yaml

The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml).

kubectl logs

Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name> prints the pod’s logs with a name pod name.

Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>.

kubectl -v(verbosity)

For kubectl, verbosity is controlled with -v or –v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.

kubectl exec

Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages to look at the logs from a given pod, or kubectl exec -it <pod name> --sh to log in to the given pod.

kubectl get events

Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning.

Debugging Pod

There are two common reasons why pods fail in Kubernetes:

  • Startup failure: A container inside the pod doesn’t start.
  • Runtime failure: The application code fails after container startup.

Debugging Common Pod Errors with Step-by-Step and Real-World Examples

CrashLoopBackoff

A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:

  • An error in the application inside the container.
  • Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
  • A liveness probe that failed too many times.
Commands to Run to Identify the Issue

The first step in debugging this issue is to check pod status by running the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     0/1       CrashLoopBackOff      1           27s

Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.

As a next troubleshooting step, we are going to run the kubectl logs command to print the logs for a container in a pod.

$ kubectl logs busybox
sleep: invalid number '-z'

Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.

$ cat podbroken.yml
apiVersion: v1
kind: Pod
metadata:
  name: busybox
  labels:
    env: Prod
spec:
  containers:
  - name: busybox
    image: busybox
    args:
      - "sleep"
      - "-z"

To fix this issue, you’ll need to specify a valid option under the sleep command in your pod definition file and then run the following command

$ kubectl create -f broken.yml
pod/busybox created

Verify the status of your pod.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
busybox     1/1       Running               1           63s

ErrImagePull/ImagePullBackOff

An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:

  • The image you have provided has an invalid name.
  • The image tag doesn’t exist.
  • The specified image is in the private registry.
Commands to Run to Identify the Issue

As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods command.

$ kubectl get pods
NAME        READY     STATUS              RESTARTS    AGE
redis       0/1       ErrImagePull            0         117s

Next we will run the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod redis
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   46s                 default-scheduler   Successfully assigned default/redis to node01
  Normal    BackOff     16s (x2 over 43s)   kubelet, node01     Back-off pulling image "redis123"
  Warning   Failed      16s (x2 over 43s)   kubelet, node01     Error: ImagePullBackOff
  Normal    Pulling     2s (x3 over 45s)    kubelet, node01     Pulling image "redis123"
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Failed to pull image "redis123": rpc error: code = Unknown desc = Error response from daemon: pull access denied for redis123, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning   Failed      1s (x3 over 44s)    kubelet, node01     Error: ErrImagePull

As you can see in the output of the kubectl describe command, it was unable to pull the image name redis123.

To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.

$ docker search redis
NAME                             DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
redis                            Redis is an open source key-value store that…   8985      [OK]

To fix this issue, we can follow either one of the following approaches:

  1. Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
  2. If you already have the yaml file, you can edit it directly, and if you don’t, you can use the --dry-run option to generate it.
kubectl edit pod redis
...
spec:
  containers:
  - image: redis
...
$ kubectl  run redis --image redis --dry-run=client -o yaml > redispod.yml

$ cat redispod.yml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: redis
    name: redis
spec:
  containers:
  - image: redis
    name: redis123
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Similarly, under image spec, change the image name to redis and recreate the pod.

$ kubectl create -f redispod.yml
pod/redis created

CreateContainerConfigError

This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                        RESTARTS    AGE
configmap-pod     0/1       CreateContainerConfigError    0           58s

As the output of kubectl get pods shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe, which gets the detailed information about the pod.

$ kubectl describe pod configmap-pod
Warning  Failed       44s (x8 over 2m10s)     kubelet       Error: configmap "my-config" not found
Normal   Pulled       44s                     kubelet       Successfully pulled image "gcr.io/google_containers/busybox" in 1.002510873s
Normal   Pulling      31s (x9 over 2m14s)     kubelet       Pulling image "gcr.io/google_containers/busybox"

To retrieve information about the ConfigMap, use this command:

$ kubectl get configmap

 

As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMap.

apiVersion: v1
kind: Pod
metadata:
  name: configmap-pod
spec:
  containers:
    - name: mytest-container
      image: busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:
        - name: MY_KEY
          valueFrom:
            configMapKeyRef:
              name: my-config
              key: name

To fix the error, create a ConfigMap file, whose content will look like this:

$ cat my-config.yaml
apiVersion: v1
data:
  name: test
  value: user
kind: ConfigMap
metadata:
  name: my-config

$ kubectl apply -f configmap-special-config.yaml
configmap/special-config created

Now run kubectl get configmap to retrieve information about the ConfigMap, and this time, you’ll see the newly created ConfigMap.

$ kubectl get configmap
NAME              DATA    AGE
my-config         2       11s

Verify the status of the pod, and you will see the pod will be running state now.

$ kubectl get pods
NAME              READY     STATUS        RESTARTS    AGE
configmap-pod     1/1       Running       4           7m29s

ContainerCreating Pod

The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.

Commands to Run to Identify the Issue

We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.

$ kubectl get pods
NAME              READY     STATUS                RESTARTS    AGE
secret-pod        0/1       ContainerCreating     0           7s

As the output of kubectl get pods shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe command, which gets the detailed information about the pod.

$ kubectl describe pod secret-pod
Events:
  Type      Reason      Age                 From                Message
  ----      ------      ----                ----                -------
  Normal    Scheduled   25s                 default-scheduler   Successfully assigned default/secret-pod to 17e9458eef1c.mylabserver.com
  Normal    Pulled      18s                 kubelet             Successfully pulled image "nginx" in 666.826881ms
  Normal    Pulled      17s                 kubelet             Successfully pulled image "nginx" in 472.277634ms
  Normal    Pulling     5s (x3 over 18s)    kubelet             Pulling image "nginx"
  Warning   Failed      4s (x3 over 17s)    kubelet             Error: secret "myothersecret" not found
  Normal    Pulled      4s                  kubelet             Successfully pulled image "nginx" in 476.69613ms

To retrieve information about the secret, use this command:

$ kubectl get secret

As the above command’s output is null, the next step is to verify the pod definition file and create the secret.

$ cat secret.yml
apiVersion: v1
kind: Pod
metadata:
  name: secret-pod
spec:
  containers:
    - name: test-contaner
      image: nginx
      envFrom:
      - secretRef:
          name: myothersecret

To fix the error, create a secret file whose content looks like this.

$ cat secret-data.yml
apiVersion: v1
kind: Secret
metadata:
  name: myothersecret
data:
  USER_NAME: YWRtaW4=
  PASSWORD: MWYyZDFlMmU2N2Rm

Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.

For example, you can use the base64 command to encode any text data.

$ echo -n "username" | base64
dXNlcm5hbWU=

Use the following commands to decode the text and print the original text:

$ echo -n "dXNlcm5hbWU=" | base64 --decode
username
$ kubectl create -f secret-data.yml
secret/myothersecret created

Now run kubectl get secret to retrieve information about a secret, and this time you will see the newly created secret.

$ kubectl get secret
NAME              TYPE      DATA    AGE
myothersecret     Opaque    2       20m

Verify the status of the pod, and you will see that the pod is in running state now.

$ kubectl get pods
NAME          READY     STATUS      RESTARTS    AGE
secret-pod    1/1       Running     0           2m36s

Debugging Worker Nodes

A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.

NotReady State

There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:

  • A virtual machine where the worker node is running shut down.
  • There is a network issue between worker and master node.
  • There is a crash within the Kubernetes software.

But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.

Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)

Commands to Run to Identify the Issue

Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    21m     v1.19.0
node01          NotReady    <none>    20m     v1.19.0

Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.

$ ps -ef | grep -i kubelet
root    21200 16471  0 04:46 pts/1    00:00:00 grep -i kubelet

As we see, the kubelet process is not running. We can run the systemctl command to verify it further.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: inactive (dead) since Tue 2020-10-20 04:34:33 UTC; 1min 53s ago
      Docs: https://kubernetes.io/docs/home/
   Process: 1455 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
  Main PID: 1455 (code=exited, status=0/SUCCESS)

Oct 20 04:34:33 node01 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Oct 20 04:34:33 node01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Using the systemctl command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet command.

# systemctl start kubelet

Verify the status of kubelet again.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 04:37:14 UTC; 1s ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 13240 (kubelet)
     Tasks: 8 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─13240 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kube

Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE   VERSION
controlplane    Ready       master    23m   v1.19.0
node01          Ready       <none>    23m   v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)

In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet), but that doesn’t help. This time, the kubelet service status is activating (auto-restart), which means it’s not able to start.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Tue 2020-10-20 04:41:58 UTC; 1s ago

Here we’ll start our debugging with a command called journalctl. Using journalctl, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.

# journalctl -u kubelet

You will see the following error message:

Oct 20 04:54:47 node01 kubelet[28692]: F1020 04:54:47.021580   28692 server.go:253] unable to load client CA file /etc/kubernetes/pki/my-test-file.crt: open /etc/kubernetes/pki/my-test-file.crt
Oct 20 04:54:47 node01 kubelet[28692]: goroutine 1 [running]:

Note: To reach the bottom of the command, press SHIFT+G.

Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:

# cat /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/my-test-file.crt

You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.

# ls -l /etc/kubernetes/pki/
total 4
-rw-r--r-- 1 root root 1066 Oct 20 04:14 ca.crt

Replace it with the correct file path and restart the daemon.

clientCAFile: /etc/kubernetes/pki/my-test-file.crt

TO

clientCAFile: /etc/kubernetes/pki/ca.crt

# systemctl restart kubelet
# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:05:38 UTC; 7s ago

Again, verify it from the Kubernetes master node.

$ kubectl get nodes
NAME            STATUS      ROLES     AGE     VERSION
controlplane    Ready       master    23m     v1.19.0
node01          Ready       <none>    23m     v1.19.0

As you can see, the worker node is now in Ready state.

Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)

In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status command.

# systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
    Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Drop-In: /etc/systemd/system/kubelet.service.d
            └─10-kubeadm.conf
    Active: active (running) since Tue 2020-10-20 05:06:17 UTC; 6min ago
      Docs: https://kubernetes.io/docs/home/
  Main PID: 6904 (kubelet)
     Tasks: 13 (limit: 4678)
    CGroup: /system.slice/kubelet.service
            └─6904 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-pl

Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.104242	6904 kubelet.go:2183] node "node01" not found
Oct 20 05:12:56 node01 kubelet[6904]: I1020 05:12:56.121262	6904 kubelet_node_status.go:70] Attempting to register node node01
Oct 20 05:12:56 node01 kubelet[6904]: E1020 05:12:56.122256	6904 kubelet_node_status.go:92] Unable to register node "node01" with API server: Post "https://172.17.0.39:6553/api/v1/nodes"

If we look at the last line of the systemctl status command, we’ll see that the kubelet cannot communicate with the API server.

To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info.

$ kubectl cluster-info
Kubernetes master is running at https://172.17.0.39:6443
KubeDNS is running at https://172.17.0.39:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.

# cd /etc/systemd/system/kubelet.service.d/
$ cat 10-kubeadm.conf

# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"

This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.

If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info.

# cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data:
    server: https://172.17.0.39:6553
  name: default-cluster

Change port 6553 to 6443 and restart the kubelet daemon.

# systemctl daemon-reload
# systemctl restart kubelet

Debugging Control Plane

The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.

Pod in a Pending State

What are some of the reasons that a pod goes into a Pending state?

  • The cluster doesn’t have enough resources.
  • The current namespace has a resource quota.
  • The pod is bound to a persistent volume claim.
  • You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue

We will start our debugging with kubectl get all commands. So far, we have used kubectl get pods and nodes, but to list all the resources, we need to pass all command line parameters. The command looks like this:

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    0/1       Pending     0           16s

NAME                  TYPE          CLUSTER-IP    EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1     <none>          443/TCP     83s

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     0/1       1             0             16s

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           0         16s

As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system namespace.

$ kubectl get pods -n kube-system
NAME                                    READY     STATUS              RESTARTS    AGE
coredns-f9fd979d6-48wdf                 1/1       Running             0           5m54s
coredns-f9fd979d6-rl55d                 1/1       Running             0           5m54s
etcd-controlplane                       1/1       Running             0           6m2s
kube-apiserver-controlplane             1/1       Running             0           6m2s
kube-controller-manager-controlplane    1/1       Running             0           6m2s
kube-flannel-ds-amd64-qbn7x             1/1       Running             0           5m44s
kube-flannel-ds-amd64-wzcmn             1/1       Running             0           5m53s
kube-proxy-b645c                        1/1       Running             0           5m53s
kube-proxy-m4lnk                        1/1       Running             0           5m44s
kube-scheduler-controlplane             0/1       CrashLoopBackOff    5           5m2s

As you can see in the output, the kube-scheduler-controlplane pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.

The next command we’ll run is the kubectl describe command to get detailed information about the pod.

$ kubectl describe pod kube-scheduler-controlplane -n kube-system
Events:
  Type      Reason      Age                     From                    Message
  ----      ------      ----                    ----                    -------
  Normal    Pulled      5m16s (x5 over 6m44s)   kubelet, controlplane   Container image "k8s.gcr.io/kube-scheduler:v1.19.0" already present on machine
  Normal    Created     5m16s (x5 over 6m44s)   kubelet, controlplane   Created container kube-scheduler
  Warning   Failed      5m16s (x5 over 6m44s)   kubelet, controlplane   Error: failed to start container "kube-scheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: "kube-schedulerror": executable file not found in $PATH": unknown
  Warning   BackOff     103s (x27 over 6m42s)   kubelet, controlplane   Back-off restarting failed container

As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.

# cd /etc/kubernetes/manifests/
$ cat kube-scheduler.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    component: kube-scheduler
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-schedulerror
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --leader-elect=true
    - --port=0

Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.

Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.

$ kubectl get pods
NAME                    READY   STATUS                RESTARTS    AGE
app-586bddbc54-xd8hs    1/1     Running               0           13m

$ kubectl get all
NAME                        READY     STATUS      RESTARTS    AGE
pod/app-586bddbc54-xd8hs    1/1       Running     0           14m

NAME                  TYPE          CLUSTER-IP      EXTERNAL-IP     PORT(S)     AGE
service/kubernetes    ClusterIP     10.96.0.1       <none>          443/TCP     15m

NAME                    READY     UP-TO-DATE    AVAILABLE     AGE
deployment.apps/app     1/1       1             1             14m

NAME                              DESIRED     CURRENT     READY     AGE
replicaset.apps/app-586bddbc54    1           1           1         14m

Note: This is the other useful command you can use to debug the issue:

$ kubectl logs kube-scheduler-controlplane -n kube-system --tail=10 # (to get the last 10 lines)

Summary

In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it’s hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey. If you want to go beyond debugging Kubernetes and debug the application itself, Thundra provides rich feature set with production debugging with tracepoints and with distributed tracing.

*** This is a Security Bloggers Network syndicated blog from Thundra blog authored by Serkan Özal. Read the original post at: https://blog.thundra.io/debugging-kubernetes-deployments