Kubernetes Pods (CNFs) - Evictions due to Node Disk Pressure

Kubernetes Pods (CNFs) - Evictions due to Node Disk Pressure

Why do the PODs go into Eviction? Eviction is a process where a Pod assigned to a node is asked for Termination. They are terminated, usually the result of not having enough resources. So Kubernetes will evict a certain number of pods from the node to ensure that there are enough resources on the node. Also, Kubernetes constantly checks resources and evicts Pods if needed, by a process called Node pressure eviction. 

Symptoms:

When running $kubectl get events -n namespace, the following errors are observed:

  • Failed to garbage collect required amount of images
  • Disk-pressure  warnings for the associated namespace. 

Aditi_Dhavale_0-1674126564916.png

The following errors are observed when running $kubectl describe pod -n namespace podname

  • NodeHasDiskPressure
  • Attempting to reclaim ephemeral-storage

Aditi_Dhavale_1-1674126892158.png

Purpose:

The purpose of this article is to provide troubleshooting guidelines for scenarios where Kubernetes pods go into an evicted state due to disk pressure.

Cause:

In Kubernetes, Pods can be evicted from a Node due to insufficient resources.

In additional to terminating the Pod, whenever a node experiences disk pressure, a process called Node-pressure Eviction can activate, which utilizes Kubelet to perform garbage collection and remove dormant Kubernetes objects from utilizing resources.

When a pod is terminated, Kubernetes can generate several core* temporary files, which if not cleaned up properly, can lead to disk exhaustion.

While this process is automated, manual intervention may be required.

Procedure 1 (Clean up corefiles)

  1. SSH into the worker node as the root user.
  2. Obtain the file system disk usage by running the following command:

    $df -kh

  3. Confirm that the root (/) partition is highly utilized, e.g. over 85% full.
  4. Navigate to the /data/storage/corefiles directory

    $cd /data/storage/corefiles

  5. Obtain the total size of the directory by running the following command:

    $du -s -h

    Note: This value is the amount of space that will be cleaned up.

  6. List the files to confirm there are corefiles present.

     $ls -lrth

  7. Run the following command to remove all corefiles:

    $rm -rf core*

  8. Review the pod status by running the following command:

    $kubectl get pods -A -o wide | grep nodename

    Note: Replace nodename in example above with valid nodename.

Confirm the Pods are in a running state as expected

Aditi_Dhavale_2-1674126990081.png

Note: If issue is not resolved, please proceed to Procedure 2.

Procedure 2 (Clean up and re-instantiate CNF(s))

  1. SSH to the worker node as root user
  2. List the containers using the following command to verify if DU container is running

    $crictl ps -a

Aditi_Dhavale_3-1674127036187.png

3. List the images by running the following command.

  $crictl images

4. Confirm the DU pod image listed. If it is present, stop & kill the DU container as we are encountering issues with DU.

Aditi_Dhavale_4-1674127183145.png5. Stop and kill the DU container immediately using the below command.

  $crictl stop container_Id ; crictl rm container_Id

Note: Replace container_ID in example above with valid containerId imageId.Aditi_Dhavale_5-1674127224010.png

Note: Once you stop a DU, the DU will go into an exited state which will prompt a new DU to get created. For this reason, we need to run these two commands in parallel.

6.  Once the DU container has been terminated, terminate the CNF through the TCA UI.

7.  On successful termination from TCA, run the following command to remove the image:

   $crictl rmi imageId
 
Aditi_Dhavale_6-1674127359957.png

Note: Replace imageId in example above with valid container imageId.

8. From the Master node, run the following command to confirm that the DU and PTP nodes have been terminated.

  $kubectl get pods -A -o wide | grep nodename

Note: Replace nodename in example above with valid nodename(CNF name).

Aditi_Dhavale_7-1674127455603.png

Note: Results should show only kube-system pods.

9. Re-instantiate CNF from TCA. This will pull the fresh image from the registry and deploy the DU & PTP containers

10. Once the instantiation has completed, run the following command to ensure the DU & PTP pods are in a Running state:

   $kubectl get pods -A -o wide | grep nodename

Aditi_Dhavale_8-1674127527243.png

Note: Replace nodename in example above with valid nodename(CNF).

If the multiple evicted pods are still listed, please proceed with Procedure 3.

Procedure 3 (Delete all pods in an Evicted state)

  1. Run the following command to delete any remaining Pods in an Evicted state

  $kubectl get pods | grep Evicted | awk ‘{print $1}’ | xargs kubectl delete pod

 

 

 

 

 

 

 

Labels (3)
Comments

Very useful article. Thank you.

This is really informative! 

Version history
Revision #:
21 of 22
Last update:
‎01-25-2023 01:26 PM
Updated by: