Disk pressure happens when your cluster disk space is near full and thus prevents your applications or Kubernetes processes from working well. Generally, to resolve this, you would need to add more storage and migrate some workloads/components there or migrate workloads to a different server. Sometimes however, this may not be feasible and the way forward would be to clear some space in current cluster/node.
In this article, the setup we will consider uses MicroK8S as the Kubernetes distribution and [Rancher] Longhorn as the storage class. Longhorn comes with a beautiful dashboard that helps you check your disk usage, volumes and their backups.
I run a cluster with a few sites each allocated between 10GB – 50GB storage. Some sites are simple containers with no persistent storage requirements. Others require persistent storage. But the storage does not change so much. Updates to the site files are rare. Everything we change daily is saved directly to the databases. Normally, you shouldn’t even notice a change in disk space. All site data, except databases, takes about 130GB. However, a 512GB NVMe disk gets filled after sometime and applications cannot run afterwards. Databases are hosted externally on dedicated database servers hence not affected.
What Causes Increasing Disk Usage
This begs the question – What causes the increasing disk usage for seemingly static sites? After investigation, I noted two sources of increasing disk usage.
With Longhorn, snapshots are saved inside the same volumes as live data. Keeping too many snapshots of your data means you eat into the container’s volume space – which can cause containers not to work well, as well as increased disk space usage – which causes disk pressure
- Unused images and containers
The other quite subtle cause is old images being saved on the disk and not being deleted. MicroK8S using containerd stores data in the folder /var/snap/microk8s/common/var/lib/containerd
Inside this folder, there are various filesystem folders which store actual data. Most users use overlaysfs, io.containerd.snapshotter.v1.overlayfs, since its the default. So you may find this folder consuming more space than your actual working data. This was the case for me.
Fixing Disk Pressure
I’ll go straight to the point here, describing how to fix the two scenarios above.
To prevent snapshots from taking too much of your Longhorn space, configure recurring snapshots. This normally deletes old snapshots as new ones are built. So if I set only 2 snapshots, once they are created, older ones will keep getting deleted as new ones are created so that you always have 2 snapshots. Check this link on recommended disk maintenance practices when using Longhorn
You can also manually delete the snapshots from your longhorn dashboard. Though this is a tedious task if you have a lot of volumes.
- Unused images and containers
The solution to this is to delete images using kubelet. The process is called garbage collection.
To do this, Kubernetes provides 2 arguments which you can use to automatically delete unused images and containers:
mage-gc-high-threshold- The percent of disk usage after which image garbage collection is always run. Values must be within the range [0, 100], To disable image garbage collection, set to 100.
mage-gc-low-threshold- The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Values must be within the range [0, 100] and should not be larger than that of --image-gc-high-threshold.
You need to edit the file
/var/snap/microk8s/current/args/kubelet and add the arguments as follows
Then restart MicroK8S
$ microk8s.stop && microk8s.start
Once the restart process completes, and all the needed Pods restart and go back to running state, you should see significant disk space cleared and made available for scheduling. In my cluster, the schedulable disk space increased from 8.2GB to 168GB! Yes, a whooping 20 times!