Welcome to my technical notes space

This site is a collection of personal notes gathered from my experiments and projects, mainly focused on Kubernetes and its ecosystem.

What you'll find here isn't polished tutorials or official documentation. It's snippets of config, handy commands, real-world troubleshooting, and sometimes solutions to specific issues I've run into in production environments. Some notes might be incomplete or lack context — that's expected, since they were originally written for my own use.

I decided to publish them because, in my experience, you often run into weird edge cases that aren’t documented anywhere. If these notes can save someone a few hours debugging an obscure issue or figuring out a tricky setup, then it's worth it.

Happy reading — and good luck hunting bugs.

Kind

kind-quick-start

Kubernetes

etcd

Database space exceeded

failed to update node lease, error: etcdserver: mvcc: database space exceeded

The Etcd cluster has gone into a limited operation maintenance mode, meaning that it will only accept key reads and deletes.

Possible Solution

History compaction needs to occur :

$ export ETCDCTL_API=3
$ etcdctl alarm list
$ etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'
143862581
$ etcdctl compact 143862581
$ etcdctl defrag
$ etcdctl alarm disarm

This operation should be done on each etcd cluster node.

Api Crashed and etcd slowness

Most of time this issue is due to a busy etcd database due to too many objects (like events) or network slowness. In this case you can see on grafana a lot of etcd leader changes. Too many Events/jobs can be caused by a cronjob running in loop pods which are looping in error also.

Possible Solution

Give Disk priority to etcd:

An etcd cluster is very sensitive to disk latencies. Since etcd must persist proposals to its log, disk activity from other processes may cause long fsync latencies. The upshot is etcd may miss heartbeats, causing request timeouts and temporary leader loss. An etcd server can sometimes stably run alongside these processes when given a high disk priority.

On Linux, etcd’s disk priority can be configured with ionice:

# best effort, highest priority
sudo ionice -c2 -n0 -p `pgrep etcd

Count the number of events in etcd database.

ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only | grep /registry/events | cut -d'/' -f4 | uniq -c| sort -nr

Identify the namespace wich is causing to many events and try to purge them with kubectl

kubectl delete events --all -n <NAMESPACE>

If the api crashed it could be complicated to clean events with kubectl. In this case you can clean events directly in etcd database.

ETCDCTL_API=3 etcdctl del /registry/events/<NAMESPACE> --prefix

Compact ETCD database on each master nodes to free space.

$ export ETCDCTL_API=3
$ etcdctl endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'
143862581
$ etcdctl compact 143862581
$ etcdctl defrag

Give time to etcd to resync all nodes.

Finally, check job and cronjob wich are creating to many events and stop them (for cronjob) and delete jobs.

kube-tuning

🛠️ Kubernetes Node Stability and Performance: Tuning Kubelet for Better Resource Management

Author's note: This is a practical guide for Kubernetes operators and administrators looking to improve cluster resilience and performance by fine-tuning kubelet parameters. The focus is on preventing node crashes and optimizing resource usage, especially for clusters running in production environments.

🚨 Why Tuning Matters

In a Kubernetes cluster, nodes are the foundation—if a node goes down, all the workloads (pods) running on it are impacted. One common cause of node instability is poor resource management at the kubelet level. Without proper reservations and eviction policies, pods can consume all the system memory or CPU, leading to Out of Memory (OOM) errors or even system crashes.

This article covers two key areas of kubelet tuning:

  1. Resource Reservations and Evictions
  2. Graceful Node Shutdown Settings

The goal is to help you configure your nodes so they remain stable under load, avoid system-level OOMs, and terminate gracefully when needed, particularly on platforms like OpenStack.

⚙️ 1. Reserve Resources to Protect the Node

❗ Problem

By default, if you don't reserve any resources for system or kubelet processes, pods can consume 100% of a node’s memory or CPU. This can starve the system, cause critical services to crash, and render the node temporarily unusable.

✅ Solution: Use kubeReserved, systemReserved, and evictionHard

  • systemReserved: Resources set aside for system-level processes (e.g., systemd, journald).
  • kubeReserved: Resources reserved for Kubernetes components like kubelet, container runtime, etc.
  • evictionHard: Memory and storage thresholds at which kubelet starts evicting pods before the system runs out of resources completely.

🔧 Example Configuration

kubeReserved:
  cpu: 420m
  memory: 9Gi
systemReserved:
  cpu: 100m
  memory: 1Gi
evictionHard:
  memory.available: 100Mi
  nodefs.available: 10%
  imagefs.available: 15%
  nodefs.inodesFree: 5%

💡 These values can be adjusted based on your node specs (CPU cores, total memory). Below is a basic recommendation logic for automation tools like Ansible

KubeletKubeReservedMemory: >-
  {% if ansible_memtotal_mb >= 256000 %}13Gi
  {% elif ansible_memtotal_mb >= 128000 %}9Gi
  {% elif ansible_memtotal_mb >= 64000 %}6Gi
  {% elif ansible_memtotal_mb >= 31900 %}4Gi
  {% elif ansible_memtotal_mb >= 16000 %}3Gi
  {% elif ansible_memtotal_mb >= 8000 %}2Gi
  {% elif ansible_memtotal_mb >= 4000 %}1Gi
  {% else %}255Mi
  {% endif %}
KubeletKubeReservedCpu: >-
  {% if ansible_processor_vcpus >= 64 %}740m
  {% elif ansible_processor_vcpus >= 32 %}420m
  {% elif ansible_processor_vcpus >= 16 %}260m
  {% elif ansible_processor_vcpus >= 8 %}180m
  {% elif ansible_processor_vcpus >= 4 %}140m
  {% elif ansible_processor_vcpus >= 2 %}100m
  {% elif ansible_processor_vcpus >= 1 %}60m
  {% else %}10m
  {% endif %}
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
kubeReserved:
  cpu: {{ KubeletKubeReservedCpu }}
  memory: {{ KubeletKubeReservedMemory }}
systemReserved:
  cpu: 100m
  memory: 1Gi
evictionHard:
  memory.available: 100Mi
  nodefs.available: 10%
  imagefs.available: 15%
  nodefs.inodesFree: 5%

📘 Official Docs:  Kubernetes Resource Reservations  Reserve-compute-resources

📘 More details:

Part of this tuning could be enabled by default on images built with Image Builder. You can find the script here.

⏱️ 2. Configure Graceful Shutdown for Your Nodes

❗ Problem

During system shutdown or reboot (planned or unplanned), nodes can terminate without properly shutting down running pods. This can result in data loss, application errors, and inconsistent states.

✅ Solution: Enable and Tune shutdownGracePeriod

Kubelet uses systemd inhibitor locks to delay the node shutdown and give time for pods to terminate gracefully. This feature improves application reliability, especially for stateful or critical services.

shutdownGracePeriod=60s
shutdownGracePeriodCriticalPods=30s

This configuration:

  • Reserves 60 seconds to complete the node shutdown.
  • Gives 30 seconds to gracefully shut down regular pods.
  • Keeps the last 30 seconds for critical system pods.

📘 Official Docs:

✅ Final Thoughts

Tuning your kubelet settings is a low-effort, high-impact improvement that can drastically increase the resilience and performance of your Kubernetes nodes. Especially in production environments or cloud platforms like OpenStack, it's crucial to: Reserve resources for essential system components Define eviction thresholds to avoid OOM errors Gracefully shut down workloads to avoid data corruption

By applying these best practices, you ensure that your nodes stay healthy and your applications remain available - even under heavy load or system shutdown events.

Debug Kubernetes

Test DNS

Test Cluster Dns using busybox pod:

kubectl exec -it busybox -n <NAMESPACE> -- nslookup kubernetes.default

Cluster-Api

capi-quick-start

The goal of this page is to provide quick commands to get started with Cluster API in under 5 minutes. For more detailed information, please refer to the official Cluster API documentation at: https://cluster-api.sigs.k8s.io/

Prerequisites:

Install Kind

Install Kind following this Link or:

[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

Install Kubectl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Install Cluster-Api

Create a Kind cluster with the kind config kind-cluster.yaml in this repo.

kind create cluster

Install Clusterctl

Clusterctl is the client to deploy cluster with capi.

curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.9.5/clusterctl-linux-amd64 -o clusterctl
sudo install -o root -g root -m 0755 clusterctl /usr/local/bin/clusterctl
clusterctl version

Install CAPI and CAPO ( ClusterAPi for Opentack)

First export this variable to enable cluster-ressource-set feature:

export EXP_CLUSTER_RESOURCE_SET=true

Now we will install clusterAPI (capi) and clusterApi Openstack Controller (capo) using clusterctl in our kind.

kubectl apply -f https://github.com/k-orc/openstack-resource-controller/releases/latest/download/install.yaml
clusterctl init --infrastructure openstack

Now your kind should look like:

ubuntu@jeff:~$ kubectl get pods -A | grep -v kube-system
NAMESPACE                           NAME                                                             READY   STATUS    RESTARTS        AGE
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-66bb86b8b8-d6jtb       1/1     Running   3 (20h ago)     5d17h
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-7bd59d5f69-bb69p   1/1     Running   2 (2d12h ago)   5d17h
capi-system                         capi-controller-manager-578674dd86-xhk7r                         1/1     Running   3 (20h ago)     5d17h
capo-system                         capo-controller-manager-79f47999df-w5p8k                         1/1     Running   3 (20h ago)     4d20h
cert-manager                        cert-manager-94d5c9976-pjw67                                     1/1     Running   2 (2d12h ago)   5d17h
cert-manager                        cert-manager-cainjector-6c49b5cdcc-bshqd                         1/1     Running   1 (2d12h ago)   5d17h
cert-manager                        cert-manager-webhook-595556d86b-zxm82                            1/1     Running   1 (2d12h ago)   5d17h
local-path-storage                  local-path-provisioner-7dc846544d-4tzbs                          1/1     Running   1 (2d12h ago)   5d18h
orc-system                          orc-controller-manager-df6c48588-mjdz5                           1/1     Running   3 (20h ago)     5d17h

Create your first Cluster CAPI

Manual step: ClusterIP

For the moment we don't have LBAAS on Openstack for Api-Servers Workaround: Create a port manualy on prod network on openstack console it will be your CLUSTER_API_IP

Prepare env vars for your cluster.

Manual step: secret cloud.yaml

Based on the file cloud.yaml , create your encoded secret cloud.yaml in base64.

export OPENSTACK_CLOUD_YAML_B64=$(cat cloud.yaml | base64)

# apply the secret in your kind (run once )
envsubst < secret.yaml | kubectl apply -f -

based on env_cos_mutu file create vars file for your cluster and source it:

source env_mutu_svc

Create Calico CRS deployment for your futur clusters:

# create crs
envsubst <  crs/crs-calico.yaml | kubectl apply -f - 

Now create your first cluster:

Create cluster command:

# create env_mutu cluster
envsubst <  cluster-template-without-lb.yaml | kubectl apply -f -

When master are available, connect on SSH on one and go on /var/log/cloud-init-output.log. Copy/Past the configuration to configure the kubeconfig file and be able to use kubectl on this master.

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

or

export KUBECONFIG=/etc/kubernetes/admin.conf

** Check your cluster status using clusterctl **:

clusterctl describe cluster dev

NAME                                                    READY  SEVERITY  REASON  SINCE  MESSAGE
Cluster/dev                                             True                     18h
├─ClusterInfrastructure - OpenStackCluster/dev
├─ControlPlane - KubeadmControlPlane/dev-control-plane  True                     18h
│ └─3 Machines...                                       True                     18h    See dev-control-plane-5djm7, dev-control-plane-tgs4l, ...
└─Workers
  └─MachineDeployment/dev-md-0                          True                     18h
    └─6 Machines...                                     True                     18h    See dev-md-0-9bh9b-89mq9, dev-md-0-9bh9b-95k5n, ...

Delete cluster command:

# create env_mutu cluster
envsubst <  cluster-template-kubevip.yaml | kubectl delete -f -

Clean Capi in your Kind:

kubectl delete cluster mycluster -n namespace
clusterctl delete  --core cluster-api -b kubeadm -c kubeadm -i openstack

Upgrade Components:**

clusterctl upgrade plan
clusterctl upgrade apply --contract v1beta1

Notes :

creation d'un autre cluster dans kind

clusterctl generate cluster capi-quickstart --flavor development
--kubernetes-version v1.32.0
--control-plane-machine-count=1
--worker-machine-count=1
--infrastructure docker \

capi-quickstart.yaml

Migrate a legacy K8s Kubeadm Cluster to a Capi K8s kubadm Cluster

Currently, this procedure is in an experimental stage and should be thoroughly tested before being used in a production environment. It is only compatible with an external ETCD and an external LBAAS.

It is designed create cluster.x-k8s.io/secret bundle to migrate a cluster created with kubeadm to a kubeadm-based cluster managed by Cluster API. At this stage, the script has been developed specifically for the Cluster API Provider OpenStack (CAPO ).

My goal is to improve the process to transition from an external ETCD to a local ETCD on the control-plane nodes, and also to migrate from a local ETCD on legacy control-planes to a local ETCD on CAPI control-planes.

The key to this, based on my analysis, would be to force CAPI to add the first control-plane node using a kubeadm join instead of a kubeadm init.

In the case of an external ETCD, this works because the secrets and ETCD are already initialized. The kubeadm init command does not pose any issues, as the kubelet simply joins an already existing API endpoint.

Feel free to share any suggestions or ideas for improvements or future developments.

Migration Process Overview

The procedure is based on having a hybrid legacy/CAPI cluster during the migration.

It is carried out in five main steps:

  1. Retrieving the necessary secrets and configurations from the existing cluster.
  2. Preparing the Cluster API (CAPI) configuration.
  3. Importing the secrets into CAPI.
  4. Create the CAPI control-plane and CAPI worker nodes on the existing cluster.
  5. Removing the nodes the old cluster nodes.

Prerequisites

  • have a CAPI cluster CAPI Controller should have acces to the api url of the cluster to manage. (https://api.mylegacy_cluster.kubeadm)

1 - Retrieving the necessary secrets and configurations.

First, run the prepare_secrets.sh script on a control plane node, passing the name of the cluster you want to migrate as an argument. This name should match the cluster_name defined in CAPI. You can find the script Here.

The script will generate a file named ${CLUSTER_NAME}-secret-bundle.yaml.

./prepare_secrets.sh ${CLUSTER_NAME}

and get the file: ${CLUSTER_NAME}-secret-bundle.yaml

2 - Preparing the Cluster API (CAPI) configuration.

Manual step: secret cloud.yaml

Based on the file cloud.yaml , create your encoded secret cloud.yaml in base64.

export OPENSTACK_CLOUD_YAML_B64=$(cat cloud.yaml | base64)

# apply the secret in your Cluster-api cluster (run once )
envsubst < secret.yaml | kubectl apply -f -

based on env_example file create vars file for your cluster and source it:

source env_example

Now, based on the 'cluster-template-migration.yaml', generate your cluster configuration, and pay close attention to the following parameters:

For the example, I hardcoded the parameters directly in the code, but it's recommended to pass these values as cluster environment variables instead.

  • External etcd endpoints in KubeadmControlPlane section :
    ...
    clusterConfiguration:
      etcd:
        external:
          caFile: /etc/kubernetes/pki/etcd/ca.crt
          certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
          keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key
          endpoints:
          - https://192.168.10.10:2379
          - https://192.168.10.11:2379
          - https://192.168.10.12:2379
    ...
  • External Api endpoints in OpenStackCluster section if you use CAPO:
    ...
    controlPlaneEndpoint:
    host: api.mydevcluster.com
    port: 443
    ...
  • Add a secutitygroup rule in OpenStackCluster section if you use CAPO to allow traffic beween old and new clusters:
  ...
  managedSecurityGroups:
    allNodesSecurityGroupRules:
    - direction: ingress
      etherType: IPv4
      name: Allow old secutirygroup
      description: "Allow all between old and new control plane and workers"
      remoteGroupID: "old-secutity-group-id"
  ...

3 - Importing the secrets into CAPI.

Apply your ${CLUSTER_NAME}-secret-bundle.yaml into you CAPI Controller Cluster:

kubectl apply -f ${CLUSTER_NAME}-secret-bundle.yaml

Now capi will detecte that a CA and secrets are already exxisting and will not generate a new one.

4 - Create the CAPI control-plane and CAPI worker nodes

# create capi  ${CLUSTER_NAME} cluster
envsubst <  cluster-template-migration.yaml | kubectl apply -f

Since the etcd database is shared between the old and new cluster, and the PKI secrets (such as TLS certificates and private keys) are identical, creating this new Kubernetes cluster will actually result in the addition of new control plane and CAPI worker nodes to the existing cluster, rather than forming a separate cluster.

Get your cluster state:

You will now see both your old control plane and worker nodes, as well as the new ones, in your cluster using kubectl get nodes.

In your CAPI cluster, running the clusterctl command will show only the new nodes that are managed by CAPI:

# Get  ${CLUSTER_NAME} cluster
clusterctl get cluster ${CLUSTER_NAME}

NAME                                                    READY  SEVERITY  REASON  SINCE  MESSAGE
Cluster/dev                                             True                     22h
├─ClusterInfrastructure - OpenStackCluster/dev
├─ControlPlane - KubeadmControlPlane/dev-control-plane  True                     22h
│ └─3 Machines...                                       True                     22h    See dev-control-plane-2xjv4, dev-control-plane-lrt8m, ...
└─Workers
  ├─MachineDeployment/dev-az1                           True                     22h
  │ └─Machine/dev-az1-z6zr4-9dldr                       True                     22h
  ├─MachineDeployment/dev-az2                           True                     22h
  │ └─Machine/dev-az2-nx55k-s265c                       True                     22h
  └─MachineDeployment/dev-az3                           True                     22h
    └─Machine/dev-az3-95fng-hsqfv                       True                     22h


5 - Remove the old cluster nodes

Make sure to update your load balancer (HAProxy or MetalLB) to include the new control plane nodes and remove the old ones.

Delete old nodes using:

# Delete old nodes
kubectl delete old-node 

Don’t forget to do the same for the old control plane nodes: stop containerd and kubelet to ensure that static pods are no longer running." "And that’s it! You can now enjoy managing your cluster with CAPI 😊.

Synology

Enable L2TP/IPsec Logs as debug:

vim /var/packages/VPNCenter/etc/l2tp/ipsec.conf

add or uncomment this parameters:

config setup
...
    plutodebug=all
    plutostderrlog=/var/log/pluto.log

Allow UDP 500, 1701 et 4500 on the NAS, don't NAT/transfert 1701 on router.