CONTROL_PLANE_ENDPOINT_IP gets lost; VM initialization hangs (cluster-init). #389

foobarzap · 2025-01-29T12:16:32Z

What steps did you take and what happened: [A clear and concise description
of what the bug is.]

First of all, thanks to all contributors for their work and for providing this
CAPI provider!

I was following the quickstart guide to setup a Kubernetes V1.30.8 cluster on
one of my proxmox instances (running proxmox 8.2.2).

The Ubuntu 24.04 VM template was created with Image Builder; by setting
PACKER_FLAGS I chose Kubernetes 1.30.8. Except for a necessary change of
builder.disks.format from qcow2 to raw this was a smooth go; template was
created and received Id 114 on my proxmox instance (INTEL).

I fired up a local kind cluster on my OSX machine (ARM) for management and
initialized Cluster API like so

clusterctl init --core cluster-api \
  --config ${CTL_CONFIG} \
  --bootstrap kubeadm \
  --control-plane kubeadm \
  --infrastructure proxmox \
  --ipam in-cluster \
  -v5

where CTL_CONFIG points to my clusterctl.yaml (contents: see below).
clusterctl picked Cluster API and kubeadm v1.9.4, proxmox provider
v0.6.2 and ipam-in-cluster v1.0.0.

Next step was to create the workload cluster manifest:

clusterctl generate cluster kubemox \
    --config ${CTL_CONFIG} \
    --infrastructure proxmox \
    --kubernetes-version v1.30.8 \
    --control-plane-machine-count 3 \
    --worker-machine-count 3 \
    -v5 > kubemox.yaml

The first VM being created is always a control plane and, as expected, it receives two
IPs: the CONTROL_PLANE_ENDPOINT_IP and the first IP from the pool defined in
NODE_IP_RANGES.

PROBLEM 1: Watching the summary of the VM in proxmox, pretty soon after
creation of more nodes begun, the CONTROL_PLANE_ENDPOINT_IP is disappearing
and I am no longer able to ping the associated IP. It rarely and randomly comes back to
disappear soon again ...

As far as I understand kube-vip, the CONTROL_PLANE_ENDPOINT_IP should go
around between control planes from time to time? However, if no other plane is
ready this should not happen? At least I see no other plane having it ...

PROBLEM 2 Although I asked for 3 control planes and 3 worker nodes, fewer
VMs get created (due to the size of my management cluster?); the amount varies
from try to try. Except for the first created VM each is labeled with
go-proxmox+cloud-init but this label never disappears, as it did with the first VM.

I was able to ssh into the nodes and journalctl -u kubelet revealed that all
kubelets (except for the first created VM) crashed due to missing file
/var/lib/kubelet/config.yaml.

There was also a warning that flag --pod-infra-container-image has been
deprecated ...

I also tried the --flavor calico after creating a config map (as described in the quickstart guide) but it didn't have an effect.

Please let me know if I should provide further information and if so, how to get
my hands on it (I am a kubernetes newbie).

What did you expect to happen:

A running kubernetes cluster on my proxmox instance ;-)

Anything else you would like to add: [Miscellaneous information that will
assist in solving the issue.]

Side note 1:Trying to ssh into the machines works but is very very slow for the first attempt.

Side note 2: Running kubectl apply -f kubemox.yaml I randomly encountered the
following error:

cluster.cluster.x-k8s.io/kubemox created
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/kubemox-control-plane created
proxmoxmachinetemplate.infrastructure.cluster.x-k8s.io/kubemox-control-plane created
machinedeployment.cluster.x-k8s.io/kubemox-workers created
proxmoxmachinetemplate.infrastructure.cluster.x-k8s.io/kubemox-worker created
kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/kubemox-worker created
clusterresourceset.addons.cluster.x-k8s.io/kubemox-crs-0 created
Error from server (InternalError): error when creating "kubemox.yaml": Internal error occurred: failed calling webhook "validation.proxmoxcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capmox-webhook-service.capmox-system.svc:443/validate-infrastructure-cluster-x-k8s-io-v1alpha1-proxmoxcluster?timeout=10s": dial tcp 10.96.178.119:443: connect: connection refused

However, if it occurs, after several retries proxmox starts to instantiate
machines from my templates - no idea, whether this is related to my problem.

Environment:

Cluster-api-provider-proxmox version: 0.6.2
Kubernetes version: (use kubectl version):
- Client Version: v1.30.8
- Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
- Server Version: v1.30.8
OS (e.g. from /etc/os-release): Ubuntu 24.04 (VM Image), proxmox 8.2.2

My clusterctrl.yaml:

## -- Controller settings -- ##
PROXMOX_URL: "https://192.168.100.244:8006"
PROXMOX_TOKEN: "capmox@pve!capi"
PROXMOX_SECRET: " ... secret ..."

## -- Required workload cluster default settings -- ##
PROXMOX_SOURCENODE: "pve-mp"
TEMPLATE_VMID: "114"
ALLOWED_NODES: "[pve-mp]"
VM_SSH_KEYS: " .... keys ...."
## -- networking configuration-- ##
CONTROL_PLANE_ENDPOINT_IP: "192.168.100.90"
NODE_IP_RANGES: "[192.168.100.91-192.168.100.110]"
GATEWAY: "192.168.100.3"
IP_PREFIX: "24"
DNS_SERVERS: "[192.168.100.8,8.8.8.8]"
BRIDGE: "vmbr0"

## -- xl nodes -- ##
BOOT_VOLUME_DEVICE: "scsi0"
BOOT_VOLUME_SIZE: "100"
NUM_SOCKETS: "1"
NUM_CORES: "2"
MEMORY_MIB: "8192"

EXP_CLUSTER_RESOURCE_SET: "true" 
CLUSTER_TOPOLOGY: "true"

The text was updated successfully, but these errors were encountered:

foobarzap · 2025-01-29T18:58:18Z

Meanwhile I digged a bit deeper: It seems that something goes wrong during creation of the VMs.

From the proxmox-logs:

Jan 29 15:59:20 pve-mp pvedaemon[1095]: <capmox@pve!capi> end task UPID:pve-mp:00000ACD:00006AE8:679A423B:qmclone:112:capmox@pve!capi: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-local-lvm' - got timeout

VMs 115, 116 and 118 were created. 116 and 118 hung with label go-proxmox+cloud-init attached.

capmox-controller-manager repeatedly complains that it cannot find VM117:

E0129 18:26:36.117277 1 find.go:53] "unable to find vm" err="cannot find vm with id 117: 500 Configuration file 'nodes/pve-mp/qemu-server/117.conf' does not exist" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="default/kubemox-workers-zz9t9-v47sc" namespace="default" name="kubemox-workers-zz9t9-v47sc" reconcileID="9c44a201-9739-4e92-b534-2fa659c58af8" machine="default/kubemox-workers-zz9t9-v47sc" cluster="default/kubemox"

Maybe this is a pointer into the right direction? Any ideas?

mcbenjemaa · 2025-01-31T13:27:45Z

@foobarzap

First of all, please try to create the cluster with ubuntu-22.04
next, you need to make sure that you're using Ceph storage.

please use kubectl desribe with the objects and logs out the controllers, if you find something unusual.
please share your manifests.

foobarzap · 2025-02-01T04:46:49Z

Thank you for your response. I did not know that Ceph storage is mandatory for Proxmox, where did you get this information from? Indeed a distributed file system may speed up the cloning process but unfortunately, I only have single node.

In the Proxmox support forums I found a posts that addresses the same problem. In a response to that post a staff member from Proxmox pointed out that sending cloning requests in parallel, as proxmox provider obviously does, can lead to problems.

Maybe this is not a problem for faster machines, and you may prefer to close this issue, however, as the recommendation from the Proxmox team stands, it would be helpful if you consider an option to handle this case; e.g. by cloning the machines one after another, as recommended.

foobarzap added the kind/bug Something isn't working label Jan 29, 2025

mcbenjemaa added the kind/support label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTROL_PLANE_ENDPOINT_IP gets lost; VM initialization hangs (cluster-init). #389

CONTROL_PLANE_ENDPOINT_IP gets lost; VM initialization hangs (cluster-init). #389

foobarzap commented Jan 29, 2025

foobarzap commented Jan 29, 2025

mcbenjemaa commented Jan 31, 2025

foobarzap commented Feb 1, 2025

CONTROL_PLANE_ENDPOINT_IP gets lost; VM initialization hangs (cluster-init). #389

CONTROL_PLANE_ENDPOINT_IP gets lost; VM initialization hangs (cluster-init). #389

Comments

foobarzap commented Jan 29, 2025

foobarzap commented Jan 29, 2025

mcbenjemaa commented Jan 31, 2025

foobarzap commented Feb 1, 2025