AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303
Labels
kind/bug
Categorizes issue or PR as related to a bug.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
/kind bug
What steps did you take and what happened:
When a VMSS VM has a provisioning state of
Failed
, AzureMachinePoolMachine controller will mark the AMPM as failed:cluster-api-provider-azure/exp/controllers/azuremachinepoolmachine_controller.go
Lines 298 to 299 in aad2249
And will not continue to reconcile the AMPM further:
cluster-api-provider-azure/exp/controllers/azuremachinepoolmachine_controller.go
Lines 265 to 268 in aad2249
According to the Cluster API provider contract also CAPI will stop reconciling this Machine (once it bubbled up to the Machine):
(emphasis mine)
However, VMSS VMs may sometimes go into ProvisioningState Failed and recover themselves again. We just had several of those. In the VMSS Activity Log they were marked as Health issues (PlatformInitiated Downtime).
What did you expect to happen:
I'd expect a AMPM to only go into terminal failed state when it really can't recover.
Anything else you would like to add:
I'm happy to provide a PR for this, but I'd need some information first. For example if there is a way to reliably determine if a VMSS VM is Failed and can't be recovered.
As a quick fix (and to test this), I'll probably resort to not setting FailureReason/Message in this case.
Environment:
kubectl version
): 1.31.1/etc/os-release
): linux/windowsThe text was updated successfully, but these errors were encountered: