-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Listener registration failing with RunnerScaleSetNotFoundException #3935
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
@nikola-jokic following from the problems we saw a few weeks ago, could this be an issue on the Github API side? |
Hey @tomhaynes, It seems like the scale set has been removed at |
Hi thanks for the response @nikola-jokic, I'm perhaps being stupid but where do you see that timestamp? This is a development cluster, and its shutdown at midnight 2025-02-19T00:00:00Z. I did wonder if perhaps its a race condition between the controller shutting down, and it not correctly cleaning up the various CRDs that it controls? We are also seeing this error now:
Uninstalling a specific gha-runner-scale-set chart, cleaning up all associated CRDs and reinstalling does appear to resolve the problem for that runner. Are there any logs to look out for in the controller that might indicate a non-graceful shutdown? |
We looked into traces on the back-end side to understand what is going on. It is likely the race condition. If the controller shuts down without having enough time to clean up the environment, it can cause issues like this. As for the log, this is also tricky. Basically, you would have to inspect the log and see that some steps that should be taken are missing. Having said that, it would be a good idea to log as soon as the shutdown is received, so you can spot these issues by checking logs below the termination mark. This solution cannot be perfect, especially when the controller is stopped without any graceful termination period, but it would help to diagnose issues with the cleanup process. |
We've got slightly closer possibly with one of the errors. A runnerset is throwing this error:
And I can see that the "Runner scale set" is missing when I look at the repository in Github UI. What would have removed that runner set on the github side? Could we raise a feature request to have the autoscalingrunnerset recreate it when this happens? |
could it be related to this? actions/runner#756 |
we've worked out a semi unpleasant way to force re-registration:
.. which at least avoids the finalizer hell of helm uninstalls. It'd be great to understand what causes the runner sets to disappear on the Github repo side? Also is there any way to request an API to list the runner sets on the repo? I saw it was raised here #2990 and the raiser was directed to the community page.. I tried and failed to see whether it has been requested there... |
So the deletion probably occured inside the autoscaling runner set controller. The shell script you just wrote forces the controller to think this is a new installation and re-creates resources properly, removing old resources and starting from scratch. As for the API documentation, we did talk about documenting scale sets APIs, but not just yet. There are some improvements we want to do, and some of them would be considered as breaking changes. |
solution proposed by @tomhaynes also worked for me as I was facing the exact same issue, thanks a lot |
Hi @nikola-jokic we've hit this issue quite a lot this morning. I've attached logs from an example listener, and the controller from shutdown time on Friday evening, and startup this morning. I can't see anything particularly relevant in them, mind. The controller does not seem to log anything at all at the point that it is shutdown. Could you please answer the following:
|
Hey @tomhaynes, Just to confirm I understood correctly, you removed the installation on Friday along with the controller, and you re-installed it this morning? To answer your questions, here is the area that is responsible for the cleanup. Basically, we uninstall the listener. Then we uninstall the ephemeral runner set which can take some time depending on the number of runners that are online. Once we are finished, we delete the scale set. At that point, the cleanup is successful. To answer the second question, I would first need to understand how the environment is shut down. I think probably there is the root of the issue, so please help me understand the shutdown steps. You are right, the lifetime of the listener must be shorter than the lifetime of the controller. Please help me understand the shutdown, so I can try to reproduce it properly. |
Hi @nikola-jokic thanks for the quick reply - we don't remove the installation / any CRDs, we just gracefully terminate the Kubernetes worker nodes i.e. shutdown the pods via SIGTERM. The entire cluster is shutdown at the same time, so there are no guarantees around which GHA pods shutdown first. The next day / week the environment powers back up with the CRDs unchanged. We've not changed this design for a long time - we used to use the summerwinds controllers but migrated over to arc in ~March last year. |
hmm the controller's helm chart doesn't support lifecycle hooks, I could add this in but if this was the issue surely others would be seeing the same.. |
Checks
Controller Version
0.9.3,0.10.1
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Having previously been healthy, our listeners are failing to register with the Github API, throwing the following error:
This causes them to repeatedly retry until we have exhausted our API limits, causing all runners to cease to work.
Describe the expected behavior
Successful registration
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: