-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: VMSS agents go offline unexpectedly due to 401 auth error #5023
Comments
Hi @angaaruriakhil Do you see any attempts in the logs before the VM is terminated?
|
Hi @DenisNikulin5 , No I don't see that in the logs. It basically loops through this block of code and goes back to 401. Let me know if more details are required, I scrubbed a lot to protect the details.
|
We're seeing the same issue with a new VMSS Agent Pool. Existing pools seems unaffected.
|
Hey all, is there any temporary work around for this issue? |
Also seeing this issue, have rebuilt on latest ubuntu2204 images on 5/12 and the problem is still existing |
Hi. Like @angaaruriakhil mentioned in the OP, there's an issue with
I executed the |
We have the same issue on our containers randomly running the same image with agent (4.248.0). The logs are the same as others posted
|
also faced this issue, fix this at the moment with a cloud-init configuration: #cloud-config
runcmd:
- mkdir /agent
- chmod -R a+rwx /agent |
What happened?
Context: We're hosting VMSS in Azure with images built with runner-images code as Azure DevOps agents. This is the source code for Azure DevOps MS hosted agents. We use the operating systems: Windows 2022, Windows 2019, Ubuntu 22.04, Ubuntu 20.04 and they are all affected by this issue. The issue happens intermittently across scale sets hosted in multiple regions across all the operating systems mentioned.
We intermittently are seeing VMs failing with the error for at least the last month in the Diagnostics tab for each pool in Azure DevOps:
Pipeline agent went offline unexpectedly
Which then will cause the VM to go offline and skip the jobs it may be running. This is causing big problems for us as VMs unexpectedly go offline and if unlucky, this could be while they are running a business critical job. This has been happening for at least a month.
We have checked our proxy/all firewalls for networking blocks and there are no blocks reported from our VMSS subnets to any destination or port. All the outbound traffic being executed is allowed.
Saving an unhealthy Ubuntu 22.04 agent for investigation and investigating the logs under /agent/_diag shows that there is a 401 error (scrubbed excepts attached in log box, I don't want to share sensitive information). See the logs box for relevant logs.
We have similarly looked at the log files under:
All of which report nothing out of the ordinary.
There are similar issues reporting this problem here. #4826
As reported in #4826 , running the ./run.sh --diagnostics command, also reports an error writing to the log.
Versions
Azure DevOps Services
Images built with runner-images on October 15th 2024
Azure Pipelines Agent v3.246.0
WA Linux Agent v2.11.1.12
Environment type (Please select at least one enviroment where you face this issue)
Azure DevOps Server type
dev.azure.com (formerly visualstudio.com)
Azure DevOps Server Version (if applicable)
No response
Operation system
Ubuntu 22.04. Ubuntu 20.04, Windows 2022, Windows 2019
Version controll system
No response
Relevant log output
[2024-10-25 11:31:36Z INFO MessageListener] No message retrieved from session '{scrubbed}' within last 30 minutes.
Results from ./run.sh --diagnostics:
System.UnauthorizedAccessException: Access to the path '/agent/_diag/Agent_20241025-190446-utc.log' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func
2 errorRewriter)at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at Microsoft.VisualStudio.Services.Agent.HostTraceListener.CreatePageLogWriter() in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 178
at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 50
at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135
at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28
./run.sh: line 68: 24614 Aborted (core dumped) "$DIR"/bin/Agent.Listener run $*
`
The text was updated successfully, but these errors were encountered: