Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Warm Pool #838

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add Warm Pool #838

wants to merge 1 commit into from

Conversation

nitrocode
Copy link
Contributor

@nitrocode nitrocode commented May 4, 2021

Attempts to close #822

@nitrocode
Copy link
Contributor Author

cc: @yob @chloeruka

@keithduncan
Copy link
Contributor

Hi @nitrocode, thank you for opening this pull request! I agree this could be a valuable feature to help subtract any latency associated with our EC2 UserData and the BootstrapScriptUrl from ASG scale out time.

I’ve started looking at how the warm pool will behave and interact with the rest of our stack. One concern I have is how to prevent warm pool instances from starting their buildkite-agent and pulling work, only for the job to be interrupted by the instance being stopped by the ASG. I’ve looked at the ASG events around instances moving in and out of the warm pool but haven’t been able to think up a reliable way to stall agent start up. What are your thoughts on this?

@nitrocode
Copy link
Contributor Author

nitrocode commented May 31, 2021

I was hoping there were warm pool specific lifecycle events but there are only instance launching and instance terminating events to hook into.

Perhaps there is a way to detect if an ec2 is part of the warm pool and if so skip starting the agent.

Edit: the warm pool does have a different lifecycle event called Warmed:Pending:Wait which could be used to NOT trigger the start of the agent. Or perhaps the Pending:Wait lifecycle event, which is available in warm pool and standard pool lifecycles, could be used to start the agent.

@keithduncan
Copy link
Contributor

Interestingly the limitations section of Warm pools for Amazon EC2 Auto Scaling calls out ECS and EKS managed node pools as having a similar issue:

If you try using warm pools with Amazon Elastic Container Service (Amazon ECS) or Elastic Kubernetes Service (Amazon EKS) managed node groups, there is a chance that these services will schedule jobs on an instance before it reaches the warm pool.

The best idea for managing the systemd unit I’ve had so far is to receive those events in a Lambda and lean on the SSM agent to execute the state change on the host. Though I’d still be concerned about the prevalence of race conditions in that set up 🤔

Maybe the best approach here is to ask AWS for guidance on how to warm pool a workload like the buildkite-agent that doesn’t use a load balancer?

@josh-ross-ai
Copy link

Are there plans to add this feature to the next release?

@keithduncan
Copy link
Contributor

Hi @joshross12 likewise this feature isn’t slated for a particular release and there are some technical hurdles to over come before we can land it.

Specifically here, my plan is to experiment with using the SSM Agent to manage the status of the buildkite-agent systemd service, whether it should be running or not, and would welcome suggestions for how that would look and how to ensure the process is reliable.

@ptarjan ptarjan mentioned this pull request Aug 26, 2021
@keithduncan keithduncan added the agent lifecycle Agent boot, job lifecycle, agent shutdown label Sep 6, 2021
@dieend
Copy link

dieend commented Nov 9, 2021

Cross-posting this information #822 (comment)
We have to remove MixedInstancesPolicy from ASG config to allow enable WarmPool for that ASG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent lifecycle Agent boot, job lifecycle, agent shutdown asg-initiated-termination
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use asg warm pools for faster buildkite job starts
4 participants