Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequencing of actions #6

Open
mjordan73 opened this issue Aug 15, 2018 · 1 comment
Open

Sequencing of actions #6

mjordan73 opened this issue Aug 15, 2018 · 1 comment

Comments

@mjordan73
Copy link

mjordan73 commented Aug 15, 2018

More of an observation that an out-and-out issue but whilst experimenting with your lambda something struck me. In the main procedure you currently:

  • Evaluate minimum state of the cluster
  • Evaluate if scaling-in is appropriate on reserved memory grounds
  • Evaluate if scaling-in is appropriate on reserved CPU grounds
  • Terminate any already-drained host instances

So in our scenario I have two hosts running tasks (one per availability zone), so a cluster with a minimum and desired size of 2. I then manually trigger our scale-up alarm (which sets desired size to 4) to force some new hosts to be added. The on the first time I run the lambda then it sees the new hosts as being surplus, and starts draining them.

The interesting part is on the next run of the lambda where the minimum state evaluation (asg_on_min_state) doesn't take into account the number of draining instances about to be terminated (i.e. desired size is still deemed to be 4). Now as we step further into the code then (as i'm in currently only experimenting in a dev environment and the containers i'm running do pretty much nothing) then the reserved memory evaluation then actually decides to start draining one of my remaining minimum 2 active boxes! Finally it also terminates our two instances that were set to draining from the first run.

So with this kind of scenario in mind would it not make sense to make termination of drained instances one of the first things to be evaluated (or at least do it before evaluating minimum state so we have a desired size figure reflective of hosts that aren't on the cusp of termination) and not the last? Either that or make the asg_on_min_state procedure take into account instances that are draining/about to be terminated.

@omerxx
Copy link
Owner

omerxx commented Sep 25, 2018

Hi @mjordan73, I think you're onto something.
I did notice some strange behaviour when experimenting with small scale clusters. I also heard similar strange effects when using 1-3 instances.
On a larger scale (over 30 instances) things work as expected.
Assuming you are correct and draining instances are not taken into account, (which actually makes sense since they would not serve new tasks) this can explain the issue, and also why big scale clusters where each instance is relatively negligible.

If you've already made progress with this, please share I'll merge the changes.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants