Track, manage and alter the state of an EC2 account with a focus on spot requests. The application understands EC2 regions, instances and pools of instances. It does not understand Taskcluster Provisioning specifics like Capacity or Utility factors.
- State Database
- CloudWatch Event listener
- Spot Instance Request Poller
- Periodic Housekeeping
- API
The state database is implemented in Postgres and has an interface provided by
the ./lib/state.js
file. This database is written to be as simple as
possible while providing the data we need with transactional consistency where
appropriate. Only the absolute minimal amount of information is stored in the
database.
CloudWatch Events are the primary datasource for information on instance state.
Information about instances is stored in the instances
table. Whenever the
state (e.g. pending
, running
) changes for an instance, a message is sent
with the instance's id and the new state. When an event is received for the
creation of an instance, we need to look up some metadata using the
describeInstances
EC2 API, but we unconditionally delete instance shutdowns.
Whenever we get a message about an instance creation, we ensure that the spot
request it was associated is removed from the list of spot requests we need to
poll.
The messages from CloudWatch Events reach us through an SQS queue. In the case
that the instance's metadata isn't available through the decribeInstances
endpoint, we redeliver the message a number of times. If it is unsuccesful on
the last attempt, we report it to Sentry and move on. The periodic
housekeeping will ensure that it is inserted into the state database when
appropriate.
Whenever a spot instance is requested from the API, we insert relevant metadata
from it into the spotrequests
table. Periodically the ec2-manager will check
if any of the outstanding spot requests have been resolved. This is often not
done because the request is fulfilled, thus generating a cloudwatch event,
long before we poll for it. If this poller discovers a spot request which
has entered a state where it will not be fulfilled, it is cancelled.
Every hour, the EC2 Manager will request the EC2 API's view of the state. When doing this it will kill any instance which has exceeded the absolute maximum run time. For all other instances and spot requests, the state returned by the EC2 API will be used to confirm the view of state that the EC2 Manager has. Any instance in local state which is not in EC2 API state will be deleted from local state and any instance in EC2 API state but not local will be added to local state.
The API provided by EC2 Manager can be used manage EC2 instances. Of paricular
note is that the endpoint for submitting spot requests requires a fully formed
and valid LaunchSpecification
.
git clone https://github.com/taskcluster/ec2-manager
cd ec2-manager
yarn
yarn test
When deploying, keep in mind the following:
- SSH Pubkey used in the LaunchSpecification must match the one configured in the EC2 Manager. If you submit a LaunchSpecification with a different public key, it will be rejected.
- This service is set up to auto deploy to the
ec2-manager-staging
Heroku app on pushes to the master branch. Deploying to production requires promotion