Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

austinlparker · 2024-09-18T17:54:35Z

This issue briefly covers the incident that took place on 09/18/24 starting at 8:49 AM Pacific/11:49 AM Eastern. All times in the document will be in Eastern time.

Timeline of events

Approx. 11:45 AM - A change was pushed to the OpenTelemetry CLOWarden configuration to address a new configuration file location (open-telemetry/people). This configuration file was deliberately limited -- it defined one team, and one repository. Prior to this change, a separate installation of CLOWarden was used in a separate organization to validate the expected outcome with no issues.

11:48 AM - A PR was made to the new config.yaml file. This change added a single user. CLOWarden picked up this change and began to validate the current state vs. the desired state. For an unknown reason, the behavior of CLOWarden differed in the live environment, and it began to apply changes to the current state of the open-telemetry organization prior to approving the PR.

11:49 AM - Teams are deleted via CLOWarden and the GitHub API.

11:49 AM - Infra SIG (@austinlparker and @trask) pull the most recent CLOWarden state backup (from community/config.yaml) and begin to address statefile issues in order to re-create teams from this backup.

~12:20 PM - The statefile passes validation, but CLOWarden fails to process it due to a lack of repository information in the config.yaml. Out of an abundance of caution, we decide that we do not want to add repository information without fully understanding the application code as the existing behavior was a surprise/unexpected. A Rust SME is invited into the SIG call (@TommyCpp) who suggests that we pursue an alternative restoration strategy via Python script and remove CLOWarden from the loop entirely. A brief discussion ensues, and all parties agree on this course of action.

~1:45 PM - A script is developed and tested to re-create the teams based on the backup directly via GitHub API.

~1:49 - The team recreation job begins. We begin to work on restoring team/repository permissions.

2:30 PM - A script to repair repository/team permissions has been written and tested. Beginning to roll out updates.

2:39 PM - Permissions have been modified for all repositories based on the backup and we have confirmed this in the GitHub UI.

Current Status

The CLOWarden app has been disabled.
Teams and Repository Permissions have been restored from backup.
The backup is from July 12th, 2024. Please double check permissions and CI/CD, as well as updating team affiliations for any modifications since July.

This incident is now closed. This thread will be left open for discussion until we have a more comprehensive retrospective and learnings available, which will be posted on opentelemetry.io. Thank you for your patience as we resolved this issue.

trask · 2024-09-18T18:01:34Z

it began to apply changes to the current state of the open-telemetry organization prior to approving the PR

this is the part that most baffles me

mtwo · 2024-09-18T18:09:24Z

Thanks @austinlparker and @trask for working so quickly to fix this! Infra / tooling changes like this are always fragile and are often impossible to test, especially when they do things like that ^

trask · 2024-09-18T18:33:55Z

TODO

review teams that were created since the backup that we used from 2 months ago
ping all teams here to ask them to validate their current membership since the backup we used was from 2 months ago

marcalff · 2024-09-18T18:45:21Z

Also missing configuration-approvers, which was created very recently:

Add brettmc to approvers opentelemetry-configuration#113
- (per @jack-berg) "Upon merging this PR, I'll create the configuration-approvers team."
- Add brettmc to approvers opentelemetry-configuration#113 (comment)
Add marcalff to approvers opentelemetry-configuration#114

austinlparker · 2024-09-18T19:40:59Z

Missing rust-approvers and rust-maintainers

https://github.com/orgs/open-telemetry/teams?query=rust

Looks like they're there?

austinlparker · 2024-09-18T19:41:23Z

@open-telemetry/rust-approvers @open-telemetry/rust-maintainers

trask · 2024-09-18T19:46:47Z

TODO

review teams that were created since the backup that we used from 2 months ago

ping all teams here to ask them to validate their current membership since the backup we used was from 2 months ago

it looks like we can get all of the changes since the backup via the GitHub audit logs:

I'll work on applying these changes...

austinlparker · 2024-09-18T19:47:49Z

Thanks, @trask!

florianl · 2024-09-18T19:48:54Z

ebpf-profiler-approvers and ebpf-profiler-maintainers (please see also recent changes from open-telemetry/opentelemetry-ebpf-profiler#151) seems to be missing for https://github.com/open-telemetry/opentelemetry-ebpf-profiler/.

trask · 2024-09-18T21:11:50Z

all changes to team membership since the July 12 backup have been applied (from the github audit log)

looking now to see if I can get repo permissions updates from the audit log...

trask · 2024-09-18T21:31:54Z

everything should be resolved now

please report any issues here, thanks!

yurishkuro · 2024-09-18T23:04:54Z

Could we link to the PRs mentioned in the timeline, to better understand the changes?
Was CLOWarden run from a workflow or as an external app?
Is there a dry-run option in CLOWarden where it logs what changes it would execute?

austinlparker · 2024-09-18T23:14:58Z

Could we link to the PRs mentioned in the timeline, to better understand the changes?

Was CLOWarden run from a workflow or as an external app?

Is there a dry-run option in CLOWarden where it logs what changes it would execute?

This PR (now repurposed) shows the changes from CLOWardens perspective. add manual management script gh-manager#2 (comment) - I linked to the specific comment it gave when modifying the limited state file; You can look at it in the PR history. Subsequent commits/messages are from when we tried to recover using CLOWarden vs. writing our own script.
CLOWarden is run as an external application that watches PRs in specified repositories.
No, there is no dry run option that I'm aware of or found. This is why we tried a small test on a separate organization. The setup for that test was an organization with an existing team that was not managed in the state file. A change was made to add a new team. In that instance, it did not automatically delete the existing team. The behavior in todays case was rather surprising given both our prior tests, as well as our expectations for how a desired state engine would function (for example, we expected that it would not make changes without human intervention/approval for a PR, which is what the docs suggest) lol

tegioz · 2024-09-19T14:10:13Z

Hi 👋

This is Sergio, one of the CLOWarden maintainers.

First of all, I'm sorry you run into some problems while testing CLOWarden. Please see my response here for more details about this incident: cncf/clowarden#261 (comment)

tegioz · 2024-09-19T15:07:09Z

No, there is no dry run option that I'm aware of or found.

You can simulate a dry-run by using the diff subcommand of the CLI tool.

The behavior in todays case was rather surprising given both our prior tests, as well as our expectations for how a desired state engine would function (for example, we expected that it would not make changes without human intervention/approval for a PR, which is what the docs suggest).

From CLOWarden's README file, in the How it works section:

CLOWarden's main goal is to ensure that the resources' desired state, as defined in the configuration files, matches the actual state in the corresponding services. This is achieved by running reconciliation jobs, that can be triggered on-demand or periodically.

CLOWarden monitors pull requests created in the configuration repository and, when applicable, it validates the proposed changes to access rules and creates reconciliation jobs to apply the necessary changes. This is what we call an on-demand reconciliation job: it's created as a result of a user's submitted pull request, and changes are applied immediately once the pull request is approved and merged.

Sometimes, however, this may not be enough. Changes can be applied manually to the service bypassing the configuration files (i.e. from the GitHub settings UI), and CLOWarden still needs to make sure that the actual state matches the desired state documented in the configuration files. So in addition to on-demand reconciliation jobs, CLOWarden runs periodic ones to ensure everything is all right all the time.

In any case, we'll try to improve the docs to make it clearer how it works (i.e. highlighting some important parts) and avoid potential problems in the future to other users.

austinlparker · 2024-09-19T15:11:46Z

I didn't see a way to configure the periodic reconciliation, or even an explanation of when it runs. Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version. When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above).

Ultimately, we did not fully understand or appreciate how CLOWarden functioned here, and did not have the requisite expertise in Rust to really audit the code or appreciate its functionality. I believe that better documentation would have helped here, or perhaps the ability to clearly configure/understand the periodic reconciliation loop.

tegioz · 2024-09-19T15:32:39Z

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

I understand, that's fine. We needed to be able to remove teams/permissions/etc manually added from the GitHub UI without maintainers interaction or approval, it was a requirement for us.

When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above).

Could you point me to those logs please? I saw the clowarden-server logs in the issue you created, but I didn't see any run of the diff subcommand (dry-run). Would love to see what it reported.

I mean the output of this command, which should detail all changes that would be applied when the service is launched (all resources to be deleted in this case):

export GITHUB_TOKEN=...
clowarden-cli diff --org open-telemetry --repo gh-manager --branch main

austinlparker · 2024-09-19T15:34:27Z

I don't have the logs for the diff command unfortunately, I ran it several weeks ago.

yurishkuro · 2024-09-19T16:40:51Z

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything?

austinlparker · 2024-09-19T16:42:40Z

Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version.

Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything?

I would expect that if you had a PR open that modified the state, that would be the one that took effect, not the diff between the empty state and the PR, if that makes sense?

yurishkuro · 2024-09-19T16:56:34Z

My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in main then deleting all groups would be the expected behavior once the app is enabled. But if the config was actually reflective of SOTW but CLOWarden's some internal state was empty and it deleted everything - then yes, it would be a big problem. I still haven't seen the actual PR - the timeline says there was a config change where the config file location was changed, and perhaps to some new version of config that didn't represent SOTW - then again, the behavior was expected (but not before PR was merged).

if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified). Once we gain confidence in it, we can enable auto-runs.

austinlparker · 2024-09-19T17:02:41Z

My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in main then deleting all groups would be the expected behavior once the app is enabled. But if the config was actually reflective of SOTW but CLOWarden's some internal state was empty and it deleted everything - then yes, it would be a big problem. I still haven't seen the actual PR - the timeline says there was a config change where the config file location was changed, and perhaps to some new version of config that didn't represent SOTW - then again, the behavior was expected (but not before PR was merged).

if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified).

The PR is the PR I linked upthread. I understand from the explanation given why things worked the way they did, even if I didn't feel that it was clear it would happen this way before.

We created a new repo and added a new config file (config.yaml, seen here open-telemetry/gh-manager@4bceb97) that had the initial team we wanted to manage through CLOWarden. We did this because prior maintainer feedback was that certain teams did not wish to use CLOWarden, so our initial spike was going to be managing a limited set of teams. In testing, this appeared to work fine, but that is because we got lucky.
Once the new config file was created, the CLOWarden service was restarted in k8s to pick up the new config file location (previously, it was pointing to community/config.yaml). This is the critical difference between the test environment and the live one, we did not restart CLOWarden in that case.
As CLOWarden restarted, it read the config.yaml file that existed with only one team, and its periodic reconciliation engine deleted all other teams to match the state in that file.

This is all relatively academic because we aren't going to be using CLOWarden after this since we don't feel that we can support a tool we don't really understand, and it won't allow us to slowly roll out features to the entire organization.

yurishkuro · 2024-09-19T17:13:10Z

Fair enough - if we achieve the same with ~300 lines of Python, I would also go with that for better control.

trask · 2024-09-27T15:58:04Z

Closing this now that recovery is completed and no new issues have been reported for a bit. If there's any more issues or details people want just let us know, thanks.

kaylareopelle · 2024-10-16T22:08:42Z

@trask - Users of multiple ruby-related teams, including the ruby-maintainers and ruby-contrib-maintainers teams are are no longer able to update Project boards in the open-telemetry org.

Were some permissions related to Projects changed in the recent incident?

Someone may be able to follow these instructions to update our permissions.

trask · 2024-10-16T23:26:29Z

hi @kaylareopelle! yeah, this was an unfortunate side effect

I've just granted write access for both @open-telemetry/ruby-maintainers and @open-telemetry/ruby-contrib-maintainers to these projects:

kaylareopelle · 2024-10-16T23:35:07Z

Thank you, @trask! Everything's working on my end now! I believe approvers also previously had write access. Is that the norm for other SIGs?

If so, would you mind granting access to ruby-approvers and ruby-contrib-approvers too?

trask · 2024-10-16T23:41:24Z

done 👍

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

austinlparker mentioned this issue Sep 19, 2024

CLOWarden applied team changes before a PR was approved. cncf/clowarden#261

Closed

trask closed this as completed Sep 27, 2024

lalitb mentioned this issue Sep 30, 2024

REQUEST: GitHub team for publishing OpenTelemetry Rust crates #2369

Open

trask mentioned this issue Oct 30, 2024

REQUEST: Repository maintenance on opentelemetry-dotnet-instrumentation #2418

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

austinlparker commented Sep 18, 2024 •

edited

Loading

trask commented Sep 18, 2024

mtwo commented Sep 18, 2024

trask commented Sep 18, 2024 •

edited

Loading

This comment was marked as outdated.

marcalff commented Sep 18, 2024 •

edited

Loading

This comment was marked as resolved.

austinlparker commented Sep 18, 2024

austinlparker commented Sep 18, 2024

trask commented Sep 18, 2024

austinlparker commented Sep 18, 2024

florianl commented Sep 18, 2024 •

edited

Loading

trask commented Sep 18, 2024

trask commented Sep 18, 2024

yurishkuro commented Sep 18, 2024

austinlparker commented Sep 18, 2024 •

edited

Loading

tegioz commented Sep 19, 2024

tegioz commented Sep 19, 2024

austinlparker commented Sep 19, 2024

tegioz commented Sep 19, 2024 •

edited

Loading

austinlparker commented Sep 19, 2024

yurishkuro commented Sep 19, 2024

austinlparker commented Sep 19, 2024

yurishkuro commented Sep 19, 2024 •

edited

Loading

austinlparker commented Sep 19, 2024

yurishkuro commented Sep 19, 2024

trask commented Sep 27, 2024

kaylareopelle commented Oct 16, 2024

trask commented Oct 16, 2024

kaylareopelle commented Oct 16, 2024 •

edited

Loading

trask commented Oct 16, 2024

Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356

Comments

austinlparker commented Sep 18, 2024 • edited Loading

Timeline of events

Current Status

trask commented Sep 18, 2024

mtwo commented Sep 18, 2024

trask commented Sep 18, 2024 • edited Loading

This comment was marked as outdated.

marcalff commented Sep 18, 2024 • edited Loading

This comment was marked as resolved.

austinlparker commented Sep 18, 2024

austinlparker commented Sep 18, 2024

trask commented Sep 18, 2024

austinlparker commented Sep 18, 2024

florianl commented Sep 18, 2024 • edited Loading

trask commented Sep 18, 2024

trask commented Sep 18, 2024

yurishkuro commented Sep 18, 2024

austinlparker commented Sep 18, 2024 • edited Loading

tegioz commented Sep 19, 2024

tegioz commented Sep 19, 2024

austinlparker commented Sep 19, 2024

tegioz commented Sep 19, 2024 • edited Loading

austinlparker commented Sep 19, 2024

yurishkuro commented Sep 19, 2024

austinlparker commented Sep 19, 2024

yurishkuro commented Sep 19, 2024 • edited Loading

austinlparker commented Sep 19, 2024

yurishkuro commented Sep 19, 2024

trask commented Sep 27, 2024

kaylareopelle commented Oct 16, 2024

trask commented Oct 16, 2024

kaylareopelle commented Oct 16, 2024 • edited Loading

trask commented Oct 16, 2024

austinlparker commented Sep 18, 2024 •

edited

Loading

trask commented Sep 18, 2024 •

edited

Loading

marcalff commented Sep 18, 2024 •

edited

Loading

florianl commented Sep 18, 2024 •

edited

Loading

austinlparker commented Sep 18, 2024 •

edited

Loading

tegioz commented Sep 19, 2024 •

edited

Loading

yurishkuro commented Sep 19, 2024 •

edited

Loading

kaylareopelle commented Oct 16, 2024 •

edited

Loading