-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident Report: Teams were deleted during trial deployment of CLOWarden. #2356
Comments
this is the part that most baffles me |
Thanks @austinlparker and @trask for working so quickly to fix this! Infra / tooling changes like this are always fragile and are often impossible to test, especially when they do things like that ^ |
TODO
|
This comment was marked as outdated.
This comment was marked as outdated.
Also missing configuration-approvers, which was created very recently:
|
This comment was marked as resolved.
This comment was marked as resolved.
https://github.com/orgs/open-telemetry/teams?query=rust Looks like they're there? |
@open-telemetry/rust-approvers @open-telemetry/rust-maintainers |
it looks like we can get all of the changes since the backup via the GitHub audit logs:
I'll work on applying these changes... |
Thanks, @trask! |
ebpf-profiler-approvers and ebpf-profiler-maintainers (please see also recent changes from open-telemetry/opentelemetry-ebpf-profiler#151) seems to be missing for https://github.com/open-telemetry/opentelemetry-ebpf-profiler/. |
all changes to team membership since the July 12 backup have been applied (from the github audit log) looking now to see if I can get repo permissions updates from the audit log... |
everything should be resolved now please report any issues here, thanks! |
|
|
Hi 👋 This is Sergio, one of the CLOWarden maintainers. First of all, I'm sorry you run into some problems while testing CLOWarden. Please see my response here for more details about this incident: cncf/clowarden#261 (comment) |
You can simulate a dry-run by using the
From CLOWarden's README file, in the
In any case, we'll try to improve the docs to make it clearer how it works (i.e. highlighting some important parts) and avoid potential problems in the future to other users. |
I didn't see a way to configure the periodic reconciliation, or even an explanation of when it runs. Respectfully, I have some concerns about the design of CLOWarden where it is capable of mass destructive actions periodically without any confirmation from the user, even if there's a command line version. When I ran the diff (which is how we had the backup to begin with), CLOWarden complained about it (you can see the complaints in the linked PR above). Ultimately, we did not fully understand or appreciate how CLOWarden functioned here, and did not have the requisite expertise in Rust to really audit the code or appreciate its functionality. I believe that better documentation would have helped here, or perhaps the ability to clearly configure/understand the periodic reconciliation loop. |
I understand, that's fine. We needed to be able to remove teams/permissions/etc manually added from the GitHub UI without maintainers interaction or approval, it was a requirement for us.
Could you point me to those logs please? I saw the clowarden-server logs in the issue you created, but I didn't see any run of the I mean the output of this command, which should detail all changes that would be applied when the service is launched (all resources to be deleted in this case):
|
I don't have the logs for the diff command unfortunately, I ran it several weeks ago. |
Isn't that what you would expect from the automation? The config becomes the source of truth, and if the change to config is merged the automation should apply it without additional manual confirmation. Are we saying that in this case the config accurately represented SOTW but CLOWarden went and erased everything? |
I would expect that if you had a PR open that modified the state, that would be the one that took effect, not the diff between the empty state and the PR, if that makes sense? |
My understanding of the explanation provided that PR had nothing to do with it. Both periodic and adhoc syncs are still run off the merged config. So if the config was actually empty in if CLOWarden has a CLI that can apply the changes, I would avoid using external app and start running the CLI in dry mode for PRs, and only tell it to apply the changes in manual runs (e.g. a dispatched workflow, after diff is verified). Once we gain confidence in it, we can enable auto-runs. |
The PR is the PR I linked upthread. I understand from the explanation given why things worked the way they did, even if I didn't feel that it was clear it would happen this way before.
This is all relatively academic because we aren't going to be using CLOWarden after this since we don't feel that we can support a tool we don't really understand, and it won't allow us to slowly roll out features to the entire organization. |
Fair enough - if we achieve the same with ~300 lines of Python, I would also go with that for better control. |
Closing this now that recovery is completed and no new issues have been reported for a bit. If there's any more issues or details people want just let us know, thanks. |
@trask - Users of multiple ruby-related teams, including the ruby-maintainers and ruby-contrib-maintainers teams are are no longer able to update Project boards in the open-telemetry org. Were some permissions related to Projects changed in the recent incident? Someone may be able to follow these instructions to update our permissions. |
hi @kaylareopelle! yeah, this was an unfortunate side effect I've just granted write access for both @open-telemetry/ruby-maintainers and @open-telemetry/ruby-contrib-maintainers to these projects: |
Thank you, @trask! Everything's working on my end now! I believe approvers also previously had write access. Is that the norm for other SIGs? If so, would you mind granting access to ruby-approvers and ruby-contrib-approvers too? |
done 👍 |
This issue briefly covers the incident that took place on 09/18/24 starting at 8:49 AM Pacific/11:49 AM Eastern. All times in the document will be in Eastern time.
Timeline of events
Approx. 11:45 AM - A change was pushed to the OpenTelemetry CLOWarden configuration to address a new configuration file location (open-telemetry/people). This configuration file was deliberately limited -- it defined one team, and one repository. Prior to this change, a separate installation of CLOWarden was used in a separate organization to validate the expected outcome with no issues.
11:48 AM - A PR was made to the new config.yaml file. This change added a single user. CLOWarden picked up this change and began to validate the current state vs. the desired state. For an unknown reason, the behavior of CLOWarden differed in the live environment, and it began to apply changes to the current state of the open-telemetry organization prior to approving the PR.
11:49 AM - Teams are deleted via CLOWarden and the GitHub API.
11:49 AM - Infra SIG (@austinlparker and @trask) pull the most recent CLOWarden state backup (from community/config.yaml) and begin to address statefile issues in order to re-create teams from this backup.
~12:20 PM - The statefile passes validation, but CLOWarden fails to process it due to a lack of repository information in the config.yaml. Out of an abundance of caution, we decide that we do not want to add repository information without fully understanding the application code as the existing behavior was a surprise/unexpected. A Rust SME is invited into the SIG call (@TommyCpp) who suggests that we pursue an alternative restoration strategy via Python script and remove CLOWarden from the loop entirely. A brief discussion ensues, and all parties agree on this course of action.
~1:45 PM - A script is developed and tested to re-create the teams based on the backup directly via GitHub API.
~1:49 - The team recreation job begins. We begin to work on restoring team/repository permissions.
2:30 PM - A script to repair repository/team permissions has been written and tested. Beginning to roll out updates.
2:39 PM - Permissions have been modified for all repositories based on the backup and we have confirmed this in the GitHub UI.
Current Status
This incident is now closed. This thread will be left open for discussion until we have a more comprehensive retrospective and learnings available, which will be posted on opentelemetry.io. Thank you for your patience as we resolved this issue.
The text was updated successfully, but these errors were encountered: