The goal of this project is to collect beginner friendly notebooks for demonstrating alignment concepts in toy environments and make them feel more tangible.
Todo:
- Instrumental convergence
- Goodharting (following the variants identified here)
- Mesaoptimization (can be largely based off this paper)
- Deception (todo: figure out how to do this one)
- ELK (showing each proposal in ELK, and then showing a toy problem that breaks the proposal, etc)
- Wireheading (an environment where the AI can modify its own reward signal directly)
- Nearest unblocked (an environment demonstrating problems with bandaid solutions)