Goal: Implementation of a fast-learning deep neural network with general applicability and especially well Angry Birds playing performance.
The idea originated from the Angry Birds AI Competition.
Key Features • Installation • Usage • Troubleshooting • Acknowledgements • Bibliography • License •
- Reinforcement learning framework with following features:
- Annealing Epsilon-Greedy Policy
- Dueling Networks
- Prioritized Experience Replay
- Double Q-Learning
- Frame-stacking
- n-step Expected Sarsa
- Distributed RL
- (Sequential Learning (for LSTMs and other RNNs))
- Monte Carlo Return Targets
- Environments:
- Angry Birds, with level generator
- Snake
- Tetris
- Chain Bomb (invented)
- Environment API: quick switch between environments without need of changing the model
- Extensive utility library:
- Plotting: >40 different ways to compare several training run metrics (score, return, time etc.) on different domains (transitions, episodes, wall-clock time, etc.)
- Permanence: save and reload entire training runs, including model weights and statistics
- Pre-training: unsupervised model pre-training on randomly generated environment data
- Train sample control: sample states with extreme loss during training are logged and plotted automatically
- and much more
- Frame stacking
- Angry Birds environment parallelization
- Handling for (practically) infinite episodes
Just clone the repo, that's it.
Just tune and run the agent from src/train.py
. You can let the agent practice, observe
how the agent plays, view plot statistics, and more.
Any generated output (models, plots, statistics etc.) will be saved in out/
.
Parameter | Reasonable value | Explanation | If too low, then | If too high, then |
---|---|---|---|---|
General | ||||
num_parallel_inst |
500 |
Number of simultaneously executed environments | Training overhead dominates computation time | Possibly worse sample complexity, GPU or RAM out of memory |
num_parallel_steps |
1000000 |
Number of transitions done per parallel environments | Learning stops before agent performance is optimal | Wasted energy, overfitting |
policy |
"greedy" |
The policy used for planning ("greedy" for max Q-value, "softmax" for random choice of softmaxed Q-values) |
- | - |
Model input | ||||
stack_size |
1 |
Number of recent frames to be stacked for input, useful for envs with time dependency like Breakout | Agent has "no feeling for time", bad performance on envs with time dependency | Unnecessary computation overhead |
Learning target | ||||
gamma |
0.999 |
Discount factor | Short-sighted strategy, early events dominate return | Far-sighted strategy, late events dominate return, target shift, return explosion |
n_step |
1 |
Number of steps used for Temporal Difference (TD) bootstrapping | ||
use_mc_return |
False |
If True, uses Monte Carlo instead of n-step TD | ||
Model | ||||
latent_dim |
128 |
Width of latent layer of stem model | Model cannot learn the game entirely or makes slow training progress | Huge number of (unused) model parameters |
latent_depth |
1 |
Number of consecutive latent layers | Model cannot learn the game entirely or makes slow training progress | Many (unused) model parameters |
lstm_dim |
128 |
Width of LSTM layer | Model cannot learn the game entirely, slow training progress or model is bad at remembering | Many (unused) model parameters |
latent_v_dim |
64 |
Width of latent layer of value part of Q-network | Similar to latent_dim |
Similar to latent_dim |
latent_a_dim |
64 |
Width of latent layer of advantage part of Q-network | Similar to latent_dim |
Similar to latent_dim |
Replay (training) | ||||
optimizer |
tf.Adam |
The tf optimizer to use |
- | - |
mem_size |
4000000 |
Number of transitions that fit into the replay memory | Overfitting to recent observations | RAM out of memory, too old transitions in replays => in case of RNNs can lead to recurrent state staleness due to representational drift; large computation overhead due to replay sampling |
replay_period |
64 |
Number of (parallel) steps between each training session of the learner | Training overhead dominates computation time | Slow training progress |
replay_size_multiplier |
4 |
Determines replay size by multiplying number of new transitions with this factor | Too strong focus on new observations, overfitting | Too weak focus on new observations, slow training progress |
replay_batch_size |
1024 |
Batch size used for learning, depends on GPU | GPU parallelization not used effectively | GPU out of memory |
replay_epochs |
1 |
Number of epochs per replay | Wasted training data, slow progress | Overfitting |
min_hist_len |
0 |
Minimum number of observed transitions before training is allowed | Unstable training or overfitting in the beginning | Wasted time |
alpha |
0.7 |
Prioritized experience replay exponent, controls the effect of priority | Priorities have low/no effect on training, slower training progress | Too strong priority emphasis, overfitting |
use_double |
True |
Whether to use Double Q-Learning to tackle moving target issue | - | - |
target_sync_period |
256 |
Number of (parallel) steps between synchronization of online learner and target learner | Moving target problem | Slow training progress |
actor_sync_period |
64 |
Distributed RL: number of (parallel) steps between synchronization of (online) learner and actor | ? | Slow training progress |
Learning rate | ||||
init_value |
0.0004 |
Starting value of learning rate | Slow training progress | No training progress, bad maximum performance or unstable training |
half_life_period |
4000000 |
Determines learning rate decay, number of played transitions after which learning rate halves | Learning rate decreases too quickly | Learning rate too large |
warmup_transitions |
0 |
Number of episodes used for (linear) learning rate warm-up | Unstable training | Slow training progress |
Epsilon | ||||
init_value |
1 |
Starting value for epsilon-greedy policy | Too few exploration, slow training progress | Too much exploration, slow training progress |
decay_mode |
"exp" |
Shape of epsilon value function over time | - | - |
half_life_period |
700000 |
Determines epsilon annealing, number of played transitions after which epsilon halves | Similar to init_value |
Similar to init_value |
minimum |
0 |
Limit to which epsilon converges when decaying over time | Not enough long-time exploration, missed late-game opportunities for better performance | Similar to init_value |
Sequential training (for RNNs) | ||||
sequence_len |
20 |
Length of sequences saved in replay buffer and used for learning | Slow training progress | Wasted computation and memory resources |
sequence_shift |
10 |
Number of transitions sequences are allowed to overlap | Few sequences, some time-dependencies might not be captured | Too many similar sequences, overfitting |
eta |
0.9 |
For sequential learning: determines sequence priority. 0 : sequence prio = average instance prio, 1 : sequence prio = max instance prio. |
? | ? |
Other | ||||
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
- Symptom: level objects don't show up in Science Birds after level was load. The actual problem is that the objects did spawn outside of the level boundaries. The reason for this turned out to be the OS's language/unit configuration. In my case the system language is de_DE and for this, the decimal point is not a point but a comma (e.g. 2,7). The problem is that unity3D uses the system configuration for their coordination system and coordinates like x=2.5, y=8.4 could not be interpreted correctly.
- Solution: start ScienceBirds with the language en_US.UTF-8 so that unity3D uses points for floats instead of commas, or (in case of Windows) set the OS's region to English (U.S.).
- The team behind Science Birds for a good framework