This repository contains an implementation with Tensorflow of a Fully Convolutional Network (FCN) used to label image pixels in the context of semantic scene understanding:
The implementation is based on the Fully Convolutional Networks for Semantic Segmentation paper by Evan Shelhamer, Jonathan Long and Trevor Darrell (original caffe implementation can be found here).
The model uses as encoder a VGG16 model, then a decoder is added in order to upsample the filters to the final image size, using 1x1 convolutions and transposed convolutions in order to upsample the layers. Additionally skip layers are used to bring in better spatial information from previous layers.
This project was implemented using TensorFlow and you'll need a set of dependencies in order to run the code, in particular:
- Python 3 (3.6)
- TensorFlow (1.12)
- NumPy
- SciPy 1.1.0
On pip, use
pip install scipy==1.1.0
- Matplotlib
- Imageio
- Tqdm
- Opencv
Given the complexity of the model a GPU is strongly suggested to train the model; A good and relatively cheap way is to use an EC2 instance on AWS. For example a p2.xlarge instance on EC2 is a good candidate for this type of task (You'll have to ask for an increase in the limits for this type of instance). Alternatively a cheaper instance type (that I used during training) is the GPU graphics instance g3x.xlarge, the instance is relatively slower but the GPU (M60) is newer and faster than the K80 on the p2.xlarge even though it has less memory (8 vs 12).
You can use the official Deep Learning AMI from Amazon that contains most of the required dependencies (See https://docs.aws.amazon.com/dlami/latest/devguide/gs.html) aside from tqdm.
The main.py script can be run as follows:
$ python model.py [flags]
Where flags can be set to:
- [--data_dir]: The folder containing the training data (default ./data)
- [--runs_dir]: The folder where the output is saved (default ./runs)
- [--model_folder]: The folder where the model is saved/loaded (default ./models/[generated_name])
- [--epochs]: The number of epochs (default 80)
- [--batch_size]: The batch size (default 25)
- [--dropout]: The dropout probability (default 0.5)
- [--learning_rate]: The learning rate (default 0.0001)
- [--l2_reg]: The amount of L2 regularization (defualt 0.001)
- [--scale]: True if scaling should be applied to layers 3 and 4 of VGG (default True)
- [--early_stopping]: The number of epochs after which the training is stopped if the loss didn't improve (default 4)
- [--seed]: Integer used to seed random ops for reproducibility (default None)
- [--cpu]: If True disable the GPU (default None)
- [--tests]: If True runs the tests (default True)
- [--train]: If True runs the training (default True), if a model checkpoint exists in the model_folder the weights will be reloaded
- [--image]: Image path to run inference for (default None)
- [--video]: Video path to run inference for (defatul None)
- [--augment]: Path to the target folder where to save augmented data from data_dir (default None)
- [--serialize]: Path of a non existing folder where to save the pb version of the checkpoint saved during training (default None)
The script will save summaries for Tensorboard in the logs folder:
$ tensorboard --samples_per_plugin images=0 --logdir=logs
The summaries include the training loss, accuracy and intersection over union (IOU) metrics. It will also save images with the predicted result:
An example to run a training session on 10 epochs with a batch size of 10 and learning rate of 0.001, saving the model into models\my_model:
$ python main.py --tests=false --epochs=10 --batch_size=10 --learning_rate=0.001 --model_folder=models\\my_model
An example of processing a single image image.png using a model saved into models\my_model:
$ python main.py --tests=false --model_folder=models\\my_model --image=image.png
An example of processing a video video.mp4 using a model saved into models\my_model:
$ python main.py --tests=false --model_folder=models\\my_model --video=video.mp4
An example of augmenting the dataset in the data folder ans saving the result in data\augmented (expects the training to be in data\data_road\training):
$ python main.py --tests=false --data_dir=data --augment=data\\augmented
An example of serializing a model to a proto buffer in model\my_model\serialized from a checkpoint in models\my_model:
$ python main.py --tests=false --model_folder=models\\my_model --serialize=models\\my_model\\serialized
In order to train the network we used the Kitti Road dataset, that can be downloaded from here. It contains both training and testing images, with the ground truth images for the training dataset that are labelled with the correct pixel categorization (road vs non-road):
The Kitti dataset contains 289 labelled samples, in order to improve the model performance it can be easily augmented, the repository contains a python script that simply mirrors the images and applies a random amount of brightness:
The training was performed with various hyperparameters, starting from the following baseline:
- Epochs: 50
- Batch Size: 15
- Learning Rate: 0.001
- Dropout: 0.5
- L2 Regularization: 0.001
- Scaling: False
Note that scaling is a teqnique depicted in the original implementation when they perform what they name "at-once" training, the pooling layers 3 and 4 from the VGG16 model are scaled before the 1x1 convolution is applied (See https://github.com/shelhamer/fcn.berkeleyvision.org/blob/1305c7378a9f0ab44b2c936f4d60e4687e3d8743/voc-fcn8s-atonce/net.py#L65).
Various experiments with different configurations were needed in order to tune the model:
And in the following a sample of images with the various configurations:
As we can see scaling smoothen the result better and augmenting the dataset helped in producing more accurate results:
Using the base learning rate (without decay) the model would converge but stop learning after around 30-40 epochs. When lowering the learning rate on the augmented dataset we could train on 80 epochs which retained the best accuracy:
Augmented Dataset, Scaling ON and Learning Rate 0.0001
The parameters used for the final training (in one shot):
- Epochs: 80
- Batch Size: 25
- Learning Rate: 0.0001
- Dropout: 0.5
- L2 Regularization: 0.001
- Scaling: True