Skip to content

atilaromero/dissertation-atila

Repository files navigation

Neural networks applied to file fragment classification

File fragment classification aims to identify the original type of file from which a given block of data was extracted. Machine learning techniques, and neural networks in particular, have the potential to improve this field, because the support of a new file type using traditional methods is laborious and not automatic. While studies on neural networks applied for file fragment classification have shown good results, to apply those contributions on real scenarios a better understanding is required on how those methods respond to an increase in number of supported file types and what the are the main sources of errors of this approach.

This work focus on the challenges of applying neural networks for file fragment classification and it is organized into three parts. In the first part, the accuracy values of some types of neural networks are compared, classifying file fragments taken from the Govdocs1 dataset, using their file extensions as class labels. The results suggest that the file types composing the dataset have a more relevant role in the results than the architectural details of the neural network models. In the second part, the influence of the number of classes on accuracy is explored. The conclusion is that the number of classes in the dataset is relevant, but less important than the types of extensions selected to compose the dataset. Finally, a procedure is devised to test the hypothesis that part of the errors of file fragment classification can be explained by the inability of the models to distinguish high entropy data from random data. The conclusion is that, for the used model, the inability to distinguish high entropy data from random data can only account for about 1/3 of the errors. This suggests that, for future researches on file fragment classification, the search for a procedure to automatically label inner file data structures may be a promising approach.

Link to full dissertation: https://github.com/atilaromero/dissertation-atila/raw/master/dissertation_atila.pdf

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published