File type identification of data fragments based on entropy and pattern recognition techniques

  • Khoa Canh Nguyen

    Student thesis: Doctoral Thesis


    The increase in computer-related crime has caused law enforcement agencies to seize digital evidence in the form of network logs, text documents, videos, and images. However, digital evidence normally does not come in a complete form. Therefore, in many cases forensic analysts have to investigate a piece of data, such as a disk block or a network packet, to identify the file type of the data so that they can retrieve necessary information from that piece of data. Much research effort has been made to solve this problem. However, there is no approach referred to in the literature that can bring high identification rates for most of the file types. This research project focuses on identifying file type for high entropy file fragments which are parts of widely used file formats. Particularly, the research aims at considering file fragments which share the same compression method, namely deflate algorithm. It is shown that sharing the same encoding method makes the problem of file fragment identification more severe; because compression algorithms eliminate statistical features that can be used to distinguish between different file types. Deflate-encoded data must be detected then decompressed to retrieve the underlying data which have discernible patterns and these patterns are then combined with pattern recognition techniques to identify the file types of file fragments. This thesis also exploits entropy which is a measurement to evaluate the randomness of data. Entropy is used to cluster file fragments into three groups. In each group, all data fragments have the same entropy value range of either low, medium or high. Different pattern recognition models and different features are built to identify the file type for executable, PDF compound and PNG image file formats. Evaluation experiments performed on the public dataset Govdocs1 and other datasets show very good results in term of detection accuracy for proposed techniques. In digital forensics, practical tools are crucial. The experiments also show that the proposed methods can be implemented for real life applications.
    Date of Award2016
    Original languageEnglish
    SupervisorDat Tran (Supervisor), Wanli Ma (Supervisor) & Dharmendra Sharma AM PhD (Supervisor)

    Cite this