Decision tree algorithms for image data type identification

Research output: Contribution to journalArticle

Abstract

Identifying file type of file fragments has been investigated for a long time but it is still a challenge. It is found in the literature that high-entropy file fragments make the problem more complicated. Especially, existing popular file types share same compression algorithms such as deflate algorithm that causes file type identification for file fragment become harder. Applying machine learning or empirical techniques is to deal with this problem. Compression algorithms are used to reduce the size of files that have big data size and include image files. Many research work of file type identification have been done for JPEG format, and the Rate of Change feature is proven to work effectively for it. Conversely, few efforts have been made for PNG although this is a popular image format and widely used nowadays. In this article, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and decision tree techniques to identify PNG data fragments which are the deflate-encoded fragments. Experiments showed high accuracy rates for the proposed method.
Original languageEnglish
Pages (from-to)67-82
Number of pages16
JournalInterest Group in Pure and Applied Logics. Logic Journal
Volume25
Issue number1
DOIs
Publication statusPublished - 2017

Fingerprint

File
Decision Tree
Compression
Entropy
Experiment
Machine Learning
Causes

Cite this

@article{9148d43a3e714adabb366eda34bdb3de,
title = "Decision tree algorithms for image data type identification",
abstract = "Identifying file type of file fragments has been investigated for a long time but it is still a challenge. It is found in the literature that high-entropy file fragments make the problem more complicated. Especially, existing popular file types share same compression algorithms such as deflate algorithm that causes file type identification for file fragment become harder. Applying machine learning or empirical techniques is to deal with this problem. Compression algorithms are used to reduce the size of files that have big data size and include image files. Many research work of file type identification have been done for JPEG format, and the Rate of Change feature is proven to work effectively for it. Conversely, few efforts have been made for PNG although this is a popular image format and widely used nowadays. In this article, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and decision tree techniques to identify PNG data fragments which are the deflate-encoded fragments. Experiments showed high accuracy rates for the proposed method.",
keywords = "Decision tree algorithm, File fragment identification, Image data type identification, PNG, SVM, Shannon entropy",
author = "Khoa NGUYEN and Dat TRAN and Wanli MA and Dharmendra SHARMA",
year = "2017",
doi = "10.1093/jigpal/jzw045",
language = "English",
volume = "25",
pages = "67--82",
journal = "Interest Group in Pure and Applied Logics. Logic Journal",
issn = "1367-0751",
publisher = "Oxford University Press",
number = "1",

}

Decision tree algorithms for image data type identification. / NGUYEN, Khoa; TRAN, Dat; MA, Wanli; SHARMA, Dharmendra.

In: Interest Group in Pure and Applied Logics. Logic Journal, Vol. 25, No. 1, 2017, p. 67-82.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Decision tree algorithms for image data type identification

AU - NGUYEN, Khoa

AU - TRAN, Dat

AU - MA, Wanli

AU - SHARMA, Dharmendra

PY - 2017

Y1 - 2017

N2 - Identifying file type of file fragments has been investigated for a long time but it is still a challenge. It is found in the literature that high-entropy file fragments make the problem more complicated. Especially, existing popular file types share same compression algorithms such as deflate algorithm that causes file type identification for file fragment become harder. Applying machine learning or empirical techniques is to deal with this problem. Compression algorithms are used to reduce the size of files that have big data size and include image files. Many research work of file type identification have been done for JPEG format, and the Rate of Change feature is proven to work effectively for it. Conversely, few efforts have been made for PNG although this is a popular image format and widely used nowadays. In this article, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and decision tree techniques to identify PNG data fragments which are the deflate-encoded fragments. Experiments showed high accuracy rates for the proposed method.

AB - Identifying file type of file fragments has been investigated for a long time but it is still a challenge. It is found in the literature that high-entropy file fragments make the problem more complicated. Especially, existing popular file types share same compression algorithms such as deflate algorithm that causes file type identification for file fragment become harder. Applying machine learning or empirical techniques is to deal with this problem. Compression algorithms are used to reduce the size of files that have big data size and include image files. Many research work of file type identification have been done for JPEG format, and the Rate of Change feature is proven to work effectively for it. Conversely, few efforts have been made for PNG although this is a popular image format and widely used nowadays. In this article, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and decision tree techniques to identify PNG data fragments which are the deflate-encoded fragments. Experiments showed high accuracy rates for the proposed method.

KW - Decision tree algorithm

KW - File fragment identification

KW - Image data type identification

KW - PNG

KW - SVM

KW - Shannon entropy

UR - http://www.scopus.com/inward/record.url?scp=85014717038&partnerID=8YFLogxK

UR - http://www.mendeley.com/research/decision-tree-algorithms-image-data-type-identification

U2 - 10.1093/jigpal/jzw045

DO - 10.1093/jigpal/jzw045

M3 - Article

VL - 25

SP - 67

EP - 82

JO - Interest Group in Pure and Applied Logics. Logic Journal

JF - Interest Group in Pure and Applied Logics. Logic Journal

SN - 1367-0751

IS - 1

ER -