Pairwise comparative classification for translator stylometric analysis

Heba El-Fiqi, Eleni Petraki, Hussein A. Abbass

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.

Original languageEnglish
Article number2
Pages (from-to)1-26
Number of pages26
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume16
Issue number1
DOIs
Publication statusPublished - 2016

Cite this

@article{336fbfc5d7a542b2908da2f042654215,
title = "Pairwise comparative classification for translator stylometric analysis",
abstract = "In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.",
keywords = "Arabic translation, Classification, Translator stylometry, translator stylometry, classification",
author = "Heba El-Fiqi and Eleni Petraki and Abbass, {Hussein A.}",
year = "2016",
doi = "10.1145/2898997",
language = "English",
volume = "16",
pages = "1--26",
journal = "ACM Transactions on Asian and Low-Resource Language Information Processing",
issn = "2375-4699",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

Pairwise comparative classification for translator stylometric analysis. / El-Fiqi, Heba; Petraki, Eleni; Abbass, Hussein A.

In: ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 16, No. 1, 2, 2016, p. 1-26.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Pairwise comparative classification for translator stylometric analysis

AU - El-Fiqi, Heba

AU - Petraki, Eleni

AU - Abbass, Hussein A.

PY - 2016

Y1 - 2016

N2 - In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.

AB - In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.

KW - Arabic translation

KW - Classification

KW - Translator stylometry

KW - translator stylometry

KW - classification

UR - http://www.scopus.com/inward/record.url?scp=84997146385&partnerID=8YFLogxK

U2 - 10.1145/2898997

DO - 10.1145/2898997

M3 - Article

VL - 16

SP - 1

EP - 26

JO - ACM Transactions on Asian and Low-Resource Language Information Processing

JF - ACM Transactions on Asian and Low-Resource Language Information Processing

SN - 2375-4699

IS - 1

M1 - 2

ER -