TY - JOUR
T1 - Pairwise comparative classification for translator stylometric analysis
AU - El-Fiqi, Heba
AU - Petraki, Eleni
AU - Abbass, Hussein A.
PY - 2016
Y1 - 2016
N2 - In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.
AB - In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.
KW - Arabic translation
KW - Classification
KW - Translator stylometry
KW - translator stylometry
KW - classification
UR - http://www.scopus.com/inward/record.url?scp=84997146385&partnerID=8YFLogxK
U2 - 10.1145/2898997
DO - 10.1145/2898997
M3 - Article
AN - SCOPUS:84997146385
SN - 2375-4699
VL - 16
SP - 1
EP - 26
JO - ACM Transactions on Asian and Low-Resource Language Information Processing
JF - ACM Transactions on Asian and Low-Resource Language Information Processing
IS - 1
M1 - 2
ER -