In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)-can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.
|Number of pages||26|
|Journal||ACM Transactions on Asian and Low-Resource Language Information Processing|
|Publication status||Published - 2016|