TY - JOUR
T1 - Arabic text clustering using improved clustering algorithms with dimensionality reduction
AU - Sangaiah, Arun Kumar
AU - Fakhry, Ahmed E.
AU - Abdel-Basset, Mohamed
AU - El-henawy, Ibrahim
N1 - Publisher Copyright:
© 2018, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2019/3/1
Y1 - 2019/3/1
N2 - Arabic Text document clustering is an important aspect for providing conjectural navigation and browsing techniques by organizing massive amounts of data into a small number of defined clusters. However, Words in form of vector are used for clustering methods is often unsatisfactory as it ignores relationships between important terms. Cluster analysis separates data into groups on clusters for the purposes of improved understanding or summarization. Clustering has a long history and many techniques developed in statistics, data mining, pattern recognition and other fields. This research proposes three approaches; Unsupervised, Semi Supervised techniques and Semi Supervised with dimensionality reduction to construct a clustering based classifier for Arabic text documents. Using k-means, incremental k-means, Threshold + k-means and k-means with dimensionality reduction, after document preprocessing removing stop words and gets the root for each term in each document. Then apply a term weighting method to get the weight of each term with respect to its document. Then apply a similarity measure method to each document and its similarity with other documents. And using F-measure, entropy and support vector machine (SVM) for calculate accuracy. The datasets are online dynamic datasets that are characterized by its availability and credibility on the internet. Arabic language is a challenging language when applied in an inference based algorithm. So, selecting the appropriate dataset is a principal factor in such research. The accuracy of those methods compared with other approaches and the proposed methods shows better accuracy and fewer errors for new classification test cases. Considering that the dimension reduction process is very sensitive because increasing the ratio of reduction can destroy important terms.
AB - Arabic Text document clustering is an important aspect for providing conjectural navigation and browsing techniques by organizing massive amounts of data into a small number of defined clusters. However, Words in form of vector are used for clustering methods is often unsatisfactory as it ignores relationships between important terms. Cluster analysis separates data into groups on clusters for the purposes of improved understanding or summarization. Clustering has a long history and many techniques developed in statistics, data mining, pattern recognition and other fields. This research proposes three approaches; Unsupervised, Semi Supervised techniques and Semi Supervised with dimensionality reduction to construct a clustering based classifier for Arabic text documents. Using k-means, incremental k-means, Threshold + k-means and k-means with dimensionality reduction, after document preprocessing removing stop words and gets the root for each term in each document. Then apply a term weighting method to get the weight of each term with respect to its document. Then apply a similarity measure method to each document and its similarity with other documents. And using F-measure, entropy and support vector machine (SVM) for calculate accuracy. The datasets are online dynamic datasets that are characterized by its availability and credibility on the internet. Arabic language is a challenging language when applied in an inference based algorithm. So, selecting the appropriate dataset is a principal factor in such research. The accuracy of those methods compared with other approaches and the proposed methods shows better accuracy and fewer errors for new classification test cases. Considering that the dimension reduction process is very sensitive because increasing the ratio of reduction can destroy important terms.
KW - Arabic text
KW - Clustering algorithm and dimension reduction
KW - Pattern recognition
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=85042608730&partnerID=8YFLogxK
U2 - 10.1007/s10586-018-2084-4
DO - 10.1007/s10586-018-2084-4
M3 - Article
AN - SCOPUS:85042608730
SN - 1386-7857
VL - 22
SP - 4535
EP - 4549
JO - Cluster Computing
JF - Cluster Computing
ER -