A multi-layered approach for Arabic text diacritization

Aya S. Metwally, Mohsen A. Rashwan, Amir F. Atiya

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

Abstract

Text diacritization is a critical task which plays an important role for improving the performance of many NLP tasks for languages that include diacritics in their orthographies. In this paper, we handle the problem of Arabic text diacritization such that our system diacritize input Arabic sequence of words both morphologically and syntactically. The operation of the system is divided into three layers: the first layer uses HMM for the morphological diacritization of previously seen words, the second layer uses an external morphological analyzer for the morphological diacritization of OOV words, and the third layer uses CRF for the syntactic diacritization of all words. To evaluate the performance of the system, we used the benchmark LDC Arabic Treebank Part 3 datasets used by the state-of-the-art systems. The proposed system achieved a morphological WER of 4.3%, and a syntactic WER of 9.4%.
Original languageEnglish
Title of host publicationProceedings of 2016 IEEE International Conference on Cloud Computing and Big Data Analysis, ICCCBDA 2016
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages389-393
Number of pages5
ISBN (Electronic)9781509025930
ISBN (Print)9781509025954
DOIs
Publication statusPublished - 2 Aug 2016
Externally publishedYes
Event2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA) - Chengdu, China
Duration: 5 Jul 20167 Jul 2016

Publication series

NameProceedings of 2016 IEEE International Conference on Cloud Computing and Big Data Analysis, ICCCBDA 2016

Conference

Conference2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)
Period5/07/167/07/16

Fingerprint

Dive into the research topics of 'A multi-layered approach for Arabic text diacritization'. Together they form a unique fingerprint.

Cite this