Abstract
Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method
Original language | English |
---|---|
Title of host publication | Proceedings of 2003 Seventh International Symposium on Signal Processing and Its Applications |
Editors | K Abed-Meraim, I Bloch |
Place of Publication | France |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 513-516 |
Number of pages | 4 |
ISBN (Print) | 0-7803-7947-0 |
DOIs | |
Publication status | Published - 2003 |
Event | 7th International Symposium on Signal Processing and Its Applications - , France Duration: 1 Jul 2003 → 4 Jul 2003 |
Conference
Conference | 7th International Symposium on Signal Processing and Its Applications |
---|---|
Country/Territory | France |
Period | 1/07/03 → 4/07/03 |