VQ-Based Written Language Identification

Dat Tran, T Pham

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

6 Citations (Scopus)
30 Downloads (Pure)


Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method
Original languageEnglish
Title of host publicationProceedings of 2003 Seventh International Symposium on Signal Processing and Its Applications
EditorsK Abed-Meraim, I Bloch
Place of PublicationFrance
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages4
ISBN (Print)0-7803-7947-0
Publication statusPublished - 2003
Event7th International Symposium on Signal Processing and Its Applications - , France
Duration: 1 Jul 20034 Jul 2003


Conference7th International Symposium on Signal Processing and Its Applications


Dive into the research topics of 'VQ-Based Written Language Identification'. Together they form a unique fingerprint.

Cite this