A proposed statistical model for spam email detection

Research output: A Conference proceeding or a Chapter in BookConference contribution

2 Citations (Scopus)

Abstract

The keyword list-based spam email detection system uses keywords in a blacklist to detect spam emails. To avoid detection, keywords are written as misspellings, for example «virrus», «vi-rus» and «viruus» instead of «virus». The system needs to update the blacklist from time to time to detect spam emails containing such misspellings. However it is impossible to predict all possible misspellings for a given keyword to add those to the blacklist. This paper proposes a statistical framework to solve this problem. A keyword is represented as a Markov chain where letters are states. A Markov model is then built for the keyword. In order to decide an unknown word as a misspelling of a given keyword, a statistical hypothesis test is used. Experiments showed that the proposed statistical models could achieve the detection error rate of 0.1%.

Original languageEnglish
Title of host publication1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006
EditorsBao T Ho
Place of PublicationVietnam
PublisherWorld Scientific Publishing
Pages15-24
Number of pages10
ISBN (Print)9812700633, 9789812700636
Publication statusPublished - 2006
Event1st International Conference on Theories and Applications of Computer Science, ICTACS 2006 - Ho Chi Minh City, Viet Nam
Duration: 3 Aug 20065 Aug 2006

Conference

Conference1st International Conference on Theories and Applications of Computer Science, ICTACS 2006
CountryViet Nam
CityHo Chi Minh City
Period3/08/065/08/06

Fingerprint Dive into the research topics of 'A proposed statistical model for spam email detection'. Together they form a unique fingerprint.

Cite this