Abstract
The keyword list-based spam email detection system uses keywords in a blacklist to detect spam emails. To avoid detection, keywords are written as misspellings, for example «virrus», «vi-rus» and «viruus» instead of «virus». The system needs to update the blacklist from time to time to detect spam emails containing such misspellings. However it is impossible to predict all possible misspellings for a given keyword to add those to the blacklist. This paper proposes a statistical framework to solve this problem. A keyword is represented as a Markov chain where letters are states. A Markov model is then built for the keyword. In order to decide an unknown word as a misspelling of a given keyword, a statistical hypothesis test is used. Experiments showed that the proposed statistical models could achieve the detection error rate of 0.1%.
Original language | English |
---|---|
Title of host publication | 1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006 |
Editors | Bao T Ho |
Place of Publication | Vietnam |
Publisher | World Scientific Publishing |
Pages | 15-24 |
Number of pages | 10 |
ISBN (Print) | 9812700633, 9789812700636 |
Publication status | Published - 2006 |
Event | 1st International Conference on Theories and Applications of Computer Science, ICTACS 2006 - Ho Chi Minh City, Viet Nam Duration: 3 Aug 2006 → 5 Aug 2006 |
Conference
Conference | 1st International Conference on Theories and Applications of Computer Science, ICTACS 2006 |
---|---|
Country/Territory | Viet Nam |
City | Ho Chi Minh City |
Period | 3/08/06 → 5/08/06 |