A proposed statistical model for spam email detection

Research output: A Conference proceeding or a Chapter in BookConference contribution

2 Citations (Scopus)

Abstract

The keyword list-based spam email detection system uses keywords in a blacklist to detect spam emails. To avoid detection, keywords are written as misspellings, for example «virrus», «vi-rus» and «viruus» instead of «virus». The system needs to update the blacklist from time to time to detect spam emails containing such misspellings. However it is impossible to predict all possible misspellings for a given keyword to add those to the blacklist. This paper proposes a statistical framework to solve this problem. A keyword is represented as a Markov chain where letters are states. A Markov model is then built for the keyword. In order to decide an unknown word as a misspelling of a given keyword, a statistical hypothesis test is used. Experiments showed that the proposed statistical models could achieve the detection error rate of 0.1%.

Original languageEnglish
Title of host publication1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006
EditorsBao T Ho
Place of PublicationVietnam
PublisherWorld Scientific Publishing
Pages15-24
Number of pages10
ISBN (Print)9812700633, 9789812700636
Publication statusPublished - 2006
Event1st International Conference on Theories and Applications of Computer Science, ICTACS 2006 - Ho Chi Minh City, Viet Nam
Duration: 3 Aug 20065 Aug 2006

Conference

Conference1st International Conference on Theories and Applications of Computer Science, ICTACS 2006
CountryViet Nam
CityHo Chi Minh City
Period3/08/065/08/06

Fingerprint

Electronic mail
Statistical tests
Error detection
Viruses
Markov processes
Statistical Models
Experiments

Cite this

Tran, D., Ma, W., & Sharma, D. (2006). A proposed statistical model for spam email detection. In B. T. Ho (Ed.), 1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006 (pp. 15-24). Vietnam: World Scientific Publishing.
Tran, Dat ; Ma, Wanli ; Sharma, Dharmendra. / A proposed statistical model for spam email detection. 1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006. editor / Bao T Ho. Vietnam : World Scientific Publishing, 2006. pp. 15-24
@inproceedings{1663a1ecd2c74d65acc9c152ba74b73c,
title = "A proposed statistical model for spam email detection",
abstract = "The keyword list-based spam email detection system uses keywords in a blacklist to detect spam emails. To avoid detection, keywords are written as misspellings, for example «virrus», «vi-rus» and «viruus» instead of «virus». The system needs to update the blacklist from time to time to detect spam emails containing such misspellings. However it is impossible to predict all possible misspellings for a given keyword to add those to the blacklist. This paper proposes a statistical framework to solve this problem. A keyword is represented as a Markov chain where letters are states. A Markov model is then built for the keyword. In order to decide an unknown word as a misspelling of a given keyword, a statistical hypothesis test is used. Experiments showed that the proposed statistical models could achieve the detection error rate of 0.1{\%}.",
author = "Dat Tran and Wanli Ma and Dharmendra Sharma",
year = "2006",
language = "English",
isbn = "9812700633",
pages = "15--24",
editor = "Ho, {Bao T}",
booktitle = "1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006",
publisher = "World Scientific Publishing",

}

Tran, D, Ma, W & Sharma, D 2006, A proposed statistical model for spam email detection. in BT Ho (ed.), 1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006. World Scientific Publishing, Vietnam, pp. 15-24, 1st International Conference on Theories and Applications of Computer Science, ICTACS 2006, Ho Chi Minh City, Viet Nam, 3/08/06.

A proposed statistical model for spam email detection. / Tran, Dat; Ma, Wanli; Sharma, Dharmendra.

1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006. ed. / Bao T Ho. Vietnam : World Scientific Publishing, 2006. p. 15-24.

Research output: A Conference proceeding or a Chapter in BookConference contribution

TY - GEN

T1 - A proposed statistical model for spam email detection

AU - Tran, Dat

AU - Ma, Wanli

AU - Sharma, Dharmendra

PY - 2006

Y1 - 2006

N2 - The keyword list-based spam email detection system uses keywords in a blacklist to detect spam emails. To avoid detection, keywords are written as misspellings, for example «virrus», «vi-rus» and «viruus» instead of «virus». The system needs to update the blacklist from time to time to detect spam emails containing such misspellings. However it is impossible to predict all possible misspellings for a given keyword to add those to the blacklist. This paper proposes a statistical framework to solve this problem. A keyword is represented as a Markov chain where letters are states. A Markov model is then built for the keyword. In order to decide an unknown word as a misspelling of a given keyword, a statistical hypothesis test is used. Experiments showed that the proposed statistical models could achieve the detection error rate of 0.1%.

AB - The keyword list-based spam email detection system uses keywords in a blacklist to detect spam emails. To avoid detection, keywords are written as misspellings, for example «virrus», «vi-rus» and «viruus» instead of «virus». The system needs to update the blacklist from time to time to detect spam emails containing such misspellings. However it is impossible to predict all possible misspellings for a given keyword to add those to the blacklist. This paper proposes a statistical framework to solve this problem. A keyword is represented as a Markov chain where letters are states. A Markov model is then built for the keyword. In order to decide an unknown word as a misspelling of a given keyword, a statistical hypothesis test is used. Experiments showed that the proposed statistical models could achieve the detection error rate of 0.1%.

UR - http://www.scopus.com/inward/record.url?scp=84903694383&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9812700633

SN - 9789812700636

SP - 15

EP - 24

BT - 1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006

A2 - Ho, Bao T

PB - World Scientific Publishing

CY - Vietnam

ER -

Tran D, Ma W, Sharma D. A proposed statistical model for spam email detection. In Ho BT, editor, 1st International Conference on Theories and Applications of Computer Science 2006, ICTACS 2006. Vietnam: World Scientific Publishing. 2006. p. 15-24