FCFilter: Feature selection based on clustering and genetic algorithms

Charles Ferreira, Deborah de Medeiros, Fabiana SANTANA

Research output: A Conference proceeding or a Chapter in BookConference contribution

3 Citations (Scopus)

Abstract

The search for patterns in big amounts of textual data, or text mining, can be at once rewarding and challenging. The patterns can reveal tendencies, similarities and predictions, but the information is usually implicit and difficult to be validated. Classification is one of the most relevant research areas in text mining, and it usually consists of predicting the class of a textual document based on a set of documents previously organized into different classes, such as author or topic. Choosing the words to compose the feature set is crucial to a proper classification. A well selected feature set can improve the performance of the classification method and enlighten the interpretation of the classification model adjusted to the data. This paper introduces the Feature Cluster Filter (FCFilter) method for feature selection. FCFilter eliminates the need to input or optimize the number of clusters by grouping the words in a sufficiently high number of clusters. Genetic algorithms are applied to optimize the combination of groups that will provide the final feature set. The method is based on the selection of features that are good predictors for text classification by clustering features and selecting only the suitable clusters. Experiments performed to evaluate the FCFilter with the Reuters-21578, SCY-Genes and SCY-Clusters datasets showed a significant reduction in the feature-value table dimensionality with slight improvements in the classification accuracy when compared to the baselines. The results are very promising, indicating potential improvements in the research on feature selection for text mining
Original languageEnglish
Title of host publication2016 IEEE Congress on Evolutionary Computation (CEC)
EditorsYew Soon Ong
Place of PublicationVancouver, Canada
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages2106-2113
Number of pages8
ISBN (Electronic)9781509006229
ISBN (Print)9781509006236
DOIs
Publication statusPublished - 2016
Event2016 IEEE Congress on Evolutionary Computation (CEC) - Vancouver, Vancouver, Canada
Duration: 24 Jul 201629 Jul 2016

Publication series

Name2016 IEEE Congress on Evolutionary Computation, CEC 2016

Conference

Conference2016 IEEE Congress on Evolutionary Computation (CEC)
Abbreviated titleCEC 2016
CountryCanada
CityVancouver
Period24/07/1629/07/16

Fingerprint

Clustering algorithms
Feature extraction
Genetic algorithms
Genes
Experiments

Cite this

Ferreira, C., de Medeiros, D., & SANTANA, F. (2016). FCFilter: Feature selection based on clustering and genetic algorithms. In Y. S. Ong (Ed.), 2016 IEEE Congress on Evolutionary Computation (CEC) (pp. 2106-2113). [7744048] (2016 IEEE Congress on Evolutionary Computation, CEC 2016). Vancouver, Canada: IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/CEC.2016.7744048
Ferreira, Charles ; de Medeiros, Deborah ; SANTANA, Fabiana. / FCFilter: Feature selection based on clustering and genetic algorithms. 2016 IEEE Congress on Evolutionary Computation (CEC). editor / Yew Soon Ong. Vancouver, Canada : IEEE, Institute of Electrical and Electronics Engineers, 2016. pp. 2106-2113 (2016 IEEE Congress on Evolutionary Computation, CEC 2016).
@inproceedings{36aee6cbd6064b5f9c3761ba59eb0aad,
title = "FCFilter: Feature selection based on clustering and genetic algorithms",
abstract = "The search for patterns in big amounts of textual data, or text mining, can be at once rewarding and challenging. The patterns can reveal tendencies, similarities and predictions, but the information is usually implicit and difficult to be validated. Classification is one of the most relevant research areas in text mining, and it usually consists of predicting the class of a textual document based on a set of documents previously organized into different classes, such as author or topic. Choosing the words to compose the feature set is crucial to a proper classification. A well selected feature set can improve the performance of the classification method and enlighten the interpretation of the classification model adjusted to the data. This paper introduces the Feature Cluster Filter (FCFilter) method for feature selection. FCFilter eliminates the need to input or optimize the number of clusters by grouping the words in a sufficiently high number of clusters. Genetic algorithms are applied to optimize the combination of groups that will provide the final feature set. The method is based on the selection of features that are good predictors for text classification by clustering features and selecting only the suitable clusters. Experiments performed to evaluate the FCFilter with the Reuters-21578, SCY-Genes and SCY-Clusters datasets showed a significant reduction in the feature-value table dimensionality with slight improvements in the classification accuracy when compared to the baselines. The results are very promising, indicating potential improvements in the research on feature selection for text mining",
keywords = "Text mining, genetic algorithms, feature selection",
author = "Charles Ferreira and {de Medeiros}, Deborah and Fabiana SANTANA",
year = "2016",
doi = "10.1109/CEC.2016.7744048",
language = "English",
isbn = "9781509006236",
series = "2016 IEEE Congress on Evolutionary Computation, CEC 2016",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
pages = "2106--2113",
editor = "Ong, {Yew Soon}",
booktitle = "2016 IEEE Congress on Evolutionary Computation (CEC)",
address = "United States",

}

Ferreira, C, de Medeiros, D & SANTANA, F 2016, FCFilter: Feature selection based on clustering and genetic algorithms. in YS Ong (ed.), 2016 IEEE Congress on Evolutionary Computation (CEC)., 7744048, 2016 IEEE Congress on Evolutionary Computation, CEC 2016, IEEE, Institute of Electrical and Electronics Engineers, Vancouver, Canada, pp. 2106-2113, 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada, 24/07/16. https://doi.org/10.1109/CEC.2016.7744048

FCFilter: Feature selection based on clustering and genetic algorithms. / Ferreira, Charles; de Medeiros, Deborah; SANTANA, Fabiana.

2016 IEEE Congress on Evolutionary Computation (CEC). ed. / Yew Soon Ong. Vancouver, Canada : IEEE, Institute of Electrical and Electronics Engineers, 2016. p. 2106-2113 7744048 (2016 IEEE Congress on Evolutionary Computation, CEC 2016).

Research output: A Conference proceeding or a Chapter in BookConference contribution

TY - GEN

T1 - FCFilter: Feature selection based on clustering and genetic algorithms

AU - Ferreira, Charles

AU - de Medeiros, Deborah

AU - SANTANA, Fabiana

PY - 2016

Y1 - 2016

N2 - The search for patterns in big amounts of textual data, or text mining, can be at once rewarding and challenging. The patterns can reveal tendencies, similarities and predictions, but the information is usually implicit and difficult to be validated. Classification is one of the most relevant research areas in text mining, and it usually consists of predicting the class of a textual document based on a set of documents previously organized into different classes, such as author or topic. Choosing the words to compose the feature set is crucial to a proper classification. A well selected feature set can improve the performance of the classification method and enlighten the interpretation of the classification model adjusted to the data. This paper introduces the Feature Cluster Filter (FCFilter) method for feature selection. FCFilter eliminates the need to input or optimize the number of clusters by grouping the words in a sufficiently high number of clusters. Genetic algorithms are applied to optimize the combination of groups that will provide the final feature set. The method is based on the selection of features that are good predictors for text classification by clustering features and selecting only the suitable clusters. Experiments performed to evaluate the FCFilter with the Reuters-21578, SCY-Genes and SCY-Clusters datasets showed a significant reduction in the feature-value table dimensionality with slight improvements in the classification accuracy when compared to the baselines. The results are very promising, indicating potential improvements in the research on feature selection for text mining

AB - The search for patterns in big amounts of textual data, or text mining, can be at once rewarding and challenging. The patterns can reveal tendencies, similarities and predictions, but the information is usually implicit and difficult to be validated. Classification is one of the most relevant research areas in text mining, and it usually consists of predicting the class of a textual document based on a set of documents previously organized into different classes, such as author or topic. Choosing the words to compose the feature set is crucial to a proper classification. A well selected feature set can improve the performance of the classification method and enlighten the interpretation of the classification model adjusted to the data. This paper introduces the Feature Cluster Filter (FCFilter) method for feature selection. FCFilter eliminates the need to input or optimize the number of clusters by grouping the words in a sufficiently high number of clusters. Genetic algorithms are applied to optimize the combination of groups that will provide the final feature set. The method is based on the selection of features that are good predictors for text classification by clustering features and selecting only the suitable clusters. Experiments performed to evaluate the FCFilter with the Reuters-21578, SCY-Genes and SCY-Clusters datasets showed a significant reduction in the feature-value table dimensionality with slight improvements in the classification accuracy when compared to the baselines. The results are very promising, indicating potential improvements in the research on feature selection for text mining

KW - Text mining

KW - genetic algorithms

KW - feature selection

UR - http://www.scopus.com/inward/record.url?scp=85008257093&partnerID=8YFLogxK

U2 - 10.1109/CEC.2016.7744048

DO - 10.1109/CEC.2016.7744048

M3 - Conference contribution

SN - 9781509006236

T3 - 2016 IEEE Congress on Evolutionary Computation, CEC 2016

SP - 2106

EP - 2113

BT - 2016 IEEE Congress on Evolutionary Computation (CEC)

A2 - Ong, Yew Soon

PB - IEEE, Institute of Electrical and Electronics Engineers

CY - Vancouver, Canada

ER -

Ferreira C, de Medeiros D, SANTANA F. FCFilter: Feature selection based on clustering and genetic algorithms. In Ong YS, editor, 2016 IEEE Congress on Evolutionary Computation (CEC). Vancouver, Canada: IEEE, Institute of Electrical and Electronics Engineers. 2016. p. 2106-2113. 7744048. (2016 IEEE Congress on Evolutionary Computation, CEC 2016). https://doi.org/10.1109/CEC.2016.7744048