With the explosion of WWW information and the increasing availability of documents in digital form, information acquisition and organisation needs are becoming more and more significant. The automated text classification task, that is how to categorise documents into pre-defined categories, has been a matter of major concern. A typical text classification (TC) system has two main components, namely document representation and classification. Some approaches to document representation have been used term frequency (TF),inverse document frequency (IDF),term category dependency (TCD) and term co-occurrence (TCO). Recent approaches to document representation look for semantic relationships among terms. The relations can be extracted from particular collections such as Wikipedia. Although these popular approaches to document representation have been successful in text classification, TF and IDF are struggling to differentiate documents and cannot achieve high classification results with some abstract and complex corpora. TCO-based methods use term relations, however relations in terms of semantic aspects are still open questions. In summary, the problems raised below are the particular issues examined in this thesis: 1. How to extract potential relationships among terms to provide additional information for document representation? 2. How to take the advantages of the extracted relationships to build document representations? 3. How to consider category information to justify weightings of features, which can be used by popular text classifiers to leverage the classification results? The thesis contributes the following solutions to solve the problems: 1. A Relation Extraction Framework: The goal of this framework is not only to present a method for extracting relationships among terms from a given text document, but also to present the preliminary work toward building a semantic relation extraction for TC. Moreover, it also shows an alternative choice of generating features from a text document instead of using the popular term based approaches. 2. A Term Weighting Approach based on Relation Extraction and Graph Model: The aims of this method are not only to address the issues of popular term weighting approaches based on term frequencies and term co-occurrences, but also to present a way of talking the advantage of extracted relations in weighting terms. This work can be considered an example, and an opening to further investigation into applying other kinds of semantic relations, graph centrality ranking for various types of text classification tasks. 3. An Adaptable Term Weighting Framework for Text Classification: The framework shows how to apply the TCD measure in weighting terms for TC. The round-robin process from the framework helps to find out the suitable term weighting schema for the document based on category information.
Relation extraction in term weighting for text classification
Huynh, D. T. (Author). 2010
Student thesis: Master's Thesis