Abstract
Biomedical Named Entity Recognition (Bio-NER) is a complex natural language processing(NLP) task for extracting important concepts (named entities) from biomedical texts, such as
RNA (Ribonucleic acid), protein, cell type, cell line, and DNA (Deoxyribonucleic acid), and
attempts to discover automatically, biomedical knowledge and clinical concepts and terms
from text-based digital health records. For automatic recognition and discovery, researchers
have investigated extensively different types of machine learning models, leading to
sophisticated NER systems. However, most of the computer based NER systems often require
manual annotations, and handcrafted features specifically designed for each type of biomedical
or clinical entities. The feature generation process for biomedical and clinical NLP texts,
requires extensive manual efforts, significant background knowledge from biomedical, clinical
and linguistic experts, and suffer from lack of generalization capabilities for different
application contexts, and difficult to adapt to new biomedical entity types or the clinical
concepts and terms. Recently, there have been increasing efforts to apply different machine
learning models to improve the performance and generalization capabilities of automatic
Named Entity Recognition (NER) systems for different application contexts. However, their
performance and robustness for biomedical and clinical contexts are far from optimal, due to
the complex nature of medical texts and lack of annotated dictionaries, and requirement of
multi-disciplinary experts for annotating massive training data for the manual feature
generation process.
In this thesis, a novel computational framework based on robust machine learning models for
biomedical and clinical named entity recognition task is proposed. The validation of the
proposed computational framework models based on shared representations, including the BioNER
and clinical-NER domains resulted in improved NER performance. The innovative
machine learning models based on shared representations were built using different types of
machine learning algorithms, including traditional shallow machine learning algorithms based
on international (Maximum Entropy), CRF (Conditional Random Fields), and several Deep
Learning variants, including FFN (Feedforward networks), RNN (Recurrent Neural Networks),
Hybrid CNN (Convolution Neural Networks), and the enhancement with hyperparameter
optimization techniques, allowing the characteristics of context-specific biomedical and
clinical entity types, to be captured from unstructured free text. The experimental validation
of proposed Biomedical NER framework based on a large scale and deep machine learning
algorithms was done on Bio-NER and clinical-NER benchmark datasets, representing different
biomedical and clinical entity types and has resulted in significant improvement in
performance, as compared to the traditional NER systems based on traditional shallow learning
models, by a large margin, even with limited training data, and unbalanced datasets, and with
challenges of recognizing a large set of entities from small datasets. The reason behind the
improved performance and generalization of the proposed deep learning-based computational
framework models, could be due to embedding of context-specific shared information at
character- and word-level between different biomedical entities and clinical entities, modelled
by the deep learning architectures used. The contributions from this research can create new
opportunities, in terms of a generalized, robust, high-performing computer-based NER
framework, that can work across a wide range of inter-related health domains, with several
polysemous names and entities, including biomedical, clinical, chemical, medical and public
health contexts.
Date of Award | 2020 |
---|---|
Original language | English |
Supervisor | Rachel DAVEY (Supervisor), Girija CHETTY (Supervisor) & Dat TRAN (Supervisor) |