Robust Biomedical Name Entity Recognition based on deep machine learning

  • Rob Phan

    Student thesis: Doctoral Thesis

    Abstract

    Biomedical Named Entity Recognition (Bio-NER) is a complex natural language processing
    (NLP) task for extracting important concepts (named entities) from biomedical texts, such as
    RNA (Ribonucleic acid), protein, cell type, cell line, and DNA (Deoxyribonucleic acid), and
    attempts to discover automatically, biomedical knowledge and clinical concepts and terms
    from text-based digital health records. For automatic recognition and discovery, researchers
    have investigated extensively different types of machine learning models, leading to
    sophisticated NER systems. However, most of the computer based NER systems often require
    manual annotations, and handcrafted features specifically designed for each type of biomedical
    or clinical entities. The feature generation process for biomedical and clinical NLP texts,
    requires extensive manual efforts, significant background knowledge from biomedical, clinical
    and linguistic experts, and suffer from lack of generalization capabilities for different
    application contexts, and difficult to adapt to new biomedical entity types or the clinical
    concepts and terms. Recently, there have been increasing efforts to apply different machine
    learning models to improve the performance and generalization capabilities of automatic
    Named Entity Recognition (NER) systems for different application contexts. However, their
    performance and robustness for biomedical and clinical contexts are far from optimal, due to
    the complex nature of medical texts and lack of annotated dictionaries, and requirement of
    multi-disciplinary experts for annotating massive training data for the manual feature
    generation process.
    In this thesis, a novel computational framework based on robust machine learning models for
    biomedical and clinical named entity recognition task is proposed. The validation of the
    proposed computational framework models based on shared representations, including the BioNER
    and clinical-NER domains resulted in improved NER performance. The innovative
    machine learning models based on shared representations were built using different types of
    machine learning algorithms, including traditional shallow machine learning algorithms based
    on international (Maximum Entropy), CRF (Conditional Random Fields), and several Deep
    Learning variants, including FFN (Feedforward networks), RNN (Recurrent Neural Networks),
    Hybrid CNN (Convolution Neural Networks), and the enhancement with hyperparameter
    optimization techniques, allowing the characteristics of context-specific biomedical and
    clinical entity types, to be captured from unstructured free text. The experimental validation
    of proposed Biomedical NER framework based on a large scale and deep machine learning
    algorithms was done on Bio-NER and clinical-NER benchmark datasets, representing different
    biomedical and clinical entity types and has resulted in significant improvement in
    performance, as compared to the traditional NER systems based on traditional shallow learning
    models, by a large margin, even with limited training data, and unbalanced datasets, and with
    challenges of recognizing a large set of entities from small datasets. The reason behind the
    improved performance and generalization of the proposed deep learning-based computational
    framework models, could be due to embedding of context-specific shared information at
    character- and word-level between different biomedical entities and clinical entities, modelled
    by the deep learning architectures used. The contributions from this research can create new
    opportunities, in terms of a generalized, robust, high-performing computer-based NER
    framework, that can work across a wide range of inter-related health domains, with several
    polysemous names and entities, including biomedical, clinical, chemical, medical and public
    health contexts.
    Date of Award2020
    Original languageEnglish
    SupervisorRachel DAVEY (Supervisor), Girija CHETTY (Supervisor) & Dat TRAN (Supervisor)

    Cite this

    '