Chronic disease status identification from de-identified clinical records based on machine learning

  • Kunal Rajput

    Student thesis: Doctoral Thesis


    Automatic detection and identification of chronic disease conditions from de-identified clinical records can give timely support to the medical decision-making process. The identified risk factors can expedite the preventative actions needed for tracking the debilitating chronic diseases and associated co-morbidities. For instance, chronic and long-term unmanaged risk factors such as obesity, diabetes, hypertension, and hyperlipidaemia can lead to coronary heart disease (also known as CAD (coronary artery disease), a leading cause of death worldwide. The costs involved in managing chronic diseases including diabetes, hypertension, hyperlipidaemia, which are the risk factors leading to CAD are significantly high, placing an enormous burden of disease on healthcare systems worldwide. Hence, there are several national clinical guidelines on CAD risk assessment, by monitoring, detection, and tracking of risk factors, such as smoking behavior, obesity and lifestyle factors, diabetes, and other co-morbid conditions, and calculation and tracking of the coronary risk scores. With the rapid adoption of electronic health record (EHR) systems, most patient data are stored in de-identified electronic format. Due to the involvement of multidisciplinary care teams in managing chronic diseases, and tracking the CAD risk factors, vital health data specific to a patient, are difficult to obtain, as they are scattered across various systems in various formats. Most of the time, the main data required for determining coronary risk are buried in unstructured clinical narratives and hand-over notes, often stored in a de-identified format. Existing solutions for chronic disease surveillance, involving detection and identification of the disease status and risk factors from the de-identified records, are based on manual methods requiring a significant amount of human efforts and domain expertise. Further, information extraction performed manually from the de-identified clinical records and text-based narratives can be error-prone, expensive, and prohibitively time-consuming. The key elements for detection and tracking of the disease status are often embedded in de-identified clinical records, discharge notes, and summaries as free text notes. Since these texts notes and records contain private information about the patients, including the personal and disease-related information, they normally exist in protected form, or in a de-identified format, normally referred to as PHI (protected health information) indicators, and are often embedded within the clinical text containing information needed to detect and track the risk factors associated with the disease. While the de-identification process itself is highly complicated, the chronic disease surveillance, involving the tasks of understanding and making sense out of de-identified clinical text notes (embedded with PHIs and other clinical information), and extracting meaningful information in terms of symptoms, risk factors, disease indicators, events, medications, allergic reactions, needed for monitoring and tracking the disease status is more complex and challenging. This is due to the difficulties associated with extracting linguistic and semantic relationships between PHIs and disease status when the clinical text records are in a de-identified unstructured form (as clinical notes and narratives). As a typical clinical discharge summary or text notes comprise several PHIs at the same time, it is more difficult to make sense out of deidentified medical text records with several masked PHIs embedded in it, with loss of context and structure in this embedded text, and the inability of traditional natural language processing (NLP) and text mining techniques to perform well. In this thesis, the focus is on detecting the vital risk factors and associated chronic disease conditions using de-identified medical text records, based on traditional machine learning and novel deep learning techniques. A novel computational framework for disease detection model development based on different machine learning models was proposed in this thesis.
    As it is extremely difficult to access the de-identified medical records from hospital systems, the NLP challenge shared tasks and associated benchmark datasets made available by the i2b2 consortium provide an opportunity to researchers in the computing and information technology field, to compare their research findings, and demonstrate the extension of previous work undertaken on several datasets provided by the i2b2 consortium, and share the outcomes and result in improving the state of the art.
    For a baseline comparison, the traditional approaches proposed in earlier work reported in the literature are compared with novel contributions made in this work. Most of the earlier work reported in the literature, that use i2b2 NLP challenge task datasets for experimental evaluation, are based on manual approaches, requiring human experts with domain knowledge from several multidisciplinary fields, including clinical, computing, natural language processing, and linguistics, and making sense of the de-identified clinical text notes and extracting knowledge, and building computer-based models based on this workflow a complex endeavor and very challenging.
    In the initial exploratory stages of this thesis, a sentence level segmentation approach for building disease status detection models based on shallow machine learning approaches, using PART, Naïve Bayes, Random Forest, and Hoeffding tree algorithms as the first step, and this served as the baseline reference for rest of the innovative algorithms for disease detection models to be developed in the rest of the thesis. The next step was the development of a document-level segmentation approach using more efficient and established shallow learning approaches, with the Random Forest, Naïve Bayes, Logistic Regression, and Gradient Boost Classifier algorithms. The findings from this stage helped in addressing the imbalanced and sparse data problem, as using algorithms based on ensemble techniques are well known to perform well in other application contexts in engineering and astronomy and have indeed led to enhanced performance of disease detection models. Then, these robust models based on ensemble techniques were extended for investigating the impact of multiple co-morbid disease conditions on debilitating cardiovascular disease risk assessment and examined with a new set of evaluation metrics for assessing improvement in performance and robustness, including accuracy, precision, recall, F1-score, confusion matrix, and ROC-AUC measures. As extraction of useful td-idf text features from de-identified clinical texts, particularly with clinical text data containing markers for multiple co-morbidities, became increasingly difficult, the use of new deep learning models was introduced. Since the deep learning models do not require feature engineering, the reliance on td-idf features reduced. Four new models based on deep machine learning and document level classification were proposed as the next step in enhancing the efficacy of disease detection models, including Bidirectional LSTM(BI-LSTM), CNN, Bidirectional GRU (BI-GRU), and BILSTM-BIGRU cascade models. Also, a new performance metric, in terms of micro-averaged F1-score was used, which has the capability to provide a better evaluation of machine learning models with the class imbalance and sparse data. Finally, a sentence-level classification approach with these deep learning algorithms was proposed, leading to enhanced performance assessed in terms of micro-averaged F1-scores. This incremental development, enhancement, and refinement of the proposed AI-based deep learning computational framework, and its experimental evaluation was done with several benchmarks publicly available clinical NLP i2b2 shared task challenge datasets, leading to significant performance improvement and robustness as compared to other competing methods and systems in the challenge tasks organized by i2b2 consortia.
    Date of Award2020
    Original languageEnglish
    SupervisorGirija Chetty (Supervisor), Rachel Davey (Supervisor) & Dat Tran (Supervisor)

    Cite this