Missing data in pathology databases

  • Sheik Faisal

    Student thesis: Master's Thesis


    Hepatitis virus is a major threat to Australia and a major health burden to the world. There are several types of this disease; the focus of this study is on Hepatitis C virus (HCV). The objective of this study is to enhance the predictive power of routinely performed diagnostic pathology laboratory results by identifying patterns in bio-markers so that the HCV infection is identified earlier rather than later, and to investigate the effects of missing values on the selection of assays. To overcome the problem of missing data, Multiple Imputation, a principled statistical imputing technique, was used to fill in the missing values. The imputed dataset was analyzed to construct predictive models using decision tree and logistics regression algorithms in R and PASW 18. ALT has been identified as the key predictor of HCV infection by all the Logistic Regression Models and all the Decision Tree Models. A higher level of ALT in the blood is indicative of HCV infection, that is, increased level of HepC (Hepatitis C antibody) in the blood. Pooled logistic regression model suggests that increased level of ALT (i.e. greater than 35 U/L) almost doubles the Odds of HCV infection. That is further affirmed by the decision tree models – all the rules in the tree models suggest increased ALT levels indicate presence of HCV infection. The study has not produced a powerful predictive model that could be used on general patients to detect the presence of HCV infection, but has provided useful information on the type of blood tests (the variables that need to be considered) to be conducted on patients who show any symptoms of HCV infection.
    Date of Award2011
    Original languageEnglish
    SupervisorAlice Richardson (Supervisor) & Brett LIDBURY (Supervisor)

    Cite this