Inference of biogeographical ancestry from human genotype

  • Elaine Cheung

    Student thesis: Doctoral Thesis

    Abstract

    In the absence of a successful identity match from evidentiary biological samples during forensic investigations, the inference of biogeographical ancestry (BGA) from human genotype can assist investigators by providing leads and forensic information. The accuracy of inferences, however, is dependent on the selection of ancestry-informative markers (AIMs), input of reference populations, and the choice of classifier. The aim of this research was to refine the techniques that influence the accuracy and interpretation of BGA inferences. A comparison of model-based and distance-based classifiers was performed in both non-admixed and admixed individuals, as well as a comparison of selection techniques to obtain the most
    informative AIMs.
    With the number of candidate AIMs growing, a marker selection strategy was derived to balance the cumulative differentiation potential amongst reference populations so that no population is excessively over- or under-represented. The statistical metrics used to calculate the differentiation potential was compared between allele frequency differences (δ), Rosenberg’s informativeness for assignment (In), F statistics (FST), and the effective number of alleles (Ae). In was the most accurate metric overall for ranking AIMs for five main continental populations: African (AFR), European (EUR), Central/South Asian (SAS), East Asian (EAS), and American (AME). Microhaplotype (MH) markers were found to outperform bi-allelic and tri-allelic single nucleotide polymorphisms (SNPs) for BGA differentiation and have the added ability for mixture deconvolution.
    Approaches including Pritchard’s STRUCTURE program, a generic genetic distance algorithm (GDA), multinomial logistic regression, and the HID SNP Genotyper plugin software were compared for their ability to correctly assign BGA in admixed individuals. Each classifier was trained with continental populations sourced from well-established databases such as the Human Genome Diversity Panel (HGDP-CEPH) and the 1000 Genomes Project. Artificially admixed genotypes were simulated from unambiguously non-admixed individuals to represent admixture ratios of 1:1, 3:1, 2:1:1, and 1:1:1:1 between AFR, EUR, EAS, and AME populations. The prediction accuracy of each classifier was compared using either the area under the receiver operating characteristic curve or by calculating the mean divergence per population per individual. AME individuals were the most difficult to classify as they tended
    to have unreported EUR co-ancestry.
    Principal component analysis and principal coordinate analysis rely on quantitative data to make qualitative assignments based on the clustering of data points. An R code was developed that plots coloured points defined by the first three to four principal component or coordinate values. This allows the variation between points to be assessed using colour comparisons in addition to spatial clustering.
    In conclusion, our comparative analyses showed distance-based metrics were the most sensitive to admixture but also likely to misclassify individuals from populations that were poorly represented by reference data. Admixture proportions assigned by model-based likelihood estimators were the most accurate for BGA inference, namely STRUCTURE and the HID SNP Genotyper plugin software.
    Date of Award2019
    Original languageEnglish
    SupervisorDennis Mcnevin (Supervisor), James Robertson (Supervisor), Tamsin KELLY (Supervisor), Michelle Gahan (Supervisor) & Fabiana Santana (Supervisor)

    Cite this

    '