Inference of biogeographical ancestry from human genotype

  • Elaine Cheung

    Student thesis: Doctoral Thesis

    Abstract

    In the absence of a successful identity match from evidentiary biological samples during
    forensic investigations, the inference of biogeographical ancestry (BGA) from human genotype
    can assist investigators by providing leads and forensic information. The accuracy of
    inferences, however, is dependent on the selection of ancestry-informative markers (AIMs),
    input of reference populations, and the choice of classifier. The aim of this research was to
    refine the techniques that influence the accuracy and interpretation of BGA inferences. A
    comparison of model-based and distance-based classifiers was performed in both non-admixed
    and admixed individuals, as well as a comparison of selection techniques to obtain the most
    informative AIMs.
    With the number of candidate AIMs growing, a marker selection strategy was derived to
    balance the cumulative differentiation potential amongst reference populations so that no
    population is excessively over- or under-represented. The statistical metrics used to calculate
    the differentiation potential was compared between allele frequency differences (δ),
    Rosenberg’s informativeness for assignment (In), F statistics (FST), and the effective number of
    alleles (Ae). In was the most accurate metric overall for ranking AIMs for five main continental
    populations: African (AFR), European (EUR), Central/South Asian (SAS), East Asian (EAS),
    and American (AME). Microhaplotype (MH) markers were found to outperform bi-allelic and
    tri-allelic single nucleotide polymorphisms (SNPs) for BGA differentiation and have the added
    ability for mixture deconvolution.
    Approaches including Pritchard’s STRUCTURE program, a generic genetic distance algorithm
    (GDA), multinomial logistic regression, and the HID SNP Genotyper plugin software were
    compared for their ability to correctly assign BGA in admixed individuals. Each classifier was
    trained with continental populations sourced from well-established databases such as the
    Human Genome Diversity Panel (HGDP-CEPH) and the 1000 Genomes Project. Artificially
    admixed genotypes were simulated from unambiguously non-admixed individuals to represent
    admixture ratios of 1:1, 3:1, 2:1:1, and 1:1:1:1 between AFR, EUR, EAS, and AME
    populations. The prediction accuracy of each classifier was compared using either the area
    under the receiver operating characteristic curve or by calculating the mean divergence per
    population per individual. AME individuals were the most difficult to classify as they tended
    to have unreported EUR co-ancestry.
    Principal component analysis and principal coordinate analysis rely on quantitative data to
    make qualitative assignments based on the clustering of data points. An R code was developed
    that plots coloured points defined by the first three to four principal component or coordinate
    values. This allows the variation between points to be assessed using colour comparisons in
    addition to spatial clustering.
    In conclusion, our comparative analyses showed distance-based metrics were the most sensitive
    to admixture but also likely to misclassify individuals from populations that were poorly
    represented by reference data. Admixture proportions assigned by model-based likelihood
    estimators were the most accurate for BGA inference, namely STRUCTURE and the HID SNP
    Genotyper plugin software.
    Date of Award2019
    Original languageEnglish
    SupervisorDennis Mcnevin (Supervisor), James Robertson (Supervisor), Tamsin KELLY (Supervisor), Michelle Gahan (Supervisor) & Fabiana Santana (Supervisor)

    Cite this

    '