Inference of biogeographical ancestry from human genotype

  • Elaine Cheung

Student thesis: Doctoral Thesis

Abstract

In the absence of a successful identity match from evidentiary biological samples during
forensic investigations, the inference of biogeographical ancestry (BGA) from human genotype
can assist investigators by providing leads and forensic information. The accuracy of
inferences, however, is dependent on the selection of ancestry-informative markers (AIMs),
input of reference populations, and the choice of classifier. The aim of this research was to
refine the techniques that influence the accuracy and interpretation of BGA inferences. A
comparison of model-based and distance-based classifiers was performed in both non-admixed
and admixed individuals, as well as a comparison of selection techniques to obtain the most
informative AIMs.
With the number of candidate AIMs growing, a marker selection strategy was derived to
balance the cumulative differentiation potential amongst reference populations so that no
population is excessively over- or under-represented. The statistical metrics used to calculate
the differentiation potential was compared between allele frequency differences (δ),
Rosenberg’s informativeness for assignment (In), F statistics (FST), and the effective number of
alleles (Ae). In was the most accurate metric overall for ranking AIMs for five main continental
populations: African (AFR), European (EUR), Central/South Asian (SAS), East Asian (EAS),
and American (AME). Microhaplotype (MH) markers were found to outperform bi-allelic and
tri-allelic single nucleotide polymorphisms (SNPs) for BGA differentiation and have the added
ability for mixture deconvolution.
Approaches including Pritchard’s STRUCTURE program, a generic genetic distance algorithm
(GDA), multinomial logistic regression, and the HID SNP Genotyper plugin software were
compared for their ability to correctly assign BGA in admixed individuals. Each classifier was
trained with continental populations sourced from well-established databases such as the
Human Genome Diversity Panel (HGDP-CEPH) and the 1000 Genomes Project. Artificially
admixed genotypes were simulated from unambiguously non-admixed individuals to represent
admixture ratios of 1:1, 3:1, 2:1:1, and 1:1:1:1 between AFR, EUR, EAS, and AME
populations. The prediction accuracy of each classifier was compared using either the area
under the receiver operating characteristic curve or by calculating the mean divergence per
population per individual. AME individuals were the most difficult to classify as they tended
to have unreported EUR co-ancestry.
Principal component analysis and principal coordinate analysis rely on quantitative data to
make qualitative assignments based on the clustering of data points. An R code was developed
that plots coloured points defined by the first three to four principal component or coordinate
values. This allows the variation between points to be assessed using colour comparisons in
addition to spatial clustering.
In conclusion, our comparative analyses showed distance-based metrics were the most sensitive
to admixture but also likely to misclassify individuals from populations that were poorly
represented by reference data. Admixture proportions assigned by model-based likelihood
estimators were the most accurate for BGA inference, namely STRUCTURE and the HID SNP
Genotyper plugin software.
Date of Award2019
Original languageEnglish
SupervisorDennis Mcnevin (Supervisor), James Robertson (Supervisor), Tamsin KELLY (Supervisor), Michelle Gahan (Supervisor) & Fabiana Santana (Supervisor)

Cite this

'