Species distribution models can be highly sensitive to algorithm configuration

W. Hallgren, F. Santana, S. Low-Choy, Y. Zhao, B. Mackey

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a large number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like MaxEnt. We found that documentation of SDM algorithm configuration option settings in the SDM literature is, overall, very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning since: (i) it detracts from the robustness of the provenance for such SDM studies; (ii) a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method; (iii) inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings, this could result in erroneous and/or unrealistic SDMs. Therefore, this study sets out to comprehensively test the sensitivity of eight widely used SDM algorithms to variation in configuration options settings: MaxEnt, Artificial Neural Network (ANN), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Flexible Discriminant Analysis (FDA), Surface Range Envelope (SRE) and Classification tree analysis (CTA). A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for two species - Koala (Phascolartos cinereus) and Thorny Devil (Moloch horridus) - in order to investigate how algorithm sensitivity depends on the species being modelled. Results were assessed by comparing the modelled distribution of the control SDM (default settings) to the modelled distribution from each sensitivity test SDM (i.e. non-default configuration settings). This was done using the visual and statistical measures of predictive performance available in the Biodiversity and Climate Change Virtual Laboratory (BCCVL), including the area under the (receiver operating characteristic) curve. The aim of our study was to be able to draw conclusions as to how the sensitivity of SDM algorithms to their configuration option settings may detract from the reliability of SDM results, given the often unjustified and unscrutinized use of the default settings, and generally infrequent and largely perfunctory attendance to this issue in most of the published SDM literature. Our results indicate that all of the algorithms tested showed sensitivity to alternative (non-default) values for some of their configuration settings and that often this sensitivity is species-dependent. Therefore we can conclude that the choice of configuration settings in these widely used SDM algorithms can have a large impact on the resulting projected distribution. This has important ramifications for decision-making and policy outcomes wherever SDMs are used to inform species and biodiversity management plans and policy settings. This study demonstrates that assigning suitable values for these settings is a very important consideration and as such should always be published along with the model. Documenting all configuration settings is necessary to increase the scientific robustness, transparency and reproducibility of species distribution modelling studies.

Original languageEnglish
Article number108719
Pages (from-to)1-23
Number of pages23
JournalEcological Modelling
Volume408
Early online date19 Jul 2019
DOIs
Publication statusPublished - 15 Sep 2019

Fingerprint

distribution
provenance
biodiversity
modeling
species occurrence
discriminant analysis
transparency
artificial neural network
decision making
climate change
test
experiment
policy
documentation
trial
programme
method
analysis
laboratory
management plan

Cite this

Hallgren, W. ; Santana, F. ; Low-Choy, S. ; Zhao, Y. ; Mackey, B. / Species distribution models can be highly sensitive to algorithm configuration. In: Ecological Modelling. 2019 ; Vol. 408. pp. 1-23.
@article{9af074eeb20e4662a480e21c8431ca42,
title = "Species distribution models can be highly sensitive to algorithm configuration",
abstract = "In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a large number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like MaxEnt. We found that documentation of SDM algorithm configuration option settings in the SDM literature is, overall, very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning since: (i) it detracts from the robustness of the provenance for such SDM studies; (ii) a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method; (iii) inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings, this could result in erroneous and/or unrealistic SDMs. Therefore, this study sets out to comprehensively test the sensitivity of eight widely used SDM algorithms to variation in configuration options settings: MaxEnt, Artificial Neural Network (ANN), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Flexible Discriminant Analysis (FDA), Surface Range Envelope (SRE) and Classification tree analysis (CTA). A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for two species - Koala (Phascolartos cinereus) and Thorny Devil (Moloch horridus) - in order to investigate how algorithm sensitivity depends on the species being modelled. Results were assessed by comparing the modelled distribution of the control SDM (default settings) to the modelled distribution from each sensitivity test SDM (i.e. non-default configuration settings). This was done using the visual and statistical measures of predictive performance available in the Biodiversity and Climate Change Virtual Laboratory (BCCVL), including the area under the (receiver operating characteristic) curve. The aim of our study was to be able to draw conclusions as to how the sensitivity of SDM algorithms to their configuration option settings may detract from the reliability of SDM results, given the often unjustified and unscrutinized use of the default settings, and generally infrequent and largely perfunctory attendance to this issue in most of the published SDM literature. Our results indicate that all of the algorithms tested showed sensitivity to alternative (non-default) values for some of their configuration settings and that often this sensitivity is species-dependent. Therefore we can conclude that the choice of configuration settings in these widely used SDM algorithms can have a large impact on the resulting projected distribution. This has important ramifications for decision-making and policy outcomes wherever SDMs are used to inform species and biodiversity management plans and policy settings. This study demonstrates that assigning suitable values for these settings is a very important consideration and as such should always be published along with the model. Documenting all configuration settings is necessary to increase the scientific robustness, transparency and reproducibility of species distribution modelling studies.",
keywords = "ANN GLM, Configuration option settings, CTA, FDA, GAM, Koala, MARS, MaxEnt, Provenance, SRE, Thorny devil, Transparency",
author = "W. Hallgren and F. Santana and S. Low-Choy and Y. Zhao and B. Mackey",
year = "2019",
month = "9",
day = "15",
doi = "10.1016/j.ecolmodel.2019.108719",
language = "English",
volume = "408",
pages = "1--23",
journal = "Ecological Modelling",
issn = "0304-3800",
publisher = "Elsevier",

}

Species distribution models can be highly sensitive to algorithm configuration. / Hallgren, W.; Santana, F.; Low-Choy, S.; Zhao, Y.; Mackey, B.

In: Ecological Modelling, Vol. 408, 108719, 15.09.2019, p. 1-23.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Species distribution models can be highly sensitive to algorithm configuration

AU - Hallgren, W.

AU - Santana, F.

AU - Low-Choy, S.

AU - Zhao, Y.

AU - Mackey, B.

PY - 2019/9/15

Y1 - 2019/9/15

N2 - In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a large number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like MaxEnt. We found that documentation of SDM algorithm configuration option settings in the SDM literature is, overall, very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning since: (i) it detracts from the robustness of the provenance for such SDM studies; (ii) a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method; (iii) inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings, this could result in erroneous and/or unrealistic SDMs. Therefore, this study sets out to comprehensively test the sensitivity of eight widely used SDM algorithms to variation in configuration options settings: MaxEnt, Artificial Neural Network (ANN), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Flexible Discriminant Analysis (FDA), Surface Range Envelope (SRE) and Classification tree analysis (CTA). A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for two species - Koala (Phascolartos cinereus) and Thorny Devil (Moloch horridus) - in order to investigate how algorithm sensitivity depends on the species being modelled. Results were assessed by comparing the modelled distribution of the control SDM (default settings) to the modelled distribution from each sensitivity test SDM (i.e. non-default configuration settings). This was done using the visual and statistical measures of predictive performance available in the Biodiversity and Climate Change Virtual Laboratory (BCCVL), including the area under the (receiver operating characteristic) curve. The aim of our study was to be able to draw conclusions as to how the sensitivity of SDM algorithms to their configuration option settings may detract from the reliability of SDM results, given the often unjustified and unscrutinized use of the default settings, and generally infrequent and largely perfunctory attendance to this issue in most of the published SDM literature. Our results indicate that all of the algorithms tested showed sensitivity to alternative (non-default) values for some of their configuration settings and that often this sensitivity is species-dependent. Therefore we can conclude that the choice of configuration settings in these widely used SDM algorithms can have a large impact on the resulting projected distribution. This has important ramifications for decision-making and policy outcomes wherever SDMs are used to inform species and biodiversity management plans and policy settings. This study demonstrates that assigning suitable values for these settings is a very important consideration and as such should always be published along with the model. Documenting all configuration settings is necessary to increase the scientific robustness, transparency and reproducibility of species distribution modelling studies.

AB - In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a large number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like MaxEnt. We found that documentation of SDM algorithm configuration option settings in the SDM literature is, overall, very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning since: (i) it detracts from the robustness of the provenance for such SDM studies; (ii) a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method; (iii) inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings, this could result in erroneous and/or unrealistic SDMs. Therefore, this study sets out to comprehensively test the sensitivity of eight widely used SDM algorithms to variation in configuration options settings: MaxEnt, Artificial Neural Network (ANN), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Flexible Discriminant Analysis (FDA), Surface Range Envelope (SRE) and Classification tree analysis (CTA). A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for two species - Koala (Phascolartos cinereus) and Thorny Devil (Moloch horridus) - in order to investigate how algorithm sensitivity depends on the species being modelled. Results were assessed by comparing the modelled distribution of the control SDM (default settings) to the modelled distribution from each sensitivity test SDM (i.e. non-default configuration settings). This was done using the visual and statistical measures of predictive performance available in the Biodiversity and Climate Change Virtual Laboratory (BCCVL), including the area under the (receiver operating characteristic) curve. The aim of our study was to be able to draw conclusions as to how the sensitivity of SDM algorithms to their configuration option settings may detract from the reliability of SDM results, given the often unjustified and unscrutinized use of the default settings, and generally infrequent and largely perfunctory attendance to this issue in most of the published SDM literature. Our results indicate that all of the algorithms tested showed sensitivity to alternative (non-default) values for some of their configuration settings and that often this sensitivity is species-dependent. Therefore we can conclude that the choice of configuration settings in these widely used SDM algorithms can have a large impact on the resulting projected distribution. This has important ramifications for decision-making and policy outcomes wherever SDMs are used to inform species and biodiversity management plans and policy settings. This study demonstrates that assigning suitable values for these settings is a very important consideration and as such should always be published along with the model. Documenting all configuration settings is necessary to increase the scientific robustness, transparency and reproducibility of species distribution modelling studies.

KW - ANN GLM

KW - Configuration option settings

KW - CTA

KW - FDA

KW - GAM

KW - Koala

KW - MARS

KW - MaxEnt

KW - Provenance

KW - SRE

KW - Thorny devil

KW - Transparency

UR - http://www.scopus.com/inward/record.url?scp=85069604417&partnerID=8YFLogxK

U2 - 10.1016/j.ecolmodel.2019.108719

DO - 10.1016/j.ecolmodel.2019.108719

M3 - Article

VL - 408

SP - 1

EP - 23

JO - Ecological Modelling

JF - Ecological Modelling

SN - 0304-3800

M1 - 108719

ER -