Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

Carsten Dormann, Jane Elith, Sven Bacher, Carsten Buchmann, Gudrun Carl, Gabriel Carre, Jaime Marquez, Bernd Gruber, Bruno Lafourcade, Pedro Leitao, Tamara Munkemuller, Colin McClean, Patrick Osborne, Björn Reineking, Boris Schroder, Andrew Skidmore, Damaris Zurell, Sven Lautenbach

    Research output: Contribution to journalArticle

    2141 Citations (Scopus)

    Abstract

    Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. owever, all approaches tested yielded degraded predictions under change in collinearity structure and the 'folk lore'-thresholds of correlation coefficients between predictor variables of |r| > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.
    Original languageEnglish
    Pages (from-to)27-46
    Number of pages20
    JournalEcography
    Volume36
    Issue number1
    DOIs
    Publication statusPublished - 2013

    Fingerprint

    simulation
    prediction
    methodology
    artificial intelligence
    biome
    statistical models
    shrinkage
    multiple regression
    testing
    method
    ecology
    ecosystems
    analysis
    parameter
    test
    parameter estimation
    indicator
    machine learning

    Cite this

    Dormann, C., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carre, G., ... Lautenbach, S. (2013). Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27-46. https://doi.org/10.1111/j.1600-0587.2012.07348.x
    Dormann, Carsten ; Elith, Jane ; Bacher, Sven ; Buchmann, Carsten ; Carl, Gudrun ; Carre, Gabriel ; Marquez, Jaime ; Gruber, Bernd ; Lafourcade, Bruno ; Leitao, Pedro ; Munkemuller, Tamara ; McClean, Colin ; Osborne, Patrick ; Reineking, Björn ; Schroder, Boris ; Skidmore, Andrew ; Zurell, Damaris ; Lautenbach, Sven. / Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. In: Ecography. 2013 ; Vol. 36, No. 1. pp. 27-46.
    @article{cfb9e5c721164c97ac8467267c3ca770,
    title = "Collinearity: a review of methods to deal with it and a simulation study evaluating their performance",
    abstract = "Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. owever, all approaches tested yielded degraded predictions under change in collinearity structure and the 'folk lore'-thresholds of correlation coefficients between predictor variables of |r| > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.",
    author = "Carsten Dormann and Jane Elith and Sven Bacher and Carsten Buchmann and Gudrun Carl and Gabriel Carre and Jaime Marquez and Bernd Gruber and Bruno Lafourcade and Pedro Leitao and Tamara Munkemuller and Colin McClean and Patrick Osborne and Bj{\"o}rn Reineking and Boris Schroder and Andrew Skidmore and Damaris Zurell and Sven Lautenbach",
    year = "2013",
    doi = "10.1111/j.1600-0587.2012.07348.x",
    language = "English",
    volume = "36",
    pages = "27--46",
    journal = "Ecography",
    issn = "0906-7590",
    publisher = "Wiley-Blackwell",
    number = "1",

    }

    Dormann, C, Elith, J, Bacher, S, Buchmann, C, Carl, G, Carre, G, Marquez, J, Gruber, B, Lafourcade, B, Leitao, P, Munkemuller, T, McClean, C, Osborne, P, Reineking, B, Schroder, B, Skidmore, A, Zurell, D & Lautenbach, S 2013, 'Collinearity: a review of methods to deal with it and a simulation study evaluating their performance', Ecography, vol. 36, no. 1, pp. 27-46. https://doi.org/10.1111/j.1600-0587.2012.07348.x

    Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. / Dormann, Carsten; Elith, Jane; Bacher, Sven; Buchmann, Carsten; Carl, Gudrun; Carre, Gabriel; Marquez, Jaime; Gruber, Bernd; Lafourcade, Bruno; Leitao, Pedro; Munkemuller, Tamara; McClean, Colin; Osborne, Patrick; Reineking, Björn; Schroder, Boris; Skidmore, Andrew; Zurell, Damaris; Lautenbach, Sven.

    In: Ecography, Vol. 36, No. 1, 2013, p. 27-46.

    Research output: Contribution to journalArticle

    TY - JOUR

    T1 - Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

    AU - Dormann, Carsten

    AU - Elith, Jane

    AU - Bacher, Sven

    AU - Buchmann, Carsten

    AU - Carl, Gudrun

    AU - Carre, Gabriel

    AU - Marquez, Jaime

    AU - Gruber, Bernd

    AU - Lafourcade, Bruno

    AU - Leitao, Pedro

    AU - Munkemuller, Tamara

    AU - McClean, Colin

    AU - Osborne, Patrick

    AU - Reineking, Björn

    AU - Schroder, Boris

    AU - Skidmore, Andrew

    AU - Zurell, Damaris

    AU - Lautenbach, Sven

    PY - 2013

    Y1 - 2013

    N2 - Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. owever, all approaches tested yielded degraded predictions under change in collinearity structure and the 'folk lore'-thresholds of correlation coefficients between predictor variables of |r| > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

    AB - Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. owever, all approaches tested yielded degraded predictions under change in collinearity structure and the 'folk lore'-thresholds of correlation coefficients between predictor variables of |r| > 0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

    U2 - 10.1111/j.1600-0587.2012.07348.x

    DO - 10.1111/j.1600-0587.2012.07348.x

    M3 - Article

    VL - 36

    SP - 27

    EP - 46

    JO - Ecography

    JF - Ecography

    SN - 0906-7590

    IS - 1

    ER -