Model evaluation: The misuse of statistical techniques when evaluating observations versus predictions
Article Full Text (PDF)

Keywords

Deviance metrics
modelling efficiency
bias
slope
deviance

How to Cite

McPhee, M., Richetti, J., Croke, B., & Walmsley, B. (2024). Model evaluation: The misuse of statistical techniques when evaluating observations versus predictions. Socio-Environmental Systems Modelling, 6, 18758. https://doi.org/10.18174/sesmo.18758

Abstract

Mathematical modellers, decision support developers, statisticians, and students evaluate the differences between observed and model predicted values. When evaluating models, it is far too easy to conduct model evaluation by fitting a linear regression to the data. In this paper, steps are presented on ‘how to’ evaluate a model using deviance metrics rather than reporting r2 from fitting a linear regression. The paper aims to provide sound reasoning, with data, against using r2. The paper addresses five arguments, previously put forward, for not fitting a linear regression when conducting model evaluation: i) Misapplication of regression; ii) Ambiguity of null hypothesis tests; iii) Lack of sensitivity; iv) Fitted line is irrelevant to validation; and v) Violation of regression assumptions. Statistical, deviance, and quality control metrics are outlined. Three models using the BeefSpecs drafting tool are reported in this paper. Each model (n = 80) had an r2 of 0.43. A mean bias of 0.06, -2.90, and -0.11 mm, and a root mean square error of prediction (RMSEP) of 1.72, 3.37, and 3.70 mm for models 1, 2, and 3, respectively. A modelling efficiency (MEF) of 0.39, -1.34, and -1.83, and 91, 51, and 56% of predictions within upper and lower quality control limits for models 1, 2, and 3, respectively. These metrics highlight the pitfall of reporting r2 from using regression. Minimum recommended steps of ‘how to’ conduct model evaluation are: a plot of the residuals with quality control limits and a table of metrics including mean observed, predicted and bias, RMSEP, and MEF.

Article Full Text (PDF)

References

Bellocchi, G., Rivington, M., Donatelli, M., & Matthews, K. (2010). Validation of biophysical models: issues and methodologies. A review. Agronomy for Sustainable Development, 30(1), 109-130. https://doi.org/10.1051/agro/2009001

Bennett, N. D., Croke, B. F. W., Guariso, G., Guillaume, J. H. A., Hamilton, S. H., Jakeman, A. J., Marsili-Libelli, S., Newham, L. T. H., Norton, J. P., Perrin, C., Pierce, S. A., Robson, B., Seppelt, R., Voinov, A. A., Fath, B. D., & Andreassian, V. (2013). Characterising performance of environmental models. Environmental Modelling & Software, 40, 1-20. https://doi.org/10.1016/j.envsoft.2012.09.011

Bibby, J., & Toutenburg, H. (1977). Prediction and improved estimation in linear models. John Wiley & Sons, Germany. (German)

Draper, N., & Smith, H. (1966). Applied regression analysis. John Wiley & Sons, New York.

Flavelle, P. (1992). A quantitative measure of model validation and its potential use for regulatory purposes. Advances in Water Resources, 15(1), 5-13. https://doi.org/10.1016/0309-1708(92)90028-Z

Hodson, T. O. (2022). Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not. Geoscientific Model Development, 15(14), 5481-5487. https://doi.org/10.5194/gmd-15-5481-2022

Huth, N. I., & Holzworth, D. P. (2005). Common Sense In Model Testing In Zerger, A. and Argent, R.M. (eds) MODSIM 2005 International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand, December 2005, Melbourne, Victoria, Australia. pp. 2804-2809. https://www.mssanz.org.au/modsim05/papers/huth.pdf

Jakeman, A. J., Letcher, R. A., & Norton, J. P. (2006). Ten iterative steps in development and evaluation of environmental models. Environmental Modelling & Software, 21(5), 602-614. https://doi.org/10.1016/j.envsoft.2006.01.004

Kvålseth, T. O. (1985). Cautionary Note about R2. The American Statistician, 39(4), 279-285. https://doi.org/10.1080/00031305.1985.10479448

Loague, K., & Green, R. E. (1991). Statistical and graphical methods for evaluating solute transport models: Overview and application. Journal of Contaminant Hydrology, 7(1), 51-73. https://doi.org/10.1016/0169-7722(91)90038-3

Lollback, D. (2012). Livestock Data Link – linking supply chain partners. Meat & Livestock Australia. https://www.mla.com.au/globalassets/mla-corporate/research-and-development/program-areas/ldl/documents/livestock-data-link---program-overview-cattle---linking-supply-chain-partners.pdf. Accessed January 27, 2023.

Mayer, D. G., & Butler, D. G. (1993). Statistical validation. Ecological Modelling, 68(1), 21-32. https://doi.org/10.1016/0304-3800(93)90105-2

Mayer, D. G., Stuart, M. A., & Swain, A. J. (1994). Regression of real-world data on model output: An appropriate overall test of validity. Agricultural Systems, 45(1), 93-104. https://doi.org/10.1016/S0308-521X(94)90282-8

McPhee, M. J., & Walmsley, B. J. (2017). Misuse of coefficient of determination for empirical validation of models. In Syme, G., Hatton MacDonald, D., Fulton, B. and Piantadosi, J. (eds) MODSIM2017, 22nd International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand, December 2017, Hobart, Tasmania. ISBN: 978-0-9872143-7-9. pp. 230–236. http://www.mssanz.org.au/modsim2017/B1/mcphee.pdf

McPhee, M. J., Walmsley, B. J., Mayer, D. G., & Oddy, V. H. (2014). BeefSpecs fat calculator to assist decision making to increase compliance rates with beef carcass specifications: evaluation of inputs and outputs. Animal Production Science, 54(11-12), 2011-2017. https://doi.org/10.1071/AN14614

Mitchell, P. L. (1997). Misuse of regression for empirical validation of models. Agricultural Systems, 54(3), 313-326. https://doi.org/10.1016/S0308-521X(96)00077-7

Mitchell, P. L., & Sheehy, J. E. (1997). Comparison of predictions and observations to assess model performance: a method of empirical validation. In M. J. Kropff, P. S. Teng, P. K. Aggarwal, J. Bouma, & B. A. M. Bouman (Eds.), Applications of systems approaches at field levels (pp. 437-451). Kluwer Academic Publishers.

Montgomery, D. C. (1991). Introduction to Statistical Quality Control. John Wiley & Sons, New York.

Nash, J. E., & Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I — A discussion of principles. Journal of Hydrology, 10(3), 282-290. https://doi.org/10.1016/0022-1694(70)90255-6

Oreskes, N., Shrader-Frechette, K., & Beltiz, K. (1994). Verification, validation, and confirmation of numerical models in the earth sciences. Science, 263, 641-646.

Picard, R. R., & Cook, R. D. (1984). Cross-validation of regression models. Journal of the American Statistical Association, 79(387), 575-285. https://doi.org/10.1080/01621459.1984.10478083

Reckhow, K. H., Clements, J. T., & Dodd, R. C. (1992). Statistical Evaluation of Mechanistic Water‐Quality Models. Journal of Environmental Engineering, 118(1), 155-156. https://doi.org/10.1061/(ASCE)0733-9372(1992)118:1(155.2)

Richetti, J., Diakogianis, F. I., Bender, A., Colaço, A. F., & Lawes, R. A. (2023). A methods guideline for deep learning for tabular data in agriculture with a case study to forecast cereal yield. Computers and Electronics in Agriculture, 205, 107642. https://doi.org/10.1016/j.compag.2023.107642

Shaeffer, D. L. (1980). A model evaluation methodology applicable to environmental assessment models. Ecological Modelling, 8, 275-295. https://doi.org/10.1016/0304-3800(80)90042-3

Shewhart Control Charts. (2000). In P. M. Swamidass (Ed.), Encyclopedia of Production and Manufacturing Management, pp. 685-686. Springer US. https://doi.org/10.1007/1-4020-0612-8_874

Sokal, R., & Rohlf, F. (1995). Biometry: the principles and practice of statistics in biological research (Vol. 3rd Edition). W.H. Freeman and CO., New York.

Tedeschi, L. O. (2006). Assessment of the adequacy of mathematical models. Agricultural Systems, 89(2-3), 225-247. https://doi.org/10.1016/j.agsy.2005.11.004

Theil, H. (1966). Applied economic forecasting. North-Holland Pub. Co, Amsterdam.

Upton, W. H., Donoghue, K. A., Graser, H. U., & Johnston, D. J. (1999). Ultrasound proficiency testing. Meeting of the Association of Advancement in Animal Breeding and Genetics, Armidale, New South Wales, Australia. pp. 341-344. http://www.aaabg.org/proceedings/1999/AB99079.pdf

Walmsley, B. J., McPhee, M. J., & Oddy, V. H. (2014). Development of the BeefSpecs fat calculator to assist decision making to increase compliance rates with beef carcass specifications. Animal Production Science, 54(11-12), 2003-2010. https://doi.org/10.1071/An14611

Walmsley, B. J., Oddy, V. H., McPhee, M. J., Mayer, D. G., & Mckiernan, W. A. (2011). BeefSpecs a tool for the future: On-farm drafting and optimising feedlot profitability. Australian Farm Business Management Journal, 7(2), 29-36. https://doi.org/10.22004/ag.econ.121460

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Copyright (c) 2024 Malcolm McPhee, Jonathan Richetti, Barry Croke, Brad Walmsley