Forecasting models have provided timely and critical information about the course of the COVID-19 pandemic, predicting both the timing of peak mortality, and the total magnitude of mortality, which can guide health system response and resource allocation. Out-of-sample predictive validation--checking how well past versions of forecasting models predict subsequently observed trends--provides insight into future model performance. As data and models are updated regularly, a publicly available, transparent, and reproducible framework is needed to evaluate them in an ongoing manner. We reviewed 384 published and unpublished COVID-19 forecasting models, and evaluated seven models for which publicly available, multi-country, and date-versioned mortality estimates could be downloaded. These included those modeled by: DELPHI-MIT (Delphi), Youyang Gu (YYG), the Los Alamos National Laboratory (LANL), Imperial College London (Imperial), and three models produced by the Institute for Health Metrics and Evaluation (IHME), a curve fit model (IHME-CF), a hybrid curve fit and epidemiological compartment model (IHME-CF SEIR), and a hybrid mortality spline and epidemiological compartment model (IHME - MS SEIR). Collectively models covered 171 countries, as well as the 50 states of the United States, and Washington, D.C., and accounted for >99% of all reported COVID-19 deaths on July 11th, 2020. As expected, errors in mortality predictions increased with a larger number of weeks of extrapolation. For the most recent models, released in June, at four weeks of forecasting the best performing model was the IHME-MS SEIR model, with a cumulative median absolute percent error of 6.4%, followed by YYG (6.5%) and LANL (8.0%). Looking across models, errors in cumulative mortality predictions were highest in sub-Saharan Africa and lowest in high-income countries, reflecting differences in data availability and prediction difficulty in earlier vs. later stages of the epidemic. For peak timing prediction, among models released in April, median absolute error values at six weeks ranged from 23 days for the IHME-CF model to 36 days for the YYG model. In sum, we provide a publicly available dataset and evaluation framework for assessing the predictive validity of COVID-19 mortality forecasts. We find substantial variation in predictive performance between models, and note large differences in average predictive validity between regions, highlighting priority areas for further study in sub-Saharan Africa and other emerging-epidemic contexts.
IHME COVID-19 Model Comparison Team. Predictive performance of international COVID-19 mortality forecasting models. MedRxiv. 14 July 2020. doi:10.1101/2020.07.13.20151233.