In the first post in this “All models are wrong” series (which I decided was a series about 2 minutes ago), I talked about how the narrative that the COVID19 models are wrong is misleading. Although some of the more extreme scenarios they describe have not come to fruition, that is by design. When models are used to make forecasts, they require assumptions about uncertain parameters and future policies and behaviors. Because of this, models typically describe a variety of different scenarios, and by definition, at most one of those scenarios can actually occur. So, “the models are wrong” is a truism that says nothing about their usefulness.
So how can we evaluate models? How can we tell if a given model is useful? In this post, I will discuss a couple of criteria that one should consider when evaluating a model. I will also discuss why one of the most popular models, the Institute for Health Metrics and Evaluation model (from the University of Washington) falls short and should not be trusted.
Principle #1: A good model doesn’t just match old data; it generalizes to new data. When scientists are creating a model, they typically fit it to the data. They do this by tuning the parameters of the model until the trends seen in the model match the trends seen in the data. You can think of this like turning the knobs on a stereo until it sounds just right. For example, most people have heard of a line of best fit. This is a model with two parameters: the slope of the line (how steep it is) and the intercept of the line (its output value when the input is zero). Given a collection of data points (pairs of input values and output values), you can determine the best possible slope and intercept for a line by changing those parameters until additional changes cease to reduce the error in the model which you can think as the differences between the predictions and the observed outputs. More sophisticated models may have hundreds or even thousands of parameters or more, but they can be tuned in a similar way. Here is a quick demo to see how this line fitting works: https://www.desmos.com/calculator/clmmvuatjs
When you have many parameters to tune, models have a tendency to “overfit” the data. This means that essentially, the model fits the data too well. Now, I know what you are thinking, how do you fit the data too well? Isn’t fitting the data the whole point? Well, yes and no. The question is: which data do you want the model to fit? Usually we want the model to explain not just the data it has seen but new data as well. It turns out that at some point in the modeling process you will inevitable encounter a tradeoff between the accuracy on the data used to fit the model and the accuracy on new unseen data.
You can see this in the example below. Here the model (the black curve) was fit to the red data points but not the blue. On the left, the model only had three knobs to turn so it was not able to overfit the data and seems to capture the trends on both sets of data points equally well. However, on the right, the model was much more flexible and as a result it fits the red points extremely well, but there is little evidence that all those extra "wiggles" actually improved the performance on the blue points at all.
To make this more concrete, let’s consider the context of modeling COVID-19 deaths. Suppose you were to divide the data into two sets with one set (called the training data) containing the deaths from before May 1 and the other set (called the testing data) containing deaths after May 1. If the goal is to fit the model, then you might start by tuning the knobs (parameters) of the model to try to get it to match the death totals in the training data. As you turn the knobs, you might keep track of the errors in the model’s predictions as compared to the observations in the training data. If you are turning the knobs in the right way, the total error will go down and down (the pre-May 1 predictions would get more and more accurate) until it bottoms out. At that point, your model is as good as its going to get; it has been fit to the data.
But what about the testing data? when you start tuning the parameters, the error on the testing data typically goes down as well at least at first. In other words, even though you were not looking at the testing data when you were turning the knobs, those changes initially improve your predictions after May 1. This is good news! It suggests that the trend your model is actually getting better. The trends that it is describing are meaningful because they generalize to new data! However, when you have a lot of knobs to tune, you will often find that at some point, before the error on the training data bottoms out, the error on the testing data starts to go up. In our example, this would mean that while the model is getting better at predicting the deaths before May 1, it starts doing worse at predicting deaths after May 1. This happens because there are a lot of factors that influence variables like death totals some of which are random. The model cannot possibly hope to accurately predict things that are completely random, and in trying to do so, it overreacts to the noise (the randomness) at the expense of the signal (the meaningful trend). If this happens, then your model is overfitting the training data and you should probably stop tuning your parameters as additional changes are likely to continue to reduce the accuracy of the model on new data.
The picture below illustrates this trade-off. You can think of the horizontal axis as representing the number of knobs the model has and/or your willingness to turn them. If you want your model to generalize well, you should tune the knobs until you reach the lowest error on the test data and NOT the training data, since that will lead to the model most likely to generalize to new and unseen data.
In the context of epidemic forecasting, you don’t just want a model that explains the past, you want one that explains what is going to happen next! The model should do reasonably well on the past by design, however, the model will only be useful if it actually matches the future. It is far more important to get tomorrow’s prediction right than yesterday’s. So, when evaluating a model, it is entirely to appropriate to look back at its past predictions to see if it did a good job of predicting the future. That brings us to the second principle.
Principle #2: The errors in future predictions for a good model are random not systematic. If a model consistently overestimates or underestimates the quantity it is trying to predict, then that is a sign that there is something fundamentally wrong. In the short run, errors are unavoidable. If you overestimate R0 by a bit, your predictions are going to be too high, but if you underestimate R0 they will be too small. However, the hope is that as new data comes in, modelers will retrain the model (i.e. update the parameters) to fix these issues and that in the long run the model will not display systematic biases toward overestimation or underestimation. When refining an epidemic model, scientists might use a process something like this:
1. Use the data from days 1 through N to make a prediction for day N+1.
2. Collect the data for day N+1.
3. Compute the error for the prediction.
4. Tune the knobs of the model to improve the model by reducing that error.
If you find that the predictions are consistently too high or consistently too low, then that is a concern that the model is fundamentally flawed.
Principle #3: A good model accounts for uncertainty, particularly about the distant future. When scientists create models, they must grapple with a great deal of uncertainty. The parameters are unknown. The data is imperfect. The underlying mechanisms for the system are not fully understood. In light of all this uncertainty, predictions about tomorrow are hard and predictions about the distant future are even harder. This challenge is particularly pronounced in epidemiology where the quantities of interest grow exponentially and the effects of small errors are magnified exponentially in the long run.
So, what are modelers to do? These challenges are often addressed by building probabilistic models that describe a multitude of different scenarios (different transmission rates, recovery rates, etc.) that are all plausible based on the data available at the time. The results from simulating all of these different scenarios can then be aggregated to produce an average outcome. This is often what gets reported in the media. However, a better way to aggregate them is to describe a range of outcomes using what is sometimes called the “cone of uncertainty”. This is called a cone because the uncertainty grows over time. We know what happened in the past (for the most part), so the uncertainty about the past is relatively low. However, when we look to the future, any errors in the estimates of the parameters compound themselves so the predictions will tend to diverge over time. As a result, the uncertainty of the predictions for a week from now is greater than the uncertainty in the predictions for tomorrow. If you see a discussion of a model that does not address this uncertainty in some way (by displaying error bands for example), then that should immediately raise questions in your mind.
Conveying uncertainty is hard to do and it can make your predictions feel wishy-washy. Because of this, that uncertainty sometimes gets left out of press releases and media reports giving an illusion of certainty. That is one of the reasons why I always try to go to the original source when I read about models in the media.
Principle #4: A model’s “failed predictions” should generally lie within the cone of uncertainty. Models are never going to get things exactly right. We can hope that the average outcome predicted by the model will track with the observed outcomes, but due to uncertainty, we shouldn’t necessarily expect that to happen. However, we can expect that if a model is worth its salt, then the observed outcomes will lie within the cone of uncertainty. In other words, when the model tells you the range of outcomes it expects, which often corresponds to the middle 80%, 90% or 95% of simulated outcomes, you should expect the actual outcome to lie within that range around 80%, 90% or 95% of the time. If it doesn’t then, something is either the model itself or the modelers are underestimating their uncertainty. Either way, that is a red flag that you should be suspicious of the model.
Why I don’t trust the IHME model. The IHME model is arguably the most prominent COVID19 model out there (https://covid19.healthdata.org/united-states-of-america). It has been cited by multiple governors and even the president as justification for policy decisions that have been made. This is a shame because this model fails on all four counts. It has a track record of overfitting the past and failing to predict death totals within even a couple of days. In recent weeks it has underestimated the death totals day in and day out. They keep tweaking the model so that it better fits the past, but it continues to do a poor job of predicting the future. As a result, its long-term predictions have fluctuated dramatically (a common symptom of overfitting). It does report uncertainty, but the uncertainty is not a cone as it should be. They claim their uncertainty decreases over time! This is highly implausible and likely means that they are underestimating their uncertainty about the future. To make matters worse, the observed death totals have fallen outside their band of uncertainty far more often than they should. There are many critiques of this model out there. Here is just a sampling:
Unfortunately, some government officials are continuing to use the model! This is unfortunate, because bad models like this one, particularly when they are used to inform public policy, undermine the credibility of good ones. The IHME has revised its model recently to address some of these concerns. It now uses an SEIR model instead of the dubious statistical model it used before, but given its track record I remain skeptical. Here are a few alternative models that have less questionable histories:
Columbia University: https://cuepi.shinyapps.io/COVID-19/
Northeastern University: https://covid19.gleamproject.org/
University of Texas: https://covid-19.tacc.utexas.edu/projections/
Los Alamos: https://covid-19.bsvgateway.org/
My favorite model right now is actually an independent model ( https://covid19-projections.com/). It has a nice discussion of the shortcomings of the IHME model and a comparison of the performance of various prominent models (https://covid19-projections.com/about/). I don’t think that any of these models should be trusted blindly, but together they can give us a sense of where we might be headed. Hopefully, if we are willing to examine HOW models are wrong, we can better evaluate which ones are useful and then use them effectively for making more informed decisions.
I haven't tried to do that yet, but it looks like this github repository (https://github.com/reichlab/covid19-forecast-hub) is aggregating the data from a variety of different models so the answer should be in there somewhere.
Do you know an simple way to download the data from any of those models' projections?