Are the models actually wrong?

Mark J. Panaggio
Aug 4, 2020
12 min read

Updated: Aug 13, 2020

Disclaimer: The views expressed in this post are my own and do not represent those of my employer.

The title is a reference to a quote by George Box that I have cited repeatedly in this blog: "All models are wrong, some models are useful." The answer is, of course, that yes the models are wrong, all models are, but they are not as wrong as some have claimed, and as we will see they are still quite useful.

This post is in response to a question from a reader. I am open to making this a regular occurrence. If you have a question related to data, mathematical/statistical modeling and current events. Feel free to send an email to panaggio [at] u.northwestern.edu.

I have really enjoyed your well written and researched blogs. I had an idea for another one, i thought it might be interesting to revisit some of the models from the start of the pandemic to see how we are doing. - Jeff, Florida

Thanks, Jeff. That's an interesting question. The CDC has been collecting predictions from a variety of different models since early on in the pandemic. Most of the models come from groups at research universities, but a handful of individuals submit their predictions as well. After receiving the question, I did a little digging and it looks like they make the data relatively easy to download. So, with a little web-scraping (thanks BeautifulSoup!) I was able to download all of the predictions dating back to late April/early May.

In all, this included over 75000 rows of data, so I decided to focus just on national models and left out those that focused on state specific predictions. That left a total of 60 different models. Of those, only 12 models made predictions in 9 or more of the 11 weeks. So, I ignored the rest under the assumption that the models that had been submitted consistently were the ones that probably had the most resources and effort behind them and were therefore likely to be the most reliable.

So, that left me with 12 models including a few familiar ones:

Columbia - Shaman Group (Columbia University)
IHME - Institute for Health Metrics and Evaluation (University of Washington)
ISU - Iowa State Univerity
JHU - Infectious Disease Dynamics Lab (Johns Hopkins University)
LANL - Los Alamos National Laboratory
MOBS - Laboratory for the Modeling of Biological and Socio-technical Systems (Northeastern University)
UA - EpiGro (University of Arizona)
UCLA - University of California Los Angeles
UMass-MB - Reich Lab (University of Massachusetts Amherst)
Ensemble - aggregated by the Reich Lab (University of Massachusetts Amherst)
UT - University of Texas
YYG - Younyang Gu (data scientist in NYC)

This includes 10 models from different universities (including the infamous IHME model which I have complained about before), one ensemble model (a composite of multiple models), and one model from a hobbyist who has no affiliation to a research lab (impressive!).

Each model includes predictions for the cumulative death totals 1 week out, 2 weeks out, 3 weeks out and 4 weeks out. There were a couple that submitted predictions further out than that, but not enough to do much analysis with. They also included a 95% confidence interval for their predictions which represents a range of values that they think would include the true value about 95% of the time. This is a way of measuring the model's uncertainty.

If you recall, back in May I wrote about four criteria that you should look for in a good model.

A good model doesn’t just match old data; it generalizes to new data.
A good model accounts for uncertainty, particularly about the distant future.
A model’s “failed predictions” should generally lie within the cone of uncertainty.
The errors in future predictions for a good model are random not systematic.

Let's take a look at how they did.

Criterion #1: A good model doesn’t just match old data; it generalizes to new data.

The good news is that this is a prediction problem, so none of the models knew the correct answer before making their predictions. This means that we can partially judge whether the model generalizes to new data by looking at how far the predicted death totals were from the actual death totals.

It is worth noting that some of these models are used to make policy decisions. So if the model predicts a high death total in a particular area and then local government/health officials intervene (i.e. by altering guidelines, closing businesses, instituting mask mandates, etc.) to counteract this, then the very act of making the prediction can cause the prediction not to pan out. Unfortunately, we have no way of examining this alternate reality where the models don't influence behavior. So, counterintuitively, the most influential models may actually fare worse in these sorts of comparisons. Conversely, the models that fare best in this comparison are not necessarily the most useful ones; they are simply the ones whose assumptions are most consistent with our behavior over the period of interest (which can be influenced by the models themselves).

I used a quantity known as the mean absolute error (MAE) to measure the error in the predicted death totals. The MAE just takes the difference between the predicted death total and the actual death total and then takes an absolute value to make it positive. The results are plotted in the bar chart below.

Another way to look at this is to compute the percent error in the predicted number of deaths. Since the total number of deaths is growing every week while the number of new deaths is not, the % errors in the cumulative death totals seem to get smaller and smaller over time. However, if you just focus on the number of new deaths reported over each time interval, you get a clearer picture. I will post this data in tabular format at the end of the post.

Here we see a couple of standouts. The UMass-MB looks like the cream of the crop followed by the LANL and MOBS models. The IHME doesn't look bad either (this dataset started around the time they made a major overhaul that resulted in some substantial improvements). In general the models seem to be able to predict the number of deaths a week out within one thousand and a month out within a couple thousand. Generally speaking the error goes up for longer term predictions, and that seems to be true here. Given that we have been averaging over 5000 deaths per week, that puts the error around 10% for estimates of the number of deaths that will be reported 4 weeks from now for the top performing models. I think that is pretty impressive. Keep in mind that these models are already the cream of the crop, so there is no shame for the models that finished at the bottom of this ranking and that may even be evidence that those models are victims of their own success in terms of influencing behavior changes.

Criteria #2 and #3: A good model accounts for uncertainty, particularly about the distant future. A model’s “failed predictions” should generally lie within the cone of uncertainty.

All of the models included uncertainty estimates in the form of a confidence interval, so one way to check whether these models accurately described their uncertainty is to see how often (i.e. what percentage of the time) the true death total landed inside of this interval with a value of 95% corresponding to a well-calibrated uncertainty estimate. In case this jargon is confusing, let's think about an example.

I like to imagine this process of constructing a confidence interval as a ring toss. You don't have to hit the peg exactly, but you need to get it somewhere in the middle of the ring. The bigger the ring, the easier it is to land it on the peg, but if it is too big then the game becomes kind of pointless. The same applies to a confidence interval, if your interval is too big then its easy to capture the true value, but your predictions aren't very informative then, so ideally you want the smallest possible interval that captures the right answer consistently (95% of the time).

Suppose model X predicted that within 4 weeks there would be a total of 125,000 deaths. It would of course be unreasonable to expect there to be exactly 125,000 deaths, but how close is close enough? Ideally we want to quantify what range of values is plausible. When these models are created they typically involve some sort of randomized simulation and therefore they produce a range of outcomes. So, if the modelers ran their simulation 1000 times, and found that 2.5% of the time they got less than 120,000 deaths and 2.5% of the time they got more than 132,000 deaths, then the middle 95% of the outcomes would be between 120,000 and 132,000 and that would be the confidence interval. So one would hope, that most of the time, the true value would lie somewhere in that range.

I plotted the percentage of the time the model captured the observed death totals using a bar chart as seen below. Note: there is a typo in the plot. The y-axis is a proportion not a percentage. Multiply by 100 to get a percentage.

Notice that UMass-MB wins here with a confidence interval that captured the true value 100% of the time (if anything, their confidence intervals might be too big). They are followed up by YYG, MOBS, LANL and the Ensemble model which only failed to capture the true value once. IHME (which is still arguably the most popular model among media types) doesn't fare as well here failing to capture the true value 1 week out over half of the time.

Another way to look at this is to plot the progression of predictions and observed values over time. I generated one plot for each. Feel free to scroll through them (the legend has the model name). I marked points where the interval did not capture the true value in red.

The nice thing about this type of plot is that it makes it easier to see how confident the models were in their predictions. You can see that the best models tended to be a bit more conservative by using wider confidence intervals. So interestingly, the more accurate models were less confident and the less accurate models were more confident in their predictions. I think there might be a lesson in there somewhere.

Criterion #4: The errors in future predictions for a good model are random not systematic.

If a model consistently overestimates or underestimates the thing it is trying to predict, that is a problem. It means the model has a fundamental flaw. You can evaluate this by looking at the errors (observed-predicted) over time to see if they are consistently positive (you underestimated) or negative (you overestimated). I did this for each of the models above again coloring points where the confidence interval did not include the true value (0 error) in red.

Here you can see that the models do seem to be a bit biased. They seem more likely to overestimate than underestimate the number of deaths. It looks like the LANL model is the most balanced in this regard.

Summary

So all in all, I think the models are doing quite well. Given the complexity of the task (predicting the number of deaths on a national scale despite uncertainty about mortality rates and the number of cases) and uncertainty about how people will behave as well as which policies will be in place, I think the fact that the models are able to consistently predict the number of new deaths a month out to within 20% is pretty impressive. And, they seem to be getting better over time. It looks like there are a handful of models (although not the ones that get most of the media attention), specifically UMass-MB, MOBS, LANL and YYG that are particularly well-calibrated and have relatively low errors. In aggregate, they suggest that by late August we should be somewhere around 170,000 deaths and that the death rate is likely to accelerate over that time.

What about my models?

Now you be wondering. Enough about other people's predictions, what about your predictions? Well, I am not a big fan of making predictions. This blog has always been more about helping people understand models than about creating my own. When I have created my own models, they have tended to be qualitative rather than quantitative. In other words, I was more interested in finding simple models that were easy to explain that could describe the trends rather than complicated models that are hard to understand but that make more accurate predictions. That said, there are a couple of things that my models have predicted.

On April 28, I wrote about what epidemic models suggested might happen if we “reopened the economy”. My model suggested that when people returned to normal, the number of infections would start to rise again surpassing the pre-lockdown peak. Here is a plot from that post (note the vertical axis refers to cases in a small town of 50,000 people).

Cards on the table, this was not an original idea. Many others have made the same point including another infamous model dating back to mid March from Imperial College London (See figure 3), the one that included one scenario with 2.2 million deaths (which they called unlikely) as well as a number of other scenarios (closer to what actually happened) with between 20k and 500k US deaths (they used the UK, but predicted the US would have about 4x as many). In that paper they suggested that there would be multiple waves and that intermittent lockdowns would be needed to avoid overwhelming hospitals. How did that turn out? Here is a plot with national data from the CDC. It has a similar shape to the blue/magenta curves but the peak is not quite as sharp. This can be explained by the fact that policies and behaviors have not been uniform around the country. If you focus on some of the states that have been more aggressive about reopening then you get something that looks awfully familiar: Florida, Arizona, Georgia, Texas. So, although the precise timescale does not match (not did I attempt to make it match), the trends seem quite consistent. I am calling that a win for my predictions and a loss for mankind.

On July 18, when cases were rising but we were still seeing less than 800 deaths per day, I presented a model that suggested that over the next 26 days the deaths would rise to somewhere between 1000 and 3000 (depending on which variables you included out of cases, positive test rate and total tests).

Here we are a couple of weeks later and the 7-day average has surpassed 1100 and is still rising (see magenta dots above). It looks like the total is going to fall a bit short of the 1600 on the plot I displayed, but it is on track to be well within the interval. However, if you include just tests and positive test rate in the model, it looks a lot better as shown below.

So, I will let you decide if that deserves partial credit.

Conclusions

I do not claim to be a great prognosticator. Besides, these were models built in an hour or two just for illustration purposes and were not extensively tested. They were more about illustrating concepts than making precise forecasts. So, if you want reliable predictions, you are probably better off looking to the models I discussed at the beginning of the post.

However, I do hope you come away with this post with the recognition that some aspects of pandemics are actually quite predictable. We should not be surprised by the fact that cases are high right now (after most of the country decided to pretend that the pandemic was over) or that deaths are rising (when cases and hospitalizations have been spiking for weeks). This was quite predictable.

Do you know what else is predictable? Contrary to what some claim, this is not going to go away in November because of an election. The virus doesn’t care whether Donald Trump or Joe Biden is president. Until we can completely eliminate it from our shores by isolating the sick until it dies out (a long shot) or acquire herd immunity (preferably through vaccination rather than infection), this threat is not going away. A relatively small (15% or less) percentage of the population has been infected and we are still not sure whether lasting immunity is even on the table. If immunity is not long lasting and vaccinations are ineffective, then there is a possibility that this virus will become endemic and will remain a part of life (without the rapid spikes) going forward. I'm still holding out hope that won't happen.

If we do everything right over the next couple of months, then by November we might get back to where many countries in Europe are today with a modest number of cases, relaxed restrictions, but with a need for continued vigilance. However, if we continue to operate as we have been, living in a fantasy world where we pretend this is over, then those death totals are going to continue to climb.

There are still a lot of things we don't know.

How will school reopenings affect transmission?
How will people respond as case numbers go up and down?
How will spending more time indoors during the cold winter months affect things?
What happens when we are dealing with flu season at the same time?

It is too early to know the answers to those questions, because we have very little data than can help us. The good news is that as we collect more data, there will continue to be lots of experts working very hard to crunch the numbers, build models and give us insight into what has happened in the past and what might happen in the future. Contrary to the narrative in some circles, their models have actually performed relatively well thus far and they suggest that things like wearing masks and continued social distancing could have a big impact on how these next few months will unfold.

PS. Here is the percent error for the predicted number of new deaths as promised:

Note: This post was updated on 8/12 to reflect the most recent batch of predictions as well as to clarify a couple of points that were confusing.

Mark J. Panaggio

Are the models actually wrong?

Recent Posts

1 Comment