Last week, the New York Times published an excellent series of visualizations tracking the COVID-19 outbreak in rural areas across the United States. At the very end of the article, however, they added a misleading table of the counties with the highest average daily case rates. The intent was to show that the most recent cases are hitting rural areas harder than urban and suburban areas of the country, and they did so by highlighting the counties with less than 10,000 people.
In reaction, Professor Scott E. Page of the University of Michigan tweeted:
Duh? Of course the counties with the highest per capita outbreaks have small populations. So do the counties with the lowest per capita rates. This is basic statistics! @nytimes pic.twitter.com/QJIlyZJ5ay
— Scott Page (@Scott_E_Page) October 25, 2020
Professor Page is exactly correct. Unfortunately, while his ire is appropriate when directed at the New York Times - they should know better - a more detailed explanation is warranted for the underlying statistics.
But first, a disclaimer: I am merely demonstrating a property of statistics using COVID-19 data as a background. The spread of COVID-19 is far from random, and cases should never be treated as a simplistic probability distribution.
A Simulation
The U.S. Census Bureau maintains an estimate of county populations, most recently for 2019. The 2020 data will no longer be an estimate, as the census was taken this year, but the data is still being tabulated.
Let’s take a look at this dataset using pandas:
import pandas
df = pandas.read_csv('co-est2019-alldata.csv', encoding='ISO-8859-1')
df.shape
# (3193, 164)
counties = df[df['SUMLEV'] == 50]
counties = counties.filter(['CTYNAME', 'STNAME', 'POPESTIMATE2019'])
counties.rename(columns={"CTYNAME": "County", "STNAME": "State", "POPESTIMATE2019": "Population"}, inplace=True)
counties.set_index(['County', 'State'], inplace=True)
counties.shape
# (3142, 1)
The dataset contains a lot of information, including population estimates for prior years and state-level aggregations. Narrowing the dataset down, we can determine some basic statistics:
- 3,142 total counties
- Total population of 328,239,523
- The least populated is Kalawao County, Hawaii, with 86 people
- The most populated is Los Angeles County, California, with 10,039,107 people
- The counties have a mean population of 104,468 and median of 25,726 indicating a heavy right-hand skew
- There are 718 counties - about 23% - with less than 10,000 people
Shown visually in a histogram with a bar for each 10,000 people:
When this was written, there had been 468,793 new COVID-19 cases in the United States in the past week, which is about 0.1428% of the total population. That is a daily average of 66,970 - or 2.04 daily cases per 10,000 people.
For our simulation, we’ll distribute these cases randomly throughout the United States. Once again: this is not how COVID-19 spreads; this is to simply illustrate a property of statistics.
import random
sample = random.sample(range(328239523), 468793)
We can then create bins that represent each county:
by_population = counties.sort_values(by=['Population'])
last = 0
intervals = []
for value in by_population['Population'].values:
intervals.append((last, last+value))
last += value
bins = pandas.IntervalIndex.from_tuples(intervals)
And place the sample set into those bins, determining an average daily case rate per 10,000 people:
out = pandas.cut(sample, bins, right=False)
by_population['Weekly Cases'] = out.value_counts().values
by_population['Avg Daily/10k'] = by_population['Weekly Cases'] / by_population['Population'] * 10000 / 7.0
by_population.sort_values(by=['Avg Daily/10k'], ascending=False, inplace=True)
Here’s the 20 counties with the highest case rates from our simulation - with full names redacted so no one confuses this for real data. Notice anything interesting?
County | State | Population | Weekly Cases | Avg Daily/10k |
---|---|---|---|---|
L* County | Nebraska | 664 | 4 | 8.61 |
L* County | Texas | 169 | 1 | 8.45 |
S* Municipality | Alaska | 1,183 | 6 | 7.24 |
H* County | New Mexico | 625 | 3 | 6.86 |
H* County | Nebraska | 682 | 3 | 6.28 |
G* County | Texas | 1,409 | 6 | 6.08 |
K* County | Texas | 762 | 3 | 5.62 |
P* County | Montana | 1,077 | 4 | 5.31 |
G* County | Washington | 2,225 | 8 | 5.14 |
A* County | California | 1,129 | 4 | 5.06 |
Q* County | Georgia | 2,299 | 8 | 4.97 |
L* County | Texas | 3,233 | 11 | 4.86 |
C* County | Montana | 1,252 | 4 | 4.56 |
P* County | West Virginia | 8,247 | 26 | 4.50 |
S* County | South Dakota | 6,376 | 20 | 4.48 |
W* County | North Dakota | 3,834 | 12 | 4.47 |
F* County | Nebraska | 2,979 | 9 | 4.32 |
P* County | West Virginia | 6,969 | 20 | 4.10 |
C* County | Oklahoma | 2,137 | 6 | 4.01 |
F* County | North Dakota | 3,210 | 9 | 4.01 |
Shown visually, with counties that have less than 10,000 people in orange:
Note that the striations are caused by case counts being discrete and not continuous - you can’t have half a case!
So, our simulation shows that smaller population counties have both the highest and lowest case rates, proving Professor Page’s point.
But Why?
Let’s look at the county that had the highest case rate in our simulation. With a population of just 664, in order to have greater than the national average of of 2.04 daily cases per 10,000 people, the county would only need to have a single case the entire week! We can determine the chance of that occurring in our simulation using a binomial distribution given that there was a 0.1428% of any resident having a case in the preceding week:
from scipy.stats import binom
p = binom(664, 0.001428)
binom(664, 0.001428).pmf(1)
# 0.3676444783149298
So there was about a 36.8% percent chance of the county having one new case during the week. Once again, this is possible because our cases are randomly determined - in reality, cases are far from random.
But that’s the probability of exactly one case. We can find the probability of one or more cases with:
1 - binom(664, 0.001428).pmf(0)
# 0.6128215783303926
Which increases the chance that this county would have higher than the national average of cases to 61.3%.
And the chance of 4 or more cases?
rv = binom(664, 0.001428)
1 - (rv.pmf(0) + rv.pmf(1) + rv.pmf(2) + rv.pmf(3))
# 0.01589406080799649
So there was a 1.6% chance of this county having more than 4 times the national average!
To represent all these likelihoods visually:
To have that same rate, the largest county in America, Los Angeles, would have to have 60,476 cases or more in a week. And what is the probability of that occurring if the cases are random? Once again we can use the binomial distribution:
rv = binom(10039107, 0.001428)
1 - sum([rv.pmf(x) for x in range(0, 60477)])
# 1.2370320878751784e-08
Or about 0.000001237%. In actuality, Los Angeles had 10,801 cases over the previous week, which was below the 14,335 we’d expect if the county had the national case rate.
The more sample points we have, the easier it is to determine if a conclusion is significant. And that’s what the New York Times forgot: if you’re going to compare small populations to large populations, then naturally your small populations will be more varied to the extremes - both high and low!
Outbreak Likelihoods
What if we can use these likelihoods to re-examine the original question: are recent COVID-19 cases hitting rural areas harder? Thankfully, the New York Times does an excellent job in regards to data provenance, and uploaded the case counts used in the original article to a GitHub repository.
Let’s create a table with the counties that have the least likely case rates above the national average:
County | State | Weekly Cases | Population | Likelihood | Avg Daily/10k |
---|---|---|---|---|---|
Norton | Kansas | 279 | 5,361 | ~0% | 74.35 |
Grand Forks | North Dakota | 699 | 69,451 | ~0% | 14.38 |
Burleigh | North Dakota | 925 | 95,626 | ~0% | 13.82 |
Dodge | Wisconsin | 836 | 87,839 | ~0% | 13.60 |
Sheboygan | Wisconsin | 1,027 | 115,340 | ~0% | 12.72 |
Marathon | Wisconsin | 1,125 | 135,692 | ~0% | 11.84 |
Winnebago | Wisconsin | 1,339 | 171,907 | ~0% | 11.13 |
El Paso | Texas | 6,494 | 839,238 | ~0% | 11.05 |
Outagamie | Wisconsin | 1,451 | 187,885 | ~0% | 11.03 |
Fond du Lac | Wisconsin | 783 | 103,403 | ~0% | 10.82 |
Minnehaha | South Dakota | 1,312 | 193,134 | ~0% | 9.70 |
Mobile | Alabama | 2,632 | 413,210 | ~0% | 9.10 |
Brown | Wisconsin | 1,626 | 264,542 | ~0% | 8.78 |
Cass | North Dakota | 1,075 | 181,923 | ~0% | 8.44 |
Lubbock | Texas | 1,510 | 310,569 | ~0% | 6.95 |
Waukesha | Wisconsin | 1,789 | 404,198 | ~0% | 6.32 |
Milwaukee | Wisconsin | 4,145 | 945,726 | ~0% | 6.26 |
Utah | Utah | 2,394 | 636,235 | ~0% | 5.38 |
Salt Lake | Utah | 4,078 | 1,160,437 | ~0% | 5.02 |
Cook | Illinois | 11,597 | 5,150,233 | ~0% | 3.22 |
This table is not a ranking by likelihood, since all these counties have likelihoods approaching zero! These zero values are a strong indication that the random distribution we’ve chosen for our model is completely inappropriate. We’ve also run into an unexpected limit: the probabilities are so small that they’ve reached the minimum supported float value of our computer.
We can visualize these likelihoods for all counties, with those with less than 10,000 people once again in orange:
In the above visualization, the counties with case rates near the national average will be in the center of the y-axis. Counties with rates higher than expected will be towards the top of the y-axis, and counties with lower rates towards the bottom.
We could also re-examine what it means for a county to be rural. Instead of using a population limit, we can use the definition of the Federal Office of Rural Health Policy: a rural county is any county that is not part of a metropolitan area.
We can create a new visualization, with our newly defined rural counties still in orange:
It’s difficult to draw any conclusion from this visualization. While there appears to be more rural counties with likelihoods above the national average and urban counties below it, a more rigorous numerical analysis is needed. If anything, we’ve just introduced new biases by using a naive model.
Perhaps with a more appropriate probability distribution these results could have more meaning, but our analysis will always suffer from the original mistake committed by the New York Times: in order to make definitive conclusions about the spread of COVID-19 we need more sample points, and in the case of a county-by-county analysis the smaller population counties will always have less. Forgetting this underlying property of statistics led them to publish a misleading table of data.