Coronavirus: We Need Better Data Hygiene
I want to share some thoughts on two related issues that have been bothering me since the coronavirus outbreak started: (i) how poor data hygiene is muddling our thinking; and (ii) the irresponsible and harmful behavior of most media outlets.
We live in a data-driven society; whenever something big happens, we look for data to glean insights and take decisions.
In the case of the coronavirus, the data so far are simply terrible, and it is imperative that we recognize this, and recognize the need to handle them with care:
The number of confirmed coronavirus cases depends on how many tests you perform, and on whom you test.
The press splashes the number of new daily cases on the front page and sells it as a measure of contagion. If the number of new cases jumps from 1,000 to 2,000, the headline proclaims that contagion is moving at an exponential rate. But if the number of tests has also doubled, say from 10,000 to 20,000, the “positivity rate” (share of tests that turn out positive) is still the same. Maybe contagion is accelerating; but you can’t tell from these numbers. In fact, I would expect accelerating contagion to result in a rising positivity rate: the share of infected people in a same-sized sample should be higher today than yesterday. The charts below plots tests and new cases for Italy—they move very closely together. The positivity rate is high, and we need to ask why (see below), but it is not increasing. We find more cases because we test more people.
This cuts both ways: the Coronavirus monitor of the Italian Sole 24 Ore has argued that when we see the daily percentage increase in new cases drop below 10% it will be a sign that contagion is being contained. But as the stock of confirmed cases—the denominator—rises, eventually the daily percent increase will fall simply because your daily testing sample is limited. Take an extreme case: contagion is spreading fast to the entire population; you test 10,000 people a day, and get 10,000 new cases every day. Starting on day twelve, your 10,000 new daily cases work out to less than 10% of the stock (110,000 on day eleven), and the daily percentage will keep falling even if the entire population is getting infected.
In most countries the number of tests gives us only a very small statistical sample. Health authorities have warned that covid-19 is more contagious than the flu. The flu on average infects 10% of the population across Europe and the US. This suggests that over 35 million people might already be infected in the US, and even more in Europe, with over 6 million just in Italy. If that’s the case, even 200-300,000 tests give you a very small sample.
Can the number of already infected people be that large? We don’t know. But China admitted that the first covid case might have occurred last November. It seems likely that it will have arrived in both the US and Europe by December. The December to mid-March period is the height of flu season—so it should have spread widely. The virus does not arrive in your country the day you start testing, as press reports seem to imply.
The sample is not only small, but biased. We tend to test mostly people showing symptoms, and often severe symptoms, which inflates the positivity rate as well as the mortality rate. This has happened with previous epidemics, it is normal, but we need to keep in mind that the mortality rate we are measuring now is not the “true” mortality rate. It will come down as we test more.
I keep seeing charts comparing the rise in the absolute number of new cases in different countries. A medical doctor (MD) who doubles as CNN anchor tweeted one such chart (from the FT), argued that the absolute number of cases is rising faster in the US than in Italy or any other country, and called it “truly terrifying”. There is so much wrong with all this. Look at the chart:
The chart shows the US having reached about 15,000 confirmed cases at a point in the supposed contagion timeline when Italy had recorded only 7-8,000. Surely this means contagion is moving twice as fast in the US as in Italy, and Italy is the worst hit! Terrifying, right? Except that the number of cases doesn’t tell me much if I don’t know how many people you have tested. Here is a different chart:
This gives us a very different picture: for a given number of tests, the number of cases in Italy is considerably higher than in the US, and the gap is widening. I have skin in the game in both countries, and I worry a lot more about Italy based on these numbers.
Different countries have used different selection criteria for testing people, resulting in some interesting outcomes. Look at the next chart: the share of infected people in their twenties is 27% in Korea and zero in Italy.
How can that be? It’s clearly implausible that Italians in their twenties have all escaped contagion. It must be the case that Korea tested a much larger number of young people. This is not a criticism or praise of either country. Korea was able to ramp up the number of tests quickly while Italy contended with a large outbreak in Lombardy. But since the mortality rate for people in their twenties is zero in both countries (see next chart), this immediately skews the overall mortality rate, which is currently 9% in Italy and just 1% in Korea.
A more important issue is the fact that mortality rates for people over the age of 60 are more than twice as high in Italy as in Korea.
An additional caveat: Countries have been using different tests; tests are fallible, and I have not seen any transparent information on the likely percentage of false positives and false negatives: how often a test mistakes a regular flu for covid-19, and how often it does not recognize a covid-19 that’s staring it in the face.
Data are extremely important—but we need to pay much closer attention to what they can and cannot tell us. Based on these numbers, governments are taking momentous decisions that are having a massive impact on our daily lives and will have a massive impact on our economies and on health outcomes down the line, and individuals are taking decisions that can help or hinder the crisis resolution efforts.
Here is where the media are proving irresponsible, in my view.
The media’s focus remains, as always, to find the shocking, inflammatory headline that will drive up “engagement”, i.e. clicks or viewership. And fomenting fear always works. So the media:
Give us the impression that the virus arrived in our country the day we started testing for it, and is spreading at the same speed with which we can scale up our testing capabilities;
Try to make us believe that the US is on a worse trajectory than any other country so far, with the silent implication than given its larger population the consequences will be catastrophic;
Feed us dramatic footage of the rare younger infected person who is seriously ill in the hospital, without ever reminding us that the mortality rate for people under 60 years of age is close to zero.
Turn the daily White House press conferences in a deplorable spectacle where reporters scream like demented children, and for every sensible question you have three that just fish for the “gotcha” moment.
Amplify the proliferating messages from asymptomatic celebrities who assure us with somber and earnest faces that they are doing ok so far.
This is irresponsible and harmful. It feeds panic, and panic feeds irrational behavior and further complicates the already hard challenge that governments face.
The stakes are high. To combat the coronavirus, a number of governments have adopted draconian measures that will plunge our economies into a very deep near-term recession, causing millions of people to lose their jobs. If the lockdown is short-lived, we might recover relatively quickly—though we’ll be left to deal with much higher debt and a massive monetary overhang. If the lockdown lasts even just a few months, we have no idea how hard the economic impact will be—and the social and human costs that go with it. The public debate on the next steps has to be based on a more transparent understanding of the data available.