Coronavirus Data Hygiene Revisited - What Would Hans Rosling Say?
In a March blog I highlighted the poor quality of the data available to assess the coronavirus crisis, and how their misuse creates confusion that hinders both our understanding of the situation and the policy response.
Two months later, I see very little improvement.
The first thing we would like to know from the data is, how deadly is this coronavirus? Yes, we also want to know how contagious it is, but contagion is something we can to some extent control with social distancing measures, so the most important question is, if a certain share of the population catches the disease, how many will die?
After months of testing and analyses, do we have a good sense of what the fatality rate is? No. The chart below shows the case fatality rates (horizontal axis) for the 30 countries with the highest number of recorded cases (vertical axis, in logs). Data are from Worldometer as of May 15th.
It’s all over the place. The case fatality rate currently ranges from 0.1% in Singapore to 16.4% in Belgium. Germany and France have almost the same number of recorded cases; yet the fatality rate in France is over three times that of Germany. The huge difference in fatality rates between countries like Singapore, Germany, and France cannot all be in the health of the population or the quality of the health care system—a lot of it has to be in the data.
What is going on?
Testing – Case Fatality Rate vs Infection Fatality Rate
Countries have ramped up viral testing, which checks for people currently infected. Viral testing is still mostly restricted to people showing symptoms; therefore it will miss those who are asymptomatic and those who have already had the virus and gotten over it without being tested.
In other words: what we are measuring is the Case Fatality Rate, or the number of recorded deaths as a fraction of the number of recorded cases. What we would like to know is the Infection Fatality Rate, or the total number of deaths as a fraction of the total number of infected people.
Some countries have begun serology tests– these check for antibodies which indicate that a person’s immune system has already been exposed to the virus. These studies suggest that the number of people who have already been infected might be considerably larger than we thought.
To fix ideas, keep in mind that the number of confirmed covid-19 cases in the US is about 0.4% of the population. Stanford University health academics conducted a serology test in Santa Clara county in California and initially estimated that between 2.5% and 4.2% of the county’s population has been infected. The first version of the study was criticized for how its authors had adjusted for the difference between the composition of the sample and that of the county’s population A revised version based on the criticism places the estimate at 2.8% or about 54,000 people—compared to about 1,000 confirmed cases at the time. A study in the Netherlands produced similar results. A study in Boston showed a positivity rate of over 30%. A study in Germany found a positivity rate of 14%--this would imply the number of people who have been infected is 70 times higher than the recorded cases. Another study found a positivity rate of 14% for New York State (8 times higher than recorded cases) and about 20% for New York City.
These results are important because they can give us a sense of the uncertainty on the true infection fatality rate, which is what really matters. Current data indicate that 0.4% of the US population has been infected, yielding a case fatality rate of 6%. If 3% of the population has been infected, instead of 0.4%, the true (infection) fatality rate would be 0.9% (taking as given the number of total fatalities, but see below); if 14% of the population has been infected the true fatality rate drops to 0.2%; if 30% of the population has been infected the true fatality rate drops to 0.1%.
In a knee-jerk reaction, many commentators have quickly dismissed these studies as deeply flawed because of limited sample size, risks of sample bias, and risk of false positives. This is a misguided reaction. Of course these initial studies suffer from important limitations, but they are a step in the right direction. And importantly, all data currently available on the coronavirus suffer from major limitations; we need to take all results with a pinch of salt and keep trying to get better data, not dismiss those we don’t like and treat as revealed truth those that validate our existing opinion.
The actual number of Covid-19 deaths—the numerator of the fatality rate—should be easier to estimate. Alas, it is not. Several countries have worried they might be under-counting Covid-19 victims. As a consequence:
1. Most covid-19 victims have additional pathologies. The US and some other countries (like UK and Italy) have decided that whenever Covid-19 is one of the conditions, Covid-19 should be recorded as the cause of death. (For the US, this was confirmed in a press conference by Dr. Birx, a top expert in the White House Coronavirus Task Force). We don’t do this for any other disease—if a cancer patient catches the flu and dies it’s not recorded as a flu death.
2. The CDC has decided to record “probable” covid-19 deaths as well as confirmed ones. If you look at the criteria, you will find that to classify a death as a probable covid-19 fatality it’s enough, for example, that the victim had a fever (even if ‘subjective’ and not measured) and a headache, and lived in an area with ongoing contagion or was in a category at risk (over 60 years old or with other pathologies).
So even on fatalities we still don’t know what’s going on. According to the Washington Post, Dr. Birx has expressed serious reservations on the CDC data, saying it might overestimate mortality by 25%. Dr. Fauci instead thinks the death toll is “almost certainly higher”. The fact that these two top experts, who work closely together, cannot agree on the data is perhaps the best proof of the limitations and poor quality of the data itself.
I have more sympathy for Dr. Birx's view--I think we are more likely to be overcounting, given the loose criteria for probable deaths—and paradoxically, we are probably counting as covid-19 fatalities a number of people who died of other causes, for example strokes and heart attacks because the lockdown prevented them from getting the necessary care.
We will eventually get a better sense of the numbers. If we are overcounting coronavirus deaths, we should see an implausible decline in the share of deaths attributed to other causes such as heart attacks, Alzheimer and cancer. But it will take some time. You can check out the CDC website here and you will see that fatalities data for at least the last several weeks are extremely incomplete. To give you a sense, below is the chart of all deaths on a weekly basis for 2020 so far, built with the CDC’s own visualization tool:
Bottomline: beware of the data
Given the persistently poor quality of the data, we should be very wary of using them to design precise litmus tests to guide key policy decisions.
What would Hans Rosling say?
This brings me to the title of this post.
Dr. Hans Rosling, the renowned and brilliant Swedish physician, spent considerable time working in the poorer parts of Africa and Asia. This opened his eyes to how little we know on the most important issues impacting our lives: developments in health care, education, economics and more—because we ignore even the most basic data. He therefore embarked on a mission to urge us to always look for data and to teach us how to interpret the data with care.
In his brilliant book Factfulness (reviewed here), Dr. Rosling relates what now sounds like a prescient cautionary tale. While he was involved in fighting the Ebola epidemic in Liberia, Rosling saw that health organizations had started including “suspected” Ebola deaths in the count. And as he put it:
“The numbers behind the official World Health Organization and the US Center for Disease Control and Prevention (CDC) “suspected cases” curve were far from certain. Suspected cases means cases that are not confirmed. There were all kinds of issues: for example, people who at some point had been suspected of having Ebola but who, it turned out, had died from some other cause were still counted as suspected cases. As fear of Ebola increased, so did suspicion, and more and more people were ‘suspected’. As the normal health services staggered under the weight of dealing with Ebola and resources had to move away from treating other life-threatening conditions, more and more people were dying of non-Ebola causes. Many of these deaths were also treated as ‘suspect’. So the rising curve of suspected cases got more and more exaggerated and told us less and less about the trend in actual, confirmed cases.”
When Dr. Rosling examined the data he realized that confirmed cases (based on blood tests) had already peaked two weeks earlier and were falling even as suspected cases kept climbing. He released his results; the World Health Organization published them. However in the US, Dr. Rosling writes, the CDC decided to stick with suspected cases “to maintain a sense of urgency”.
To be clear, Dr. Rosling was not aiming to declare victory and relax efforts. He thought it was important to look at the evidence and feared that looking at the wrong data would mis-direct resources and policies. His premature death in 2017 has deprived us of an outstanding individual whose clarity of thought, balance and objectivity would have been immensely beneficial in this crisis. We can at least reflect on Dr. Rosling’s lesson from the Ebola experience:
“Data was absolutely key. And because it will be key in the future too, when there is another outbreak somewhere, it is crucial to protect its credibility and the credibility of those who produce it. Data must be used to tell the truth, not to call to action, no matter how noble the intentions.”
Over the past couple of months, the quantity of the coronavirus data has increased, but their quality has not. We are still very far from a reliable estimate of how deadly the virus is. What is worse, the debate on the data and even the classification of the data seem too often guided by the desire to push a specific agenda—however well-meaning that might be—rather than to get to a better understanding of the underlying reality.
Given the persistent poor quality of the data, we should resist the temptation to use them as a litmus test to guide crucial policy decisions. They can provide us more information, but they cannot give us any precise benchmarks.
We’d be better advised to follow common sense and rely most heavily on the data and information that appear more solid: focus on protecting the people at higher risk; monitor the capacity of the health care system; encourage people to keep adopting basic precautions while we gradually relax the more draconian restrictions to restart economic activity.
Much like the virus, dirty data are going to be with us for a while, and we have to learn to live with them, taking the right precautions to prevent them from infecting and impairing our decision-making.