News Release

The paradox of big data spoils vaccination surveys

Researchers analyze where COVID vaccination surveys went wrong

Peer-Reviewed Publication

Harvard University

When Delphi-Facebook and the U.S. Census Bureau provided near-real time estimates of COVID-19 vaccine uptake last spring, their weekly surveys drew on responses from as many as 250,000 people.

The large data sets provided statistically tiny margins of error, a key measure of a poll’s accuracy, and raised confidence that the numbers were correct. But when the Centers for Disease Control and Prevention later provided figures of actual reported vaccination rates, the two polls were off — by a lot. By the end of May, the Delphi-Facebook study overestimated vaccine uptake by 17 percentage points — 70 percent versus 53 percent, according to the CDC — and the Census Bureau’s Household Pulse Survey did the same by 14 percentage points.

A comparative analysis by statisticians and political scientists from Harvard, Oxford, and Stanford universities concludes that the surveys fell victim to the “Big Data Paradox,” the mathematical tendency of big data sets to minimize one type of error -- that due to small sample size – but to magnify another that tends to get lesser attention: errors due to systematic biases that make the surveyed sample a poor representation of the larger population.

The “Big Data Paradox” was identified and coined by one of the study’s authors, Harvard statistician Xiao-Li Meng, the Whipple V.N. Jones Professor of Statistics, in his 2018 analysis of polling during the 2016 presidential election. Famous for predicting a Hillary Clinton presidency, those election polls were skewed by what is termed “nonresponse bias,” which in this case was the tendency of Trump voters to either not respond or define themselves as “undecided.”

The danger posed by the paradox, Meng said, is that a biased big data survey has the potential to be worse than no survey at all, because with no survey, researchers still understand that they don’t know the answer. When underlying bias is poorly understood – as in the 2016 election – it can be masked by the confidence given by the large sample size, leading researchers and subsequent consumers of survey results to mistakenly think they know the answer.

“This is the Big Data Paradox: the larger the data size, the surer we fool ourselves when we fail to account for bias in data collection,” the paper’s authors wrote in their analysis, published Dec. 8 in the journal Nature.

Those misleading results can be particularly harmful when actions are taken based on them, the authors point out. The governor of a state where a survey shows that 70 percent are vaccinated against COVID, for example, might relax public health measures. If actual vaccination rates are closer to 55 percent, instead of fostering a return to normal life, the step could result in a spike in cases and a rise in COVID deaths.

“All around the world, policymakers and scientific advisors are trying to make sense of COVID data,” said Seth Flaxman, associate professor at Oxford University, a 2008 alumnus of Harvard’s computer science and mathematics program, and corresponding author in the paper. “Reported cases are a fraction of true infections, COVID-19 attributed deaths are a severe undercount of the true toll of this pandemic, and electronic medical records do not give us the full picture of long COVID. When it comes to survey data, all sorts of data quality issues, such as vaccinated respondents being more likely to respond to surveys and marginalized groups being underrepresented, can lead to incorrect estimates.”

Though it is broadly known that survey accuracy comes from both data quantity and data quality, data quantity has stolen the spotlight in recent years as technology has dramatically increased our ability to both collect and process massive data sets. Though these potentially offer insights never before possible, particularly of subpopulations previously difficult to study, if attention isn’t paid to data quality -- gained by ensuring your sample population is representative of the larger population or by understanding how it differs so results can be adjusted -- the results can be misleading.

“There’s this drive to get the biggest data sets possible and modern technology, big data, has made that possible,” said Shiro Kuriwaki, a first author of the paper who received his Ph.D. in government from Harvard last spring and is now a postdoctoral fellow at Stanford. “What that allows is analysis at a more granular level than ever before, but we need to be mindful that biases in the data get worse with bigger sample size, and that can carry right to the subgroups.”

Meng said he began thinking about the problems posed by big data during a visit to Harvard a decade ago by a U.S. Census Bureau official. The official met with a group of statisticians and asked them about the handling of data sets that were becoming available covering large percentages of the U.S. population. Using the hypothetical example of tax data collected by the IRS, he asked whether the statisticians would prefer a sample covering 5 percent of the population that they knew was representative of the larger population or IRS data that they weren’t sure was representative but covered 80 percent of the population. The statisticians chose the 5 percent. “What if it was 90 percent?” the Census Bureau official asked. The statisticians still chose the 5 percent, because if they understood the data, their answer would likely be more accurate than even a much larger set with unknown biases.

“Every data set is going to have certain quirks, but the question is whether the quirk matters to whatever your problem is,” said Meng, whose work is partially funded by the National Science Foundation. “Social media has tons of data just sitting there. And they may think they have a public sample, but may not realize that their population is biased to start.”

Indeed, nonresponse bias remains pernicious even if survey researchers are aware of its dangers. For example, a 2020 article by Kuriwaki and another coauthor of the current study, Harvard undergraduate Michael Isakov, correctly predicted overconfidence in the 2020 presidential election polls despite new methods being introduced in the aftermath of 2016.

“In the current paper, we found that while both the Delphi-Facebook and Census Bureau researchers attempted to account for potential issues, their corrections were simply not enough to alleviate all of the bias,” Isakov said.

The study — conducted with Dino Sejdinovic at Oxford — identifies areas of potential bias in the vaccination polls. The Delphi-Facebook polls were drawn from daily Facebook users but didn’t account for things like education level — two in 10 respondents did not have a college degree, compared with four in 10 of all U.S. adults — and race and ethnicity — the fraction of black and Asian respondents was only half of what it is in the general population. The Census Bureau study corrected for both education and race/ethnicity, but neither survey collected data on partisanship of respondents, which may have been an important factor in vaccine uptake. Also, neither adjusted their sample to represent the distribution of urban and rural areas, another factor the authors said may have been at play. 

“The U.S. government is spending billions of dollars this year doing targeted outreach to try to get people who are not vaccinated, vaccinated,” said Valerie Bradley ’14, an alumna of Harvard’s statistics program, Ph.D. student at Oxford University, and a first author of the paper. “And if you are guiding that based on the Census Household Pulse or Facebook survey, you might be pouring literally billions of dollars into the wrong communities.”

By comparison, researchers running a more traditional poll, conducted by Axios-Ipsos and with just 1,000 respondents, took pains to ensure the sample was representative of the larger population. They accounted for education, race, ethnicity, political partisanship, and even provided tablets with internet access to “offline” respondents to ensure their points of view were registered. Despite the smaller sample size, the Axios-Ipsos estimates of vaccine uptake were similar to the actual numbers reported as having been vaccinated by the CDC.

The ultimate effect of the uncorrected bias in the large polls, the authors said, was that the Delphi-Facebook poll, despite surveying 250,000 respondents, had an effective sample size when adjusted for bias of less than 10 in April 2021, a 99.99 percent reduction from their raw average weekly sample size. Similarly, the Census Household Pulse, which tallied 75,000 responses weekly, also had an effective sample size 99 percent lower in May 2021.

“If you have the resources, invest in data quality far more than you invest in data quantity,” Meng said. “Bad quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. 

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.