chapter 5 Getting a sense of Big Data and well-being
A case study on the promise of commercial Big Data
One of the most high-profile cases of the possibilities of Big Data involves a tale that begins in 2009 when a new virus was discovered. This new illness spread quickly and combined elements of bird flu and swine flu. This story opens Mayer-Schönberger and Cukier’s book, Big Data: A Revolution That Will Transform How We Will Live, Work and Think, which you may remember is mentioned earlier in the chapter as a much-cited originator of the term ‘datafication’ (2013). The authors explain that the only way authorities could curb the spread of this new virus was through knowing where it was already.
In the US, the Centres for Disease Control and Prevention (CDC) requested that doctors inform them of cases. However, the information on the pandemic that the CDC had to work with was out of date. This was by nature of the data collected, and its ‘data journey’ . There were multiple data journeys to consider: data were collected at the point someone went to the doctor, which could be days after initial symptoms, let alone contraction; sharing data with the CDC was a time-consuming procedure; the CDC only processed the data once a week. Thus, the picture was probably weeks out of date, making intervention or behavioural analysis difficult. In other words, while the datasets were large, even potentially fairly detailed, these Big Data were too slow.
Coincidentally, so Mayer-Schönberger and Cukier tell us, a few weeks before the new disease made the headlines, Google engineers published a paper in a high-profile journal, Nature, which explained how Google could ‘predict’ the spread of the winter flu in the US. This was possible just through analysing what people had typed into their search engine (and, of course, knowing where those people typing were). It compared the CDC data on the spread of seasonal flu from 2003 to 2008 with the 50 million most common search terms in America.
The Google engineers looked for correlations between what people typed into the Google search engine and the spread of the disease. Mayer-Schönberger and Cukier point out that.
Google’s method doesn’t require traditional infrastructures to distribute mouth swabs or for people to go to doctors’ surgeries.
‘Instead, it is built on ‘big data’—the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value. With it, by the time the next pandemic comes around, the world will have a better tool at its disposal to predict and thus prevent the spread.(Mayer-Schönberger and Cukier 2013, 2–3)
Sadly, a pandemic with wider societal and well-being effects arrived after I started writing this book, and despite the promise of Big Data, it did not prevent the spread. Data hold a very important place in the story of COVID-19 and its management, but all data have limitations in how it can inform human action to change reality, as do the different ways of analysing data. Indeed, data are not just there but are managed and used by people with their own interests. Data do not speak for themselves but are interpreted. All data realities also involve selective processes in what data are important and what data are not. These limits are not always made as clear as they should be.
Mayer-Schönberger and Cukier’s promise of Big Data as revolutionary and transformational in the US was clearly jumping the gun. Not only was the pandemic not prevented by way of predictive analytics, but actually, part of COVID-19 data management has very much involved doctors’ surgeries and mouth swabs—in the UK at least. To clarify, I was randomly selected from data held on people registered with a GP to participate in a survey in August 2020. I was contacted by the Real-time Assessment of Community Transmission (REACT) Study, which is in fact a series of studies, using home testing to understand more about COVID-19, and its transmission in communities in England. The logic behind the study was that not all people with the virus were being tested at this point, either because they were asymptomatic or for some other reason. This was one of a few projects to collect data from a sample of the population, over time, in order to understand how it was spreading.
This process relied on old infrastructures: I received a letter by Royal Mail, I signed up online, and then I was sent a mouth swab—also by post. That all worked fine for me, but there was a series of steps registering different barcodes and I found myself wondering how accessible this was for everyone (when I say everyone, I often think of my once tech-savvy Dad, who’d have been bewildered at this whole process). After completing these steps, a courier was ordered to collect the test. I sat in patiently waiting for my test to be collected, slightly anxious about what felt like a huge responsibility, and acutely aware that I might need to be ready to run out and meet a courier with my test.
I live in a high-rise with no working bell or intercom (and a bunch of other things that don’t work). For three separate days, I watched for details of the courier on the app, and out of my window, waiting for them to appear on the road, or call to say I should come down. But there was no sighting of the courier in real life and no phone call. When the app showed they were coming, they disappeared without attempting to deliver. After three attempts. I was told that this particular courier company was infamous for not bothering to try and collect from my flats, because it was too inconvenient. So, in my case, while some aspects of the traditional data infrastructure (the post) worked fine, they didn’t necessarily all work together as they might. This meant that my test remained uncollected, expired and had to be securely disposed of. This meant my data became ‘missing data’.
What I was surprised by was how the information system assumed you would live somewhere that was easy to access. As we know, many people from our poorest communities live in high-rises where the lift doesn’t work, or the people in the flats themselves are difficult for a courier to access. Thinking about the contexts in which data are collected (or not) can be both extraordinary, and mundane, and we often don’t hear of these stories—when they work, and the odd occasion when they don’t, and what that might mean for the data. Yet, these contexts have huge impact on who is readable in data and how we understand well-being and inequality.
So why did COVID-19 data collection end up using more traditional infrastructures in the UK? On a larger scale, why did the world not use Google data as Mayer-Schönberger and Cukier predicted? As it turns out, Google Flu Trends (GFT) missed the peak of the 2013 flu season by 140%, and Google subsequently closed the project (REF). In 2014 a paper called ‘The Parable of Google Flu: Traps in Big Data Analysis’ was published in another high-profile academic journal, Science . The authors concluded that while there was potential in these sorts of methodologies, and while Google’s efforts in projecting the flu may have been well meaning (which could be called into question), the method and data were opaque. This made it potentially ‘dangerous’  to rely on GFT for any decision-making, as the context of the data and the analyses were not made explicit to public decision-makers. Of course, it is also perhaps unlikely that Google had designed the tool for public decision-making contexts, considering what government officials need to understand for this kind of decision-making.
There are other limits to the data: its sample. Google assumes this ubiquitous reputation, yet, it is not the only search engine available: people choose other search engines for various reasons. Crucially, Google also does not have global reach. Most services offered by Google China, for example, were blocked by the Great Firewall in the People’s Republic of China. This was not even the first time it was banned in China. So, even if GFT were still in action, would it have pre-empted the COVID-19 outbreak in Wuhan, China, before more official announcements?
If we are to think about how Big Data have transformed how we live, as Mayer-Schönberger and Cukier want us to, then we must also consider how ‘datafication’ has changed people’s practices. More and more of us scour the internet, hoping to reassure ourselves that recently developed symptoms are minor ailments. This is—as we discovered in Chap. 2—part of the anxiety introduced with audit culture: we consult technologies as a default because we can, rather than should. We search for confirmation that nothing is wrong, rather than only searching when something is wrong. In countries where access to healthcare is diminished, people are actively encouraged to search the internet before interacting with health services. Consequently, this limits the predictability of search data, as their contexts have changed.
In the case of COVID-19, people searched for symptoms they didn’t necessarily have, especially in the second quarter of 2020, when most nations were in lockdown and the severity and ramifications of the disease were becoming clearer. The implications of this are that searches would not necessarily have reflected the infected state of an individual that could be aggregated to reveal community or population infections, or more importantly, predict transmission so that it might be controlled in some way. Instead, searches for COVID-19 symptoms may well be a predictor of concern or anxiety. Ironically, then, Google searches are arguably a better indicator of negative subjective well-being than of COVID-19.
The very idea of data being reliable has led to our need to feel sure—to have objective confirmation that all was OK, is OK or will be OK, and has led to an increased reliance on data. In the case of Google searches, this reliance has triggered people to search for verification of risk or safety. So how might we have cut through the ‘noise’ that the definitions at the beginning of this chapter point to, in order to know how it was spreading? We are back at the chicken and the egg dilemma: do people search about COVID-19 because they have symptoms? Or do people search about COVID-19 because they are worried about it and feel compelled to search for confirmation—or search on behalf of friends or loved ones? I watched someone use their internet searches to check our colleague’s proclaimed symptoms against the common signs of swine flu—a very collegiate individual, but one whose search history told a story of their friend’s (potential) disease state, rather than their own. In this latter case, then, Google searches were more indicative of personality than health or even subjective well-being, although, perhaps well-being data all the same.
Bigger datasets make correlation more powerful than causation, explain Mayer-Schönberger and Cukier, devoting a whole chapter to it in their book . Google queries went from 14 billion per year in 2000 to 1.2 trillion a decade later. There are even websites that show a live running tally of how many searches have been achieved in a day.. Internet Live Stats offer plenty more up-to-date data on data, if you are interesed.)) If Big Data were all about scale, then GFT would have been more, not less likely to work on the premise of correlation as search numbers increased. The scale at which we have correlations using ‘Big Data’ may be an indicator of causation, but not proof. Is this the end of the promise of Big Data, though? If we return to a case of COVID-19 and Big Data, what might we find?
- Bates et al. 2016
- More information is available on the REACT’s data collection and management here: https://www.ipsos.com/ipsos-mori/en-uk/covid-19-swab-test-faqs#nameaddress.
- REACT was commissioned by the Department of Health and Social Care (DHSC) and is being carried out by Imperial College London in partnership with Ipsos MORI and Imperial College Healthcare NHS Trust. https://www.imperial.ac.uk/medicine/research-and-impact/groups/react-study/.
- Lazer et al. 2014
- Lazer and Kennedy 2015
- A review of literature on data and data practices, Kennedy et al. (2020), found that tech and policy were considered different worlds when it comes to data practices, and with different aims, although that is evolving.
- See Internet Live Stats, ‘Google search statistics’ ((Internet Live Stats n.d.