chapter 5 Getting a sense of Big Data and well-being
What even is ‘Big Data’?
Big data generally capture what is easy to ensnare—data that are openly expressed (what is typed, swiped, scanned, sensed, etc.; people’s actions and behaviours; the movement of things)—as well as data that are the ‘exhaust’, a by-product … It takes these data at face value, despite the fact that they may not have been designed to answer specific questions and the data produced might be messy and dirty.
(Kitchin 2014, Chap. 2, p. 3 of individual chapter version)
Rob Kitchin is possibly one of the most cited definers of ‘Big Data’, opening books and dissertations up and down the land. Yet, as we are about to discover, Kitchin himself tells us that while the term ‘Big Data’ is repeatedly defined1, big data themselves defy categorical labelling. So, it is not clear-cut, because differentiating what ‘it’ is and what they are not is often side-stepped, or comes with caveats.((In fact, what a lot of people refer to as Big Data are not ‘Big’ at all by the initial standards of defnition. They are just large datasets or newer types of data in not even large datasets, and so arguably not Big at all.)) We encountered something similar before, if you remember, in Chap. 2. When it comes to understanding what well-being is, those inclined to measure are sometimes keen to measure well-being to understand it, rather than define what it is that is being measured. In a similar way, those describing Big Data are often more concerned with what Big Data does (or do), rather than what Big Data is, or are.
In this chapter on Big Data, we will discover that how they are used can defy some of the old definitions of how to use data or what data are for. So, let us start with some definitions and what is different. For Kitchin, the lack of ‘ontological clarity’ of Big Data (as the individual concepts and categories of Big Data and the relations between them) means the term acts as a vague, catch-all label for a wide selection of data1. Despite this, he has reviewed how other people define it and proposes the key traits of Big Data. These qualities are outlined in Table 5.1. Given the word ‘big’, it is probably no surprise that volume is one of ‘the 3Vs’ identified by Doug Laney back in 2001. The other two being velocity and variety. Other qualities include exhaustivity, resolution, indexicality, relationality, extensionality and scalability2. But what does this mean? How do these characteristics help us understand the data?
Table 5.1 Ways that Big Data are different
Label/definition | Origin | Meaning | Pre Big Data | Big Data |
---|---|---|---|---|
Volume | Laney (2001) | Consisting of enormous quantities of data | Limited to large | Very large |
Velocity | Laney (2001) | Created in real-time | Slow, freeze-framed/ bundled | Fast, continuous |
Variety | Laney (2001) | Being structured, semi-structured and unstructured | Narrow((Kitchin and McArdle’s (2016) original table says, ‘Limited to wide’ here (p2), but I think this makes more sense, as: ‘Limited in width’ or narrow)) | Wide |
Exhaustivity | Mayer- Schönberger and Cukier (2013) | An entire system is captured, Rather than being sampled | Samples | Entire populations |
Resolution and identification | Dodge and Kitchin (2005) | Fine-grained (in resolution) and uniquely indexical (in identification) | Coarse and weak to tight and strong | Tight and strong |
Relationality | Boyd and Crawford (2012) | Containing common fields that enable the conjoining of different datasets | Weak to strong | Strong |
Flexible and scalable | Marz and Warren (2012) | Can add/change new fields easily and can expand in size rapidly | Low to middling | High |
Adapted from tables in Kitchin (2014) and Kitchin and McArdle (2016)
Having established a series of classifications for Big Data, Kitchin tested his taxonomy of traits with co-author McArdle a few years later3. They applied the categories to 26 datasets which are widely considered Big Data and drawn from across seven sources: mobile communication, websites, social media/crowdsourcing, sensors, cameras/lasers, transaction process generated data and administrative data ((2016)). The authors find all seven traits in Table 5.1 are only applicable to ‘a handful’ of these datasets4. This shows how difficult it is to diagnose what Big Data actually are. Rather than the qualities of the data themselves, it might be more useful to instead turn to thinking about the contexts of data again: where they come from, and what they do5.
The key differences in the characteristics of Big Data are context, which is often missing when presented. Table 5.2 represents how difficult it is to diagnose what Big Data actually are, without considering the qualities that affect their use. It shows there are additional Vs: veracity, value and variability—these are concerned with how the data suit their re-purposing. Given the multiple insights and applications of data outside of their original setting, it can be difficult—even more difficult—to find certainty from them. This is because the data were collected, generated and produced for a specific reason, or as a by-product, that differs from how they are re-used.
The value of Big Data is the variety of insights that are possible and that can be used for other purposes. However, there are many things in the data that may not be useful. This also means using Big Data can increase the risk of confounding more traditional causal explanations. Instead, the mess of Big Data lends them to correlation with many insights, which can be used to enable prediction of well-being for individuals and society. We shall return to correlations and well-being in our case studies later in this chapter.
Table 5.2 Some qualities of Big Data
Label/definition | Origin | Qualities of data that affect their use |
---|---|---|
Veracity | Marr (2014) | The data can be messy, noisy and contain uncertainty and error. |
Value | Marr (2014) | Many insights can be extracted and the data repurposed. |
Variability | McNulty (2014) | Data whose meaning can be constantly shifting in relation to the context in which they are generated. |
Synthesised from Kitchin and McArdle (2016)
Table 5.3 looks at sources of different kinds of data typically used to predict well-being along with their pros and cons. These sources were drawn from an article in a journal for Data Science Analytics6, and I have synthesised these with Kitchin’s seven sources (mobile communication, websites, social media/crowdsourcing, sensors, cameras/lasers, transaction process generated data and administrative data) retaining commentary from Voukelatou et al. on the pros and cons for their use to understand well-being. You may look at these and feel like these data sources seem like strange ways to understand people’s well-being: the difference in origins and what they may be used for. You may also note that the authors’ presentation of the pros and cons, based on these sources, does not really prompt consideration for the people whose data they are, more their ease of use for the Data Scientist.
Returning to contexts of use: mobile phone data, for example, have a primary purpose which is for billing, or because apps need location data to work (such as maps or for local restaurant recommendations). This is very different from these data being used to understand trends about people and society. Our previous examples of data re-use (or secondary analysis) have largely involved data that were collected in national surveys, or through more qualitative methods with smaller samples to understand a specific aspect of people and society more deeply in some way. Notably, even if the research question is different when data are re-used in Chap. 3’s examples, the purpose of the data’s collection is not as different, or as removed, as this ‘exhaust’, ‘by-product’ nature of the data Kitchin refers to.
The process which has come to be known as ‘datafication’7 describes the increased demand for and uses of data. As we have seen in previous centuries, appetite for numbers (pandemics being one accelerator of data desire) has coincided with technological evolutions with numbers. In turn, and as we have seen over the last four chapters, different disciplines have increased and expanded their capacities for data and knowing the human experience in their own, particular way, and ‘new sciences’ have been declared. ‘Big Data’, as data with the qualities presented above, result from mounting capacity and faster instruments that increase the possibilities for the origins and volumes of data that can be stored in expanding databases, or in different databases which can be readily linked for a variety of purposes. As we have also seen before, it can be difficult to decide which came first: appetite for data, or capacity to expand on data possibilities.
Table 5.3 Sources of Big Data and their pros and cons for well-being measurement
Data Source | Pros | Cons |
---|---|---|
Mobile communications data (including GPS) | Captures temporal, spatial and social dimensions, Worldwide diffusion, Repeatability Unbiased and classified, real-time monitoring | Not publicly available, sparsity, geographically Imprecise Limited coverage in rural areas Indoor/altitude spatial inaccuracy |
Social media | Measuring social dynamics, publicly available | Privacy issues, overrepresentation, Social desirability bias Disturbance of normal activities to post |
Health and fitness (including mental health and well-being apps) | Cost-effective, Prediction of near-term risk of events Reduced respondent burden | Not publicly available, not necessarily representative of the population Requests for data input can disrupt daily activities Data can neglect moment- to-moment variations in mood. |
News | Variety of subject domains, Variety of data Range of targets, 24/h updated, Archived historical news | Gatekeeping bias, Coverage bias, Statement bias |
Transaction process generated data | Modelling of dynamic household behaviour, Temporal accuracy, Long-term coverage, Quality | Dependency on retailer’s permission, Legal constraints |
Websites and searches | Publicly available. Speed, convenience, flexibility, ease of analysis Timeliness, observation of people’s behaviour through searches | Population size varies across domains. Relevant queries difficult to identify Bias of content and terms Comparability of different search terms on different days |
Crowdsourcing | Large number of data Speed, relative low-cost measurement of daily behaviour and activity | Risk of low-quality results, trade-off between quality and cost Use of self-reports Paid participation of users |
Administration data | Accurate, temporal stability, valid for community-level understanding and cross-cultural comparisons | Paid participation of users Limited understanding of human experience in administration data |
NOTES: Made from synthesising across Rob Kitchin’s 7: mobile communication; websites; social media/ crowdsourcing; sensors; cameras/lasers; transaction process generated data; and administrative data & Voukelatou et al. (2020)—with the data examples in this chapter
In the age of Big Data, these newer data sources hold a wide variety of easy-to-capture data points, including observations of how we feel, where we are (or were), who we know, what we spend—and on what. These provide information on what products we have clicked on, and those we have not bought8. They can show how and where we spend our spare time and our money, both off and online. They are, therefore, incredibly valuable for research and commerce.
It is not these individual data points that are important, per se, but the links between them, that make them valuable. Through linking, assumptions can be made about how our behaviour, such as online spending, or improved mood, can be replicated in another place or time. These insights are also linked with other more familiar data points from administrative records, for example: where we were born, how much we earn, whether we own our own house. Other data are produced by loyalty cards, smartphones and in-house devices, such as Alexa, expanding such linking opportunities. Those who may try to avoid ‘being known’ by these other data will try to bypass the systems that gather these data. However, this resistance also becomes data in and of themselves; avoidance still produces digital traces that can be used to gather insights. Corporations may still create an automated profile of sorts, and assumptions will be made about the kind of products ‘the resistors’ buy. The persistence of data practices and their seeming inescapability are the reason we are starting to think about the experience of Big Data as something we ‘live with’9 and as something we ‘feel’.
This chapter covers some of the pervasiveness of Big Data, alongside the possibilities that come with that. Crucially, we look at what that means for well-being. We start by looking at the ways that data about mundane aspects of our lives is increasing, alongside how normalised increasing data collection, analysis and re-use are. These ‘data practices’ present new possibilities and realities of data-driven systems and decision-making that affect culture and society.
In this chapter, we touch on some of the uncomfortable aspects of these new realities, before historicising Big Data as well-being data to contextualise contemporary concerns regarding data practices that can be harmful. The second half of the chapter uses case studies to explore these concerns about well-being and data. Firstly, we consider a high-profile case that was billed as the promise of Big Data: Google Flu Trends (GFT), looking back from the age of COVID-19. Three further, short examples show the possibilities of social media data, place-based data, and health and fitness data to understand well-being for social and cultural policy and culture and society more generally.
- Kitchin 2014, Chap. 2, p. 3 [↩] [↩]
- Kitchin and McArdle 2016; Kitchin 2014 [↩]
- Kitchin and McArdle 2016 [↩]
- Kitchin and McArdle 2016, 9 [↩]
- Oman n.d. [↩]
- Voukelatou et al. 2020 [↩]
- as coined by Mayer-Schönberger and Cukier 2013 [↩]
- Turow 2011 [↩]
- Kennedy et al. 2020 [↩]