chapter 5 Getting a sense of Big Data and well-being

chapter 5 contents

What even is ‘Big Data’?

Big data – a new way to understand well-being?

Why we need to ask critical questions of data in the context of well-being

Value

Are Big Data even actually new?

The darker side of historical well-being data and commercial gain

A case study on the promise of commercial Big Data

Linking Big Datasets – for well-being?

Social media data – a game changer?

Social media data-mining in social and cultural sectors

Understanding where people are and how they feel using Twitter data

Fit for Purpose? Health and well-being tracking and apps

Conclusion

Bibliography

← previous • chapters • next →

What even is ‘Big Data’?

Big data generally capture what is easy to ensnare—data that are openly expressed (what is typed, swiped, scanned, sensed, etc.; people’s actions and behaviours; the movement of things)—as well as data that are the ‘exhaust’, a by-product … It takes these data at face value, despite the fact that they may not have been designed to answer specific questions and the data produced might be messy and dirty.
(Kitchin 2014, Chap. 2, p. 3 of individual chapter version)

Rob Kitchin is possibly one of the most cited definers of ‘Big Data’, opening books and dissertations up and down the land. Yet, as we are about to discover, Kitchin himself tells us that while the term ‘Big Data’ is repeatedly defined¹, big data themselves defy categorical labelling. So, it is not clear-cut, because differentiating what ‘it’ is and what they are not is often side-stepped, or comes with caveats.((In fact, what a lot of people refer to as Big Data are not ‘Big’ at all by the initial standards of defnition. They are just large datasets or newer types of data in not even large datasets, and so arguably not Big at all.)) We encountered something similar before, if you remember, in Chap. 2. When it comes to understanding what well-being is, those inclined to measure are sometimes keen to measure well-being to understand it, rather than define what it is that is being measured. In a similar way, those describing Big Data are often more concerned with what Big Data does (or do), rather than what Big Data is, or are.

In this chapter on Big Data, we will discover that how they are used can defy some of the old definitions of how to use data or what data are for. So, let us start with some definitions and what is different. For Kitchin, the lack of ‘ontological clarity’ of Big Data (as the individual concepts and categories of Big Data and the relations between them) means the term acts as a vague, catch-all label for a wide selection of data¹. Despite this, he has reviewed how other people define it and proposes the key traits of Big Data. These qualities are outlined in Table 5.1. Given the word ‘big’, it is probably no surprise that volume is one of ‘the 3Vs’ identified by Doug Laney back in 2001. The other two being velocity and variety. Other qualities include exhaustivity, resolution, indexicality, relationality, extensionality and scalability². But what does this mean? How do these characteristics help us understand the data?

Table 5.1 Ways that Big Data are different

Label/definition	Origin	Meaning	Pre Big Data	Big Data
Volume	Laney (2001)	Consisting of enormous quantities of data	Limited to large	Very large
Velocity	Laney (2001)	Created in real-time	Slow, freeze-framed/ bundled	Fast, continuous
Variety	Laney (2001)	Being structured, semi-structured and unstructured	Narrow((Kitchin and McArdle’s (2016) original table says, ‘Limited to wide’ here (p2), but I think this makes more sense, as: ‘Limited in width’ or narrow))	Wide
Exhaustivity	Mayer- Schönberger and Cukier (2013)	An entire system is captured, Rather than being sampled	Samples	Entire populations
Resolution and identification	Dodge and Kitchin (2005)	Fine-grained (in resolution) and uniquely indexical (in identification)	Coarse and weak to tight and strong	Tight and strong
Relationality	Boyd and Crawford (2012)	Containing common fields that enable the conjoining of different datasets	Weak to strong	Strong
Flexible and scalable	Marz and Warren (2012)	Can add/change new fields easily and can expand in size rapidly	Low to middling	High

Adapted from tables in Kitchin (2014) and Kitchin and McArdle (2016)

Having established a series of classifications for Big Data, Kitchin tested his taxonomy of traits with co-author McArdle a few years later³. They applied the categories to 26 datasets which are widely considered Big Data and drawn from across seven sources: mobile communication, websites, social media/crowdsourcing, sensors, cameras/lasers, transaction process generated data and administrative data ((2016)). The authors find all seven traits in Table 5.1 are only applicable to ‘a handful’ of these datasets⁴. This shows how difficult it is to diagnose what Big Data actually are. Rather than the qualities of the data themselves, it might be more useful to instead turn to thinking about the contexts of data again: where they come from, and what they do⁵.

The key differences in the characteristics of Big Data are context, which is often missing when presented. Table 5.2 represents how difficult it is to diagnose what Big Data actually are, without considering the qualities that affect their use. It shows there are additional Vs: veracity, value and variability—these are concerned with how the data suit their re-purposing. Given the multiple insights and applications of data outside of their original setting, it can be difficult—even more difficult—to find certainty from them. This is because the data were collected, generated and produced for a specific reason, or as a by-product, that differs from how they are re-used.

The value of Big Data is the variety of insights that are possible and that can be used for other purposes. However, there are many things in the data that may not be useful. This also means using Big Data can increase the risk of confounding more traditional causal explanations. Instead, the mess of Big Data lends them to correlation with many insights, which can be used to enable prediction of well-being for individuals and society. We shall return to correlations and well-being in our case studies later in this chapter.

Table 5.2 Some qualities of Big Data

Label/definition	Origin	Qualities of data that affect their use
Veracity	Marr (2014)	The data can be messy, noisy and contain uncertainty and error.
Value	Marr (2014)	Many insights can be extracted and the data repurposed.
Variability	McNulty (2014)	Data whose meaning can be constantly shifting in relation to the context in which they are generated.

Synthesised from Kitchin and McArdle (2016)

Table 5.3 looks at sources of different kinds of data typically used to predict well-being along with their pros and cons. These sources were drawn from an article in a journal for Data Science Analytics⁶, and I have synthesised these with Kitchin’s seven sources (mobile communication, websites, social media/crowdsourcing, sensors, cameras/lasers, transaction process generated data and administrative data) retaining commentary from Voukelatou et al. on the pros and cons for their use to understand well-being. You may look at these and feel like these data sources seem like strange ways to understand people’s well-being: the difference in origins and what they may be used for. You may also note that the authors’ presentation of the pros and cons, based on these sources, does not really prompt consideration for the people whose data they are, more their ease of use for the Data Scientist.

Returning to contexts of use: mobile phone data, for example, have a primary purpose which is for billing, or because apps need location data to work (such as maps or for local restaurant recommendations). This is very different from these data being used to understand trends about people and society. Our previous examples of data re-use (or secondary analysis) have largely involved data that were collected in national surveys, or through more qualitative methods with smaller samples to understand a specific aspect of people and society more deeply in some way. Notably, even if the research question is different when data are re-used in Chap. 3’s examples, the purpose of the data’s collection is not as different, or as removed, as this ‘exhaust’, ‘by-product’ nature of the data Kitchin refers to.

The process which has come to be known as ‘datafication’⁷ describes the increased demand for and uses of data. As we have seen in previous centuries, appetite for numbers (pandemics being one accelerator of data desire) has coincided with technological evolutions with numbers. In turn, and as we have seen over the last four chapters, different disciplines have increased and expanded their capacities for data and knowing the human experience in their own, particular way, and ‘new sciences’ have been declared. ‘Big Data’, as data with the qualities presented above, result from mounting capacity and faster instruments that increase the possibilities for the origins and volumes of data that can be stored in expanding databases, or in different databases which can be readily linked for a variety of purposes. As we have also seen before, it can be difficult to decide which came first: appetite for data, or capacity to expand on data possibilities.

Table 5.3 Sources of Big Data and their pros and cons for well-being measurement

Data Source	Pros	Cons
Mobile communications data (including GPS)	Captures temporal, spatial and social dimensions, Worldwide diffusion, Repeatability Unbiased and classified, real-time monitoring	Not publicly available, sparsity, geographically Imprecise Limited coverage in rural areas Indoor/altitude spatial inaccuracy
Social media	Measuring social dynamics, publicly available	Privacy issues, overrepresentation, Social desirability bias Disturbance of normal activities to post
Health and fitness (including mental health and well-being apps)	Cost-effective, Prediction of near-term risk of events Reduced respondent burden	Not publicly available, not necessarily representative of the population Requests for data input can disrupt daily activities Data can neglect moment- to-moment variations in mood.
News	Variety of subject domains, Variety of data Range of targets, 24/h updated, Archived historical news	Gatekeeping bias, Coverage bias, Statement bias
Transaction process generated data	Modelling of dynamic household behaviour, Temporal accuracy, Long-term coverage, Quality	Dependency on retailer’s permission, Legal constraints
Websites and searches	Publicly available. Speed, convenience, flexibility, ease of analysis Timeliness, observation of people’s behaviour through searches	Population size varies across domains. Relevant queries difficult to identify Bias of content and terms Comparability of different search terms on different days
Crowdsourcing	Large number of data Speed, relative low-cost measurement of daily behaviour and activity	Risk of low-quality results, trade-off between quality and cost Use of self-reports Paid participation of users
Administration data	Accurate, temporal stability, valid for community-level understanding and cross-cultural comparisons	Paid participation of users Limited understanding of human experience in administration data

NOTES: Made from synthesising across Rob Kitchin’s 7: mobile communication; websites; social media/ crowdsourcing; sensors; cameras/lasers; transaction process generated data; and administrative data & Voukelatou et al. (2020)—with the data examples in this chapter

In the age of Big Data, these newer data sources hold a wide variety of easy-to-capture data points, including observations of how we feel, where we are (or were), who we know, what we spend—and on what. These provide information on what products we have clicked on, and those we have not bought⁸. They can show how and where we spend our spare time and our money, both off and online. They are, therefore, incredibly valuable for research and commerce.

It is not these individual data points that are important, per se, but the links between them, that make them valuable. Through linking, assumptions can be made about how our behaviour, such as online spending, or improved mood, can be replicated in another place or time. These insights are also linked with other more familiar data points from administrative records, for example: where we were born, how much we earn, whether we own our own house. Other data are produced by loyalty cards, smartphones and in-house devices, such as Alexa, expanding such linking opportunities. Those who may try to avoid ‘being known’ by these other data will try to bypass the systems that gather these data. However, this resistance also becomes data in and of themselves; avoidance still produces digital traces that can be used to gather insights. Corporations may still create an automated profile of sorts, and assumptions will be made about the kind of products ‘the resistors’ buy. The persistence of data practices and their seeming inescapability are the reason we are starting to think about the experience of Big Data as something we ‘live with’⁹ and as something we ‘feel’.

This chapter covers some of the pervasiveness of Big Data, alongside the possibilities that come with that. Crucially, we look at what that means for well-being. We start by looking at the ways that data about mundane aspects of our lives is increasing, alongside how normalised increasing data collection, analysis and re-use are. These ‘data practices’ present new possibilities and realities of data-driven systems and decision-making that affect culture and society.

In this chapter, we touch on some of the uncomfortable aspects of these new realities, before historicising Big Data as well-being data to contextualise contemporary concerns regarding data practices that can be harmful. The second half of the chapter uses case studies to explore these concerns about well-being and data. Firstly, we consider a high-profile case that was billed as the promise of Big Data: Google Flu Trends (GFT), looking back from the age of COVID-19. Three further, short examples show the possibilities of social media data, place-based data, and health and fitness data to understand well-being for social and cultural policy and culture and society more generally.

Kitchin 2014, Chap. 2, p. 3 [↩] [↩]
Kitchin and McArdle 2016; Kitchin 2014 [↩]
Kitchin and McArdle 2016 [↩]
Kitchin and McArdle 2016, 9 [↩]
Oman n.d. [↩]
Voukelatou et al. 2020 [↩]
as coined by Mayer-Schönberger and Cukier 2013 [↩]
Turow 2011 [↩]
Kennedy et al. 2020 [↩]

Cookie	Duration	Description
AWSELB	session	Associated with Amazon Web Services and created by Elastic Load Balancing, AWSELB cookie is used to manage sticky sessions across production servers.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.