An Explorative Study on the Perceived Challenges and Remediating Strategies for Big Data among Data Practitioners

(1)

Degree project

An Explorative Study on the Perceived Challenges and Remediating Strategies for Big Data among Data Practitioners

Authors: Patrick Pilipiec Olga Soprano Supervisor: David Randall Examiner: Päivi Jokela

Date: 2020-07-10

Course Code: 4IK50E, 15 credits Subject: Information Systems

Level: Graduate

Department of Informatics

(2)

2

Abstract

Background: Worldwide, new data are generated exponentially. The emergence of Internet of Things has resulted in products that were designed first to generate data. Big data are valuable, as they have the potential to create business value. Therefore, many organizations are now heavily investing in big data. Despite the incredible interest, big data analytics involves many challenges that need to be overcome. A taxonomy of these challenges is available that was created from the literature. However, this taxonomy fails to represent the view of data practitioners. Little is known about what practitioners do, what problems they have, and how they view the relationship between analysis and organizational innovation.

Objective: The purpose of this study was twofold. First, it investigated what data practitioners consider the main challenges of big data and that may prevent creating organizational innovation. Second, it investigated what strategies these data practitioners recommend to remediate these challenges.

Methodology: A survey using semi-structured interviews was performed to investigate what data practitioners view as the challenges of big data and what strategies they recommend to remediate those challenges. The study population was heterogeneous and consisted of 10 participants that were selected using purposive sampling. The interviews were conducted between February 27, 2020 and March 24, 2020. Thematic analysis was used to analyze the transcripts.

Results: Ninety per cent of the data practitioners experienced working with low quality, unstructured, and incomplete data as a very time-consuming process. Various challenges related to the organizational aspects of analyzing data emerged, such as a lack of experienced human resources, insufficient knowledge of management about the process and value of big data, a lack of understanding about the role of data scientists, and issues related to

communication and collaboration between employees and departments. Seventy per cent of the participants experienced insufficient time to learn new technologies and techniques. In addition, twenty per cent of practitioners experienced challenges related to accessing data, but those challenges were primarily reported by consultants. Twenty per cent argued that organizations do not use a proper data-driven approach. However, none of the practitioners experienced difficulties with data policies because this was already been taken care of by the legal department. Nevertheless, uncertainties still exist about what data can and cannot be used for analysis. The findings are only partially consistent with the taxonomy. More specifically, the reported challenges of data policies, industry structure, and access to data differ significantly. Furthermore, the challenge of data quality was not addressed in the taxonomy, but it was perceived as a major challenge to practitioners.

Conclusion: The data practitioners only partially agreed with the taxonomy of challenges.

The dimensions of access to data, data policies, and industry structure were not considered a challenge to creating organizational innovation. Instead, practitioners emphasized that the

(3)

3 dimension of organizational change and talent, and to a lesser extend also the dimension of technology and techniques, involve significant challenges that can severely impact the creation of organizational innovation using big data. In addition, novel and significant challenges such as data quality were identified. Furthermore, for each dimension, the

practitioners recommended relevant strategies that may help others to mitigate the challenges of big data analytics and to use big data to create business value.

(4)

4

1 Introduction

In this section, we present the research problem and the background of this study.

Subsequently, we address the research gap, and the importance and relevance of this study with an initial literature review. This chapter continues with scoping this study by

formulating the research objective and research question. Finally, the responsibilities for this study are allocated among its authors.

1.1 Research Problem

The problem that this study addresses is the current absence of any empirical work which addresses what data practitioners do, what problems they have, and specifically how they view the relationship between analysis and organizational innovation. We identify a gap between theoretical assumptions concerning the work of data analysis and what practitioners themselves see as the main issues.

1.2 Background

We are living in the era of big data (Choi, 2018). The wide-spread adoption of the World Wide Web, smartphones, satellite technology, sensor technology, genomic data, and social media platforms have significantly digitalized our society (Galeano and Peña, 2019, Rouhani et al., 2017). Just as importantly, information from these disparate sources can be

triangulated. Consequently, humans continuously generate data on a massive scale and often in real-time (Galeano and Peña, 2019).

In addition, following Moore’s Law, CPU performance and storage capacity have increased exponentially, while storage costs have decreased drastically (Tsai et al., 2015, Kiersz, 2019, Galeano and Peña, 2019). Consequently, the confluence of many years of advances in technological innovation, the proliferation of data generation, and advances in machine learning algorithms, have enabled the processing and extraction of valuable insights from massive datasets (Dormehl, 2014, Elshawi et al., 2018, Northcott, In Press, Rabhi et al., 2019, Sakr, 2016, Tabesh et al., 2019, Wall and Krummel, 2020).

These user-generated data potentially hold great value (Jeske and Calvard, 2020). For example, data have become an immensely valuable resource for answering some of the world’s most challenging questions in climate change and healthcare (Favaretto et al., 2019).

In proliferation with the enormous ongoing interest from corporations, big data are now even considered the most valuable resource that can replace oil (Alharthi et al., 2017, The

Economist, 2017). Having said that, the data only have value if the value is unlocked. Data have value only if it affects policy and practice, and this is true at all levels from government, through a variety of business and commercial interests, down to so-called ‘smart city’

innovation and changes in domestic environments (Rath and Solanki, 2019, Ng et al., 2017, Kaleem et al., 2019).

(7)

7 Unsurprisingly, the innovative potential of analyzing big data has spurred great interest in the commercial sector (Wamba et al., 2015). Therefore, most contemporary software and online services were primarily built to collect vast amounts of data such as telemetry, and are often offered for free to promote its use (Fitzpatrick, 2010, Puri, 2015). These data are utilized to develop machine learning algorithms that learn to understand these customers and that can predict their preferences and future behaviors (Chen and Zhang, 2014).

An example is the streaming service Netflix that collects vast amounts of data to train algorithms to serve potentially relevant content (Jahnke, 2019). In addition, Netflix collects telemetry about the viewing behaviors of customers, such as when and why streaming is interrupted, among others with the purpose to create tailored content that appeared to be quite popular and addictive for many customers, and that is only available on Netflix (Jahnke, 2019).

Despite several fairly successful applications, only a small percentage of organizations have clearly benefited from their investments in big data analysis (Ross et al., 2013). Most organizations struggle with extracting insights from their data (Zeng and Glaister, 2017). In fact, many challenges need to be overcome before big data can be successfully analyzed to create business value (Tarafdar et al., 2013). Indeed, the management theorist Thomas Davenport estimated that only 0.5 per cent of big data is ever actually used (Davenport, 2014).

In their systematic literature review, Wamba and colleagues reported a comprehensive taxonomy to understand the challenges of big data using five domains (Wamba et al., 2015).

These challenges involve access to data, data policies, industry structure, organizational change and talent, and technology and techniques (Wamba et al., 2015). These challenges were identified based on academic publications.

1.3 Research Gap

However, this taxonomy does not necessarily reflect the challenges perceived by

practitioners. In fact, in the literature, a paucity of evidence exists about consensus between academics and practitioners concerning what challenges need to be overcome for big data to affect organizational innovation. Likewise, it cannot be excluded that practitioners perceive other challenges that were not considered by academics. To our awareness, the strategies that practitioners employ to remediate the challenges of big data were also not yet studied.

1.4 Importance and Relevance of Study

There is little empirical evidence which describes the possibilities and challenges of big data analytics from the point of view of practitioners in different domains, such as data analysts.

Knowing too little about what the technical and organizational challenges might be could explain why real-life big data projects are still less common than one might expect and why

(8)

8 these projects are prone to failure (Davenport, 2014). In addition, Davenport points out that we know very little about the consequences of big data for organizational structures or for customer relationships (Davenport, 2014). We aim to rectify this through an exploratory qualitative interview study. Second, in the field of data science, application and academia are strongly intertwined. To strengthen this mutual dependency, it is beneficial to consider academics and practitioners in the same study to assess possible differences and similarities in their perspectives.

1.5 Research Objective and Question

This study aimed to address the observed lacuna in the literature. The purpose of this study was twofold. First, it investigated what data practitioners consider the main challenges of big data and that may prevent creating organizational innovation. Second, it investigated what strategies these data practitioners recommend to remediate these challenges.

Therefore, the following research question was formulated:

What challenges do data practitioners perceive regarding big data to create

organizational innovation, and what strategies do they recommend to address these challenges?

1.6 Allocation of Responsibilities

Because this study is jointly conducted, the responsibilities of both researchers need to be described, and are allocated as follows.

As far as is humanly possible, both authors have contributed equally to the production of the manuscript. Where one person contributed more than the other in respect of a specific element (for instance, with regard to the conduct of interviews, where access issues, for instance, meant some practical trade-offs), results were carefully discussed and agreed upon by both participants, and their deliberations verified with their supervisor. As far as possible, then, there was equal contribution to all elements. Not least, this was important when arriving at a degree of inter-rater reliability. In the nature of qualitative and emergent research of this kind, a theoretical framework evolved and both authors agreed on how that framework was to be described. They further discussed and jointly delivered conclusions regarding ethical issues, limitations of the research, and what the contribution of the research was.

(9)

9

2 Literature Review

This section introduces the concept of big data and describes its evolving characteristics.

Thereafter, the history of big data is outlined. Subsequently, the velocity of generated data, and the interest and innovative potential of big data, are discussed. These innovations are then illustrated using various applications of big data. Because big data analysis requires non- conventional tools, Hadoop is discussed as both a framework and an ecosystem. This chapter concludes with the challenges of big data and the remediating strategies.

2.1 Big Data

We currently live in the era of big data (Choi, 2018). The concept of big data first emerged around 2011 (Surbakti et al., 2020), when various innovations that generate massive amounts of data became widely adopted (Genender-Feltheimer, 2018). However, the term big data itself was used for the first-time already at an IEEE conference in 1997 (Cox and Ellsworth, 1997).

The term big data originated from organizations that process and analyze enormous amounts of data, such as Facebook, Google, and Yahoo (Garlasu et al., 2013). These innovations that create much data include, among others, cloud computing (Passacantando et al., 2016), e- commerce, Internet of Things (IoT), search engines, smartphones, social media, and wireless sensor networks (Ding et al., 2016, Takaishi et al., 2014, Choi, 2018, Surbakti et al., 2020, Watson and Wixom, 2007).

Due to the tremendous potential commercial value of big data, its popularity continuous to sky-rocket (Jeske and Calvard, 2020). Researchers have however not yet formed a consensus about one comprehensive definition for big data (Rouhani et al., 2017). Instead, a large variety of different definitions exists in the literature. Several authors (e.g., Mehta and Pandit, 2018, Wamba et al., 2015, Mikalef et al., 2018) have aggregated these definitions into an overview. Nevertheless, on an abstract level, big data may be described as “data that are so voluminous and complex that traditional data-processing applications are inadequate to deal with them” (Genender-Feltheimer, 2018). There are a number of reasons for this, summed up famously as the 3Vs.

A definition for big data that is frequently cited in the literature was published by Gartner (Alharthi et al., 2017). Gartner defines big data as “high-volume, high-velocity and/or high- variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation” (Gartner, 2020). Therefore, big data is data that is large in size, that is created at high speed, originates from different sources, and is stored using different data types (Campos et al., 2017).

(10)

10

2.2 Traditional Characteristics of Big Data

In essence, Gartner characterizes big data using three dimensions, namely as data that is high in volume, velocity, and variety (Ghasemaghaei, In Press). These dimensions are commonly referred to as the 3Vs of big data (Wamba et al., 2015, McAfee and Brynjolfsson, 2012, Davis, 2014, Sun et al., 2015, Kwon and Sim, 2013).

The dimension ‘volume’ refers to data that consumes a large amount of disk space, such as terabytes or petabytes, or a dataset that contains a massive number of records and variables (Gandomi and Haider, 2015, Akter and Wamba, 2016, George et al., 2016). The number of observations can easily entail billions or even trillions of records (Bansal and Kagemann, 2015). Note however, as will be further discussed in paragraph 2.4, that the definition of

‘voluminous’ is not static but changes drastically with technological advancements over time.

The dimension ‘velocity’ refers to the high speed at which data are generated, processed, and analyzed, and it encapsulates the nature of big data as real-time or near-real-time data streams (Tabesh et al., 2019, Chen and Zhang, 2014, Ghasemaghaei, 2018, Ghasemaghaei and Calic, 2019b, Rabhi et al., 2019, Kruse et al., 2016). The velocity of newly created data is further discussed in paragraph 2.5.

The dimension ‘variety’ refers to the heterogeneity of the data types, data formats, and data sources (Chen and Zhang, 2014, Ghasemaghaei, 2018, Lam et al., 2017, Russom, 2011, Davenport et al., 2012). For example, data can originate from audio, images, relational databases, sensors, spreadsheets, text documents, or videos (Tabesh et al., 2019,

Ghasemaghaei, In Press, Favaretto et al., 2019, SAS Institute, 2020). These data sources can be internal and external (e.g., open data and purchased data) (Grover et al., 2018). In

addition, big data can have a structured, semi-structured, unstructured or multi-structured format (Alharthi et al., 2017, Rabhi et al., 2019, Iglesias et al., 2015), but it often has no predefined data structure (Tabesh et al., 2019).

Overall, although many definitions of big data proliferate in the literature, various authors discussed the commonality that all definitions recognize at least the dimensions volume, velocity, and variety to describe the essence of big data (Ghasemaghaei et al., 2018, Lam et al., 2017, Chen et al., 2015, Gupta et al., 2018, Ward and Barker, 2013, Iglesias et al., 2015, Chen et al., 2012, Kwon et al., 2014).

2.3 Extended Characteristics of Big Data

However, in the literature, there are also authors that complement the characterization of big data using 3Vs by additionally using the dimensions value (4Vs) (Dijcks, 2013, Gogia, 2012), veracity (5Vs) (White, 2012, Lakshen et al., 2016, Ayed et al., 2015, Chen et al., 2014,

Elshawi et al., 2018), variability (6Vs) (Gandomi and Haider, 2015), and visualization (7Vs) (Jukic et al., 2015).

(11)

11 The dimension ‘value’ refers to the usefulness of the data, or the extent to which economic benefits can be extracted from the data (Wamba et al., 2015, Lakshen et al., 2016). Big data provides value when the analysis of these data results in the extraction of hidden patterns and insights that can be used to identify trends and create knowledge models (Elshawi et al., 2018, Yaqoob et al., 2016). The value of data is also assumed to be positively correlated with the volume of data. Therefore, the analysis of voluminous data can yield the most economic value (Gandomi and Haider, 2015).

The dimension ‘veracity’ refers to the extent to which data are accurate, authentic, correct, and trustworthy (Lakshen et al., 2016, Yaqoob et al., 2016, Demchenko et al., 2013). In addition, it refers to the capability to identify and eliminate ambiguous, biased, and inaccurate data (Tabesh et al., 2019, Beulke, 2011). Although some data sources are inherently unreliable, such as sentiments in written reviews and opinions that customers publish to social media, it can still be used to extract economically relevant and valuable insights from (Gandomi and Haider, 2015, Tabesh et al., 2019).

The dimension ‘variability’ refers to constant changes in the data (Owais and Hussein, 2016) or to the fluctuating rates at which data flows (Gandomi and Haider, 2015). Furthermore, variability can additionally refer to different interpretations that are yielded from the same data (Seddon and Currie, 2017, Mikalef et al., 2018).

The dimension ‘visualization’ refers to the need for illustrative and well-crafted

visualizations that are created using the big data sets, and which clearly tell a story to the reader (Jukic et al., 2015). The visualization of big data is a popular field that has received a great amount of attention in the literature (Jukic et al., 2015).

The identification of extended dimensions may explain why no consensus has yet been formed to define big data uniformly. In fact, the original 3Vs of big data have even been extended to 42 dimensions (Shafer, 2017). The selection of a subset of these dimensions, which may be performed to meet special requirements, obviously results in a different definition for the term big data (Chen and Zhang, 2014).

2.4 History of Big Data

The definition of voluminous data that was discussed in paragraph 2.2 should be regarded in close relation to time (cf. Schaller, 1997). For example, large datasets that existed 25 years ago are nowadays considered small. By a similar logic, contemporary enormous datasets that exceed multiple terabytes in size may be considered small 25 years into the future.

This principle, termed Moore’s Law (Schaller, 1997), is fundamental to the emergence of big data. Moore’s Law estimates that CPU performance and storage capacity double roughly every 18 months (Tsai et al., 2015). The development of computer hardware thus occurs exponentially over time (Kiersz, 2019). The proliferation of high processing power and low

(12)

12 storage cost are two fundamental prerequisites that enable the collection and analysis of big data (Galeano and Peña, 2019).

Furthermore, the emergence of big data can be traced back to more than 50 years of developments in data management technology, and Surbakti et al. (2020) provide a

comprehensive discussion of these historical advances. Although these technologies for data management primarily facilitated the generation and subsequent storage of a large amount of records in relational databases (Galeano and Peña, 2019, Wang et al., 2018), it required two additional revolutionary innovations before humans could become generators and consumers of big data (Galeano and Peña, 2019).

The first innovation was the World Wide Web, which was developed at CERN in

Switzerland by Tim Berners-Lee (Galeano and Peña, 2019). The World Wide Web facilitated fast communication via the Internet and provided the technical infrastructure for the

subsequent development of social media platforms and Web 2.0, which enabled the creation of user-generated content (Galeano and Peña, 2019). Over the years, the Internet has seen an exponential growth in users worldwide, which resulted in a fast adoption of these social media platforms (Kaplan and Haenlein, 2010, Moro et al., 2016) and that are often referred to as Web 2.0.

The second innovation was smartphones, which provided a novel communication channel to transmit and receive information wirelessly (Galeano and Peña, 2019). In addition, the computing power of smartphones advanced and provided sufficient requirements to install third-party apps that utilize the World Wide Web, are constantly connected to the Internet, and that stimulate the vast generation of new data, including telemetry, social data, and weblogs (Galeano and Peña, 2019, Bryant et al., 2008).

Both innovations made a significant contribution to the creation of our contemporary digitalized society (Galeano and Peña, 2019, Rouhani et al., 2017). The World Wide Web and smartphones have profoundly altered our working environment and changed how we communicate with family and friends (Galeano and Peña, 2019, Jeske and Calvard, 2020).

Few would be aware of, let alone understand, that we are constantly generating a massive amount of new data (Genender-Feltheimer, 2018, Zwitter, 2014).

In the subsequent years, from 2001 to 2008, many developments for big data analysis evolved, which includes Hadoop, discussed in paragraph 2.8 (Wang et al., 2018).

Consequently, on a massive scale, humans have become generators of data (Galeano and Peña, 2019). These data, that among others include social data, can have great economic value (Galeano and Peña, 2019). Since 2009, big data analytics started to revolutionize decision-making in organizations (Wang et al., 2018). For this reason, frankly all

contemporary software and hardware emphasize the creation of data, and even unexpected devices such as alarms, lighting, refrigerators, televisions, washing machines, and window blinds are increasingly developed with big data as a key criterion for its design (Yaqoob et

(13)

13 al., 2016, Hashem et al., 2016, Tsai et al., 2015). As a result, data now exists everywhere (see also paragraph 2.5) (Galeano and Peña, 2019).

Overall, the confluence of many years of advances in technological innovation, the proliferation of data generation, and advances in machine learning algorithms to analyze these data, have led to the phenomenon of big data, and it has enabled organizations in principle to transform large datasets into information, extract valuable insights, and to engage in data-driven decision making (Rabhi et al., 2019, Tabesh et al., 2019, Wall and Krummel, 2020, Northcott, In Press, Dormehl, 2014, Sakr, 2016, Elshawi et al., 2018).

2.5 Velocity of Generated Data

Smolan and Erwitt published historical statistics concerning the time that was required to lapse to produce 5 billion gigabytes of data (Smolan and Erwitt, 2012). Since history was recorded until the year 2003, a total of 5 billion gigabytes were generated (Smolan and Erwitt, 2012). In 2011, the same amount of data was already produced every two days, while it took only 10 minutes to generate these data in 2013, and just 10 seconds in 2015 (Smolan and Erwitt, 2012).

Furthermore, in 2018, Marr published an excellent work on the velocity at which new data were created (Marr, 2018). It was estimated that in 2012, worldwide, a total of 2.5 quintillion bytes were generated daily, and that 90 per cent of these data were unstructured (Marr, 2018, Dobre and Xhafa, 2014, Kruse et al., 2016). Furthermore, the combined data that were created in the preceding two years was approximated to total an astonishing 90 per cent of all data that had ever existed in the world (Marr, 2018, Henke et al., 2016, Gobble, 2015).

To place this into perspective, Marr published an extensive overview of statistics to illustrate the velocity at which data were generated in various industries (Marr, 2018). A subset of these statistics is presented in Table 1.

Table 1. Velocity of data generated per industry Industry Generated data

Internet All search engines combined processed 5 billion search queries per day.

Google processed 3.5 billion search queries daily (i.e. 40,000 queries per second).

Social Media Users shared 527,760 photos on Snapchat per minute.

Twitter processed 456,000 Tweets every minute.

Per minute, 4,146,600 videos were viewed on YouTube.

Instagram processed 46,740 new photos every minute.

Every minute, Facebook processed 510,000 new comments.

Communication 16 million text messages were sent every minute.

99,000 swipes on Tinder were made per minute.

Per minute, 156 million legitimate emails were sent.

154,200 calls were made on Skype every minute.

Services The Weather Channel processed more than 18 million requests for per minute.

Uber processed almost 50,000 trips every minute.

(14)

14 Adopted and modified from Marr (2018).

Note that, as a result of, among others, the wide-spread adoption of Internet of Things and other digital devices that generate real-time or near-real-time data, the reported velocity is expected to be much greater at the time of writing this manuscript in 2020 (Marr, 2018, Gandomi and Haider, 2015, Genender-Feltheimer, 2018).

In fact, it was estimated that all existing data in the digital universe would total 40 zettabytes in 2020 (Santos, 2016, Lam et al., 2017, Sivarajah et al., 2017) and that it would further grow to 175 zettabytes in 2025 (Reinsel et al., 2018). To illustrate, 40 zettabytes is approximately 40 trillion gigabytes. Indeed, the total volume of data that exists worldwide today is measured in zettabytes (Alharthi et al., 2017), and in the near-distance future, we will continue to measure data in terms of yottabytes.

2.6 Interest and Innovative Potential of Big Data

Data has become the most valuable resource of the 21^st century to replace oil (The

Economist, 2017, Alharthi et al., 2017). By revolutionizing and increasing the productivity and competition among organizations and in public administration, big data has a significant potential to contribute to increasing economic growth (Manyika et al., 2011, Chen and Zhang, 2014, Wamba et al., 2015). In fact, the International Data Corporation (IDC)

estimated that the worldwide revenue of the market for big data analytics would exceed 203 billion U.S. dollars in 2020 (Press, 2017).

Worldwide, big data continuous to generate enormous attention among academics, industry, and practitioners (Wamba et al., 2015, Dubey et al., 2019, Aydiner et al., 2019, Delen and Zolbanin, 2018, Malomo and Sena, 2017, Matthias et al., 2017, Srinivasan and Swink, 2018).

For example, academics are developing new algorithms, processes, and research methods to extract more insights from data (Zomaya and Sakr, 2017, Elshawi et al., 2018, Bolón-Canedo et al., 2015). Data holds great value and the ability to collect and process big data has resulted in a new gold rush for data (Tabesh et al., 2019, Wamba et al., 2015).

Organizations can indeed benefit from analyzing big data because it can provide a new source for knowledge, innovation, and productivity (Chen and Zhang, 2014, Manyika et al., 2011, Acharya et al., 2018, Ghasemaghaei, 2019a, Maroufkhani et al., 2019, Ghasemaghaei et al., 2017, Ashrafi and Ravasan, 2018, LaValle et al., 2011). In every industry, senior management questions whether they utilize the full potential that their data can yield (LaValle et al., 2011). Generating knowledge from data, and accurately predicting

phenomena, have the potential to become the primary force for future competition (Manyika et al., 2011, Northcott, In Press).

As a result, many organizations in every major industry continue to invest much time and financial resources, initiate structural changes to their organization, and adopt tools for big

(15)

15 data analytics, with the purpose to extract valuable insights from data (Tabesh et al., 2019, Mayhew et al., 2016, Mazzei and Noble, 2017). Such enthusiasm is considered legitimate because the size of data is expected to grow exponentially over time, and the value of these data then also increases (Tabesh et al., 2019). Organizations in many industries are now collecting vast amounts of data (Ghasemaghaei and Calic, 2019a, Larson and Chang, 2016, Mehta and Pandit, 2018, Kruse et al., 2016, Feldman et al., 2012, Giacalone et al., 2018, Tambe, 2014, LaValle et al., 2011). Exploiting big data can provide organizations with a significant competitive advantage (Maroufkhani et al., 2019, Ghasemaghaei et al., 2017, Brands, 2014, Alharthi et al., 2017).

For example, big data can provide value to retailers and e-commerce and increase their return on investment up to 20 per cent (Cardona, 2013), by analyzing the buying patterns and geospatial location of customers (Ghasemaghaei, In Press). Big data can also be used to better forecast demand and supply, and thereby enable a more efficient management of resources in the supply chain (Barbosa et al., 2018). Moreover, big data can be used to develop artificial intelligence that increases the quality of decision-making, and it may even automate decision-making, thereby having a positive effect on the outcomes of an

organization (Makridakis, 2017, Ghasemaghaei, 2019b, Neilson et al., 2019). Similarly, process mining may be used to optimize business processes, or by improving service delivery and customer service (Hartmann et al., 2016). More generally, the collection of more data from various sources can improve the reliability of data-driven decision making because the quality of the underlying data increases, namely by reducing bias and errors that are more likely to occur in smaller datasets (Ghasemaghaei and Calic, 2019b).

The innovative potential of big data was also confirmed by Wamba and colleagues, who conducted a systematic literature review to aggregate created values that were reported in the literature (Wamba et al., 2015). Big data can create value by increasing transparency in decision-making, facilitate the identification of needs using experimentation, segment the population, support and even automate human decision-making, and provide insights to develop innovative services, products, and business models (Wamba et al., 2015).

To conclude, the worldwide interest and innovative potential of big data resulted in the emergence of a new paradigm, named the data-intensive science, that utilizes big data to address and solve big data problems (Chen and Zhang, 2014, Bell et al., 2009).

2.7 Applications of Big Data

As already stated, big data was found to have abundant applications in many industries (Foster et al., 2017, Japec et al., 2015, Sagiroglu and Sinanc, 2013).

Among others, big data can be applied in astronomy (Chen and Zhang, 2014), banking (Srivastava and Gopalkrishnan, 2015), education (Soares, 2012), ecology (Kelling et al., 2009, Monkman et al., 2018), government (Sobek et al., 2011, Chen et al., 2012, Mervis, 2012), healthcare (Brinkmann et al., 2009, Field et al., 2009, Callebaut, 2012, Chen et al.,

(16)

16 2012, Cole et al., 2012, Kruse et al., 2016), hoteling (Padma and Ahn, 2020), manufacturing (Brown et al., 2011, Dubey et al., 2019), online newspapers (Iglesias et al., 2015), public health (Yang et al., 2013, Dai and Hao, 2017), retail (Brown et al., 2011, McAfee and Brynjolfsson, 2012, Lee et al., 2013), services (Acker et al., 2011, Demirkan and Delen, 2013, Johnson, 2012, Kauffman et al., 2012, Kolker et al., 2012, Kubick, 2012, McAfee and Brynjolfsson, 2012), technology (Bradbury, 2011, Reddi et al., 2011, Allen et al., 2012, Chen et al., 2012, Highfield, 2012, Huwe, 2012, Smith et al., 2012), and transportation (Neilson et al., 2019).

Two industries that received enormous attention in the literature are e-commerce and healthcare, which will be described in more detail.

First, due to fierce competition, the e-commerce industry is one of the largest adopters of big data analytics (Akter and Wamba, 2016). Akter and Wamba found in their systematic

literature review that the applications of big data in e-commerce involve the identification of customer needs, decision-making and performance improvement, innovations in products and markets, improving transparency and infrastructure, and market segmentation (Akter and Wamba, 2016). Four types of big data are used for these applications, namely click-stream data, transactional data, video data, and voice data (Akter and Wamba, 2016). Furthermore, the emergence of social media platforms has provided the e-commerce industry with a new channel to influence consumers using personalized advertisements (Moro et al., 2016, Lariscy et al., 2009).

Second, big data analytics has the potential to revolutionize healthcare (Alonso et al., 2017, Wall and Krummel, 2020). For example, big data may facilitate a better understanding of chronic conditions and age-related diseases such as dementia, and it may identify new treatment alternatives to address these diseases (Kruse et al., 2016). In addition, these analyses may help to identify and remove waste from the process, thereby reducing the cost of healthcare, increasing the efficiency in healthcare, and improving outcomes for patients (Hillestad et al., 2005, Mehta and Pandit, 2018). The enormous potential of big data was also confirmed by Kruse and colleagues, who found in their systematic literature review that big data can provide 11 opportunities to healthcare (Kruse et al., 2016). The reported

opportunities are better decision-making, better accessibility, structure, and quality of data, detection of fraud, detection of threats to health, early detection of diseases, globalization, improved quality of care, management of population health, patient-centric healthcare, personalized medicine, and a reduction of costs (Kruse et al., 2016). Andreu-Perez et al.

provide an overview of the way in which both genetic and non-genetic data might be used (Andreu-Perez et al., 2015).

Furthermore, in their systematic literature review, Metha and Pandit found that big data can be applied in many areas in healthcare (Mehta and Pandit, 2018). These areas are

cardiovascular disease, diabetes, drug discovery and clinical research, elderly care,

genecology, genomics, mental health, nephrology, oncology, ophthalmology, personalized healthcare, precision medicine, and urology (Mehta and Pandit, 2018).

(17)

17

2.8 Hadoop

Concerning the definition of big data as was discussed in paragraph 2.1, in addition to data, Alharthi and colleagues stated that big data also refers to the “tools and practices for

analyzing, processing, and managing these massive, complex, and rapidly evolving data sets”

(Alharthi et al., 2017). One of the influential tools required for processing large datasets is Hadoop, which supports distributed processing (Tambe, 2014, Yaqoob et al., 2016).

Apache Hadoop (Apache Hadoop, 2019) is an open-source framework for the analysis of big data that utilizes distributed processing to distribute the dataset and processing across

multiple computers in a cluster (Merelli et al., 2014, Cunha et al., 2015). The origin of Hadoop can be traced back to Google (Tambe, 2014), but it was later adopted by the Apache Foundation. Written in Java, Hadoop scales up easily by adding more computers to the cluster (also called horizontal scaling), and it guarantees high-availability and fault-tolerance at the application layer (Apache Hadoop, 2019, Tambe, 2014). Hadoop has two key

components, namely the Hadoop Distributed File System (HDFS) and MapReduce (Alonso et al., 2017).

First, HDFS is a filesystem that combines the local filesystem of the individual computers in the cluster into a large filesystem (White, 2015). A key characteristic of HDFS is that it stores metadata about the filesystem separately from the actual application data (Huang et al., 2015). As a result, HDFS enables to storage of enormous individual files across multiple computers, even if the file size exceeds the storage capacity of an individual computer (Huang et al., 2015). Therefore, HDFS is a cluster filesystem for storing very large datasets reliably that are distributed across many computers in a cluster (Alonso et al., 2017)

Second, MapReduce was developed by Google as a big data solution to swiftly process and index billions of webpages that it crawled (Alonso et al., 2017, O’Driscoll et al., 2013).

MapReduce utilizes programming models that divide the processing of data into smaller tasks (Saravana et al., 2015). As a result, MapReduce utilizes a series of mapping and reducing tasks to process complex and unstructured data, while it automatically shuffles the data in- between these tasks (Saravana et al., 2015).

2.9 Hadoop as Big Data Ecosystem

Over the years, Hadoop has evolved into a big data ecosystem (Apache Hadoop, 2019). As a result, other projects were developed that integrate with Hadoop to solve certain problems (Apache Hadoop, 2019). Notable projects, among others, are Pig, Hive, Kafka, Storm, and Spark.

Because dividing a processing job in MapReduce into a series of mapping and reducing tasks can quickly become complex, Apache Pig (Apache Pig, 2018) was developed as an

alternative. Pig utilizes the high-level language Pig Latin to write complicated programs for the analysis of big data (Apache Pig, 2018). The platform then automatically compiles this

(18)

18 program into a series of low-level MapReduce tasks, which are then executed in Hadoop (Chennamsetty et al., 2015).

Apache Hive (Apache Hive, 2014) was developed to analyze enormous datasets that reside in distributed storage such as HDFS (Merelli et al., 2014). As a software for data warehouses, Hive enables the querying of these data using a language that is comparable to SQL (Thusoo et al., 2010). Subsequently, Hive compiles these SQL-like queries into low-level MapReduce tasks and submits these to be executed in Hadoop (Grover et al., 2015).

Initially developed for log processing, Apache Kafka (Apache Kafka, 2017) is a scalable messaging system for reading and writing large streams of data (Kreps et al., 2011). These real-time messages can be routed to applications for stream processing (Apache Kafka, 2017). The key advantages of Kafka include a distributed cluster, fault-tolerance, high efficiency, high throughput, replication, scalability, and stability (Yaqoob et al., 2016, Apache Kafka, 2017).

A computation system that can utilize Kafka is Apache Storm (Apache Storm, 2019). While Hadoop was developed for batch processing, Storm is a distributed system for the real-time processing of streaming data (Yaqoob et al., 2016, Apache Storm, 2019). Due to better parallelism and in-memory computing, Storm is incredibly fast as it can process more than a million individual tasks per second per node (Apache Storm, 2019, Patel and Sharma, 2014).

Overall, it was estimated that Storm runs 100 times faster than Hadoop (Patel and Sharma, 2014). The primary advantages of Storm are compatibility with every programming

language, ease of use, fault-tolerance, guaranteed data processing, and scalability (Yaqoob et al., 2016). Therefore, Storm is a perfect system for, among others, real-time analytics and online machine learning (Apache Storm, 2019).

Apache Spark (Apache Spark, 2018) is a highly scalable and high performance platform that is specialized in in-memory data analytics (Alonso et al., 2017). It therefore supports many algorithms for machine learning and natural language processing (Patel and Sharma, 2014).

Spark can be used for both batch and stream processing, and due to its in-memory processing, Spark was found to process the logistic regression algorithm 110 times faster than Hadoop (Apache Spark, 2018).

2.10 Challenges of Big Data

Although the analysis of big data can yield significant benefits, many organizations struggle with extracting insights from these data and subsequently using these insights to create business value (Zeng and Glaister, 2017). In fact, 85 per cent of big data projects have failed (Asay, 2017), and the outcome of 67 per cent of the big data projects was below average (Baldwin, 2015). Instead, the investments in big data have only paid off for a small percentage of organizations (Ross et al., 2013).

(19)

19 The analysis of big data involves many challenges that should be dealt with appropriately in order to successfully create business value (Tarafdar et al., 2013).

In the era of big data, privacy and the security of data has become one of the most important challenges (Cai and He, 2019) that also has received a significant amount of attention in academia and the industry (Sun et al., 2020b). As data have become a valuable asset and an important source of insights, increasingly more organizations are motivated to collect, process, store, and reuse these data (Mayer-Schönberger and Cukier, 2014). Although not all data also involves personal information, persons may experience that their personal data and privacy are violated when organizations process data related to for example their financial transactions, health records, and social media updates (Mayer-Schönberger and Cukier, 2014). The General Data Protection Regulation (GDPR) was introduced in the European Union not only to restrict the processing and storage of personal data, but also to provide clarity and certainty about how and under which conditions these data can still be utilized (Greene et al., 2019).

Based on the work of Manyika and colleagues (Manyika et al., 2011), Wamba and colleagues conducted a systematic literature review and found that the challenges of big data could be classified into five comprehensive dimensions (Wamba et al., 2015). These challenges are, in decreasing order of the reported frequency, technology and techniques, access to data,

organizational change and talent, industry structure, and data policies (Wamba et al., 2015).

An overview of each dimension and a description that lists the related attributes is presented in Table 2.

Table 2. Five dimensions of challenges for big data in decreasing order of reported frequency

Dimension Description

Technology and techniques Technologies encompass: storage, computing, and analytical software, while techniques are more related to new types of analyses of big data.

Both are needed to help individuals and organizations to integrate, analyze, visualize, and consume the growing torrent of big data.

Access to data The access and integration of information from various data sources is the key for the realization of big data-enabled firm transformative opportunities.

Organizational change and talent Currently, organizational leaders often lack the understanding of the value of big data and how to unlock this value. In addition, many organizations do not have the talent in place to derive insights from big data. Furthermore, many organizations today do not structure

workflows and incentives in ways that optimize the use of big data to make better decisions and take more informed action.

Industry structure The full business capture and realization from big data will be a function of the industry structure (e.g., industry with a relative lack of competitive intensity and performance transparency, high competition vs. low competition, high performance transparency vs. low

performance transparency, and high concentrate profit pools vs. low concentrate profit pools).

(20)

20 Data policies Privacy (e.g., personal data such as health and financial records),

security, intellectual property, and liability.

Replicated and slightly modified from Wamba and colleagues (Wamba et al., 2015).

Other authors (Neilson et al., 2019, Chen and Zhang, 2014, Surbakti et al., 2020, Kruse et al., 2016, e.g., Alharthi et al., 2017, Jeske and Calvard, 2020) reported comparable challenges of big data, but Wamba and colleagues arguably provide the most comprehensive review (Wamba et al., 2015).

2.11 Strategies to Remediate Challenges of Big Data

As far as we know, there has not yet been published a comprehensive overview of the recommended strategies to remediate the challenges reported by Wamba and colleagues (Wamba et al., 2015), although Alharthi and colleagues report on some potential strategies to address several barriers of big data (Alharthi et al., 2017). Their recommended solutions address the technological, human, and organizational barriers of big data analytics.

For technological barriers, recommendations were made to address the infrastructural readiness and the complexity of data. First, cheaper commodity hardware, instead of

expensive specialized hardware, they suggest, should be utilized to significantly increase the storage capacity and processing power (Alharthi et al., 2017). Likewise, the use of Hadoop is recommended to solve the challenges of complex data, fast data growth, and the variety of data formats (Alharthi et al., 2017).

Human barriers involve a lack of skills and privacy. A remediating strategy to address insufficient skills among workers is to utilize workforce development programs that enable workers to participate in specialized data-related educational programs (Alharthi et al., 2017).

In addition, it is recommended that partnerships are established between universities and the industry (Alharthi et al., 2017). Furthermore, privacy-related challenges can be addressed by consulting a legal expert in the field of privacy legislation, and to incorporate best practices on how practitioners should work with personal information and other sensitive data

(Alharthi et al., 2017).

Existing organizational culture is considered an organizational barrier (Alharthi et al., 2017).

Change management that requires a well-defined organizational vision and management support are required to motivate the workers in the organization to adopt and use big data (Alharthi et al., 2017). However, interventions that address organizational culture are often focused on the long-term, because the organizational culture is particularly difficult to change (Handy, 1993, Schein and Schein, 2017).

(21)

21

3 Theoretical Frameworks

This chapter introduces the three theoretical frameworks that guided this study and

influenced our thinking. We should be clear that these various theories were not deployed as a means to organize or analyze our results but provide us with some perspective on the issues we might face. They serve, to quote the sociologist Herbert Blumer, to ‘sensitize’ or

‘illuminate’ our thinking. “Sensitizing concepts are constructs that are derived from the research participants’ perspective, using their language and expressions, and that sensitize the researcher to possible lines of inquiry” (Given, 2008). This is appropriate in an exploratory study of this kind.

First, the socio-technical approach of systems thinking is discussed. Second, Practice Theory is outlined to understand the contrasting views of academics and data practitioners with respect to the challenges and remediating strategies of analyzing big data. Third, the Diffusion of Innovation Theory is discussed because it enables an understanding of the different rates at which technologies and innovations are adopted.

3.1 Socio-Technical Approach of Systems Thinking

This study is framed within a socio-technical approach that recognizes the interaction between individuals and technologies. Socio-technical thinking is used in terms of socio- technical systems and the related values (Alter, 2019). The premise of socio-technical

thinking can be formulated as: (1) the mutual constitution of people and technologies; (2) the contextual inclusiveness of this mutuality; and (3) the significance of collective action (Mumford, 2006).

The socio-technical approach focuses on the interdependence of linked relationships among the features of any technological object or system and the social norms, rules of use, and involvement of a comprehensive range of human stakeholders (Mumford, 2006). The key principle in socio-technical thinking is to consider both the technical and social aspects of any situation, as these aspects highly influence each other and because technical aspects may not be appropriate with social aspects of humans (Mumford, 2006).

The mutual constitution of individuals and technologies suggests that both humans and technologies may have some ability to act in each situation (Mumford, 2006). These actions are not independent of the surrounding activities. The fundamental background of the mutual constitution is co-evolution among that which is considered technological and that which is social (Mumford, 2006). By attending to material triggers, actions of social groups, pressures from contextual impacts, and the complex processes of development, adoption, adaptation, and use of technologies in people’s social world, it focuses on the interdependency among technologies and humans in organizations. The socio-technical perspective, and the principle of mutual constitution, allows us to recognize and assess the complex and dynamic

(22)

22 interactions among technological capacities, social histories, context, people’s choices, and action (Mumford, 2006).

The socio-technical perspective is premised on the embedding of the information technology and information systems into the world of situated action, that is tightly tied to the

characteristics of where the actions happen (Mumford, 2006). It focuses on placing work and seek to examine all contextual factors, even those with limited influence. The contextual elements are not reduced or removed, and is not taken as fixed, but these elements are defined dynamically (Mumford, 2006). Mumford’s work has been extremely influential in a number of information systems contexts, including change management (Clegg and Walsh, 2004) and participatory design (Mumford and Henshall, 1979).

The last element of the socio-technical premise is collective action, in the pursuit of goals by interested parties. The fundamental premise is that several parties will pursuit one or more shared goals, while focusing on the design, development, deployment, and the uses of information technology or information systems that are both shaped by and that form the nature of collective action (Mumford, 2006). The common interests and various goals are intertwined with both the contextual and the technological elements (Mumford, 2006).

An important principle within socio-technical systems thinking is the principle of controlling change at the source (Søndergaard et al., 2007). To deal with change, individuals need to be given the ability and the knowledge to react to changes at the source. This involves the

facilitation of the employees and giving them control over their work environment to ensure a quick and flexible response to changes and variations. Organizations should therefore ensure that employees have the necessary resources (Søndergaard et al., 2007).

3.2 Practice Theory

In this study, we are interested in contrasting the views of academics and data practitioners with respect to the challenges and remediating strategies of analyzing big data. Practice Theory may be relevant to understand these contrasting views.

Practice Theory is the name for a family of theories that are influential in the field of research in management, organizations, and technology. Examples of researchers that are associated with Practice Theory are Pierre Bourdieu (Bourdieu, 1977, Bourdieu, 1990), Anthony

Giddens (Giddens, 1979, Giddens, 1986), and Schatzki (2001). Although Practice Theory is a broad intellectual field and the theoretical principles work differently in the theories from the various theorists, the main principle of consequentialism is prevalent throughout Practice Theory (Feldman and Orlikowski, 2011).

Practice Theory argues that everyday actions are consequential in creating the structural outlines of social life, and because the act of engaging in the actions is consequential for the development of the activity, any activity becomes a practice (Feldman and Orlikowski, 2011). Practice Theory focuses on human activity where individual behavior is always

(23)

23 embedded within a web of social practices (Vaara and Whittington, 2012). The idea is to expose the taken-for-granted practices that shape social life, by applying a critical lens in order to reveal the unrecognized (Vaara and Whittington, 2012).

Practice Theory explains the creation of the socio-material world through the macro dynamics in organizations in everyday life, and it tries to understand how actions produce outcomes (Feldman and Orlikowski, 2011). When focusing on practice ontology, we realize that the practices produce organizational reality (Feldman and Orlikowski, 2011). The entailments of taking on a practice lens in the studies can allow us to see that theorizing practice is in itself a practice, namely one that produces different kinds of consequences in the world (Feldman and Orlikowski, 2011).

Orlikowski (2007) suggests a ‘practice lens’ and theorizes the relationship between everyday practices and technologies in use. The view of technology in organizations suggests that through their regularized engagement with a technology or its features in their constant practices, users regularly enact technology structures (Feldman and Orlikowski, 2011).

Technologies in practice are (re)constituted in people’s ongoing interactions with the technologies at hand. Therefore, it is not technologies per se, nor how they may be used in general that matter, but it the specific technologies in practice that are periodically produced in everyday activities that are consequential for the shaping of organizational results

(Feldman and Orlikowski, 2011). The following citation illustrates this: “When viewing technology use through a practice lens, the specific outcomes of stability or change are seen as consequential only in the context of the dynamic relations and performances through which such (provisional) stability and change are achieved in particular instances of practice”

(Feldman and Orlikowski, 2011).

Practice Theory has been applied in an abundance of studies, and two examples of these applications that are most relevant for the present study will be discussed. Scholars have made substantial use of Practice Theory to investigate the phenomena of strategy formulation and knowledge in practice.

Example 1: Strategy – For the formulation of strategies, Practice Theory was used to comprehend the relational and enacted nature of strategizing (Feldman and Orlikowski, 2011). Strategy as a practice is focused on what actors do as opposite to something that organizations have. This is an understanding of “strategy in the making” as a dynamic accomplishment rather than a static outcome (Feldman and Orlikowski, 2011).

Example 2: Knowledge – The practice theorists focus on insights into human

knowledgeability and view knowledge as a consequential activity based in everyday practice.

Knowledgeability is defined as the ability to continue to gain knowledge within the routines of social life that are constructed within practice rather than passively registered (Feldman and Orlikowski, 2011).

(24)

24 For the purposes of the present study, Practice Theory contrasts real-world activities with the theoretical orientation to structures and knowledge, which are commonly found in

organization science and in information systems. It enables a focus on certain kinds of factors that influence the way in which practices are constructed and it treats these factors as matters for practitioners themselves rather than for theorists. These factors include, according to Reckwitz (2002) the importance of routines, and according to Schmidt (2018) a normative element. To simplify, using interviews, it is expected to become apparent that people have ordinary routines through which they conduct their business and that these routines are recognized as normal and appropriate. In the present study, we aim to identify the ordinary, normal, and appropriate ways in which people understand the business of big data analytics.

3.3 Diffusion of Innovation Theory

The Diffusion of Innovation Theory has been widely used to explain how innovations spread (and what barriers to it there may be), and it is suitable for investigating the adoption of a technology. The diffusion research includes technological innovations, and Rogers (2003) uses the terms technology and innovation synonymously. For Rogers, a “technology is a design for instrumental action that reduces the uncertainty in the cause-effect relationships involved in achieving a desired outcome” (Rogers, 2003). A technology can therefore include hardware and software.

Diffusion is defined as “the process by which an innovation is communicated through certain channels over time among the members of a social system” (Rogers, 2003). When new ideas are created, diffused, and eventually adopted or rejected, it leads to certain consequences and it results in social change (Rogers, 2003).

An innovation is defined as an object, idea, or practice, which one perceives as being novel (Rogers, 2003). However, technological innovation creates uncertainty regarding its expected consequences. Therefore, to reduce the uncertainty of adopting this innovation, individuals should be informed about its advantages and disadvantages and to be aware of all of its consequences (Rogers, 2003).

The second element of the diffusion of innovations process involves the communication channels that participants utilize to produce and distribute information with each other with the objective to increase their understanding about the innovation (Rogers, 2003).

In addition, the time aspect should be included in diffusion research because it illustrates one of its strengths (Rogers, 2003). Time does not exist independently of events, but it is a part of every activity, where an individual proceeds from first knowledge of an innovation onto the adoption or rejection of this innovation (Rogers, 2003).

Innovations are diffused using a process. The innovation-decision process outlines the phases through which an individual or a decision-making unit goes from first knowledge of an innovation to establishing an attitude toward the innovation, to a decision to adopt or reject,

(25)

25 to implementation of the new idea, and to confirmation of this choice (Rogers, 2003). The five phases are: knowledge, persuasion, decision, implementation, and confirmation (Rogers, 2003).

According to the diffusion innovation theory, the innovation-decision process starts with the knowledge stage. In this phase, an individual learns about the existence of the innovation and seeks thorough information about the innovation (Rogers, 2003). There exist three types of knowledge, namely awareness-knowledge, how-to knowledge, and principles-knowledge (Rogers, 2003). Awareness-knowledge represents the knowledge of the innovation’s existence, which can motivate the individual to learn more about the innovation and potentially to adopt it (Rogers, 2003). How-to-knowledge involves information about how the innovation should be used correctly (Rogers, 2003). The third type of knowledge is principles-knowledge, which includes the functioning principles describing how and why an innovation works (Rogers, 2003).

Rogers (2003) defined five important determinants that can affect the rate of adoption,

namely relative advantage, compatibility, complexity, trialability, and observability. Of these, the first three are particularly important. Relative advantage, which has also been found to be a consistent predictor for the adoption of information technology can be measured by aspects such as increased business opportunities, improved customer service, enhanced

competitiveness, and value creation (Sun et al., 2020a).

An innovation can also be adopted without the knowledge, but the misuse of the innovation may cause its discontinuance. One of the biggest barriers to the use of a technology is the lack of a vision or understanding on why or how to integrate technology. Education and practice should therefore be provided to create the how-to and know-why experience and to create new knowledge. In fact, an individual may have all the essential knowledge, but this does not imply that the individual will also adopt the innovation. Overall, the individual’s attitudes towards new technologies might influence the adoption or rejection of the innovation (Sahin, 2006).

(26)

26

4 Methodology

This section first presents the research strategy and method. Subsequently, the study population, data collection, and data analysis are described. This chapter concludes with a discussion of the reliability, validity, and the ethical considerations of this study.

4.1 Research Strategy and Method

The purpose of this study was to investigate what data practitioners (hereafter practitioners) consider the challenges of big data that may prevent creating organizational innovation, as well as what strategies these practitioners recommend to remediate these challenges.

Research typically involves one of the three research paradigms, namely the positivist, critical, and interpretive approaches (Thanh and Thanh, 2015). Scholars normally base the research that they conduct on various assumptions (Creswell and Creswell, 2003). These assumptions are usually described as involving a theory of reality that focus on the nature of reality (ontology) and a theory of knowledge (epistemology) that focuses on what the

relationship is between the inquirer and the known (Denzin and Lincoln, 2011). Thus, behind a positivist approach is an ontology which says that the world of objects (and of behaviors) which is called, ‘natural kinds’. What this means is that it is possible to compare one thing with another using numerical or statistical devices because comparison between objects is straightforward. It is not difficult to measure the effects of combining an acid with a salt, because in chemistry no-one is interested in debating what an ‘acid’ might be. Another way of putting it is that the world divides up the way it does independently of what an observer might believe. Sometimes, a post-positivist paradigm is described, one where it is recognized that the assumptions behind traditional positivism are not always justified but which

nevertheless justifies the belief that the world has an independent status. The post-positivists state that reality can never be completely apprehended, only approximated. Post-positivism relies on multiple methods as a way of capturing as much reality as possible (Denzin and Lincoln, 2011). The so-called ‘realists’ and ‘critical realists’ are an example of this, and they adopt methodologies which they know will never get absolutely to the ‘truth’, but through triangulation of different methods will get somewhat closer to it (Denzin and Lincoln, 2011).

Many scientists today would describe themselves as realists rather than positivists.

The ’interpretivist’ paradigm is rather different since it is based on the view that the world, physical and social, is socially constructed (though it does not follow that everything is in the same way) (Hacking, 2000). Interpretivist traditions stem largely from idealist and

phenomenological philosophies, which argue that the world is constituted out of the concepts we use, and that those concepts are created socially through our common use of language and through our common cultural viewpoints. There are many variations of interpretivism, but the most common today is called ‘social constructionism’. The difference between these two broad paradigms is obvious. In one, the world has a distinct and objective form, independent of the observer, while in the other the world takes its shape from observer beliefs. It is the

(27)

27 difference between ‘objectivity’ and ‘relativism’ (Creswell and Creswell, 2003). Regardless of these philosophical underpinnings, interpretivism has methodological consequences.

Measurement has little value in such a perspective, because it is more concerned with how human beings go about making sense of their worlds and their meaningful behavior.

Research in the field of information systems is primarily based on the paradigms of positivism and interpretivism/constructivism (Orlikowski and Baroudi, 1991).

In the constructivism or social constructivism (which is often unified with interpretivism)

“(…) believes that individuals seek understanding of the world in which they live and work.

Individuals develop subjective meanings of their experiences-meanings directed toward certain objects or things” (Creswell and Creswell, 2003).

The interpretive paradigm enables researchers to view the world through the perceptions of participants (Thanh and Thanh, 2015). In line with this paradigm, the objective of the present study was to investigate the views, the backgrounds, and the experiences of data

practitioners. In this paper, we should stress, we were not much interested in the philosophical background, but were focused on the interconnection between

constructivism/interpretivism and the qualitative methods used in the field of information systems.

For the research strategy, a survey was selected to investigate what practitioners consider the challenges of big data, as well as what strategies they recommend to remediate these

challenges. A survey also enables the identification of discrepancies between the challenges that academics and practitioners describe. The qualitative field of study has widely accepted a survey as a reliable and an effective method to investigate and understand complex real-life matters (Harrison et al., 2017). Furthermore, the inductive approach is considered necessary because no research was yet performed on this subject. Consequently, no quantitative measures are available for analyses using a deductive approach (Garson, 2013).

A semi-structured interview was selected as the research method for data collection

(DiCicco-Bloom and Crabtree, 2006). This method is particularly relevant because it enabled an in-depth exploration of the subjective opinions and strategies of practitioners concerning big data (Kallio et al., 2016, Sharma and Petosa, 2014). Therefore, a method that combines the structure of pre-defined questions with the flexibility of in-depth questioning was necessary. We took a pragmatic view of the number of participants required for our study, using purposive sampling. This was because of the practical difficulty of obtaining a bigger sample.

4.2 Study Population

Participants were selected using purposive sampling (Mason, 2002, Trost, 1986, Robinson, 2014). Four selection criteria were applied. First, participants were employed in a data-related occupation, such as data science, data engineering, business intelligence, or information

An Explorative Study on the Perceived Challenges and Remediating Strategies for Big Data among Data Practitioners

Degree project