DATA QUALITY IN THE INTERFACE OF INDUSTRIAL MANUFACTURING AND MACHINE LEARNING

(1)

DATA QUALITY IN THE INTERFACE OF

INDUSTRIAL MANUFACTURING AND

MACHINE LEARNING

Author: Teoman Duran Timocin

Mentor: Jukka Hohenthal

Master thesis 30 HP

Master’s Programme in Management, Communication and IT

Institution of business administration

Uppsala University

Summer, 2020

(2)

1

Abstract

Innovations are coming together and are changing business landscapes, markets, and societies. Data-driven technologies form new or increase expectations on products, services, and business processes. Industrial companies must reconstruct both their physical environment and mindset to adapt

accordingly successfully. One of the technologies paving the way for data-driven acceleration is machine learning. Machine learning-technologies require a high degree of structured digitalization and data to be functional. The technology has the potential to extract immense value for manufacturers because of its ability to analyse large quantities of data. The author of this thesis identified a research gap regarding how industrial manufacturers need to approach and prepare for machine learning technologies. Research indicated that data quality is one of the significant issues when organisations try to approach the technology. Earlier frameworks on data quality have not yet captured the aspects of manufacturing and machine learning as one. By reviewing data quality frameworks, including

machine learning or manufacturing perspectives, the thesis aims to contribute with an area-specific data quality framework in the interface of machine learning and manufacturing. To gain further insights and to complement the current research in the areas, qualitative interviews were conducted with experts on machine learning, data and industrial manufacturing. The study finds that ten different data quality dimensions are essential for industrial manufacturers interested in machine learning. The insights from the framework contribute with knowledge to the data quality research, as well as providing industrial manufacturing companies with an understanding of machine learning data requirements.

(3)

2

Abbreviations

Artificial Intelligence (AI) – A system of technologies that fulfils the characteristics of “intelligence.”

Cyber-physical systems (CPS) –Transformative technologies for managing interconnected systems between physical assets and computational capabilities

Data Governance –The management of the availability, usability, integrity, and security of data used in an enterprise

Information and communication technologies (ICT) – Technologies that gives access to information through telecommunications, e.g. internet and wireless networks

Industry 4.0 – A concept for industrial companies based on technologies such as IoT, CPS, ICT, and cloud computing, in which machines augmented with wireless connectivity and sensors, connected to a system, can visualise the entire production line, control, and make decisions on its own

Intelligence – Intelligence is the ability to learn and understand, to solve problems, and to make decisions

Internet of Things (IoT) - The internet of things, is a system of interrelated computing devices, digital machines and mechanical, objects, or people that provide unique identifiers (UIDs) and the capability to transfer data over a network without requiring human-to-computer or human-to-human interaction

Machine Learning (ML) – A form of applied statistics with increased emphasis on the use of

(6)

5

1. Introduction

This chapter introduces the background, problem description and purpose, connecting the thesis with expected significant industrial changes driven by data-driven technologies such as machine learning. Furthermore, the scope, delimitations and the general thesis structure are addressed

The demand for intelligent data-driven technologies in the manufacturing industry processes are gradually increasing, as pressure from market competition, customer demand and governmental regulations regarding energy efficiency, emissions and renewables steadily rises (Ge, Song, Ding & Huang, 2017). An Accenture industry report, Manufacturing the future by Schaeffer, Cabanes and Gupta (2017) about manufacturing and data-driven technologies, claims that 85% of the industrial companies feel pressured to innovate in the area of industrial equipment to sustain and develop their competitive edge. Furthermore, 78% believe that AI-technologies, often associated with machine learning, will have a significant impact on the industrial equipment and industry. Researchers have also taken notice of this change and are claiming that we are currently entering the Spring of a new industrial revolution with data-driven technologies at its centre. Synonyms, or closely related terms to this revolution, are Industry 4.0 (Germany), Smart Production (USA), Smart Factory (South Korea), Cognitive Production and “the Age of Cyber-Physical Systems” (Thoben, Wiesner, & Wuest, 2017; Xu et al., 2018; Zhong, Xu, Chen, & Huang, 2017; Zhong, Xu, Klotz, & Newman, 2017).

Over the last twenty years, the application areas of machine learning have become vast in business, as a result of the more significant amount of available and transparent data and improved algorithms (Larose, 2005; Smola & Vishwanathan, 2008). These improvements have paved the way for new fields of applications, such as automatization, optimization, control and diagnostics in the

manufacturing industry. Machine learning has therefore become an area of interest for data-heavy sectors all around the world, expected to significantly affect the manufacturing industry and processes (Hermann, Pentek, & Otto, 2015; Lu, 2017). However, there are many prerequisites to derive value from data-driven technologies such as machine learning. The exact nature and form of the challenges may vary depending on the contextual situation of each company, but some challenges such as requirements on data remain continuous as they are fundamental for the technology (Ge et al., 2017; Jardine, Lin, & Banjevic, 2006).

(7)

6

Deploying machine learning technologies is, however, a complicated matter, as massive amounts of data alone are not enough to meet the requirements for implementing machine learning technologies. There are other data requirements on, e.g. data quality, that need to be met for the technology to provide value. Data quality requirements are, according to the interviewees in the study, rarely fulfilled in industrial manufacturing companies. The technology gap is thus often a barrier for

manufacturers who intend to implement machine learning technologies. To appropriately deal with the issues of data quality and make data useable, Cichy and Rass, 2019 emphasises that a decision strategy for the steps of planning, obtaining, storing, sharing, maintaining, applying and disposing of data is necessary for organisations. Couture (2012) suggests an approach to deal with data quality and complex systems by identifying the underlying system dependencies. The identification can be made by breaking down data quality in its different dimensions and attributes and put these in relation to a goal. Cichy and Rass further claim that these are the underlying factors to why data quality

management is comprised of a variety of frameworks and methodologies regarding the assessment and improvement of data. This thesis will use this approach to help industrial manufacturers get a better understanding of the data quality challenges that stands between them and machine learning technologies. Additionally, the thesis will contribute to a new area-specific framework to close the existing research gap.

Problem Description

Research indicates quality issues in how industrial companies handle raw and processed data. The large quantities of unprocessed data create obstacles during evaluation, creation and implementation processes of machine learning technologies. Processed data of a certain quality ís often required for the machine learning technologies to work. The data quality challenges in industrial companies cause barriers such as significantly higher project costs and implementation time, born out of factors such a manual data collection and inaccurate data from processes (Panian, 2010; Polyzotis, Roy, Whang, & Zinkevich, 2017). To help different sectors deal with their data quality issues, researchers have produced data quality frameworks with varying purposes, e.g. frameworks for big data or data quality in healthcare. However, when searching for data quality frameworks in the context of machine learning and manufacturing, an absence of research and scientific guidelines on data quality requirements and considerations was identified. The data quality frameworks closest to the area of machine learning in industrial settings focus either on data quality and machine learning (Gudivada et al., 2017) or data quality and manufacturing (Hubauer et al., 2013). No found framework was putting the areas directly in relation to each other. The gap needs to be resolved as the industrial

(8)

7

To fill the research gap and support industrial companies with the navigation of data strategies in the context of machine learning, the thesis embarks on an investigation of data quality issues

characterising the industrial sector in the context of machine learning. The research is based on empirical driven studies and seven interviews with experts in the areas. The value of compiling an empirical driven framework in the area lies in bringing data quality frameworks closer to the reality of the industrial companies, in comparison to general frameworks (Wang and Strong, 1996). A data quality framework focusing on industrial manufacturing and machine learning help industrial manufacturers evaluate their data quality situation with more accuracy by having context-based information. Consequently, industrial manufacturers can create better decision bases when setting up machine learning acquisition strategies and roadmaps for implementation.

Purpose and Research Objectives

This paper aims to reduce the research gap by identifying characterising industrial practices associated with data quality and machine learning challenges. Increased insights in manufacturing practices and challenges enable industrial manufacturing companies and researchers to identify the cause of data quality issues. Data quality is currently a significant barrier in the industrial manufacturing industry when implementing machine learning technologies. A data quality framework of characterising data quality issues in the industrial manufacturing sector will help managers and researchers break-down data quality issues into smaller parts and enable a more systematic approach to manage the problems. A systematic approach can help during the navigation of data strategies and creation process of roadmaps when preparing to acquire machine learning technologies to stay competitive in industry 4.0.

RO 1 Create a framework of the key aspects of data quality in the context of machine learning in an industrial setting.

Delimitations

(9)

8

regarding challenges, possibilities and demands are excluded. These aspects can be related to the research objective and are important to consider when conducting projects in practice.

Further limitations are set surrounding the empirical data collection. The author was interviewing professionals working with data, data quality and/or machine learning. They were deemed to be the most relevant group when targeting a small population for interviews. Professionals on other levels indirectly working with the areas may have valuable insights too but will not be included as the small population needs to be from a specific group to add into a valid perspective on matters.

Thesis Structure

Chapter two gives background information about the thesis subjects by defining data, data

quality, providing insight into current data quality frameworks from literature and by

(10)

9

2. Background

This chapter will first introduce the basic theory of the industrial data lifecycle to give an

understanding of what different kinds of data that exists and how a typical process surrounding data in industrial companies is. This information is essential to the topic of data quality and machine learning. Afterwards, the concept of data quality is presented and put in the context of manufacturing and machine learning.

Industrial Data and its Lifecycle

Before entering the subject of data quality, there needs to be a basic understanding of data and its life cycle (the sequence of activities from creation to disposal). Data can be compared to a product in a manufacturing system. A manufacturing process begins with the input of raw materials which is later processed, and have the output of a physical product. Data goes through a very similar process, see table 1 (Wang et al., 1995). The primary data sources of raw data in an industrial factory are from equipment, products, human operators, information systems and networks. Data processing is essential for data to become cleansed, structured and tailored for its purpose. The processing is vital as data only becomes valuable when translated into something concrete and understandable for users (Tao, F., Qi, Q., Liu, A., & Kusiak, A., 2018).

Table 1. An analogy between physical products and data products

Product Manufacturing Data Manufacturing Input Raw Materials Raw Data

Process Materials processing Data Processing

Output Physical Products Data Products

Cichy and Rass (2019) suggest that there are two ways of seeing data. The first way to view data is presented in Table 2. The view has many similarities with the analogy Wang et al., (1995) presented. Component data are processed raw data from, e.g. the factory floor, which is stored in a data

(11)

10

Table 2. Data types

The second, a widely recognized way to present data, is by structure, see table 3. The way data is organised can have a significant impact on usability areas in machine learning. Supervised machine learning is one of the two most common in the industrial environment, and it needs structured data to function (Ge, Song, Ding, and Huang, 2017). The thesis will, therefore refer to structured data when discussing data. To have structured data industrial companies need to make raw data into component data. There are, however, underlying factors which can affect the process of turning raw data into structured data. The area of data quality can help us clarify what these underlying factors are.

Table 3. Data Structures

Data Type Description Example

Structured data Generalization or aggregation of items described by

elementary attributes defined within a domain item

Relational data tables

Semi-structured data Data that have a structure with some degree of flexibility Web page, XML file

Unstructured data A generic sequence of symbols, typically coded in natural

language

Email text,

The actual process surrounding data collection and refinement in an industrial manufacturing setting is more complex than previous examples and analogies. Figure 1 is an example of a data life cycle process in the rotating equipment at Siemens Energy Services, which have more than 50 service centres globally. All factories are linked to a common data centre which stores the data of the service centres. The data is collected by, e.g. sensors and later sorted in different categories depending on the source and nature of the data. Some data are raw data, while some are component data (see table 2), pre-processed data or event data, describing the status of units and their functions. Data becomes

component data early in the processes when pre-sorted by small chunks of code before being sent to Data Type Description

Raw data item Smaller units which are used to create information and component data (e.g. data

from machines or workers on the factory floor)

Component data item Data constructed from raw data items and stored temporarily until the final product

is manufactured (e.g. data from factory floor that is structured and made ready for use)

Information product Data which is the output of performing manufacturing activity on data (Insights

(12)

11

the data collector (see figure 1). The data collector accumulates the information and regularly sends it to the data centre where it later is further passed on to the service centre to become processed and structured, see table 3. The data (see table 2) that passes through the Siemens appliance structure have many forms. The data range from being serial numbers, identification codes and all general

characteristics of unit’s components, measurements of sensors and monitoring devices, pre-processed data, event data, and processed data from the data centre. The processed data in the data centre can later be used to produce information product. Figure 1 is a good example of how the data life cycle could look like in an industrial manufacturing setting. The data collection and data processing processes are, however, seldom perfect in practice. They do leave room for a wide variety of data quality issues such as how complete and believable data is (Hubauer, Lamparter, Roshchin, Solomakhina & Watson, 2013). The specific data quality issues that characterize the industrial industry will be further discussed in chapter 2.2.1.

Figure 1. Appliance structure and data flow

Data Quality

Data quality have become critical for companies on a both operational and strategic level, especially in the area of computing- and data-intensive applications. Data quality can, in some cases, be a deciding factor between failure and success. According to Sadiq (2013), some of the high costs of having low data quality is higher operating costs, more flawed and/or delayed decisions, issues in aligning organisations and more challenging to set and execute strategies. Data quality challenges have existed since the early days of computing but remain a problem for many industries worldwide. With the progress in technology, data quality has become critical in many domains, inside and outside the business world. The importance of data quality can be reflected upon the wide-area, often domain-specific, which studies have been conducted; (Gudivada, Apon, & Ding., 2017); cyber-physical

systems (Sha & Zeadally, 2015), ERP systems (Cao & Zhu 2013), accounting information systems

(Xu, 2015), drug databases (Curé, 2012), sensor data streams (Klein, 2007), linked data (Kontokostas et al., 2014), data integration (Bansal & Kagemann), multimedia data (Na, Baik & Kim, 2001), and

customer databases (Sneed & Majnar, 2011). The areas of big data management (Gudivada,

(13)

12

and quality. It is, therefore, no surprise that the industrial manufacturing sector, characterized by being data-heavy, has taken a greater interest in data-driven technologies and data quality (Gudivada et al., 2017). Industrial manufacturing companies that are aiming to implement data-driven technologies with high-quality requirements on data face many challenges in their internal processes. To comprehend the industrial manufacturers' data quality issues, an understanding of the underlying dimensions of data quality is essential. By having insight into the dimensions, the subject of data quality can be structured, and the existing challenges more comprehensible when discussing the industrial data quality.

The term quality was defined 1986 by the International Standards Organisation as “the totality of

features and characteristics of an entity that bears on its ability to satisfy stated and implied needs”.

The definition is very similar to the one they use for data quality in 2020, “The degree to which a set

of inherent characteristics of data fulfils requirements”. Not many changes have happened in the

general description of the term. However, going back to 1996 a more elaborate description was done by Wang & Strong that defined data quality as “the fitness of use” and split data quality into four subcategories, containing fifteen quality attributes, presented in a hierarchical framework. Their research aimed at making information system professionals better understand and meet the data quality needs of their data consumers. The most critical dimensions, according to the empirical two-step study of Wang and Strong (1996), are accuracy, timeliness, precision, reliability, currency, completeness, and relevancy. The study also mentions accessibility and interpretability as essential dimensions. Notable is the acknowledgement of key dimensions varying depending on the purpose of the data. The study was based upon a data quality literature review and a two-step survey on data consumers. The first step focused on mapping what data quality dimensions there are with a survey group that consisted of 25 data consumers from the industry and 112 students with work experience as data consumers. The second step focused on what dimensions that were important to the data

(14)

13

Figure 2. A Conceptual framework of Data Quality

Later, Abate et al. (1998) tried to define data quality in their research but found the literature lacking any unified perspective of the area, but they still found one common denominator among the

researchers; the notion of data quality is inherently user-centric. The data was deemed to have the right quality when meeting the requirements stated in a specification that reflects the implied needs of the user, meaning that data quality needs to be determined from a context-based point of view.

Daniella et al. (2002) took the definition one step closer to the modern version by defining data quality through the categories of data accuracy and -completeness.

• Data accuracy (the extent to which registered data conform to the truth) and

• Data completeness (the extent to which all necessary data that could have been registered, actually have been registered)

The modern definition keeps in line with the former definitions but with a narrower view. Khatri and Brown (2011), Hazen et al. (2014) and Gudivada, Apon, and Ding (2017) used the same definition as the earlier, “the rate of which data satisfies its usage requirements”.

(15)

14

included in data quality frameworks targeting industrial manufacturers and machine learning. The frameworks of Cichy and Rass were analysed in a comparative way to find similarities and differences in what dimensions of data quality that was considered as the most significant. The authors, therefore, looked at the aspect of how frequently dimensions were mentioned in the 12 frameworks. They presented the results on a staple diagram where they excluded dimension only being mentioned less than two times (see figure 3). The most frequently mentioned dimensions were Completeness (10/12,

Timeliness (9/12) and Accuracy (8/12); other frequently mentioned dimensions were Accessibility

(5/12) and Consistency (5/12). The dimensions in figure 1 show a representative picture of the most common data quality dimensions in general frameworks.

Figure 3. Most recurrent dimensions and attributes in data quality frameworks

(16)

15

manufacturing processes, in turn, must meet this demand. Still, according to the studies of Hubauer et al., (2013) in chapter 2.2.1, there are weak links which cause issues in several dimensions, some being Accuracy and Believability.

(17)

16

2.2.1. Data Quality Challenges in Industrial Companies

Few publications are addressing the issues of data quality in the specific context of the industrial environment. Two-case studies were, however, found and chosen to generate insight into current industrial challenges regarding data quality. The author of this thesis collected the information generated from the case studies by indexing the case studies for information related to data quality issues in industrial companies. The author is aware of the risk of taking information out of its context when using the index method and was therefore thorough in reading the whole article before

beginning the process. By comparing the case-studies to general data quality frameworks, the author will be able to distinguish what dimensions that indicate to be more industrial-specific and what data quality issues that indicate to be non-industrial specific issues. The literature will later be further complemented with empirical data surrounding the state of industrial companies to get a more representative picture and more evident differences.

The first study by Hubauer et al., (2013) is a part of the project “Optique” that aims to improve data quality in industrial companies. It is conducted jointly by several European universities and two industrial giants Siemens AG and Statoil USA. The study researched the state of different data quality dimensions in the Siemens Energy Sector, that maintains thousands of power generation facilities. They analysed and evaluated a variety of data sets from different processes to identify what

(18)

17

Table 4. Critical data quality issues in real-world industrial data

Dimension of issue Reason Completeness,

Accessibility

The fullness of information is a common issue; data is often

missing and or not sufficiently detailed, which is the essential characteristics of data. The identified sources are accessibility issues of data because of, e.g. outlier data from a remote region and bad. Another cause is the absentness of data because of faulty devices such as sensors and bad connections between devices and databases. Issues range from having no data at all between time-periods to having “raw” or event data only.

Consistent Representation

Information often comes from multiple sources with a different formulation of the same issues, e.g. different monitoring and control systems in industrial companies. It is essential to have data homogenous, but variations in, e.g. timestamps (yyyy/mm/dd and dd/mm/yyyy) and denotations (Product number 1234 or Product number [1234]) are common. These factors badly affect analysis and statistics.

Free of Errors, Believability, Accuracy, Duplication

Data need to be correct, precise and relevant for machine learning. Challenges in this area occur in industrial companies because faulty devices in the data transportation chain, different settings and values on machines (e.g. timestamps), occasional outliers, varying value ranges on, e.g. sensors, noisy data, oscillation and duplication of data from using several sensors in the same line.

Ease of Manipulation

Heterogenous databases cause data issues when not having different/missing source keys between systems. Heterogeneity causes data to vary between systems and makes it hard to have complete and accurate data and causes issues during data merging. Manually implemented data is also easily manipulated by limitations and differences in systems, e.g. same recorded issues might have different data ID:s on different

machines. People may also interpret data differently Timeliness,

Appropriate Amount of Information

What data is necessary can vary depending on the purpose. Still, for thorough analyses, it is critical to have all data available and updated, not having it can cause severe data limitations.

The second is a study regarding challenges in the manufacturing industry and preparation for industry 4.0 by Khan and Klaus (2016). It used the qualitative method of case research strategy as industry 4.0, and smart manufacturing is relatively new research areas that pose new problems. They, therefore, found practice-based problems as the most fitting approach to find new insights. To further

(19)

18

complete width and depth of the study. Still, seen to the study being regularly cited and presented in

the proceedings of the first international scientific conference “intelligent information technologies for industry” (IITI’6), it is deemed trustworthy enough to be used in combination with other studies.

The study of Khan and Klaus presents and discusses the top five challenges faced by industrial companies, where data challenges are included as one of the main areas. The following table summarizes their insights:

Table 5. Common data quality challenges in the industrial environment

Dimensions of issues Reason Data Heterogeneity, Duplication, Consistent representations, Ease of Manipulation

The lack of a standardized data management approach because of rigid structures and low adaptability in systems is one of the current hindrances in big industrial companies. Furthermore, redundant data in different formats, extensions and enrichments are stored in various parts and systems of the company, which leads to inconsistency in data and semantics, which allow different interpretation of data.

Availability, Timeliness

Low system integration is causing data to be processed periodically, which causes issues in having real-time data and therefore hard to execute real-time data analyses which is vital for many machine learning technologies.

Free of Errors, Believability, Accuracy

No continuous check or countermeasures of the data and its quality to address low data quality are done.

(20)

19

Machine Learning

Machine Learning has its roots in statistical modelling which have been used in industries for many years. Statistical models were in the beginning, based on linear regression (Cochran, 1976). The field of machine learning has since then developed into several subareas, where one of the more recent prominent fields being Artificial neural networks (ANN). The drivers for this development are found in the progress of technologies, allowing more processing power, a more significant amount of available data and transparency (Larose, 2005; Smola & Vishwanathan, 2008). These drivers have formed new fields of applications possible, such as automatization, optimization, control and

diagnostics in the manufacturing industry (Alpaydin, 2014; Pham & Afify, 2005). Machine learning, as all statistical tools, is deeply connected to its data. The machine learning method referred to in this thesis are supervised, which together with the area of unsupervised represents 80 – 90% of the models used within the industrial environment (Ge, Song, Ding, and Huang, 2017). Supervised machine learning is dependent on structured data which is described as “Generalization or aggregation of items

described by elementary attributes defined within a domain items” and can be found in relational data

tables (Cichy and Ross, 2019). There are other areas of machine learning that could be interesting but as to supervised being one of the most common in the industrial environment, the project limitation has been set according to that. Before elaborating on the subject of data quality and its essential role in machine learning, a clear definition of machine learning is needed to create congruency in perspectives between author and reader.

2.3.1. Definitions of Machine Learning

Wang, Ma and Zhou (2009) reviewed the definitions of machine learning, its core structure,

applications and a variety of machine learning methods. They define the subject of machine learning with the following high-level approach:

“A subject that studies how to use computers to simulate human learning activities and to study self-improvement methods of computers to identify existing knowledge, obtain new knowledge and skills,

and continuously improve performance and achievement.”

A very similar and broad but slightly more specific definition of machine learning by Sharp et al. (2018) who has an industrial manufacturing approach to the subject is:

“Machine learning is the concept of having a computer update a model or response based upon new

data or experiences through its learning lifetime.”

(21)

20

“Machine learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving

confidence intervals around these functions. “

The definitions are all connected, going from a more high-level definition by Wang et al. (2009) to a more low-level definition by Goodfellow et al., (2016). The high-level definitions aim more to explain the “purpose” of machine learning while the low-level view focuses more on “how”. Machine learning has its basis in the interface of computers and mathematics. The core of machine learning is the algorithms and models which we continuously feed with information (data). What differentiates the machine learning-algorithms from others is the ability to “learn” from data. Learning in the meaning that a system attains the ability to solve a specific task (Goodfellow et al. 2016). However, Cabitza, Locoro and Banfi (2018) mean that the “learning” characteristic lies in machine learnings ability to make automatic procedures for incremental function optimization.

Machine learning can either be an independent entity or a part of a system of technologies. This system could, for example, be an AI-system. In this system of technologies, different software’s have different purposes. The “AI” technology could, for example, give a machine the ability to act

intelligently (John McCarty, 1955) while the machine learning-algorithm or model helps the AI in its decision making. Machine learning helps the AI by being the function in the system of technologies which analyse the collected data and learn from it; telling the AI what the optimal decision is in a changing environment. One could say that the primary goal of "AI” is success, while the purpose of machine learning, is to increase accuracy without being explicitly programmed to do so.

2.3.2. The Role of Data in Machine Learning

The very blood of machine learning-technologies is relevant data in a useable condition/format (Ge et al., 2017; Jardine, Lin, & Banjevic, 2006). It is not only essential to have an overview and data in good condition when machine learning is operational, but also before when designing the machine learning-models as, e.g. Datatype and data quality are deciding factors regarding what model should be chosen (Jardine et al., 2006; Ge et al., 2017). It is therefore essential to have data standards and policies in place to fulfil the data quality demands for machine learning.

Machine learning uses data for multiple purposes. Training data (data used to train the machine learning-technology) and test data (data used to test the machine learning-model) differs slightly from

serving data (data from deployment area). Training and test data need to be of higher quality. The less

accurate the data is, the more unpredictable and less trustworthy will the machine learning-technology be. A common phrase when discussing the potential of machine learning is “machine learning is only

(22)

21

quality data to train ones machine learning-technology with, it will be reflected in the value of its output. Bad data quality could even result in the machine learning technology becoming misleading and rather be draining value from the process which it is involved in than adding value.

The basic concept of how machine learning can work is visualised below in Figure 4 from Shin and Park (2010). By putting input data (new data) through into the algorithms, will lead to the machine learning analysing the data (in the machine learning technology) and come up with the best answer based on the new input data and the historical data (old data). Something it gradually becomes better at, generally, as it continuously optimizes its function. It basically solves tasks through statistical analysis on new and old data.

Figure 4. Example of a machine learning framework of continuous improvement in manufacturing processes

The main advantage of machine learning is its ability to solve complex tasks with speed with decisions based on ample amounts of real-world data instead of intuition. The data analysis makes it possible for the machine learning-technologies to adapt to new situations. Through continuously collected data and improvement of its function, it can make the right decisions in changing environments (Russell, Norvig & Ernest, 2010; Pennachin & Goertzel, 2007; Lison, 2015). To achieve the effect of

continuous development machine learning need quality data for the creation and during operation of the software. The data needs to be useful and relevant for the environment which it operates in and in a condition in which it becomes applicable to the machine learning technology. There is a clear

(23)

22

The Key Dimensions of Data Quality in the Context of Machine

Learning

Gudivadas et al., (2017) paper is a substantial extension of the work presented at the Alldata 2016 - the

second international conference on big data, small data, linked data and open data. The paper

presents a collection of key data quality dimensions from the research submissions, which was important for the interface of big data and machine learning. Gudivada et al. (2017) claim that the major challenges that need to be addressed in machine learning regarding data quality are similar to those of big data challenges. If these issues are not dealt with, it will lead to numerous issues or costs later in the process of implementing and operating machine learning. Not meeting the requirements could lead to incorrect machine learning models, and the ability to make accurate predictions due to, e.g. fluctuations in the training set, negatively affecting the output. To avoid issues, Gudivada et al., narrowed down the key data quality areas that need to be considered for machine learning. As some dimensions as gender bias and business rules are not directly linked to data quality, even though they in some contexts play an important role, they are excluded from the summary of Gudivadas et al., (2017) key dimensions presented on table 6. The primary reason for their exclusion is that they miss relevancy for the general industrial-manufacturing company.

Table 6 Data Quality Dimensions

Dimension(s) of issues Description and example

Accuracy Refers to the correctness of data; whether the recorded value conforms with the actual value of its intended use.

Example: False reason was given for why a machine stopped functioning in a production line or sensors giving high amounts of uncertain data.

Timeliness Recorded values are up to date for the task at hand.

Example: Infrequent updates may lead to decision-making based on old data (Incorrect input is prone to incorrect output)

Completeness Suggest that the requisite values are recorded (not missing) and that it is of adequate depth/ breadth. In other words, it needs to paint a complete picture of reality.

Example: If the recorded product has data for product number and date, but no batch number (if necessary, for the purpose) it is considered incomplete.

Believability / Consistency Indicates the trustworthiness of the source as well as its content

Example: Duplicated data inputs of, e.g. production date in data sets and/or a range of when the production occurred, e.g. between 2020-01-01 to 2020-01-30 instead of a specific date; 2020-01-13.

Heterogeneity / Semantic consistency

Data need to consist of one type of “language.”

Example: Machine X send X data to X database, and Machine Y send Y data to Y database, these later need to be collected and converted to a language fitting machine learning.

Accessibility Data need to be accessible to be used. High security and many outliers may create accessibility issues.

(24)

23

The study can, however, be criticized for not being transparent regarding its methodology. To further complement Gudivada et al., (2017) research, a collection of what four research papers consider are key success factors in the interface of data quality and machine learning are presented below. The various studies have a narrower technical focus which can provide further insights and/or confirmation of important data quality dimensions for machine learning. The information was collected through indexing, were the same cautions was considered as in previously indexed studies. The study by Polyzotis et al., (2017), is based on the empirical data of industry specialist from google and professors working with research in the area of data management in the production of machine learning models. Their research paper highlights the data-management issues that arise in the context of machine learning pipelines deployed in production. The table below reflects the compiled issues relating to data quality which frequently arise according to the study.

Table 7. Key areas of Data Quality for the purpose of Machine Learning

Confounding factors / Heterogeneousness

Data comes from multiple sources and are collected in different ways, with varying purposes of initial use. Data collected in different ways and purposes significantly affect the usability of the data as it becomes hard to manage and understand when trying to merge the data for a new common purpose.

Completeness / Missing data

If a large proportion of a critical attribute has missing values, it could create gaps in the statistical data that could lower the accuracy of the machine learning output.

Consistent representation

Since multiple sources are and different ways of collecting data are used, it creates variations in the data. Homogenous data are necessary for machine learning to operate successfully.

Duplication of Data Data duplication is a major challenge and create false statistics but is considered a larger problem in environments where external users from, e.g. internet forums, make the data.

Semantic data integration

(25)

24

3. Literature Review

This chapter aims to review data quality frameworks to highlight the need for a data quality framework in the interface of machine learning and Manufacturing. Strategies for designing the requirements needed to create and present a model are reviewed in the first section. The second section will review and discuss earlier data quality frameworks and related studies from research which will serve as the foundation for creating a theoretical basis for the thesis data quality framework in the interface of machine learning and manufacturing.

Frameworks in the Context of Data Quality

Clegg (1989) described frameworks in software development in two ways, “a reusable design of all or

part of a system that is represented by a set of abstract classes and the way their instances interact”

and “a skeleton of an application that can be customized”. Tonette & Maria (2009) describes frameworks as a manuscript that synthesises theoretical and/or empirical theories and concepts to develop a foundation for a new theory. Cichy and Rass (2019) mean that there are two kinds of frameworks. The first has a more general approach that focuses on methodology, definitions, assessment and improvement processes within an organisation in a comprehensive way. The second approach is specialized with more depth, where the framework functions as a tool describing how to do things step-by-step which are useable, e.g. in software development to help the user with tasks such as data profiling or data validation. Others mean that data quality frameworks are a design of systems that show how the different data quality dimensions and attributes interact with each other and as a tool for evaluation and analysis of data quality processes (Corrales, Ledezma & Corrales, 2018; Eppler and Wittig, 2000; Willshire and Meyen, 1997). More specifically, Willshire and Meyen add that data quality frameworks can be used for defining models of current and future data environments; identify and analyse the importance of attributes and dimensions in specific contexts, and provide guidance for data quality improvements. To be able to evaluate and analyse data, it needs to somehow be measured if it meets “the requirements for its purpose”. The common way to do that is through various metrics, e.g. dimensions and attributes (Sebastian-Coleman, 2012).

(26)

25

Table 8. Important characteristics when creating Data Quality frameworks

Comprehensive A framework needs to consider all aspects influencing as equal and not be weighted toward one specific issue. A weighted framework may cause tunnel-vision and not drive all issues to the surface.

A balance between rigour and general perspective

Experts may desire a balanced view of the issue, while non-experts prefer the general perspective.

Transparent results Suggests a systematic and reproducible approach, in that the same dimensions, elements and indicators can be applied across a wide range of situations.

Furthermore, Wang and Strong (1996) mean that there are three ways of researching data quality frameworks, the intuitive, theoretical and empirical approach. The intuitive approach - the most common approach, is based on what researchers’ find through their own experience and intuitive understating, are the most important dimensions and attributes. The advantage of this approach is that each study can select the dimensions most relevant to the goal of the study. The intuitive approach is usually resulting in a small set of dimensions and attributes which are structured in a hierarchy. The

theoretical approach is centred around how data become deficient during data manufacturing

processes. The advantage of this approach is its potential of delivering a comprehensive set of data quality attributes that are intrinsic to a data product (Wang and Strong, 1996). The two approaches are criticised for focusing too much the development characteristics of data, instead of use characteristics. They fail at providing an adequate basis for improving data quality and are according to Wang and Strong significantly worse than the empirical approach. The empirical approach, which Wang and Strong used, analyses the quality of the data collected from data consumers to determine the

characteristics they use to assess whether the data are fit for its purpose. The approach is user-centric, which avoids issues of the theoretical and intuitive approach. Wang and Strong further add that it may reveal characteristics that researchers have not considered as part of data quality. However, the

drawback of taking the empirical approach is that the correctness or completeness of the results cannot be proven via fundamental principles (Wang and Strong, 1996).

(27)

26

analysis of their current and wished state of data quality. Thus, a framework can increase the knowledge of purpose and methodology for how organisations can improve their data quality processes.

3.1.1. Advantages of Using Frameworks

Frameworks, in general, can be used for a wide variety of purposes. They provide reusable processes and a way of handling errors, whether it is a straightforward solution (software) (Clegg, 1989) or a guide for understanding in different contexts errors (organisation) (Tonette & Maria, 2009). Frameworks also allow organisations to create better information improvement processes (Cichy & Rass, 2019) and help employees do more extensive projects, through multiple smaller purposeful actions in line with a greater purpose, as the big picture is clearer (Clegg, 1989; Tonette & Maria, 2009). By giving organisations and its employees means for understanding the relations between the different data quality dimensions and attributes, they can utilize the technical resources more

effectively which result in a higher quality output as it draws a connection between the smaller actions and greater purpose. The data quality frameworks are ultimately a mean for communication and education, as they are not tailored for each company but rather a guide for understanding the concept and evaluate their organisation and make purposeful actions for their specific context. The general knowledge in companies can, therefore, be improved and processes and strategies better understood across teams, as knowledge gains another transfer channel and can become more widespread in organisations (Grant, 1996).

Related Work

Data quality research has traditionally focused on operational data that is structured and normally stored in relational databases. Wang, Storey and Firth (1995) made a literature analysis on more than 70 data quality management papers from computing, business, management and industry-specific literature and articles between 1970 to 1993. They concluded that the literature at the time aimed at

syntactic correctness (e.g. constraints enforcement that prevents “garbage data” from being entered

(28)

27

(2005). However, new areas of issues were introduced with data warehousing since it required the integration of data from various data sources (Gudivada, 2017). The new areas were about data acquisition, cleansing, transforming, linking and integrating data (Maydanchik, 2007), which is what Batini et al., (2009) study focused on (see table 8). However, with new improvements and the rise of big data and technologies such as cloud and machine learning, that require real-time processing of data. The rules of the game have once again changed, and the issue of low data quality has made it back as a major impeding force, as data need to be managed and be made ready faster (Gudivada, 2017).

(29)

28

Table 9. Data Quality studies conducted between 1996 and 2019

Researcher(s) Focus of studies Approach Field

Wang and Strong (1996)

A two-step empirical study to make information system professionals better understand and meet their data consumers ‘data quality needs by presenting a data quality framework.

Empirical Data Quality, Data Quality Dimensions,

Helfert and Herrmann (2002)

A metadata-based data quality framework (ProDQM) for Data Warehouse Systems. The study is based on Cooperative Information Systems.

Empirical Data Quality, Frameworks, Information Systems

Shankaranarayanan (2005)

A study on how to manage data quality in a data warehouse through the approach of the framework Total Data Quality Management (TDQM).

Theoretical Data Quality Management, Data _{Warehouse, Information Product, Metadata}

Batini et al. (2009)

A review of a few well-known and established methodologies for assessing and improving the quality of heterogeneous data.

Theoretical

Data Quality, Data Quality Measurement, Data Quality Assessment, Methodology, Information System, Quality Dimension

Hubauer et al. (2013) A case study of data quality challenges at the

industrial company, Siemens energy. Empirical

Industrial Data, Data Quality, Data Quality Dimensions

Khan and Klaus (2016)

A study of the common challenges and opportunities

in production systems Empirical

Industrie 4.0, Digital Manufacturing, Industrie 4.0, Smart Systems, Smart Factory, Future Factory, Cyber Physical Systems, Data Quality, IoT

Polyzotis et al. (2017)

Describes data management challenges when producing machine learning based on experience from four industry and research experts within the area.

Empirical Machine Learning, Data, Data Quality, Modelling

Ge et al. (2017) The study discusses data mining, machine learning

and analytics in the process industry. Theoretical

Data, Big Data, Data Analytics, Data Mining, Machine Learning, Process Industry, CPS, Data-Driven Monitoring, Gudivada (2017)

Describes the nature of data quality issues in the context of big data and machine learning based on four cases and the All data conference 2016.

Empirical

Data Quality, Data Quality Assessment, Data Cleaning, Big Data, Machine Learning, Data Transformation Blum and Schuh

(2017)

A study that presents a real-time reference architecture for order processing in manufacturing and measures for how to improve data quality.

Theoretical

Data Quality, Industrie 4.0, Data Analytics, Digital Twin, Digital Shadow, Real-Time Architecture

Cichy and Rass (2019)

A review of how dimensions are mentioned in 12 data quality frameworks applicable in a wide range of business environments and a guide to determine the suitability of frameworks.

Theoretical Data Quality, Frameworks

Saqlain et al. 2019

A framework for IoT-based industrial data management system (IDMS) which can deal with real-time data processing in manufacturing.

Theoretical

Industrial Internet Of Things (IIoT), Smart Factory, Data Management Framework, IoT Middleware, Factory Floor Azimi and Pahl

(2019)

A framework for machine learning-driven data and information models, where they discuss a range of quality factors for data and specific machine-learning generated information models.

Theoretical

Data Quality, Information Value, Machine Learning, Big Data, Data Quality Improvement, Data Analysis.

(30)

29

Towards Creating the Thesis Framework

There is a wide variation of data quality frameworks which have been applied across different

industries with varying purposes (Cichy and Rass, 2019). By comparing frameworks and insights from research directed towards the areas of data quality, data quality & machine learning and data quality & Industrial manufacturing, the thesis will highlight typical characteristics and requirements for the respective areas. The distinctive characteristics will later be compiled in a framework, highlighting machine learning and data quality challenges in an industrial manufacturing context.

As the different studies will be compared to each other, they need a common ground. The typical way to evaluate and analyse data is through various metrics and dimensions (Sebastian-Coleman, 2012). Studies, therefore, need to have a data quality discussion on a dimensional level to be included, which means that they distinguish between, e.g. data accuracy and timeliness, which are two frequently mentioned data quality dimensions. Furthermore, as Carson (2001) highlighted, the studies cannot be weighted towards one issue but have to take a general approach towards the subject of data quality. That means that the study cannot focus on one data quality dimension only, but has to include several and consider them equally.

Moreover, studies with strong empirical drivers are chosen over purely theoretical studies to get a user-centric thesis framework and potentially reveal characteristics that researchers have not

considered a part of data quality (Wang and Strong, 1996). The con of using an empirical approach is that the correctness and completeness of the results cannot be proven via fundamental principles. However, it is still considered as the favoured approach. A theoretical approach could, however, provide a more comprehensive set of data quality attributes. However, since the purpose is to narrow down the data quality dimensions to identify which of them that are characterising for industrial manufacturers in the context of machine learning rather than describing data quality dimensions in general, the theoretical approach is considered less effective (Wang and Strong, 1996). There is, however, one exception where a theoretical driven study is used, and that is the general perspective of data quality dimensions. The study by Cichy and Rass (2019) serve the purpose of helping with distinguishing the data quality dimensions characterising the industrial manufacturing and machine learning context from the general data quality dimensions.

(31)

30

theoretical driven, e.g. the study of Ge et al., (2017) or being heavily focused on specific issues outside the parameters of the thesis objective, such as real-time data processing - Saqlain et al., (2019) & Blum and Schuh, (2017) and specific machine learning functions - Azimi and Pahl (2019) or a having too heavy of a focus on a particular dimension such as heterogenous data – Batini, (2009). The studies contributions are very relevant for the overall research field but fall outside this thesis’s focus. The selected studies that are used to form a theoretical basis for the thesis framework are presented in the sections 3.3.1 - 3.3.5.

3.3.1. Wang and Strong (1996) – Beyond Accuracy: What Data Quality

Means to Data Consumers

The data quality framework developed by Wang and Strong has been used effectively in industry and government to help information system managers to understand and better meet their consumers' data quality needs. As earlier mentioned, when presenting the study in 2.background, the empirical study is based on a two-step survey on data consumers. The first step focused on mapping what data quality dimensions there are with a survey group that consisted of 25 data consumers from the industry and 112 students with work experience as data consumers within the industry. The second step focused on what dimensions that were important to the data consumers with a quantitative approach where the survey group consisted of 1500 alumni students in a variety of positions in the industry that regularly used data to make decisions. They ranked the given dimensions 1-9 depending on how important they were to fulfil the data requirements. They used this data to statistically evaluate the 118 different data quality dimensions that were mentioned during their empirical research, to define data quality as an umbrella term through a conceptual hierarchical framework with four different categories and fifteen dimensions. The purpose of doing this was to gain a better understanding of the area as it is, according to Wang and Strong, vital to understand what data quality means to data consumers to improve it. Their hierarchical order can be seen in figure 2. A Conceptual framework of data quality on, page 13. The definition of the different categories can be seen below:

• “The Intrinsic data quality category denotes that data have quality in their own right.” • “The Contextual data quality category highlights the requirement that data quality must be

considered within the context of the task at hand.”

• “Representational data quality and accessibility data quality categories emphasize the

importance of the role of systems. “

(32)

31

which the study constructed. In this regard, the categorisation system and hierarchy are considered valuable and will be included as a structure for the thesis framework.

3.3.2. Cichy And Rass (2019) – An Overview of Data Quality

Frameworks

The research of Cichy and Rass aimed at providing a decision guide to data quality frameworks through a metanalysis of 12 data quality frameworks, applicable to a wide range of business contexts. The requirements for the frameworks to be included in the study was general applicability in the context of data, information’s systems and type of businesses. The studies should furthermore define relevant data quality attributes, data quality assessment steps and data quality improvement steps. The method they used was by comparatively surveying data quality frameworks in the aspects of the definition, assessment, and improvement of data quality, where the conclusion ends in a decision guide for what framework that can be used in what contexts from a user perspective. The research was also mainly centred around structured and semi-structured data which is well in line with this thesis. The most interesting part of their studies lies in their meta-analysis, where they compare definitions and frequency of dimensions being mentioned in the 12 studies. They identified a wide range of data quality dimensions but found that completeness, timeliness and accuracy to be the overall most crucial quality attributes. The meta-analysis brings value when forming a general perspective on data quality frameworks as it gives a representative picture of what the most common dimensions are in general data quality frameworks (See figure 3 on page 14). It is of great value as the general perspective is necessary for distinguishing the specific data quality dimensions characterized by machine learning and industrial manufacturing later.

3.3.3. Hubauer et al., (2013) – Analysis of Data Quality Issues in

Real-World Industrial Data

As earlier mentioned, this is a case study on data quality in the Siemens Energy Sector that maintains thousands of power generation facilities with the major components of gas and steam turbines. They analysed and evaluated a variety of data sets from different processes by, e.g. examining noisy data through software and techniques, to identify what characterized the data and its quality from a

dimensional perspective and propose approaches of how to deal with the issues. This study was chosen because it identifies and discusses industrial data quality challenges through a dimensional discussion. The key insights generated from the paper are that Siemens Energy faced issues of imperfect,

(33)

32

when implementing high-data quality demanding software’s such as Machine learning, to perform diagnosis, prognosis and analysis in their processes.

Further insights regarding data quality dimensional challenges are presented in Table 4. on page 16. This study there provides valuable insights to form an understanding of the industrial environment and is comparable with the other included studies in the thesis. However, it can be criticised for not being transparent about how many data sets were analysed and how common data quality issues were relative to each other. Furthermore, as a sole case study, it does not give a representative view of the general industrial challenges and therefore needs to be complemented with a complementary study. Two studies give a broader understanding of the common data quality challenges in the industrial environment, still not wholly representative but will be further complemented by the empirical research in this thesis to become so.

3.3.4. Khan and Klaus (2016) – A Perspective on Industry 4.0: From

Challenges to Opportunities in Production Systems

The study by Khan and Klaus focus challenges in the manufacturing industry and preparation for industry 4.0. They used an iterative approach and the qualitative method of case research strategy in the study, as industry 4.0 and smart manufacturing are relatively new research areas that pose new problems. They, therefore, find practice-based problems as the most fitting approach to find new insights. To further complement their case-study on an industrial company, they made questionnaires distributed to IT exhibitions, informal interviews, company documentation and interviews with industry experts and consultants regarding current problems and challenges in the industrial

environment. They do not present the exact number of interviews or questionnaires in the study. The low transparency makes it hard to understand the complete width and depth of the study. However, seen to the study being regularly cited and presented in the proceedings of the first international

Scientific conference “intelligent information technologies for the industry” (IITI’6), it is deemed

trustworthy enough to be used in combination with other studies. The study of Khan and Klaus presents and discusses the top five challenges faced by industrial companies in the dawn of Industrie 4.0, where data challenges are included as one of the main areas of common issues. Their insights of common data quality issues are presented in Table 5. Common data quality issues in industrial

industries. Their discussion of data quality is on a dimensional level and is therefore comparable with

(34)

33

3.3.5. Gudivada, et al., (2017) – Data Quality Considerations for Big Data

and Machine Learning

The researchers of the paper describe the nature of the data quality issues and their characterization in the context of big data and machine learning in organisations and ultimately present an IT-architecture and data quality framework for the context. The study summarises key dimensions of data quality in the interface of big data, data governance and machine learning based on metadata from the Alldata

2016 - the second international conference on big data, small data, linked data and open data (see

table 6) and metadata from four case studies. This research is closely related to the thesis objective. It contributes with insights into what kind of challenges companies are facing and neglecting in the area of data quality and machine learning. The study gives further indicators of what dimensions should be prioritized when forming the thesis framework in the perspective of machine learning. It can, however, be criticised for not being completely transparent regarding the method of how the metadata was used in the study, but seen to the work being published in The International Journal on Advances in

Software 10.1 (2017), pp. 1-20 it is deemed to be a trustworthy source of information. The data quality

dimensions and insights challenges are also further confirmed by the study by Polyzotis et al. (2017) who, as earlier mentioned in chapter 2.5 conducted an empirical study of data management challenges when creating machine learning models, unbiased of Gudivada et al., (2007) paper. Seen to the

similarities between the studies and seen to Gudivada using metadata from several studies and cases, it can be viewed as a representative work of the general challenges in the area of data quality and

machine learning

Creation of the Theoretical Basis for the Thesis Framework

This chapter section brings insights from the selected studies together to create insights into what issues and data quality dimensions that characterises the industrial manufacturing and machine learning context. The vertical section with categories in Table 10 is based upon Wang and Strong (1996) framework categories. The horizontal section represents the different data quality dimensions included in the general data quality perspective, machine learning perspective and industrial

perspective, which is based upon the nature of each study.

DATA QUALITY IN THE INTERFACE OF INDUSTRIAL MANUFACTURING AND MACHINE LEARNING