Data Quality in Data Warehouses: a Case Study

(1)

Data Quality in Data Warehouses: a Case Study

Per Bringle

Submitted by Per Bringle to the University of Skövde as a dissertation towards the degree of M. Sc. by examination and dissertation in the Department of Computer Science.

August 1999

I certify that all material in this dissertation which is not my own work has been identified and that no material is included for which a degree has already been conferred upon me.

_________________________________________________

Per Bringle

(2)

Abstract

Companies today experience problems with poor data quality in their systems. Because of the enormous amount of data in companies, the data has to be of good quality if companies want to take advantage of it. Since the purpose with a data warehouse is to gather information from several databases for decision support, it is absolutely vital that data is of good quality. There exists several ways of determining or classifying data quality in databases. In this work the data quality management in a large Swedish company’s data warehouse is examined, through a case study, using a framework specialized for data warehouses. The quality of data is examined from syntactic, semantic and pragmatic point of view. The results of the examination is then compared with a similar case study previously conducted in order to find any differences and similarities.

Keywords: data quality, data warehouse, evaluation framework

(3)

1 Introduction ... 4

1.1 Data quality management ...4

1.2 Data quality and data warehouse ...5

1.3 Framework for understanding data quality ...6

1.3.1 Syntactic data quality ...7

1.3.2 Semantic data quality ...8

1.3.3 Pragmatic data quality ...8

1.3.4 Stakeholders ...10

1.4 The problem...10

1.4.1 Aim and objectives ...10

1.4.2 Description of the objectives ...11

1.5 Preview of forthcoming chapters ...12

2 Background ... 13

2.1 Overview...13

2.2 Data quality...13

2.2.1 The problem area...13

2.2.2 Terminology ...14

2.2.3 Data quality categorization...16

2.3 Data warehouse...16

2.4 Case studies and research design ...18

3 Research design... 22

3.1 Study questions ...22

3.2 Study propositions ...22

3.3 Unit of analysis ...23

3.4 Linking data to the propositions ...23

3.5 Criteria for interpreting the findings ...23

4 The case... 25

4.1 Unit of analysis ...25

4.2 The data warehouse ...26

4.3 The interviews...28

4.3.1 Syntactic data quality ...28

4.3.2 Semantic data quality ...29

4.3.3 Pragmatic data quality ...30

4.4 Summarized answers ...32

5 Analysis of the case... 34

5.1 Evaluation with the framework as a basis...34

5.2 Comparison between the case studies...35

6 Analysis of the case study process ... 37

7 Conclusions and discussion ... 39

7.1 The case ...39

7.2 The framework...40

7.3 Future work in the area ...41

8 References ... 43

(4)

Introduction

1 Introduction

This project aims to undertake a case study in a large Swedish company, in order to examine data quality management in one of their data warehouses. The examination will be done using a framework proposed by Shanks and Darke [SHA98]. This framework is suited for analysing case study data from empirical studies of data quality. The results from the examination will then be compared with the case study performed by Shanks and Darke [SHA98]

in order to find any differences and similarities. The main goal of this dissertation is to show how the data quality practises are handled in the company. My personal view of the framework and how applicable it was to the case will also be discussed, but it isn’t the main issue in this dissertation.

1.1 Data quality management

A problem with the management of data in today’s companies is the quality of the data [TAY98, STR97]. The amount of data in a company can be enormous. To be able to use this data in a way that benefits the company, the data has to be of good quality. Otherwise the data just costs money to store and the company gets nothing out of it. The data can be very useful if it is used in the right way. Tayi and Ballou [TAY98] compares raw material with data in their paper. They look upon data as the raw material for the information age. The difference between raw material and data is that raw material only can be used once, while data can be used over and over again. It is easy to calculate the value of raw material, but the value of data depends on how it is used. If the quality of the data is poor, there is a risk that the data wont be used. This is why the quality of data has to be good.

(5)

Introduction

1.2 Data quality and data warehouse

Tayi and Ballou [TAY98] describes the problem with data quality when it concerns data warehouses:

“The capability of judging the reasonableness of the data is lost when users have no responsibility for the datas integrity and when they are removed from the gatherers. Such problems are becoming increasingly critical as organisations implement data warehouses.”

Orr [ORR98] states that the single biggest problem in companies that have implemented a data warehouse is the poor quality of data in their legacy systems.

Why then is good data quality even more important in data warehouses? Because a data warehouse is so large and contains so much data it is vital that the data has good quality.

Another argument is that the data warehouse fetches its data from different sources. There are also many people working with the data warehouse from different departments, which often use different description for the same data.

According to McFadden [MCF96] there has not been much scholarly research done in the data warehouse area when it comes to identifying costs and benefits or evaluating various issues.

(6)

Introduction

Different ways of determining or classifying data quality have been proposed in the literature [SHA98], [WAN95], [WAN96], [WA+96], [RED98]. The framework that I will use in this study is a framework proposed by Shanks and Darke [SHA98], that can be used to understand the quality of the data in a data warehouse. With this framework as a basis I will conduct a case study in a large Swedish company in order to evaluate the company’s data quality practises.

1.3 Framework for understanding data quality

The framework proposed by Shanks and Darke [SHA98] provides a way to understand data quality in data warehouses. The framework supports both intrinsic and extrinsic data quality dimensions. The framework is based in semiotic theory, which is the study of symbols and words. [SHA98] The components of the framework are:

• a set of data quality goals

• means to achieve the goals that are defined via desirable properties

• improvement strategies

• measures for each property

• stakeholders [SHA98]

The three semiotic levels that are handled by the framework are syntactic data quality, semantic data quality and pragmatic data quality. A table which summarizes the goals, properties, improvement strategies and measures is shown in table 1.

(7)

Introduction

TABLE 1. Summary of goals, properties, improvement strategies and measures [SHA98]

1.3.1 Syntactic data quality

The syntactic level concerns the quality of the structure of the data. The goal with this level is consistency. By this is meant that the data values should use a consistent symbolic representation. This is important when comparing and consolidating data. To reach a good syntactic data quality it is important to have a well-defined syntax for the data. To improve the quality of data a number of improvement strategies exists. A corporate data model can be developed in order to achieve the same syntax in the entire company. Another improvement strategy is to have automatic syntax checking when the data is inserted in to the data warehouse. A third improvement strategy is to train the people that produce the data, so that they know how to insert the data correctly. To be able to measure consistency in a data warehouse, the percentage of inconsistent data values can be calculated. [SHA98]

Semiotic level Goal Property

Improvement

strategy Measure

Syntactic Consistent Well-defined

(perhaps formal) syntax

Corporate data model, Syntax checking, Train- ing for data producers

Percentage of inconsistent data values

Semantic Comprehensive

and Accurate

Complete, Unambiguous, Meaningful, Cor- rect

Training for data producers, Mini- mise data transformations and transcriptions

Percentage of errors in data or population sample

Pragmatic Usable and Use-

ful

Timely, Under- stood, Concise, Easily accessed, Reputable

Monitoring data consumers, Explanation and visualisation, High quality data delivery systems, Data tagging

Time of update, User surveys, Effect on decision making processes and outcomes

(8)

Introduction

1.3.2 Semantic data quality

The goal with semantic data quality is to have comprehensive and accurate data. Data is comprehensive data if, for each relevant state in the real world system, there exists a data value in the data warehouse. The goal of accurate data is to have data that corresponds to the data it represents. For the data to be comprehensive and accurate it should be complete, unambiguous, meaningful and correct, and also consistent, which was the goal with the syntactic level. The data is complete if each state in the real-world system is mapped to a data- value in the data warehouse. If no two states in the real-world system are mapped to the same data-value in the data warehouse, the data is unambiguous. If the data is meaningful, there should not exist any data in the data warehouse that can’t be mapped to any state in the real-world system. For the data to be correct, the states in the real-world system should be mapped to the right data-values in the data warehouse. [SHA98]

To improve semantic data quality, the data producers can be trained so that they will produce comprehensive and accurate data. Another improvement strategy is to ensure that the number of transformations and transcriptions of the data from the time that the data is cap- tured, until it is stored in the data warehouse, remains at a low level. A measure for the semantic level is percentage of errors in data or population sample when the data is compared to the states in the real-world system. [SHA98]

1.3.3 Pragmatic data quality

The third semiotic level deals with data quality at the pragmatic level. The goal with this level is to achieve usable and useful data. Data is usable if it can be easily accessed and used

(9)

Introduction

by the stakeholders of the data warehouse. Useful data helps stakeholders in completing their tasks in the company. [SHA98]

For data to be usable and useful to stakeholders it should be timely, understood, concise, easily accessed, reputable and also consistent, comprehensive and accurate which are the goals of the syntactic and semantic levels. For the data to be timely, it should be updated so that it is valid for the specific task. The stakeholders must be able to understand the data that resides in the data warehouse. Data also has to be concise. For data to be accessible it should be easy to retrieve and manipulate. Reputability stands for how the stakeholders consider the data in terms of its source, content and credibility. [SHA98]

To improve pragmatic data quality, the data consumers can be monitored in their use of the data warehouse to assure that the data is up-to-date for the specific tasks. Explanation tools and visualisation tools can be used to make the usage of the data warehouse easier and help the stakeholders understand the data. For the data consumers to use the data warehouse, it is essential that high quality data delivery systems are used. Data tagging can also be used to improve pragmatic data quality. To measure pragmatic data quality, the amount of time passed since the last update can be measured. User surveys can be conducted in order to find out what the stakeholders think about the data in the data warehouse. The effect on the stakeholders’ decision making processes and outcomes can also be measured. These measures are highly subjective, since it is difficult to attain any direct values of these measures.

[SHA98]

(10)

Introduction

1.3.4 Stakeholders

Four categories of stakeholders exists in the framework:

• Data producers

• Data custodians

• Data consumers

• Data managers

The data producers are the ones that create or collect the data that is inserted into the data warehouse. Those who design, develop and operate the data warehouse are called data custodians. The data consumers are the ones that use the data that resides in the data warehouse.

The responsibility of managing the entire data warehousing process is given to the data managers. [SHA98]

1.4 The problem

1.4.1 Aim and objectives

The aim of this project is to examine the data quality practises in a data warehouse in a large Swedish company with the framework proposed by Shanks and Darke [SHA98] as a basis.

In order to accomplish my aim, I have set up four objectives that I will follow on the way:

• Formulate a research design.

• Describe the data quality aspects of the data warehouse in the company.

(11)

Introduction

• Evaluate the data quality practises with the framework proposed by Shanks and Darke [SHA98] as a basis.

• Analyse the outcome of the evaluation.

1.4.2 Description of the objectives

The first objective is to formulate a research design. This is a way to prepare for the case study. Here I will write about the type of information I am looking for in the company. The purpose of the case study is also stated here and a description of the unit of analysis. I will use the framework proposed by Shanks and Darke [SHA98] as a basis for the information I want to gather from the company.

The second objective is to describe the data quality aspects of the data warehouse in the company. In doing this I will give some introduction to the company’s data warehouse and its purposes. I will describe the data warehouse from a data quality view. With this objective I hope to give some understanding of how the company uses its data warehouse and what data quality aspects are considered.

The third objective is to evaluate the data quality practices with the framework proposed by Shanks and Darke [SHA98] as a basis. The framework helps in understanding data quality in data warehouses by providing the three semiotic levels of data quality. With the framework as a basis I will conduct a case study in a large Swedish company. The case study will

(12)

Introduction

include interviews in the company in order to show the opinion of the stakeholders of the data warehouse and their thoughts on the quality of the data in the data warehouse. The material collected from the interviews will then be used for the analysis.

The fourth objective is to analyse the outcome of the evaluation. The analysis will be based on the material gathered from the interviews and its results will then be linked to the case study performed by Shanks and Darke [SHA98] in order to find any differences and similarities. This analysis will hopefully provide a view of how the company looks upon data quality in their data warehouse and the users’ views on their work with the data warehouse.

1.5 Preview of forthcoming chapters

Below follows a presentation of the forthcoming chapters.

Chapter 2 provides the background knowledge that this work is based on.

Chapter 3 focuses on the research design which will be used in the case study.

Chapter 4 describes the case and the interviews conducted at the case site.

Chapter 5 presents an analysis of the data quality in the data warehouse. A comparison between my case study and the case study conducted by Shanks and Darke [SHA98] is also made.

Chapter 6 addresses an analysis of the case study process.

Chapter 7 contains conclusions and reflections on the study.

(13)

Background

2 Background

2.1 Overview

In this chapter I will provide some background material that will help in putting the rest of this paper into perspective. I will describe the concepts of data quality and data warehouse.

A description of the case study method follows. I will also describe the framework that this work is based upon, i.e. “Understanding Data Quality in Data Warehousing: A Semiotic Approach” by Graeme Shanks and Peta Darke [SHA98].

2.2 Data quality

2.2.1 The problem area

Some problems with data quality have been presented in the literature. Tayi and Ballou [TAY98] mentions some of them. With multiple users of the data, the semantics of the data becomes very important. If people in different departments put different meaning on the same data, the meaning of the data disappears, i.e. if the financial department talks about sales in Dollars and the marketing department talks about sales in Yen, there is a big risk of misunderstanding between the departments, which could lead to devastating consequences.

Tayi and Ballou also mentions the problem that data quality has low priority in a company.

The importance of data quality is well known in the companies, but in the budget the data quality area tends to be of lower priority.

(14)

Background

Another problem described by Tayi and Ballou is that it is difficult to determine the level of data quality to achieve in a company. If the level of data quality is aimed too high, the cost of achieving the high level of the data quality and keeping it high, can increase very much, but if the level of the data quality is poor, the usability of the data will be low. Or as Orr puts it [ORR98]:

“The real concern with data quality is to ensure not that the data quality is per- fect, but that the quality of the data in our information systems is accurate enough, timely enough, and consistent enough for the organization to survive and make reasonable decisions.”

Orr describes another problem with data quality. Data in databases is static, but the real world is dynamic. Therefore the quality of the data is likely to worsen the longer it stays in the database.

2.2.2 Terminology

There is little variation in what is understood by the term “data quality”, but its definition differs in the literature. Orr [ORR98] describes data quality as “the measure of the agree- ment between the data views presented by an information system and that same data in the real world.” In other words, data quality shows how well the data matches the information it presents.

Strong et al. [STR97] defines high-quality data as “data that is fit for use by data consumers”.

(15)

Background

Shanks and Darke [SHA98] uses the definition “fitness for purpose” when talking about quality and points out that quality should include both the intrinsic characteristics of the data itself and assessments of users of the data. Shanks and Darke divide the quality of data into two categories, metadata and content data. The metadata describes the content data and how to access it. The quality of the metadata depends on how well the data warehouse’s concep- tual model is done. It is important that the quality of the content data is good, or else the users will have difficulty in assessing and understanding the data and thereby will make no use of the data warehouse. The framework provided by Shanks and Darke focuses on data quality. [SHA98]

Below I present the definitions for data, information and meaning used by Shanks and Darke [SHA98] which I will adopt.

Data:

“Data is a collection of symbols that are brought together because they are considered relevant to some purposeful activity.“

Information:

“The information carried by a symbol relates to who produces it, why and how it was produced and its relationship to the real world system state it signifies.”

(16)

Background

Meaning:

“Meaning is defined as the particular meaning that people derive from symbols and is generated from the information that accompanies data.”

2.2.3 Data quality categorization

How to categorize data quality differs from author to author. The meaning is often about the same though. Below I will present a couple of ways to categorize data quality.

Wang, Reddy and Kon [WAN95] describe data quality as a multi-dimensional and hierarchi- cal concept. They look upon data quality from a users’ perspective. To the user the data must be accessible, interpretable, useful and believable. Wang, Strong and Guarascio [WA+96]

have constructed a framework where the data quality concept is divided into four categories, i.e. intrinsic data quality, contextual data quality, representation data quality and accessibil- ity data quality.

These frameworks are aimed at regular databases, while the Shanks and Darke [SHA98]

framework is for understanding data quality in data warehouses. A deeper description of the framework is given in section 2.6.

2.3 Data warehouse

A data warehouse supports decision making through its analysis capabilities of integrated, corporate-wide historical data. The role of a data warehouse is to store and organize data in order for it to be analysed. [INM94] Typically, a data warehouse is very large and contains

(17)

Background

both current and historical data. Applications exist that help the data warehouse users to ask ad hoc queries through an easy graphical interface. [SHA97]

The founder of the concept of data warehousing is widely known to be Bill Inmon [POW99]. The definition of a data warehouse according to him is [INM94]:

“A data warehouse is a subject oriented, integrated, time variant, non-volatile collection of data in support of management’s decision-making process.”

By subject-oriented is meant that the data warehouse is organized around the major subjects of a company. The major subjects in an insurance company could be customer, policy, pre- mium and claim, while the operational entities may be auto, health, life and casualty [INM96]. The second distinctive characteristic of a data warehouse is that the data in it is integrated. By integrated it is meant that the data in the data warehouse comes from different sources. In storing data from different sources some problems arise that normally do not occur in conventional operational databases, i.e. differences in naming conventions, differences in encoding etcetera. A typical example could be the different ways to measure things.

There could, for example, be problems determining whether data is in centimetres, inches or yards.

Time-variant data means that the data is accurate for some moment in time. Data that isn’t time variant is accurate right now. In operational databases non-time-variant data is used which can be updated. The data in data warehouses can’t be replaced. A new snapshot of the

(18)

Background

source is taken instead, which doesn’t erase the old data. Time-variancy also incorporates a time horizon that is much longer than an operational database’s time horizon. The data in a data warehouse typically represents five to ten years, while the time-horizon in an operational database stretches from about 60 to 90 days. The characteristic non-volatility means that once the data is inserted into the data warehouse, it isn’t changed. The only two kinds of operations that are allowed in a data warehouse are initial loading of data and access of data.

In operational databases insertion, change and deletion of data occurs on a regular basis.

[INM94]

2.4 Case studies and research design

There are several ways to conduct case studies [BRA99]. According to Yin [YIN94], a case study is advantageous as a method when:

“a “how” or “why” question is being asked about a contemporary set of events over which the investigator has little or no control”.

To be able to compare this case study with the case study conducted by Shanks and Darke [SHA98] it is desirable that it resembles the case study conducted by Shanks and Darke. To increase the resemblance a research design similar to the one used by Shanks and Darke should be used. Using the same research design should lead to similar implementation of respective case and therefore a higher probability of resembling cases. Although [SHA98]

does not explicitly say which research design was used for the study, I decided to use a research design by Yin, which is mentioned in the research design area in an article by Darke, Shanks and Broadbent [DAR98]. Since Shanks and Darke are the authors of the

(19)

Background

framework the method is unlikely to be counter to the one actually used. Yin’s book is also recommended at ISWorld’s website about references on case study [MYE99].

Every empirical study has a research design, a plan, either explicitly or implicitly [YIN94].

Yin defines the research design as:

“...the logical sequence that connects the empirical data to a study’s initial research questions and, ultimately, to its conclusions.”

The research design is above all a plan that helps the researcher to achieve answers that relate to the initial research questions [YIN94].

Yin [YIN94] states five components of research design that are especially important:

1. a study’s questions

2. its propositions, if any,

3. its unit(s) of analysis,

4. the logic linking the data to the propositions, and

5. the criteria for interpreting the findings.

(20)

Background

Study questions:

In determining what research method is appropriate for the study in hand, the researcher can review the form of questions that suits the study. Yin suggests that the choice of research method can be determined from the questions “who”, “what“, “where”, “how” and “why“.

The case study is most appropriate when it is the “how” and “why” questions that are most important. In this step the researcher should clarify the nature of questions that he/she is looking for, regarding to the form previously described. [YIN94]

Study propositions:

Study propositions are made to force the researcher to move in the right direction about what to examine in the study. If the study is explorative, a purpose with the study should be stated instead. Criteria that will be used to determine whether or not the study has been successful should also be stated. [YIN94]

Unit of analysis:

The unit of analysis defines what the case is. A case can be anything from a person to an organization or country. [YIN94]

Linking data to the propositions:

This component, along with the next component, are the ones that have been least developed in case studies [YIN94]. From my interpretation this component serves to let the researcher

(21)

Background

state how the data that is gathered shall be considered with regards to the study’s propositions or purpose. If propositions are made, the researcher should find evidence in the data that supports the propositions.

Criteria for interpreting the findings:

This component serves, according to my interpretation of it, to write down how to interpret the findings from the study, i.e. to make sure that the researcher, before conducting the study, issues criteria that will guide the researcher to judge whether the propositions are negated or supported by the study. Yin [YIN94] mentions that the issue of setting precise criteria can be problematic.

(22)

Research design

3 Research design

3.1 Study questions

To be able to aim at some goal with the interviews I have formulated some study questions at a high level. These questions will work as some kind of “overall questions” that the answers from the interviews shall try to answer. The study questions consist of two “how” questions:

• How does the company handle data quality in the data warehouse?

• How do the stakeholders of the data warehouse experience the quality of the data in the data warehouse?

3.2 Study propositions

Since this study is descriptive no propositions will be stated. Instead a purpose of the study will be used that enables a determination to be made whether or not the study has been successful.

The purpose of the interviews will be to gather as much information as is needed to be able to describe the data warehouse and its data quality issues and to be able to learn about the stakeholders’ experience of the data quality issues. The information will hopefully be sufficient to be able to make a comparison with the case study performed by Shanks and Darke [SHA98].

(23)

Research design

3.3 Unit of analysis

The unit of analysis is the stakeholders of a data warehouse in an international sales department within a large Swedish industry. Since the time of the project is limited, a single case is necessary in order to manage to get some results within the time period. The aim is to interview at least one person from each group of stakeholders that Shanks and Darke [SHA98]

has pointed out, i.e. data producer, data custodian, data consumer and data manager. With interviews with one person from each of these groups I think that I will get sufficient information to make a description and an analysis of the data warehouse and the data quality practises in the company.

3.4 Linking data to the propositions

I will use the framework proposed by Shanks and Darke [SHA98] as a way to explain how the data quality practises are handled in the data warehouse. The framework will provide help in evaluating the level of data quality in the data warehouse. The outcome of this evaluation will be compared with the outcome from the case study performed by Shanks and Darke.

3.5 Criteria for interpreting the findings

To be able to know if my findings are ones that I have been looking for, I have set up some criteria that will help determining the value of the findings. The findings have to be of such nature that they are applicable to the problem. It is important that the findings will be based on the framework presented by Shanks and Darke [SHA98]. This framework provides help in determining the quality of the data in the data warehouse. The findings have to be compa-

(24)

Research design

rable to the ones presented in the case study conducted by Shanks and Darke [SHA98]. This comparison will provide information on how the company in this case study handles data quality practises compared to another company that is situated in Australia.

(25)

The case

4 The case

4.1 Unit of analysis

The owner of the data warehouse is an international sales department within a large Swedish manufacturing industry. The data warehouse has been up and running since 1997. The case study was performed in 1999.

There was a need for better overview over sold units in the sales department, which lead to the start of a data warehouse. The data warehouse was developed with help from the Infor- mation Technology department. The responsibility for the content in the data warehouse lies with the international sales department, while the responsibility for the data quality in the data warehouse informally lies with the Information Technology department.

The number of interviewed people was 4 in a total of 3 interviews. The first interview person was one of the developers of the data warehouse and is also one of the administrators. From this person I collected general information on the data warehouse and information on how the data warehouse was built and its purposes. I also gathered some information on data quality problems in the data warehouse and information on the data warehouse from a syntactic and semantic point of view.

The second interview person was the developer of the "source data"-section. This person

(26)

The case

is the administrator of the "source data"-section. From these persons I gathered information on how the data warehouse was built and its purposes. I also collected information on data quality problems in the data warehouse and information on the data warehouse from a syntactic, semantic and some from a pragmatic point of view.

The fourth interview person is a user of the system. The person is also part of the user group, which is a group that consists of users and administrators of the data warehouse. The purpose with this group is to ease the communication between the users and the administrators.

From this person I collected information on the data warehouse's purposes and data quality problems in the data warehouse. I also gathered information on the data warehouse from a semantic, pragmatic and some from a syntactic point of view.

4.2 The data warehouse

The data warehouse contains sales information for the department. The sales can be viewed in different ways, e.g. sales divided in the different models of their product, different geo- graphic areas etcetera. The data warehouse also contains economic information on the sales, but there has been trouble gathering this information, which has lead to a temporal stop on the information since the beginning of this year. In the future a spare parts section is also planned which will show the sales of spare parts for a product.

(27)

The case

The data warehouse gets its information from two operational databases. These databases are more detailed than the data warehouse. The data that is about to be transferred into the data warehouse first goes through a filter, which analyses the data and decides what data will be supplied to the data warehouse. This data is then transformed to a format that suits the data warehouse. The data is stored in a database and then loaded into the data warehouse program that the users use. The data warehouse can also be reached from the intranet via a web browser.

Since the data comes from a database that itself has received the data from other systems, the data often has gone through some control before it goes into the operational database that serves as a basis for the data warehouse. In this way, the data that is supposed to go into the data warehouse is often of good quality.

A total of 32 users have logged on to the data warehouse during the year. Not all 32 users are, however, users that use the data warehouse on a regular basis. The main users of the data warehouse are sales-responsible persons in the department. They use the data warehouse mostly to get a clear picture of how many units they sold the week before and thereby can do follow-ups and update the prognosis for the rest of the year.

There have been problems in communication between the people that decide what market areas exist and which countries belong to which market area, and the administrators of the data warehouse. All of a sudden one country has been moved to another market area, which

(28)

The case

has lead to a data warehouse that has had some problems in always agreeing with the reality.

The administrators are aware of this problem and hope that some solution can be reached.

The main focus when it concerns data quality problems in the company is to get the sales numbers as quickly as possible into the data warehouse. Sometimes when something has been changed the loading of the data into the data warehouse can be delayed for one day, because of the changes that has to be made. This have to do with the communication problem described earlier.

4.3 The interviews

4.3.1 Syntactic data quality

The data seems to be syntactically correct. The data doesn’t suffer from differences in the syntax in different places in the data warehouse. The persons interviewed haven’t found any evidence showing that the data is inconsistent. Syntax-checking is used before the data is loaded into the data warehouse. If some data is syntactically wrong, it is automatically cor- rected before it goes into the data warehouse. There is, however, no guarantee that the number of sold units is correct. If something new comes in or some changes are made to the market areas, this is logged and can be reviewed by the administrators.

No corporate data model has been used for the development of the data warehouse. There exists standard definitions in the company, but these aren’t always applicable in the different departments. The administrators of the “source data”-section received instructions on which

(29)

The case

information the data warehouse should contain and could state the codes for the source data themselves.

The producers of the data haven’t had any special training on how to produce the data. The persons interviewed do not see any need for it since there isn’t any trouble with this aspect.

There is, however, standardisation of how the data should be written down in the system that serves as a basis for the data warehouse. No measures of syntax errors or the like are made.

4.3.2 Semantic data quality

The data warehouse and the data in it seem to be sufficient for the moment. There has been thought of expanding the information in the data warehouse. There is, for example, a wish to be able to see what the different models are composed of, in order to get a better level of detail of the sales. With this information it would be easier to calculate the exact needs of the parts that the different models are built from. It was meant that the data warehouse also should contain information about sold spare parts, but there were some troubles solving the implementation of this section. This is lying still for the moment, but should be taken care of in the future.

The data isn’t complete in the sense that everything relevant in reality should exist in the data warehouse. But the purpose of the data warehouse has been to show the sales in units in different market areas. The data warehouse would be more complete if there were more

(30)

The case

The data isn’t ambiguous in the data warehouse. This wasn’t at all recognised among the persons interviewed. Every data value in the data warehouse can be mapped to one real world system state. However there has been some trouble with the history of the data. When a product is sold on to a country other than the one it was originally sold to, the history of the product should also be moved to the new country. By changing the history of products, there can derive uncertainties about the product and where it was originally sold, where it is at the moment etcetera.

The data in the data warehouse is meaningful. All data in the data warehouse can be mapped to reality. There isn’t anything in the data warehouse that shouldn’t be there. The data in the data warehouse is correct so far as the interviewed persons know. The only problem with this is that the markets changes from time to time and the data warehouse have had some troubles keeping up with these changes.

There aren’t any controls that show whether or not the data covers everything, that it is correct and so on. If the users have something to complain about or some ideas about the data warehouse, they can talk directly to the administrators or use the user group to communicate their thoughts. For the moment the user group isn’t used so the communication goes directly to the administrators.

4.3.3 Pragmatic data quality

The data warehouse is very simple to work with. It has a graphical appearance and can also show the numbers in tables. The data warehouse supports the users in the way that they get a

(31)

The case

quick overview of the sales. For the most part, this is something that can be calculated and shown to the user through the operational databases, but will take longer and is more difficult to obtain. There is some level of detail on the sales that the data warehouse provides, which can’t be gathered elsewhere. If the user wants graphics on the sales without using the data warehouse, this has to be done manually, which will take longer and is more complex.

Another good thing with the graphs produced from the data warehouse is that they can be used for presentations in the company. Before the data warehouse these graphs had to be produced manually.

The data is updated for the work at hand. Updates are done regularly every week. This is sufficient for the users. There aren’t any wishes to update the data warehouse more often. There has, however, been some trouble getting the information quickly enough into the data warehouse. The data warehouse is updated each Monday afternoon, but some weeks the data warehouse hasn't been updated until Tuesday. This mainly has to do with market area changes that have to be considered. The history of numbers reaches 3 years back in time, which the administrators thinks is enough for the users. From the user side however, advan- tages can be found in having 5 years of numbers in the history instead of 3 years.

The data is very easily understood. There isn't any room for misunderstandings. The data in the data warehouse is concise. Nothing more and nothing less than is needed. The data is very easy to work with. Easy access and the graphical interface makes it simple to use. If the user wants numbers instead of graphs, this is no problem. The user can switch between the

(32)

The case

data. The users trust the data and its sources. Some users, however, want to look at the numbers instead of the graphs.

No controls exist over how current the data is for the tasks at hand, but both the users and the administrators agree that the updates are kept at a good level. Graphics are used to make the usage of the data warehouse easier and help on definitions exists, although these definitions may be a little bit old. The users are very familiar with the current definitions anyway. An on-line handbook on how to navigate in the data warehouse also exists. Measures for updates and what the users think of the data in the data warehouse aren’t used.

4.4 Summarized answers

Syntactic data quality:

The data quality on the syntactic level is good. No corporate data model has been used in the development of the data warehouse. No measures of any kind are used.

Semantic data quality:

The data quality on the semantic level is also good, except in some respects for the property of data to be complete. There is a desire to have a greater level of detail on the units sold, plus sales on spare parts. No measures are used to ensure good semantic data quality.

(33)

The case

Pragmatic data quality:

The data quality on the pragmatic level is very good. All properties are covered. No user surveys or the like are used to gather what the users think of the data.

(34)

Analysis of the case

5 Analysis of the case

5.1 Evaluation with the framework as a basis

The company has good data quality on all three levels in the framework. I believe the reason why the syntactic data quality is high, has much to do with the fact that the data warehouse is so small and that the development of the data warehouse was simplified. The development of the data warehouse was simplified mainly due to the fact that data that was already of good quality, since it had been adjusted for other systems before. The syntax control of the data also contributes to its high quality.

The fact that the data warehouse is so small probably contributes to the high data quality on the semantic level. The size makes the maintenance of the data warehouse easy and good control over the content can be achieved. To start small and upgrade with more and more information will probably make it easier to maintain high data quality in the data warehouse.

When the size of the data warehouse grows the high data quality on the semantic level will be harder to maintain. The section with economic information on the sales in the data warehouse is, however, currently dormant. The company is working on a solution to this problem and hopes to solve it soon.

The biggest problem today seems to be the difficulties associated with market area changes and the consequences that these bring. The communication between the people that are in

(35)

charge of the market areas and the administrators of the data warehouse falls short when market areas are about to change.

The high data quality on the pragmatic level is in my opinion mainly due to the easy and user friendly interface. The user gets a quick overview over the sales and can switch between graphical mode and number mode. The ease of use is likely to increase the usage of the data warehouse. The existence of a web version that can access the data warehouse via the intranet will probably help to spread its usage around the company. The length between updates seems to be kept at an appropriate level. Both the administrators and the users seem satisfied with this.

5.2 Comparison between the case studies

When compared with the case study performed by Shanks and Darke [SHA98] it looks as if the company in my case study (from now on called company B) has come one bit further. In the organisation in the case study performed by Shanks and Darke (from now on called company A), the syntactic and semantic data quality levels were well recognised, but not the pragmatic data quality level. In company B all three levels are of good quality and especially on the pragmatic level.

Both case studies show that the goal of consistent data has been reached. Company A had developed a corporate data model that provided standardised data definitions in the organisa-

(36)

tion. Some kind of standardisation of definitions exists in company B, but when developing the data warehouse this wasn’t used. The administrators try, however, to use standardised definitions in the data warehouse. The syntactic level of data quality was the first level to address in both cases.

The goal of comprehensive and accurate data has been reached in both cases. While company A needs to contact customers to maintain the semantics of the data warehouse, company B can solve this internally in the company. This leads to a better starting point of achieving comprehensive and accurate data. It also requires less human resources.

At the pragmatic level it seems as though company B has reached further. The data warehouse in the organisation seems to be pleased so far with having syntactically and semanti- cally correct data. The company has a pragmatic level of data quality that is well established.

The difference in the level of pragmatic data quality has probably to do with the usage of the data warehouse. The data warehouse in company A has little to do with decision-making, while the data warehouse in company B has very much to do with decision-making.

(37)

Analysis of the case study process

6 Analysis of the case study process

The research design was constructed according to Yin’s research design. This is some kind of guarantee that the case study has been carefully considered. The criteria for interpreting the findings have been fulfilled in a satisfactory manner.

The interviews were in-depth interviews performed with the framework as a basis for the questions. A weakness with the interviews is that, though most of the information was first- hand information, some of the information was second-hand information. Another weakness is that only one user was interviewed. The possibility that this particular user stands out from all users can not be excluded. In this case, however, my subjective opinion is that the user I interviewed seemed to be a regular user with no extreme views. I interviewed persons from all stakeholder groups except from the data producer group. Although this is also a weakness, I think I received enough information to cover the questions in my research design and fulfilled the purpose with the interviews.

Most of the answers are confirmed in interviews with other people, which increases confi- dence in generalizing the answers for the company. My personal beliefs and interests have most likely affected the case study, but hopefully not in a way that reduces its credibility. I tried to stay as objective as I could. The interviews were recorded on tape, which guarantees the preservation of the original source.

(38)

Analysis of the case study process

My objectives are reached to the extent that I had expected them to be. The objectives were of different extent and importance. The main objective was a description of the data warehouse with help from the framework and in some way the evaluation of data quality. The objective to analyse the data quality and compare it with the case study conducted by Shanks and Darke was of less importance, but still interesting to conduct. The main reason why this objective was of less importance is that the preconditions to make an extensive analysis between the two case studies were not satisfied. This is because of the brief description of the case study conducted by Shanks and Darke. This description was basically just a summary of the results of the study. This lead to the fact that no extensive analysis could be made, just a superficial analysis.

By reading about the case study conducted by Shanks and Darke I get the impression that it is about the same size and level as the one I have conducted. This is, however, largely a con- jecture, but it serves as a basis to ground the analysis in. The two cases addresses data warehouses with different purposes. One is for maintaining customer data and the other is for presenting sales information in a company. Hence the two data warehouses contain different types of data. This is another reason why the comparison between the two cases shouldn’t be paid much attention in itself.

It is not possible to generalize the results in this case since this is a single case and the pre- requisites to be able to conduct a case study similar to the one conducted by Shanks and Darke were not in place.

(39)

Conclusions and discussion

7 Conclusions and discussion

7.1 The case

The company could benefit by using some kind of corporate data model for future uses. The data warehouse will probably grow and the bigger it gets, the more difficult it will be to administrate. The usage of a corporate data model will be more important if the data warehouse is to be connected to or communicate with other data warehouses in the company.

User surveys, measures etcetera could be useful when determining what users want from the data warehouse. The user group that exists and works as a communication link between the department and the Information Technology department could be used more. Today the user group is dormant and communication is directly with the administrators. The user group should be used more to improve communication between users and administrators. Regular meetings could be held, at least twice yearly, to get some strictness into the group.

A higher level of detail is required in the data warehouse. Users are interested in being able to view the sales and get a better picture through different presentations of the models.

Implementing this could lead to better knowledge on sales, which increases knowledge of what and how many units to produce.

The company would gain by setting up conventions about the usage of the data warehouse.

(40)

house. Since it is the Information Technology department that is responsible for the maintenance of the data warehouse, this department should state the conventions in cooperation with the international sales department. These conventions should state what, when and how to do things concerning the data warehouse. If, for example, a market area is about to change, the conventions should be followed in order to make sure that everything that needs to be done is done correctly from the beginning. There are some uncertainties about where the responsibility for data quality lies. In order to clarify this matter, the responsibilities from each stakeholder should be written down in the conventions. In this way there will not be any uncertainties concerning who should do what and where the responsibilities lie in respect of the data warehouse.

The web version of the data warehouse could be used more in the departments around the world. The company should inform potential users around the world about the data warehouse's existence and the benefits of the data warehouse. The more users of the data warehouse there are, the better the utilization of the investment and hopefully the better and the faster decisions made.

7.2 The framework

I found the framework rather easy to use. There were, however, occasional difficulties in understanding what is meant with the terminology. I was confused concerning the goal and the properties. It could be hard to distinguish between these in some cases. It was sometimes a little tricky to differentiate between comprehensive (goal) and complete (property) and

(41)

between accurate (goal) and correct (property). There is just a brief explanation of the terms involved. The term conciseness was, for example, not explained, which leads to a personal interpretation from the researcher. For the framework to be understandable in different companies and organisations around the world, the terms in the framework should be explained rigorously so that no doubts can be raised. The definitions could be extended with examples to help the companies and organisations to more easily identify themselves with the definitions and the framework.

How to apply the framework to understand the data quality in a company’s data warehouse wasn’t described in the paper. Only the framework itself was described. The goal with the framework should, in my opinion, be a framework easy enough for any company or organisation to use on their own and a framework that helps companies to improve the data quality in the data warehouse. After the goal of understanding the data quality for the company, the framework should contain information on how to move on and improve data quality. A tuto- rial on how to use the framework could be helpful. Charts or tables could be used to help in visualising the results of an investigation. Some kind of weighting of the properties could also be used to extend the framework. In this way the companies and organisations could more easily learn the importance of the different properties.

7.3 Future work in the area

Further case studies with the framework as a basis should lead to improved usage of the framework and could also give some feedback to the authors of the framework, which hope-

(42)

fully will help them in further developing the framework. The more case studies conducted, the better the ground is laid to be able to generalize the results.

(43)

References

8 References

[BRA99] Braa, K. and Vidgen, R., 1999, Interpretation, intervention, and reduction in the organizational laboratory: a framework for in-context information system research. Accounting Management and Information Technologies, 9(1): 25-47.

[DAR98] Darke, P., Shanks, G. and Broadbent, M., 1998, Successfully completing case study research: combining rigour, relevance and pragmatism. Information Sys- tems Journal, 8(4): 273-289.

[INM94] Inmon, W. H. and Hackathorn, R. D., 1994, Using the data warehouse, John Wiley and Sons, New York.

[INM96] Inmon, W. H., 1996, Building the data warehouse (Second edition), John Wiley and Sons, New York.

[MCF96] McFadden, F. R., 1996, Data Warehouse for EIS: Some issues and impacts, In Proceedings of the Twenty-ninth Hawaii International Conference on Systems Sciences, Vol. II, (Nunamaker J. F. and Sprague R. H., eds.), Los Alamitos:

IEEE Computer Society Press, pp 120-127.

[MYE99] Myers, M. D., 1999, References on case study research. http://www.auck- land.ac.nz/msis/isworld/case.htm (as is: 27th May 1999).

[ORR98] Orr, K., 1998, Data quality and systems theory. Communications of the ACM, 41(2): 66-71.

[POW99] Powell Publishing, Inc., 1999, http://www.datawarehouse.com/sigs/survival/

k_inmon.htm (as is: 12th May 1999).

[RED98] Redman, T. C., 1998, The impact of poor data quality on the typical enterprise.

Communications of the ACM, 41(2): 79-82.

[SHA97] Shanks, G., O’Donnell, P. and Arnott, D., 1997, Data Warehousing: Lessons from the Field, In Proceedings of the 2nd Australian Data Administration Man- agement (DAMA) Conference, Sydney.

(44)

References

[SHA98] Shanks, G. and Darke, P., 1998, Understanding Data Quality in Data Warehous- ing: A Semiotic Approach, In Proceedings MIT Conference on Information Quality, (Chengilar-Smith I. and Pipino L., eds.), Boston, pp 247-264.

[STR97] Strong, D. M., Lee, Y. W. and Wang, R. Y., 1997, Data quality in context. Com- munications of the ACM, 40(5): 103-110.

[TAY98] Tayi, K. G. and Ballou, D. P., 1998, Examining data quality. Communications of the ACM, 41(2): 54-57.

[WAN96] Wand, Y. and Wang, R., 1996, Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11): 86-95

[WA+96] Wang, R. Y., Strong, D. and Guarascio, L. M., 1996, Beyond accuracy: What data quality means to data consumers, Journal of Management Information Sys- tems, 12(4): 5-33.

[WAN95] Wang, R. Y., Reddy, M. P. and Kon, H. B., 1995, Toward quality data: An attribute-based approach. Decision Support Systems, 13(3,4) 349-372

[YIN94] Yin, R. K., 1994, Case study research: design and methods, 2nd edn. Sage Pub- lications, Thousand Oaks