A Model for Company Document Digitization (CODED)

(1)

A Model for Company Document Digitization (CODED)

Proposal for a Process Model for Digitizing Company Documents

ANTON BOTHIN

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(2)

Document Digitization (CODED)

Proposal for a Process Model for Digitizing Company Documents

ANTON BOTHIN

Degree Programme in Information and Communication Technology Date: December 9, 2020

Supervisor: Mira Kajko-Mattsson Examiner: Leif Lindbäck

School of Electrical Engineering and Computer Science Host company: Peter Eriksson Fastighets AB

Swedish title: En modell för dokumentdigitalisering inom företag Swedish subtitle: Förslag till en modell för att digitalisera

företagsdokument

(3)

c

2020 Anton Bothin

(4)

Abstract

There exists many companies which wish to transition toward a more digital workflow. However, many of these companies lack the technical expertise required to undertake such an endeavor. To assist companies in this area, a digitization process model could be used as a stepping-stone toward successful digitization. Currently, however, there exists no such digitization process model.

The purpose of this thesis is to suggest such a digitization process model. The goal is to help companies in digitizing their documents and their workflow. The research question used to reach this goal pertains to how a digitization process model should be structured.

Due to the lack of currently existing digitization process models, different process models within the field of software engineering where analyzed as a basis. The research was qualitative and explorative in its nature, and it followed design science as its research paradigm. An extensive literature study was conducted before development of the model began. The model was evaluated using interviews together with action research. These interviews focused on evaluating the model based on five criteria which had been defined:

(1) interviewee credibility, (2) semantic correctness, (3) syntactic correctness, (4) usefulness, and (5) process flexibility.

The results of this thesis is the company document digitization process model (CODED), which, as the name suggests, is a proposed process model for document digitization. This model has been based on information gathered by, partly the literature study, and partly the interviews. The literature study proved the model to be unique, since no similar model existed prior to this thesis. While the interviews proved the model to be valid, since it accomplished all evaluation criteria which had been defined.

Keywords

Digitization, Document storage, Process model, Quantitative research, Workflow optimization

(5)

(6)

Sammanfattning

Det är många företag som vill gå mot att digitalisera sitt arbetsflöde. Många av dessa företag har däremot en avsaknad av den tekniska expertis som krävs. För att assistera företag i detta skulle en processmodell kunna användas som ett redskap för framgångsrik dokumentdigitalisering. Problemet är att det just nu inte existerar någon sådan processmodell för dokumentdigitalisering.

Syftet med denna avhandling är att föreslå en processmodell för digitalisering.

Målet är att hjälpa företag i att digitalisera både deras dokument och deras arbetsflöde. Hur en sådan processmodell skulle kunna struktureras är denna rapports forskningsfråga.

Det fanns ingen existerande modell att utgå ifrån. Därför användes andra modeller inom området för programvaruteknik som en bas under forskningen.

Forskningen var kvantitative och explorativ, och den använde designvetenskap som ett forskningsparadigm. En omfattande litteraturstudie genomfördes innan utvecklingen av processmodellen påbörjades. Modellen evaluerades utifrån intervjuer tillsammans med aktionsforskning. Där intervjuerna har fokuserat på att evaluera modellen utifrån fem kriterier: (1) trovärdighet, (2) semantisk korrekthet, (3) syntaktisk korrekthet, (4) användbarhet, och (5) flexibilitet.

Resultatet av denna avhandling är ett förslag till en processmodell för att digitalisera dokument, vid namnet CODED (company document digitization process model). Den föreslagna modellen har baserat både på information som samlats från litteraturstudien, och information från intervjuerna.

Litteraturstudien visade att processmodellen är unik, då det ej existerade någon likartad modell tidigare. Intervjuerna visade att modellen är valid, då den uppfyllde de definierade evalueringskriterierna.

Nyckelord

Digitalisering, Dokumentlagring, Processmodell, Kvantitativ forskning, Optimering av arbetsflöde

(7)

(8)

Acknowledgments

A big thanks is extended to PEFAB and the people working there. I would like to thank them for introducing me to the problem of document digitization, and for allowing me to evaluate the CODED process model at their company.

I would like to extend a special thanks to the people who have been directly involved giving feedback on the model. This includes both the people at PEFAB, and the two industrial respondents who took time from their busy schedule to familiarize themselves with the CODED model. Without these people, the thesis could not have been completed. Finally, I would like to express my gratitude to my supervisor, Associate Professor Mira Kajko-Mattsson, for the useful feedback and engagement provided throughout the writing process.

Stockholm, December 2020 Anton Bothin

(9)

(10)

List of Figures

2.1 Illustration of a relational database . . . 12

2.2 Illustration of a document-data model . . . 14

3.1 Overview of research strategy. . . 20

3.2 Design science paradigm [1] . . . 22

3.3 The thesis’ research phases . . . 24

5.1 Activity groups included in the preliminary CODED model . . 38

6.1 Activity groups included in the Improved CODED model, part 1 44 6.2 Activity groups included in the Improved CODED model, part 2 45 6.3 Blueprint of the Improved CODED model, part 1 . . . 46

6.4 Blueprint of the Improved CODED model, part 2 . . . 47

6.5 Blueprint of the Improved CODED model, part 3 . . . 48

(15)

(16)

List of Tables

1.1 Definitions of Process and Process Model . . . 5 2.1 Definitions of One-to-Many and Many-to-Many . . . 11 2.2 Differences between the Relational model, and the Document-

data model . . . 14 4.1 Interview questionnaire . . . 32 7.1 Interview questionnaire, final round of evaluation . . . 57

(17)

(18)

List of acronyms and abbreviations

CODED Company Document Digitization

PEFAB Peter Eriksson Fastighets AB

(19)

(20)

Chapter 1 Introduction

During the last few years, a continuous upwards trend pertaining to digitization has been observed; More and more companies are transitioning toward becoming completely, or at least predominantly, digital. In fact, over 90 % of the world’s data had already been digitized by the year 2013 [2]. It is therefore not hard to imagine that, companies which have yet to become digitized, may find themselves left behind.

Digitizing company documents is not always straight forward. For some companies, it may be enough to scan documents and store them, either locally, or using any number of the currently existing cloud services. Although, for many companies, this is not enough. Even if the documents are stored digitally, this will not, by itself, create a digitized workflow. And this is what most companies are after.

1.1 Background

To digitize the workflow, a custom software solution will most likely need to be developed. And this software solution will, presumably, be modeled after the current workflow. Therefore, the digital storage of documents should also mimic the currently existing document structure. This is only a small part of the digitization lifecycle, there exists several other steps which the company may need to keep in mind. For instance: they may need to perform software or hardware migration, they may require additional personnel or training, and they might even need to redefine their business model.

(21)

The digitization process is a difficult one, and it requires both technical expertise and knowledge about the particular company. [3,4] This difficulty stems from the many different things that need to be considered during a digitization endeavor. Here, a robust process model could be of huge help. The process model should serve as a stepping-stone, guiding companies during their digitization process. Without such a process model, the company may end up with an inadequate result.

To summarize, many companies currently work with physical documents, and therefore require their workflow to be manual. In order to digitize their workflow, they also need to digitize their documents. With so many different things that need consideration, achieving a satisfactory result become increasingly difficult.

1.2 Problem

The problem that is addressed in this thesis is the fact that, to the extent of the author’s knowledge, there currently exists no general digitization process model to help companies tailor their own digitization solutions. Such a process model would greatly aid companies in successfully digitizing both their documents and their workflow, since it would deal with many of the difficulties mentioned in Section1.1.

1.3 Research Question

To solve the previously mentioned problem regarding the difficulty of carrying out a successful digitization endeavor without a proper process model, the following research question has been posed:

How should a digitization process model be structured, which activities are to be included and how should they be organized, which guidelines should be put in place to help with the digitization process?

This research question served as the basis for this thesis and for the overall research process.

(22)

1.4 Purpose

The purpose of this thesis is to suggest a digitization process model. This process model, named Company Document Digitization (CODED), will systematically list a series of activities that are to be accomplished in order to help companies fulfill their specific needs when undergoing a digitization endeavor.

1.5 Goals

The goal of this thesis is twofold. First, the short-term goal is to help companies in successfully digitizing both their documents and their workflow. Second, the long-term goal is to advance research within the field of digitization.

1.6 Research Method

The focus of this thesis was placed on how a good digitization process model should be designed. This required in-depth knowledge within the field of digitization, a qualitative research method was therefore chosen. In addition, the study followed design science as a research paradigm [1]. An extensive literature study has been conducted as the main method used for gathering information. Interviews have also been conducted in order to gather feedback about the process model, the research is therefore also explorative in its nature.

During the later evaluation phase of the study, action research was used. This was done by implementing parts of the process model in an industrial setting.

In summation, a qualitative literature study was conducted in order to gather information about different digitization models. The information gathered was then used to design the CODED process model. This was complemented by interviews with industry experts. The model was then further evaluated using action research. Further details of the research method are presented in Chapter 3.

(23)

1.7 Commissioned Work

A company that is currently undergoing a digitization endeavor is PEFAB, which owns and manages commercial properties in Stockholm [5]. Currently PEFAB manually create their end of year financial statements using binders, this takes both time and resources and is overall a very inefficient method. The company would therefore like to develop a digital document storage solution that mimic the structure of their end of year financial binders. This would allow them to digitize the workflow, that is, the construction of these binders, which would substantially speed up the end of year work. To assist them in this endeavor PEFAB have commissioned the author of this thesis to analyze different digitization models, and to help choose a suitable model for digitizing their documents and their workflow. A more comprehensive description of the commissioning company can be found in AppendixA.

1.8 Target Audience

The target audience for this thesis is both the industry and the academia. Within the industry the thesis is directed at all software and non-software companies that need help with their digitization process. The results of this thesis will also be useful for the academia. There currently exists no digitization process model focused on helping companies tailor their own digitization solutions, this thesis can therefore be used as a basis for further research within the field.

The thesis can also be used for education to highlight different aspects that are of importance during a digitization process.

1.9 Scope and Limitations

This thesis focuses on the process of digitizing documents in order to achieve a digitized workflow, that is to say, when it is not simply enough to store the documents digitally. Such a process can be quite exhaustive, and due to time constraints some limitations have been put in place:

• When digitizing documents, one has to choose a database model that is to be used for digital storage. There exists a plethora of different models such as: navigational, hierarchical, relational, object-oriented, network,

(24)

Table 1.1 – Definitions of Process and Process Model

Process: A set of related activities that leads to the production of a software system [7].

Process Model: A simplified representation of a software process [7].

and NoSQL databases [6]. Comparing all these database models to one another could constitute its own research report, and is therefore beyond the scope of this thesis. Thus, during the action research, when some steps of the model were implemented, only two database models were compared:. the relational model, and the document-data model.

• Some companies require the database to be distributed. Either for reliability, performance, or scalability reasons. This introduces a whole new layer of complexity and is outside the scope of this thesis. The focus, during the action research, will be on databases that are stored on a single central server.

• While considering document storage, security has to be taken into account. The topic of secure storage is however staggering and would take up a research paper of its own. Security will therefore not be mentioned in this thesis.

1.10 Terminology

This thesis uses standardized terminology in the field of software engineering.

There are however some terms used in this thesis that does not have a widely agreed upon definition, mainly process and process model. The definitions for these terms are presented in Table1.1, for the sake of clarity these terms will also be expanded upon.

A process, as stated in Table1.1, is a set of activities, which upon completion will result in a software system. Processes are often outlined by some process model. The process model, as a baseline, will contain a set of activities and a partial order in which they are to be performed. A process model can however also include instruction for how to choose the activities and which roles they should be performed by. [7]

(25)

1.11 Benefits, Ethics, and Sustainability

The results of this thesis should benefit any company that is currently, or is planning on, undergoing a digitization endeavor. The main focus, however, is placed on companies that specifically plan on mimicking their current workflow in a digital environment. Even though focus has been placed on industrial adaptations, the academia should also benefit. The suggested process model can be used within academia to illustrate relevant steps in a digitization process.

The subject matter of this thesis also touches upon ethics and sustainability. This thesis has as a goal to help companies succeed in their digitization endeavor, which should in turn reduce the amount of paper used within the company.

Reducing the amount of paper is, of course, beneficial from both an ethical and a sustainability perspective; Not only does it reduce the need for deforestation, it also reduces the travel amount within the companies since documents will no longer need to be hand-delivered. While writing this thesis the IEEE Code of Ethics [8] were followed, this was done to ensure that the research upheld ethical standards. The ethical requirements pertaining to qualitative research were also followed [9].

1.12 Thesis outline

The subsequent chapters of this thesis are outlined in the following manner:

• Chapter 2: Digitization: This chapter presents the theoretical and practical background needed to understand the remainder of the thesis.

• Chapter3: Research Methodology: This chapter describes the research methodology. It presents the type of research, research strategies, research phases, and research instruments. This chapter also tackles possible threats to the validity of the research results.

• Chapter 4: Evaluation Model: This chapter presents the evaluation model, which will be used to evaluate the suggested digitization process model.

• Chapter 5: Preliminary CODED Model: This chapter presents the preliminary design of the CODED model. It will go through the structure of the model, the included activities, and the structure of these activities.

The first round of evaluation is also presented and analyzed.

(26)

• Chapter6: Improved CODED Model: This chapter presents the improved design of the model. The structure of this chapter is similar to that of Chapter5.

• Chapter7: Partial evaluation of the model within PEFAB:This chapter describes how well the CODED model works within PEFAB, that is to say, how well the solution satisfies their needs. The chapter will go through the implementation of parts pf the CODED process model, and evaluate this implementation by intervieweing representatives from PEFAB.

• Chapter8: Analysis and Discussion: This chapter compiles, analyzes, and discusses the results of the thesis. It also explains how the validity threats discussed in Chapter3were addressed.

• Chapter 9: Conclusions and Future work: This chapter discusses the results of the research, makes conclusions, and proposes future work.

(27)

(28)

Chapter 2 Digitization

This chapter goes through the background knowledge that is needed in order to understand the rest of the thesis. The chapter starts off by introducing different digitization techniques in Section2.1. Depending on the digitization endeavor, some of these techniques may not need to be utilized. Thus, only a portion of the techniques presented will be utilized during the partial evaluation of the CODED process model, which is presented in Chapter 7. All relevant techniques found during the literature study are however presented. This is done since they are implicitly part of the included activities pertaining to the development process. When each technique could be applied is specified in SectionDP. The overview of digitization techniques is followed by an overview of two database models that can be used for digital document storage, this overview is provided in Section2.2. Only two database models are presented due to time and space restrictions, when selecting a database in a real-life scenario, a more exhaustive comparison should be performed. Alongside the digitization techniques, the comparison between the two database models will also be used during the partial implementation of the CODED model, described in Chapter7. As a final segment, Section2.3will go through general process steps which have been gathered from related publications. The entire CODED process model is based upon these steps.

(29)

2.1 Digitization Techniques

This section will cover different techniques, or steps, that can be used when digitizing physical documents. All techniques presented in this section have massive research behind them. However, due to space and time limitations, these techniques will not be presented in detail. Instead, they will only be introduced as a possible step toward digitization. The two techniques that will be presented are: data segmentation, and automatic identification and data capture. These are presented in Subsection2.1.1and2.1.2, respectively.

2.1.1 Data Segmentation

Before a company can effectively digitize their documents, they first have to segment the data that these documents contain. This can be done in several different ways. One of the most common ways, however, is to categorize the data based on its type: text, numerical, Boolean, audio/visual, and so forth.

There exists some techniques for automatic data segmentation, most pertaining to image segmentation, which is the act of automatically splitting an image into related segments [10]. Image segmentation could, for example, be used to isolate text segments from a PDF. It is, however, often the case that data segmentation needs to be done manually. Data segmentation is therefore a very costly process, both in terms of time and money.

2.1.2 Automatic Identification and Data Capture (AIDC)

Automatic Identification and Data Capture (AIDC) is an umbrella term for any method that automatically identifies data, collects it, and enters it into a computer system. Some common technologies that would be classified as part of AIDC include: QR codes, bar codes, and biometrics. [11]

For some companies, it is enough to store their documents as digital hard copies, meaning that only an image of the document is stored. However, many companies need to store their documents as soft copies. In this context, a soft copy is a computer-readable version of the document, which requires extraction of the document’s content. Here, AIDC techniques can be used for extracting text, as well as other data, from the physical documents. There exists a plethora

(30)

Table 2.1 – Definitions of One-to-Many and Many-to-Many

One-To-Many: A table entry of Table A may be linked to many table entries of Table B, while an entry of Table B may only be linked to a single entry of Table A [13,14].

Many-to-Many: A table entry of Table A may be linked to many table entries of Table B, and an entry of Table B may be linked to many table entries of Table A [13,14].

of different techniques used for extracting different types of data. However, it is most common for documents to contain pure text; Therefore, this subsection will describe only a single AIDC technique: Optical Character Recognition (OCR).

OCR is a technique that is used for automatically converting handwritten or printed text into computer-readable text. This is an incredible useful technique, however, one thing that needs to be kept in mind is that the document will usually be separated into isolated text segments, which does not guarantee a correct ordering of the text segments. [12].

2.2 Digital document storage

This section provides a limited overview of two different database models that can be used for digital document storage. Subsection 2.2.1introduces terminology that will be used and their definitions. Afterwards, Subsection 2.2.2-2.2.3 gives a brief description of relational, and NoSQL databases, respectively.

2.2.1 Definitions

The terminology used in Section2.2refer to different types of relationships that can be used within the different database models, mainly one-to-many and many-to-many.

Table2.1contain the definitions that will be used during the remainder of this thesis. To make these definitions more clear, an example of each relationship will also be given below:

(31)

Figure 2.1 – Illustration of a relational database

• One-to-Many: Let us say we have two tables, one containing books and one containing pages, then each book would be linked to several pages, while each page is only linked to a single book.

• Many-to-Many: Let us now say that we instead have one table containing books and one table containing authors, then each book may be linked to multiple authors, and each author may be linked to multiple books.

2.2.2 Relational databases

The relational model has been around for a while now, having been proposed as early as the 1970s [15]. The model is based on relational algebra, which is a query language having instances of relations as both input and output, making it both simple and reliable. This has caused it to see widespread usage and it is now considered one of the most popular database models. [16,17]

An example of a relational database model is shown in Figure2.1. As one can see, a relational database consists of several tables. These tables are split up in rows and columns; A row signifies a data entry, while columns represent the types of data that each entry will contain. One can access an entry either by

(32)

using their identifier, also called the primary key, or by searching for specific data values using queries. These queries a written using a Structured Query Language (SQL), which is a declarative programming language. Different tables can relate to each other, which is done using columns containing foreign keys; These keys point to the primary keys of other tables, creating a relationship between those table entries. [13,14,16,17]

Relational databases support one-to-many relations, and many-to-many can be achieved by creating a table with several one-to-many relationships. A one-to- many relation can be seen in Figure2.1between employees and documents, an employee can be linked to several documents, but each document can only be linked to one employee. A possible many-to-many relation can also be found between employees and deleted documents. Each employee can be linked to several deleted documents, and each deleted document could theoretically be linked to several employees, although this should not happen in practice. The structure of a relational model is less complex compared to other database models, which is a result of the simplicity of implementing different types of relations between tables. [13,14,16,17]

2.2.3 NoSQL databases

There exists many different database models that have been grouped together under the NoSQL label. The common subcategories within NoSQL are: key- value stores, document-data models, column-family stores, and graph data models. [18] Due to time restraints, covering all NoSQL databases lays outside the scope of this thesis; Instead, only the document-data model will be covered.

A comparison between a wider variety of database models could be carried out during future research.

In the document-data model, each entry is stored as a document containing fields with data. Just as with the relational model, an entry can be found by submitting a query to the database based on relevant fields. A unique identifier is usually also included in the document, which allows the database to function similar to a key-value store if required. [18]

There exists no predefined structure that the documents need to follow. A new field can be added to future document entries without needing to change all existing entries. Figure2.2shows an illustration of a document-oriented database. Here, it can be seen that all information is stored in a single document.

This differs from the relation database shown in Figure2.1, where documents

(33)

Figure 2.2 – Illustration of a document-data model

Table 2.2 – Differences between the Relational model, and the Document-data model

Relational model Document-data model Used for relational storage Used for hierarchical storage Structure is predefined Structure is dynamic

Supports vertical scalabilityⁱ Supports horizontal scalabilityⁱⁱ

iIncrease capacity of existing hardware/software by adding resources. ⁱⁱIncrease capacity by connecting several hardware/software entities, distributing the workload.

and deleted documents were stored in their own tables. In the document-data model, one-to-many relationships are modeled using array fields. Examples of this are the created documents, and deleted documents field, shown in Figure 2.2. Many-to-many relations can not be modeled cleanly without duplicating data. [18]

2.2.4 Differences

The relational model is built upon relational algebra, while the document-data model is based on traditional document storage [16,17,18]. Therefore, these two models can be seen as very different, both in terms of their structure, and

(34)

their functionality. An exhaustive list of comparisons is thus too far-reaching to present within this thesis. Instead, the most relevant, and impactful, differences have been elected; These differences are presented within Table2.2.

The most apparent difference between the two models might be the structure.

The relational model is, as the name suggests, used for relational storage.

Meaning, that it excels at storing, and linking, data that is connected in any way. This differs from the document-data model, which has a hard time linking data from different documents, or storage containers, without a myriad of data duplication.

What the document-data model excels at is not the linking of data; Instead, it excels at storing unstructured or semi-structured data. It does not require a defined structure, thus, several different types of documents can be stored and accessed in a transparent manner. This is useful if a company want to store soft copies of data from many different types of documents, that is, documents that may have very different structures containing different types of data. This is not possible in a relational database since the linking of tables require a predefined structure.

The final difference, which is only touched upon briefly, is the scalability of the two different models. The tables included in a relational database often are tightly connected; Thus, it becomes difficult to distribute the database over several machines, that is to say, scale the database horizontally. This is not a problem for the document-data model, where each document can be detached without any major problems to the overall structure. Both models support vertical scalability, which denote the possibility to increase capacity by introducing additional resources. However, the effect this will have on the document-data model is not as visible since querying is usually done sequentially.

2.3 Process models in publications

Two different process models may vary greatly, both in their design and structure.

However, by reading through several publications pertaining to the creation of process models, some insights about general trends can be gathered. This was done during the middle parts of the literature study, which is described in Subsection3.3.1, the general trends gathered are presented herein. For most cases, one could divide the process model into two distinct phases: (1) an

(35)

Inceptionphase, and (2) an Implementation phase. The inception phase is presented in Subsection2.3.1; Here, the company should: identify a problem or need, and decide whether this problem/need requires rectification. During the second phase, labeled as the implementation phase, development occurs.

This phase is described and elaborated upon in Subsection2.3.2.

2.3.1 Inception Phase

The first phase of any process model is identifying a need. There are however many steps that should be taken before deciding to proceed with the endeavor.

The number of steps, and their purpose, can vary between different process models. However, some steps always seem to appear in one form or another.

Most research found focused on migrating or reengineering current software systems. In these cases, the company already has a system in place. Hence, during the inception phase, the company needs to take several steps in order to determine whether the system contains enough business value for it to be kept operational. [7] However, in the case of digitization, there is often no current software system in place. Therefore, these steps are not deemed to be relevant.

The remaining steps, which appear throughout other research publications, are the following: (1) Identify a need, (2) Evaluate current process, (3), Evaluate competence within company, (4) Assess business value, (5) Assess cost and effort, (6) Conduct feasibility study, and (7) Decide how to proceed. These steps have been conceptualized based on information found during the literature study, and they are presented in the listed order below:

• Identify a need: Any endeavor that a company may choose to undergo should always be based on a need. [19,20] This is true for choosing to digitize company documents, as well as for any other project that requires a process model. The need often comes from a perceived problem, such as the workflow being inefficient.

• Evaluate current process: Once a need has been established, the company should proceed by evaluating the current process. This is done in order to get insight about how big the need is; If the need only concerns a small fraction of the users, or if the need is perceived to be small, then it may not be worth proceeding further with the inquiry. [7]

(36)

• Evaluate competence within company: After the current process has been evaluated, if the need is perceived to be big, then the company may proceed with looking into the competence of the planned users. Would the users be able to utilize the digital system, or would competency development need to take place? [3, 21] The answer to this question changes how much effort the undertaking requires.

• Assess business value: The company needs to assess what value they can expect to gain from proceeding with the endeavor. This value can come in many different forms, but it should ultimately increase the well-being of the company as a whole. [7]

• Assess cost and effort: Once the company has assessed their expected business value, they also need to assess the cost and effort that can be expected to be required. This will later be compared to the perceived business value to determine if the endeavor is worth proceeding with.

[7]

• Conduct feasibility study: Once all the prior steps have been takes, the company should as a final measure perform a feasibility study. What is the expected cost, the perceived value, and the chances of success?

The company has to take possible conflict and alternative solutions into account when performing this feasibility study. [7]

• Decide how to proceed: As a final step, the company should decide weather they want to proceed or not. This decision should be based on all the information gathered during the prior steps.

2.3.2 Implementation Phase

Once the company has gone through all the business related activities, and decided that they wish to proceed with the endeavor, only then can they begin with implementation. Generalizing the steps in this phase is difficult due to the fact that the development process may look vastly different depending on the sought-after result. However, some steps can still be identified, these are presented below:

• Formulate system requirements: A need has been found, and it has been decided that the company should venture to solve this need. System requirements now need to be established in order to guarantee that the new system fulfills the need. The system requirements should specify what

(37)

the system should do, how it should behave, and what the performance requirements are. Other requirements and constraints that should be taken into account during design, development, or deployment, should also be listed. These system requirements greatly eases the implementation phase, and are often later used to test the validity of the implemented solution. [7]

• Conduct a prestudy: Before the company starts development on a solution, they should first study different possible solutions. This should be done both in order to increase the likelihood of achieving the system requirements, and in order to examine whether better approaches exist.

[7]

• Develop solution: This step is usually split over several different activities due to its size. Here, it is however presented as a single step since the activities in the phase varies greatly depending on what needs to be developed.

• Confirm that system requirements have been achieved: Once a possible solution has been developed, the company need to confirm that all requirements have been fulfilled. If all requirements are fulfilled, then the endeavor may be considered a success. Otherwise, reiteration of the development process is required. This is done until all requirements have been successfully achieved. [7]

(38)

Chapter 3 Research Methodology

This chapters describes the research methodology that has been applied during this thesis. First, Section3.1presents the research study that was used. Section 3.2describes and motivates the choice of research method. Later, Section3.3 lists all research phases. Section 3.4presents the research instruments used for data collection and evaluation, while Section3.5presents the selection of respondents. Finally, Section3.6and3.7describes the possible validity threats and ethical requirements that have been taken into account.

3.1 Research Strategy

This thesis tend to a research area that is both vast and complex. Furthermore, the complexity was increased due to the fact that there existed no previously published reports pertaining to digitization process models. Finding experts within the field therefore also proved to be difficult. In order to address the research question in an efficient manner, an appropriate research strategy was chosen. The selected research strategy was chosen on the basis of optimizing the research within the given time frame.

An overview of the research strategy is presented in Figure3.1. As one can see, the research strategy is based on the design science paradigm, which in turn uses inductive reasoning throughout each component. There is a total of six components included: (1) research methods, (2) research phases, (3) research instruments, (4) respondents, (5) validity threats, and (6) ethical requirements.

All components are further discussed in later sections of this chapter.

(39)

䤀渀搀甀挀琀椀瘀攀刀攀愀猀漀渀椀渀最

刀攀猀攀愀爀挀栀䴀攀琀栀漀搀 ⴀ 儀甀愀氀椀琀愀琀椀瘀攀

ⴀ 䤀渀搀甀挀琀椀瘀攀 ⴀ 䔀砀瀀氀漀爀愀琀椀瘀攀 ⴀ 䠀攀爀洀攀渀攀甀琀椀挀猀

嘀愀氀椀搀椀琀礀吀栀爀攀愀琀猀 ⴀ 䌀爀攀愀搀椀戀椀氀椀琀礀 ⴀ 䐀攀瀀攀渀搀愀戀椀氀椀琀礀 ⴀ 吀爀愀渀猀昀攀爀愀戀椀氀椀琀礀 ⴀ 䌀漀渀昀漀爀洀愀戀椀氀椀琀礀

刀攀猀瀀漀渀搀攀渀琀猀 ⴀ 倀甀爀瀀漀猀攀昀甀氀猀愀洀瀀ⴀ 氀椀渀最

ⴀ 匀攀氀攀挀琀椀漀渀挀爀椀琀攀爀椀愀 刀攀猀⸀ 䤀渀猀琀爀甀洀攀渀琀猀

ⴀ 䰀椀琀攀爀愀琀甀爀攀

ⴀ 䔀瘀愀氀甀愀琀椀漀渀挀爀椀琀攀爀椀愀

刀攀猀攀愀爀挀栀倀栀愀猀攀猀 ⴀ 㐀洀愀椀渀瀀栀愀猀攀猀 䔀琀栀椀挀愀氀刀攀焀猀⸀

ⴀ 䤀渀昀漀爀洀愀琀椀漀渀 ⴀ 䌀漀渀猀攀渀琀 ⴀ 䌀漀渀昀椀搀攀渀琀椀愀氀椀琀礀

䐀攀猀椀最渀匀挀椀攀渀挀攀倀愀爀愀搀椀最洀

Figure 3.1 – Overview of research strategy

3.2 Research Methods

During this section, the research methods that were used are presented and motivated. First, Subsection3.2.1introduces the research type used, and gives arguments as to why this research type was chosen. Second, Subsection3.2.2 presents the design science paradigm.

(40)

3.2.1 Qualitative Research

The research area that this thesis tackled was relatively unexplored. During the literature study, no previously published works pertaining to digitization processes were found; And only a small number of publications related to similar topics were found. Which research databases that were used, together with a partial list of used keywords, is presented in Subsection3.3.1.

In order to answer this thesis’ research question despite the above mentioned challenges, a qualitative approach was chosen. This entails that the research focused on gaining an in-depth contextualized understanding within the domain of digitization processes. Being of a qualitative nature, many of the existing qualitative research methods were used, such as conducting in-depth interviews and carrying out an exhaustive literature study. By using these methods as a means of data collection, extensive knowledge about digitization processes was gained. This knowledge was later analyzed in order to answer why, how, and whatis the process; The knowledge was also used to understand influences and contexts within the domain. [22]

While the data collection was done using qualitative methods, data evaluation was done using the hermeneutic method together with an evaluation model.

The hermeneutic method is based on extracting and interpreting information, both from empirical evidence, and from other sources of knowledge within the specified domain [23,24].

A qualitative approach was chosen for multiple reasons. Qualitative research is interpretative in its nature, and is focused on gaining in-depth knowledge within an unexplored and unstructured domain. The purpose is to provide new knowledge within the domain, interpreted from previously existing knowledge.

[22] Research within digitization processes is almost nonexistent, which makes the qualitative model a perfect fit for this thesis.

The other possible approach would have been a quantitative approach. The quantitative approach focuses on answering a hypothesis or a set of concrete questions; It is therefore more suited toward experiments, simulations, or statistical inferences. [25] The quantitative model is ill-suited for answering questions requiring in-depth knowledge within a specific field, which this thesis’

research question requires.

(41)

刀攀猀攀愀爀挀栀猀琀爀愀琀攀最椀攀猀Ⰰ 爀攀猀攀愀爀挀栀洀攀琀栀漀搀猀Ⰰ 愀渀搀挀爀攀愀琀椀瘀攀洀攀琀栀漀搀猀

䬀渀漀眀氀攀搀最攀䈀愀猀攀

䔀砀瀀氀椀挀愀琀攀 瀀爀漀戀氀攀洀

伀甀琀氀椀渀攀愀爀琀椀昀愀挀琀 愀渀搀搀攀昀椀渀攀 爀攀焀甀椀爀攀洀攀渀琀猀

䐀攀猀椀最渀愀渀搀 搀攀瘀攀氀漀瀀

愀爀琀椀昀愀挀琀

䔀瘀愀氀甀愀琀攀 愀爀琀椀昀愀挀琀 䐀攀洀漀渀猀琀爀愀琀攀

愀爀琀椀昀愀挀琀

Figure 3.2 – Design science paradigm [1]

3.2.2 Design Science Paradigm

This thesis strictly followed the design science paradigm during the research phase. Design science can be seen as a template used for defining research strategies and research methods [1]. As was shown in3.1, the entire research strategy used in this thesis is based upon design science. A general outline of the design science paradigm is illustrated in Figure3.2; The paradigm is constituted of five activities: (1) explicate problem, (2) outline artifact and define requirements, (3) design and develop artifact, (4) demonstrate artifact, and (5) evaluate artifact. These activities are governed by the choice of research strategies and methods, as shown by the upper rectangle in Figure3.2. The result of each activity contributes to the knowledge base, which therefore gradually develops during the progression of the research phase.

The choice of strategies and methods is not defined by design science. These choices were therefore made based on existing prerequisites, together with requirements that needed to be satisfied. The chosen research strategies and methods are illustrated in Figure3.1, and each choice is motivated throughout this chapter.

The activities that are defined by the design science paradigm are represented by corresponding research phases which are described in Section3.3. Below a comparison between the template activities, and the defined research phases, is made:

• Explicate problem: This activity corresponds to the Literature study phase. Here, a lack of digitization process models was explicated as the problem.

(42)

• Outline artifact and define requirements: This activity is covered by the Preliminary Designphase. First, the artifact is outlined in the Creation of evaluation modelsub-phase. Afterwards, the requirements are defined in the Design of preliminary digitization process model sub-phase.

• Design and develop artifact: This activity spans two phases: the Preliminary Designphase, and the Evaluation and Improvement phase.

• Demonstrate artifact and Evaluate artifact: These two activities are covered by Preliminary Design, Evaluation and Improvement, and Finalizationphases. The artifact is evaluated by the modeller (the author of this thesis) at the end of the Preliminary Design phase. After that, the artifact is both demonstrated to the interviewees and evaluated during the first stage of the Evaluation and Improvement phase. The artifact is also evaluated during the Finalization phase in which feedback from the interviewees is used for improving the CODED model one final time.

3.3 Research Phases

In this section, the different research phases taken during this thesis are presented. These phases correspond to the activities which are part of the design science paradigm, as described in Subsection 3.2.2. The research phases which were conducted are shown in Figure3.3, they are as following:

(1) Literature Study, (2) Preliminary Design, (3) Evaluation and Improvement, and (4) Finalization. These phases are further elaborated upon in Subsection 3.3.1-3.3.4, respectively. Since evaluation happened throughout three of the four research phases, a rundown of each evaluation is presented in Subsection 3.3.5, for clarity’s sake.

3.3.1 Literature Study

The literature study phase was split up into three sub-phases: (1) study of digitization techniques, (2) study of process models, and (3) study of evaluation models. The main source of information came from published research articles and books within the software engineering field. Here, well known research databases were used, including but not limited to: Google Scholar [26], IEEE Xplore [27], Scopus [28], ACM [29], and Springer [30]. Some of the keywords used when inquiring about related work included: digitization,

(43)

倀爀攀氀椀洀椀渀愀爀礀䐀攀猀椀最渀

匀琀甀搀礀漀昀攀瘀愀氀甀愀琀ⴀ 椀漀渀洀漀搀攀氀猀 匀琀甀搀礀漀昀瀀爀漀挀攀猀猀

洀漀搀攀氀猀

䐀攀猀椀最渀漀昀 瀀爀攀氀椀洀椀渀愀爀礀 搀椀最椀琀椀稀愀琀椀漀渀 瀀爀漀挀攀猀猀洀漀搀攀氀

䔀瘀愀氀甀愀琀椀漀渀漀昀 瀀爀漀挀攀猀猀洀漀搀攀氀 昀爀漀洀瀀攀爀猀瀀攀挀琀椀瘀攀

漀昀琀栀攀洀漀搀攀氀氀攀爀 䌀爀攀愀琀椀漀渀漀昀 攀瘀愀氀甀愀琀椀漀渀洀漀搀攀氀

䤀洀瀀爀漀瘀攀洀攀渀琀漀昀 瀀爀漀挀攀猀猀洀漀搀攀氀 䔀瘀愀氀甀愀琀椀漀渀漀昀 瀀爀漀挀攀猀猀洀漀搀攀氀 昀爀漀洀瀀攀爀猀瀀攀挀琀椀瘀攀

漀昀愀甀猀攀爀 䔀瘀愀氀甀愀琀椀漀渀愀渀搀

䤀洀瀀爀漀瘀攀洀攀渀琀

䌀爀攀愀琀椀漀渀漀昀 昀椀渀愀氀搀椀最椀琀椀稀愀琀椀漀渀

瀀爀漀挀攀猀猀洀漀搀攀氀 䰀椀琀攀爀愀琀甀爀攀匀琀甀搀礀

匀琀甀搀礀漀昀搀椀最椀琀椀稀ⴀ 愀琀椀漀渀琀攀挀栀渀椀焀甀攀猀

Figure 3.3 – The thesis’ research phases

digitization process, digitization model, digitization guidelines, digitization method, digitization evaluation, digital transformation, digital document storage, evaluation models, and theoretical evaluation models; The keyword digitizationwas also switched out for digitalization during the search, even though most sources consider these keywords to be distinct [31,32].

During the initial sub-phase, study of digitization techniques, a vast assortment of publications were found. Of these, however, only a handful related to the digitization of paper documents. Nonetheless, there was no shortage of publications pertaining to digital storage and database solutions. Because of the vast number of different database models used for document storage, a limitation had to be imposed. Time restraints did not allows for a comprehensive comparison of database models, therefore, only two were selected for comparison: the relational model, and the document data model.

The relational model was chosen because of its popularity. On the contrary, the document data model was chosen because it is categorized as a NoSQL model, which differ vastly from the relational model.

(44)

The study of process models was met with many obstacles. Searching for existing digitization process models proved to be a fruitless endeavor. Instead, process models pertaining to different, but related, fields had to be studied. The process models found pertained to cloud migration and software reengineering [19,20,21,33,34].

Study of evaluation models, which was the final sub-phase, also proved to be difficult. There existed a plethora of publications pertaining to evaluation models, however, most of these related to evaluation of implemented systems.

Because of this, only a small handful of publications were relevant [35,36,37].

The criteria used in this thesis’ evaluation model, presented in Section4, has been based on these works.

3.3.2 Preliminary Design

The preliminary design was constituted by three sub-phases: (1) creation of evaluation model, (2) design of preliminary digitization process model, and (3) evaluation of process model from perspective of the modeller. The creation of evaluation model phase dealt with outlining an evaluation model which could be used during the later phases. Ultimately, five criteria got included in the evaluation model: (1) interviewee credibility, (2) semantic correctness, (3) syntactic correctness, (4) usefulness, and (5) process flexibility. These criteria were evaluated via both interviews and action research. The complete evaluation model is thoroughly presented throughout Chapter4.

Design of a preliminary digitization process model, which was the second sub- phase, dealt with outlining a rudimentary version of the model. This design was based on knowledge gathered from the literature study. Evaluation of the preliminary model, which was the third and final sub-phase, happened in parallel to its creation. Here, evaluation was done from the perspective of the modeller, that is, the author of this thesis. This was done since one should always evaluate a model from both the perspective of the modeller and of the users to achieve optimal results [36]. Accordingly, the modeller tried to answer, for each criterion, whether it was achieved by the model. These answers where given through experience gathered from the literature study, together with common sense. Two of the five evaluation criteria were ignored during this first round of evaluation: (1) interviewee credibility, and (2) process flexibility. Interviewee credibility was quite obviously excluded, since no interviews were performed. Process flexibility was excluded during this sub-

(45)

phase since the modeller did not feel that they possessed enough expertise to adequately evaluate the flexibility of the model. Development and evaluation of the preliminary process model happened until the modeller believed that the process model fulfilled the underlying requirements.

3.3.3 Evaluation and Improvement

The evaluation and improvement phase was recursive in its nature. It had two sub-phases: (1) evaluation of process model from perspective of a user, and (2) improvement of process model. These sub-phases were recursively performed until the first sub-phase concluded that the process model achieved all evaluation criteria. In total, two recursions were performed.

The first sub-phase conducted its evaluation through the perspective of a user.

Here, the process model was evaluated by conducting an interview with one expert, according to the evaluation criteria listed in Chapter4. This evaluation could have been either from an academic or an industrial perspective, depending on the background of the interviewee; However, in this case, the expert interviewed in both recursions had an industrial background. A questionnaire was designed in order to structure the interviewees and increase repeatability and comparability, this questionnaire is presented together with the evaluation criteria in Section4.

The feedback received during this first sub-phase was then carried over to the second sub-phase. Here, the process model was improved upon according to the feedback. Once improvements were completed, it was time for the process model to be evaluated once again. This phase repeated itself until the process model sufficiently passed the evaluation criteria.

3.3.4 Finalization

Once evaluation was concluded, only minor improvements remained before the final version could be complete. During this phase, all results gathered from the literature study, interviews, and action research, were taken into account. All of this knowledge together contributed toward the final version of the CODED model.

(46)

Evaluation was also conducted during this final phase, although here it was conducted using action research. Evaluation could only be done before implementation, due to the fact that time restraints did not allow for a full scale implementation in an industry setting. This last evaluation phase tried to combat that by performing some of the theoretical steps of the process model in a simulated industry setting. Meaning that some of the steps of the process model were conducted for a proposed digitization project. The steps performed were the following: (1) Define requirements guiding the digitization process, (2) Determine the most appropriate digitization form for the process, (3) Choose a database, and (4) Design database structure. Just as in the first evaluation, the interviewee credibility and process flexibility criteria were ignored. No interviews were conducted during this phase. And evaluating process flexibility would be difficult since the model was only partially implemented, and only implemented for one possible digitization endeavor.

3.3.5 Rundown of Evaluation

Initial evaluation happened at the end of the Preliminary Design phase. Here, evaluation happened through the author of this thesis, using knowledge gaining from the literature study together with common sense. This evaluation was done in order to ascertain the value of the model before presenting it to an independent reviewer. This evaluation is presented in Section5.2. Without change to the model, it was again evaluation in the first step of the Evaluation and Improvement phase. This time, evaluation happened in the form of an interview with an industry expert. This evaluation is presented in Section5.3.

The third evaluation happened during the second iteration of the Evaluation and Improvementphase. Again, evaluation happened in the form of an interview with an industry expert. The third round evaluation is presented in Section 6.2. The final round of evaluation happened throughout the Finalization phase.

Here, action research was applied by implementing parts of the process model in a real-life industry setting. This evaluation is presented in Chapter7.

3.4 Research Instruments

The research instruments used during this thesis were the following: (1) literature study, (2) evaluation model, and (3) interview questionnaire. These instruments are presented below:

(47)

• Literature study: Due to the fact that no existing digitization process models could be found, the study had to rely on published work from related domains. The literature study therefore proved to be an essential component of this research.

• Evaluation model: The evaluation model consists of several criteria used for evaluating the CODED model, they are presented in Chapter4. These criteria were used together with the questionnaire answers during the evaluation phase.

• Interview questionnaire: The interview questions were designed to be semi-structured and open-ended. This decision was made in order to gain new knowledge while at the same time being able to compare the answers. All interviews were conducted over video conference, this made recording the interviews and transcribing the answers an easy task.

3.5 Sampling method

Before looking for possible respondents to interview, a sampling method had to be defined. The sampling technique used for recruiting participants was purposeful sampling. That is, the interviewees were selected based on their ability to provide in-depth and detailed information. However, there also existed an element of convenience sampling; The author prioritized respondents which had a higher likelihood of accepting participation. [38,39] The following selection criteria were used for determining suitable respondents: (1) the respondent should have been, or currently be, involved with a digitization process; and (2) the respondent should have at least ten years of experience within the field of software engineering. Using these criteria, it could be assured that the respondents would introduce useful knowledge for answering the posed research question. In total, two respondents were chosen, both passed the above mentioned selection criteria.

(48)

3.6 Validity

Validity of a qualitative study is assessed using four criteria: (1) credibility, (2) dependability, (3) transferability, and (4) conformability. These criteria are used to test the strength, appropriateness, and soundness of the chosen research method. [40] They are further described below:

• Credibility: This criterion deals with the believability and trustworthiness of the CODED process model. Since evaluation of the model largely happened in the form of interviews, the credibility of the model largely depended on the credibility of the respondents. In order for the results to be credible, it needed to be assured that the respondents gave credible answers. This was assured by establishing Interviewee credibility as an evaluation criterion. This criterion helped assure that each respondent had enough experience in order to give believable and trustworthy answers, the criterion is further described in Section4.1.

• Dependability: This criterion deals with the repeatability of the research process. Qualitative research always suffer from a lack of repeatability, it cannot be replicated in the same way as quantitative research can. This is due to the fact that the qualitative process vary greatly depending on the context. In order to prove dependability, the research therefore needs to be repeated several times within different industrial contexts.

• Transferability: This criterion refers to the generalizability of the study’s findings. The study should aim to have a degree of generalizability that is high enough for the results to be transferable from one context to another. This was achieved in two major ways: (1) by using several different sources of data for validation, and (2) by selecting respondents that possessed different angles of expertise.

• Conformability: This criterion refers to what degree the results can be confirmed by other researchers. The criterion was dealt with by:

(1) thoroughly describing the process phases, and (2) confirming the transcribed answers with each interviewee.

(49)

3.7 Ethical Requirements

When conducting any form of research, it is important to follow some form of ethical guidelines. In the case of qualitative research, there exists four ethical requirements that should be followed: (1) information requirement, (2) consent requirement, (3) confidentiality requirement, and (4) usufruct [9].

These requirements are further elaborated upon below:

• Information requirement: The respondent need to be informed about the research, its purpose, and the rights associated with participation. This requirement was fulfilled in two ways: (1) by beginning each interview with a presentation of the research and its purpose; and (2) informing each respondent about the voluntary nature of their participation. Furthermore, each respondent was informed that they may rescind their participation at any point if they wish to do so.

• Consent requirement: Each respondent should individually be able to decide about their participation in the study. The requirement was fulfilled by informing each respondent about their participation, and having them give verbal informed consent about the continuation of the interview. In addition, each respondent was informed that they may discontinue the interview at any point.

• Confidentiality requirement: Each participating party has the right to remain anonymous. As for the commissioning company, PEFAB, they did no request any form of anonymity. Each respondent was informed that their name and workplace would not be disclosed, and it was asked whether they wished further anonymity. Only one of the respondents requested further anonymity.

• Usufruct: The outcome of the research should only be used for its intended purposes. This thesis had as its intended purpose to design a digitization process model. The material gathered has only been used toward this purpose, the research has not been used for any commercial or other purposes.

(50)

Chapter 4 Evaluation Model

The evaluation model consist of five evaluation criteria, the criteria are as following: (1) interviewee credibility, (2) semantic correctness, (3) syntactic correctness, (4) usefulness, and (5) process flexibility. The evaluation criteria were designed to help answer the research question posed in Section1.3. This is true for all evaluation criteria besides the first one pertaining to interviewee credibility, since it was introduced to increase the validity of the model.

The evaluation criteria are presented throughout Section4.1-4.5, respectively.

Besides these criteria, a set of open-ended questions were also asked in order to gain further insight. These questions are presented in Section4.6.

Evaluation was primarily performed through interviews. When it comes to conducting interviews, it is important to have a planned structure in place, since there otherwise exists a risk of derailing the topic at hand. To combat this risk, an interview questionnaire was put in place, as can be seen in Figure 4.1. Throughout this chapter, each evaluation criteria will be related to its corresponding interview questions.

4.1 Interviewee Credibility

As mentioned in Section3.6, it is important to ascertain the believability and trustworthiness of each respondent. One has to make sure that the interviewees have solid knowledge within the field in question. Otherwise, the feedback provided will not be credible. The purpose of this thesis was to present a process model for digitizing company documents. Therefore, it was of utmost

A Model for Company Document Digitization (CODED)