Continuous Event Log Extraction for Process Mining

(1)

IN

DEGREE PROJECT INFORMATION AND COMMUNICATION

TECHNOLOGY,

SECOND CYCLE, 30 CREDITS ,

STOCKHOLM SWEDEN 2017

Continuous Event Log Extraction for Process Mining

HENNY SELIG

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

(2)

Abstract

Process mining is the application of data science technologies on transactional business data to identify or monitor processes within an organization. The analyzed data often originates from process-unaware enterprise software, e.g.

Enterprise Resource Planning (ERP) systems. The differences in data management between ERP and process mining systems result in a large fraction of ambiguous cases, affected by convergence and divergence. The consequence is a chasm between the process as interpreted by process mining, and the process as executed in the ERP system. In this thesis, a purchasing process of an SAP ERP system is used to demonstrate, how ERP data can be extracted and transformed into a process mining event log that expresses ambiguous cases as accurately as possible. As the content and structure of the event log already define the scope (i.e. which process) and granularity (i.e.

activity types), the process mining results depend on the event log quality. The results of this thesis show how the consideration of case attributes, the notion of a case and the granularity of events can be used to manage the event log quality. The proposed solution supports continuous event extraction from the ERP system.

Keywords

Process Mining, Event Log, Data Convergence and Divergence, Continuous Log Extraction, Case Identification

(3)

Abstract

Process mining är användningen av datavetenskaplig teknik för transaktionsdata, för att identifiera eller övervaka processer inom en organisation. Analyserade data härstammar ofta från processomedvetna företagsprogramvaror, såsom SAP-system, vilka är centrerade kring affärsdokumentation. Skillnaderna i data management mellan Enterprise Resource Planning (ERP)- och process mining-system resulterar i en stor andel tvetydiga fall, vilka påverkas av konvergens och divergens. Detta resulterar i ett gap mellan processen som tolkas av process mining och processen som exekveras i ERP-systemet. I denna uppsats används en inköpsprocess för ett SAP ERP-system för att visa hur ERP-data kan extraheras och omvandlas till en process mining-orienterad händelselogg som uttrycker tvetydiga fall så precist som möjligt. Eftersom innehållet och strukturen hos händelseloggen redan definierar omfattningen (vilken process) och granularitet (aktivitetstyperna), så beror resultatet av process mining på kvalitén av händelseloggen. Resultaten av denna uppsats visar hur definitioner av typfall och händelsens granularitet kan användas för att förbättra kvalitén. Den beskrivna lösningen stöder kontinuerlig händelseloggsextraktion från ERP- systemet.

Nyckelord

Process Mining, händelselogg, Data Convergence, Data Divergence, Kontinuerlig loggextraktion, Typfallsidentifiering

(4)

List of Acronyms

BPMN Business Process Model and Notation ERP Enterprise Resource Planning

IDoc Intermediate Document

PO Purchase Order

PR Purchase Requisition PTP Purchase-To-Pay SaaS Software-as-a-Service XES Extendable Event Stream

(6)

1

1 Introduction

Optimization of processes has been performed ever since there were processes to improve. Multiple different scenarios may lead to a company’s desire to analyze and optimize existing processes, e.g. the switch from a brick-and- mortar business to online commerce, cost reduction initiatives, or IT transformations. Process mining is the application of data mining techniques on business processes to gain new insights into business processes and monitor them based on their IT footprint.

This thesis presents the results of a project that analyzed the possibilities to integrate process mining with SAP ERP. The goal was the standardization of log creation, identification of relevant business activities, and continuous extraction of such activities from SAP ERP systems. A special focus lay on the handling of ambiguous data within the SAP system.

1.1 Background

Process Mining is a combination of process science, business analysis and data mining [1], [2]. Process Mining techniques allow the discovery and conformance checking of processes based on already existing data stored in IT systems. The static data is given a secondary value by extracting and transforming it into an event log that represents the process flow within an organization. The event log contains a step-by-step overview of which process steps are happening at a given time. Related activities are mapped into cases, i.e. process instances, so that each event can be assigned to a specific execution or case, for example a purchasing transaction. On the event log with the process steps and cases, process mining techniques can be applied to gain insights into the processes, e.g. the execution flow. From that result, analysts can detect bottlenecks, criteria that impact the flow, and compliance issues.

To perform process mining, accurate and complete event logs are required.

Event logs contain entries for every event in the system, related to a specific process. An event might be the creation of a document, a change, or a deletion.

It must be recognizable to which process execution an event belongs, e.g.

whether a change was applied to document A or to document B.

Many processes within companies rely on the central enterprise IT systems, like Enterprise Resource Planning (ERP). These systems build the core of a company’s IT landscape, as many business activities, e.g. accounting, sales, and purchasing, are executed in the systems. For example, to purchase goods, the purchase order, as well as delivery and invoices are created and documented in the ERP system. By looking at the core IT systems of a company, also the core processes can be analyzed.

The focus of this project lay on retrieving event logs that can be used for conformance checking, i.e. to evaluate whether or not the execution of a process follows the designed model. Conformance checking should not be a one-time activity, but a continuous analysis. This way, bottlenecks within an organization or other hinderers can be identified at the time of occurrence, and

(7)

2 countermeasures can be taken immediately. However, the identification of relevant data and the continuous extraction requires manual work and domain knowledge about the process. This work focuses on standardizing the selection, extraction and transformation of process events for SAP ERP systems, while maintaining customizability of the process mining application where necessary.

There are multiple publications from different researchers that focus on the creation of an event log. They focus on different areas, e.g. automatic case identification, automatic selection of relevant tables, and others [3] [4].

Building an integration of these concepts and showing benefits and limitations of the existing technologies is the purpose of this work. A special focus lies in the identification of ambiguous data and its handling.

1.2 Problem

Extraction and preparation of event logs are a complex matter. Firstly, all relevant data for a process needs to be identified. Within SAP ERP systems, the information is not provided collectively, but scattered across multiple tables.

For a purchasing process, more than 30 tables are needed to capture all data related to the process. Only people with sufficient domain knowledge are able to identify required data sources. The identification needs to be performed for every analyzed process. Finding an automated identification of relevant data is the long-term goal. Secondly, the ERP system is oriented along business documents, and not along business processes. To find the process data behind the static data, the business documents and related activities need to be matched into cases, i.e. the same execution of a process. For example, an invoice always belongs to a purchase order, and this connection needs to be identified.

As the ERP system is process-unaware, there are different interpretations of a correct match. Therefore, a suitable way of case identification has to be found which is also able to handle complex cases. Thirdly, this identification is further complicated when data is extracted continuously rather than in a single export.

Decisions need to be taken on a partial view of the data, while the whole picture is only known after the process execution has been completed. Lastly, extraction of data must be performed in a way that does not interfere with the ongoing business activities in the ERP system, as ERP systems build an essential backbone of most companies.

1.3 Purpose

This thesis presents the results of a project that was conducted with the goal of optimizing the extraction and transformation of event data from SAP systems for process mining by addressing the four problems mentioned above. The project consisted of analyses of extraction methods, case identification and data quality improvements. The findings from the analyses were implemented in a prototype of an extraction connector between SAP ERP and a process mining tool.

Existing research and publications laid the foundation of this project and were adopted where sensible. This required the adaption of some of the concepts in order to be used in conjunction with other components. By reducing the impact

(8)

3 of the afore mentioned problems, the project aimed to accelerate the adoption of process mining within a company, and offer continuous log extraction with as little interruption of the ERP system as possible. As process mining results rely heavily on the granularity and adaptability of the event logs, a special focus of this project lay on the quality of logs.

The project was conducted together with Signavio GmbH. The format of the event log created by the presented prototype follows the requirements of Signavio’s process mining solution, Signavio Process Intelligence. The findings of this project shall help Signavio to accelerate process mining projects.

However, the data extraction and transformation proposed in this thesis are generally applicable and not limited to Signavio Process Intelligence.

1.4 Delimitations

This work focuses solely on the extraction of event logs, not the process mining itself. The extracted event logs can be used, after possibly format adaptions, with different classical process mining solutions. Process mining algorithms, that are based on different data structure, e.g. by using the artifact-centric approach [5], are not covered.

As ERP systems from different vendors have special characteristics, this work focusses on the data extraction from SAP ERP. Due to SAP-specific technical decisions and the SAP data model, adaptions would be required to apply the concept to other ERP systems. However, similar work for other ERP systems exists (e.g. [5], [3], [6]). Furthermore, this work is based on a purchasing process example. An adaption of the data selection in the SAP system would be required to transfer the concept to other processes.

1.5 Outline

To show the challenges and proposed solutions towards the extraction of SAP ERP data for process mining, chapter 2 introduces the concept of event logs for process mining, as well as the purpose and requirements of the event logs. This chapter will also discuss existing approaches and known challenges when extracting data from SAP ERP systems. Chapter 3 introduces the research problem and the methods chosen to approach it. The prototype solution to the problem will be introduced in chapter 4, along with the challenges faced during the design phase. Chapter 5 will give an evaluation of the prototype, with a special focus on ambiguous cases. Chapter 6 will summarize the contributions and give an outlook into future work and open questions.

(9)

4

2 Data Extraction for Process Mining

According to the Process Mining Manifesto of the IEEE Task Force on Process Mining, the “idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in today's (information) systems” [1, p.1]. The following chapters will introduce the applications of this idea and the challenges in acquiring these readily available logs.

2.1 Types of Process Mining

As the above quote shows, process mining deals with existing data and reveals valuable information about the processes. According to van der Aalst [2], there are three purposes of process mining:

- Discovery: The analysis of data that originates from a company’s IT systems, reveals how tasks of processes are executed within the company. The outcome is a business process model that shows the processes within an organization how they are performed, not how they are intended.

- Conformance: The data from IT-Systems is used to verify an existing business process model. This means that the organization has defined a process model and uses process mining to verify whether the real process and the designed process are identical.

- Enhancement: As in process conformance, a process model already exists for process enhancement. However, the difference lies in the purpose. Enhancement focuses on improving the existing model, e.g.

repairing it in case it was wrong, or extending it in case a bottleneck was identified. Conformance concerns the improvement of the process execution, while enhancement concerns the improvement of the process model.

All three types have in common that they rely on event logs coming from IT systems. The event logs may look the same for all three applications of process mining, however, this work focuses on the latter two. All three types can give insights into past process executions, for example, whether activities were skipped, re-ordered, taking additional time, or are likely to be exposed to fraud.

For these later two scenarios, real-time insights into processes offer an additional benefit, as potential bottlenecks or issues can be identified at the time of execution.

van der Aalst as well as other researchers [1], [7] differentiate strictly between the application of process mining for process conformance and process enhancement. The difference lies in the handling of any deviations between the to-be-situation (process model) and as-is-situation (event logs). In the conformance case, the as-is-situation needs to be adapted, e.g. employees trained or IT systems optimized. In the enhancement case, the process model needs to be adapted, e.g. specified on a more detailed level or adapted as it contained a logical error. In both conformance and enhancement of processes, the process model and the event logs are generated and analyzed in the same

(10)

5 way. For the purpose of this project, this differentiation between enhancement and conformance checking is not required. As the event log generation is the focus of this work, the event log shall be equally suitable for both process mining applications.

Both conformance and enhancement of models require the availability of a business process model a priori. The tools used for this work are based on Business Process Model and Notation (BPMN), a standard for modelling, maintained by the Object Management Group [7]. Transition systems, Petri Nets, Process Trees and other languages can be used for process modelling.

However, BPMN is one of the most widely used languages to model business processes [2].

Beyond the question of choosing a modeling language, making a content-wise good model is difficult. van der Aalst describes some of the common errors [2]:

- Simplification: The model is too optimistic and does not cover more complicated cases

- Idealization of human behavior: The model assumes overly optimistic human behavior, and does not allow for varying performance or workload.

- Wrong abstraction level: The model is not in accordance with the level of detail of event logs for process mining. It can either be too detailed for IT systems to be able to capture events, or too abstract to be able to answer relevant questions.

While the first two might be identified by applying process mining, this last point needs to be considered when creating the logs. It is possible to capture events on different levels, so the model and the log extraction need to match.

This issue will be discussed in chapter 2.7.3.

According to van der Aalst, these modeling problems “stem from a lack of alignment between hand-made models and reality. […] Discovery techniques […] allow for viewing the same reality on different angles and at different levels of abstraction” [2, p.88]. This means, that discovery techniques can help understanding the as-is-situation, but modelling is still required. Without modelling, no improvements to the actual process can be introduced. By offering an approach to retrieve accurate, high-quality event logs, this research will help to complete the circle of optimizing a model (theoretically), improving a process (practically), and checking the effects and monitoring the executions (analytically).

2.2 The Event Log – Theoretical Challenges

Process mining cannot be done without available data logs. Data can come from any or multiple data sources that hold correct and valuable information about processes. A data source can be a flat file, transaction log, database table, Data Warehouse, ERP system, and many others. It can be structured or unstructured, of different quality and might have to be combined from multiple sources [2]. In this work, the data source is a single SAP ERP system.

(11)

6 When using data from an ERP system, the data typically cannot be used immediately, but needs to be extracted from the source system and transformed to a suitable format before it can be loaded into the process mining system. This extraction, transformation and load process (ETL) is a standard approach that is also used for creating and maintaining Data Warehouses. Though Data Warehouses may hold information valuable for process mining, the data is often modelled for Online Analytical Processing (OLAP) and is not process- oriented [2]. In this project, the extraction was done from the ERP system directly. This way, there is no loss of information and continuous, near-real- time extraction is possible (multiple updates per hour).

2.2.1 Terminology and Characteristics of the Event Log

A process mining event log exhibits some typical elements and characteristics.

The first point to note is that there is one event log per process [2]. That means if multiple processes shall be analysed by process mining, each of those processes requires its own event log, like they also have different process models. This is clear when comparing two processes. To analyse a sales process, the process mining tool requires sales data, e.g. order volume, customer or product information. This data might come from a CRM or an ERP system.

When analysing a machine maintenance process, completely different data is required, e.g. about the plant, procurement duration, or costs. Though the structure of the event logs will look the same, both will need a timestamp and an activity type, the business logic and activities will differ.

A second characteristic of an event log is that every entry in the log refers to a single case [2]. A case (also called process instance) refers to a single execution of a process, and is identified by a caseid. For a purchasing process, a case may consist of creation of a purchase requisition (PR), approval of the PR, creation of a purchase order (PO), approval of the PO, receipt of goods, receipt of an invoice, and the payment of the invoice. Each of the steps can be mapped to this particular case, and each of these steps is represented as an event in the log.

Events, single entries in the log, need an order within a specific case, and therefore need a timestamp. They also have a specific activity assigned, which is the executed business task that happened. They can also carry other attributes. [2]

When comparing cases with each other, different activities or different orders of activities can be observed. A variant is a specific order of activities, that is followed by at least one case of the process. A variant is therefore an execution of the process steps in a specific sequence. [2]

Figure 2-1 visualizes these definitions. An event log consists of events, which contain activities that can be seen in the IT systems and mapped to the process model. Every event is mapped to a case, a specific execution of the process.

Every case has a sequence of ordered events. Each sequence forms a variant.

(12)

7

Figure 2-1: Definition of activities, cases and variants

2.2.2 Format of the Event Log

There is some information that every event log needs to contain. These are a timestamp, at which an activity was performed, an event type, identifying the performed activity type, and a case identifier to relate the event to a process instance [3]. Additionally, there can be attributes describing either the event or the case, e.g. the currency and amount of an invoice.

In 2010, the IEEE Task Force on Process Mining adopted XES (Extendable Event Stream) as a standard for process mining event logs [1]. XES is XML- based and every event log entry comprises of an event type, a timestamp and any number of additional attributes. Attributes can be assigned on different levels, i.e. log level, case level, or event level. An XES log contains any number of cases (traces in XES language), where each case consists of sequential events of this case [2]. This requires that every event is already assigned to a case, and all events related to a case are already known. Especially when continuously extracting events from an ERP system, as intended in this project, this is not possible, and the identification of cases across events is more complex. This, and the already existing interface for the used process mining tool, have led to the decision not to use XES format for this project. Instead, a relational model as shown in figure 2-2 is used.

Figure 2-2: Extract from an event log

(13)

8 The process mining tool used for this project differentiates into an event log and an attribute log. This has the advantage that attributes can be added also at a later point in time without changing the event log, which simplifies attribute assignment in continuous log extraction scenarios. Both the event log and the attribute log require a specified format. The event log is a file containing a row of caseid, activity name and timestamp per event (compare figure 2-2). Thereby it contains the minimal information needed to map an event to a case and an activity on the process model in the correct time order. The attribute log consists of the caseid and several attributes that are specified for each process.

2.3 SAP ERP as a Data Source

The architecture of an SAP ERP system is separated into a database layer and the ABAP application layer. Due to this separation, the application is independent and different relational databases (Oracle, Microsoft SQL Server,…) can be used with the ERP system. The variety of databases makes it difficult to extract data from the database directly. An abstraction from the specific database implementation is offered by the ABAP application layer. The logic for database tables and relations between these tables are implemented in the ABAP application layer. Also at this layer, SAP systems implement the logic for authorization and user management, as well as standardized access methods, failure handling, load balancing and the ABAP framework for custom coding. These services, provided by the ABAP application layer, are critical for productive ERP systems and will be influential factors on design decisions discussed in chapter 3.2.

SAP ERP systems handle entries in the database tables as representations of documents, e.g. sales order documents, purchase order documents, material movement documents, … That means the SAP ERP System is oriented along business documents, not oriented along business processes. For mining these systems, entries from multiple tables need to be transformed into event logs.

For this, a re-engineering of parts of the relational schema and foreign key relationships between tables is required. The structure of these tables, automation of this identification and a sample process will be discussed in the following.

2.3.1 SAP Business Processes and Purchase-To-Pay

The SAP ERP end users perform their tasks by connecting to the ABAP application server. Various components of the application that perform different business activities are accessed via transaction codes. The other way around, a transaction code a user has entered can identify the activity type that was performed. For example, transaction code ME21 can be used to create a purchase order. Unfortunately for the process miner, these transactions are interconnected, and from the same transaction code a user can transfer into changing a purchase requisition. That means, the information which transaction code the end user executed, is insufficient to identify the activity.

(14)

9 SAP systems can be highly customized, but there are some standard processes that are similar between companies. Typical examples are the sales process (order-to-cash, OTC) and the purchasing process (purchase-to-pay, PTP).

Though there are some company-specific differences, these processes are roughly the same, independent of the company’s industry, ERP system version, and country. For this project, a purchase-to-pay process in an SAP ERP 6.0 EHP 7 (released 2013) is used as reference. Figure 2-3 shows the activities and associated transaction codes of the process.

Figure 2-3: The SAP standard procurement tasks and related transaction codes

The PTP process can be modelled at different detail levels. Figure 2-3 shows the PTP process with only the SAP standard steps [8]. These include the creation of a purchase requisition (PR) by the requesting department, e.g. the manufacturing plant. Then a purchase order (PO) is created and sent to the supplier, normally by the purchasing department assigned for the specific material class. Afterwards, the goods or services are delivered and the invoice received. Next, the invoice is forwarded to the accounting department and cleared. Depending on the company’s needs, additional or optional steps can be designed. Optional steps can be the creation of PR, approvals, and vendor confirmation. These typically depend on the ordered volume. In other scenarios, a PO might not even be required if the value of the goods is below a certain threshold. For this project, a simple process as shown in figure 2-3 is assumed, as this is the basis for more detailed process models and therefore offers more flexibility.

2.3.2 Identification of Process-Relevant ERP Data

The SAP ERP system stores purchase orders, material movements, invoices and other activities as entries in database tables. For example, a line is added in table EKKO for every newly created purchase order. Each line of the table represents a single purchase order. However, EKKO only holds the header information for each purchase order, for example the creation date, the supplier, etc. The items of a purchase order, meaning which product is ordered in which quantity, are stored in table EKPO, the item table related to EKKO. A purchase order with two different items therefore has one entry in table EKKO, and two entries in tables EKPO.

This separation of header and item data is a common theme in SAP ERP. To know which items belong to which header, SAP uses foreign key relations.

These logical connections are not necessarily implemented on database level.

Therefore, reliable information on the logical connection needs to be extracted on SAP application layer level.

(15)

10 SAP uses additional foreign keys to represent relationships between documents and their predecessors. Relationship between purchase request items and purchase order items are established, as well as between any other involved business document. Within the SAP ERP system, foreign keys are based on item level, not documents or headers. Automatic identification of foreign key relations between tables to identify a process flow is not possible. Domain knowledge about the relations and their meaning is required [4].

Due to the architecture of the SAP system and the typical business scenarios, multiple predecessors can point to the same successor. It is also possible that multiple successors reference back to the same predecessor. From a database perspective, these situations can be identified as many-to-many-relationships of foreign keys (m-to-n-relationships). This influences the definition of cases, as multiple business documents of the same type can be assigned to a case. In turn, this impacts the ability of the event log to accurately represent the reality.

This situation, known as divergence and convergence of cases, will be discussed in chapter 2.7.

From the header and item tables we can retrieve information about the document relations and therefore the caseid of the process instance. The caseid identifies the documents of a specific case, though the caseid is not part of the ERP data. It is assigned to the business document for process mining purposes only. From those tables, additional attributes of the business documents involved in the process can be obtained. Only a fraction of all columns of the tables needs to be extracted, as not all information stored in the ERP system is relevant for process mining. In the PTP example, these attributes may include the supplier name, quantity and price. However, these tables contain only the most recent state of the data. To track changes of data, it must be retrieved from somewhere else.

Tracking changes of ERP data is an important feature for reasons of transparency and auditing. In SAP ERP, change and logging tables are used for this purpose. There are two types of change tables. The first type tracks changes of specific object types. For example, table EKBE (purchase order history) records changes of purchase orders. The second type tracks changes of any type of object or table. Tables CDHDR (change table header) and CDPOS (change table items) are such tables. Tracking in the change tables CDHDR and CDPOS needs to be enabled for each table, field and activity, e.g. for EKKO all inserts, updates and deletes on any table fields shall be tracked. Due to legal auditing restrictions, most relevant tables and fields are tracked by default.

Figure 2-4 gives an overview of the most relevant tables and their foreign key relationships for the P2P process. Knowing the required tables and relations is not intuitive. For the PTP process, more than 30 tables can be required, including both transactional and master data. The figure shows the most important transactional tables in the centre, the existence of foreign keys between them, and the business area they are assigned to. The outer circle shows a very short extract of master data tables used in the PTP process, also arranged based on their business area. As these tables have foreign key references to most of the involved transactional tables, these are not shown in the figure.

(16)

11

Figure 2-4: Relevant SAP tables for PTP process, inlcuding master and transactional data

From transactional data, event information and attributes can be retrieved.

This needs to be put in context together with master data, e.g. meaning of company codes, location of the plants, or daily currency conversion rates at the day of the transaction. Each process may require different tables or fields of a table. An automatic identification of relevant tables would speed up process mining on new processes, esp. non-standard processes. However, this is complex and error-prone. We will discuss approaches from research and practice in chapter 2.9, and show the limitations of automatic table identification.

The data structure is the same for all SAP ERP systems. Therefore, all companies that use SAP ERP for their purchasing process, use roughly the same business documents, tables and fields. The customization is hidden in the content of the tables. If a specific document type is not used, e.g. a purchase requisitions is only rarely required, this table is only changed rarely.

Customizing of purchase order types are stored in table T161. Though the content can differ widely between companies, the settings will still be stored in that table. Due to this, the standardized extraction of purchasing data for process mining is feasible and reflects the company-specific requirements and processes.

(17)

12 2.4 Guiding Principles to Data Extraction

Before the extraction and therefore before the process mining can take place, a decision is needed on which selection of data is suitable, and how to convert the data into the required format described in chapter 2.2.2. For this, the IEEE Task Force on Process Mining advices to follow six guiding principles [1]:

- GP1: Event data should be treated as first-class citizen - GP2: Log extraction should be driven by questions

- GP3: Concurrency, choice and other basic control-flow constructs should be supported

- GP4: Events should be related to model elements

- GP5 Models should be treated as purposeful abstractions of reality - GP6: Process Mining should be a continuous process

GP3 refers to process modelling notations. This needs to be supported by the process mining tool and is unrelated to the event log extraction. The only requirement this poses for a log, is the need for timestamps. GP5 deals with the interpretation of results. This is also unrelated to the event log extraction. This leaves GPs 1, 2, 4, and 6 as those principally of relevance for the log extraction.

GP1 states that event data should be treated as a first-class citizen. In other words, process-oriented data should be preferred over other data. The Process Mining Manifesto [1] states criteria to evaluate the data quality for process mining. This will be discussed in more detail in the next chapters. It is important to note at this point that data should be selected based on its suitability for process mining.

GP2, question-driven log extraction, is important before a project can start.

Extracting data is not about extracting much data, but about extracting the required data. In order to do that, leading questions must be defined. Such a question for a procurement process could be: How often and in which circumstances are cash discounts for early invoice payment being used?

Defining these questions is a non-technical task and requires domain knowledge. This principle must be considered when trying to standardize log extraction, as a standardized extraction, esp. of attributes, predefines the purpose of the process mining and the answers it can provide [9].

GP4, the relationship between logs and model elements, might seem trivial at first. The mapping of elements in the process model and the log is important when checking for conformance or enhancements. The mapping itself is done in the process mining tool and is therefore not directly related to the log extraction. However, two factors of this principle are relevant before logs can be extracted. Firstly, the granularity of events in the log needs to match the granularity in the model. If the logs are more fine-grained than the model, the mining application might be able to handle this. In this case, the creation of logs would be required to summarize events until the granularity level of activities in the process model is reached. In the opposite scenario, where the model is more detailed than the logs that were extracted, the analysis does not work.

Apart from the granularity of the extracted events, the events themselves need to express relevant business activities. These events should not relate to technical activities in the SAP system, but to activities within the process scope.

(18)

13 Lastly, GP6 states that process mining should be a continuous process. When looking at the log creation, this expresses the need to extract event data continuously, or at least regularly, from the source systems. Processes may be ever-changing, and in order to be able to navigate a business, information on the current status of processes is needed. If real-time data is available, exceptional cases can be identified and dealt with before they lead to issues.

This work discusses special considerations for log extraction if logs are retrieved in near-real-time.

2.5 The Event Log - Practical Challenges

So far, it was discussed what an event log should look like and which information it should contain. To produce the desired output, a number of challenges have to be met, depending on the data source. According to van der Aalst [2] there are five typical challenges when creating an event log:

- C1 correlation: When data is scattered across multiple tables or even data sources, it is difficult to identify which event belongs to which caseid (see also [3]). This is also a major challenge in this work and will be discussed in chapter 2.7.

- C2 timestamps: Times between servers of different data sources may differ, they might be located in different timezones, subject to daylight saving times, or timestamps can be to grain (e.g. date only, no information on time). This can make it hard to establish the correct sequence of activities. Additionally, start and end times of events can be overlapping. Especially in SAP systems, we will see that date and time are often separated and in some instances, only a date, not the time, will be saved. For events that happen on the same day, e.g. the creation of a purchase requisition and the release of it, this might lead to the wrong order of events, if only one of the events includes a timestamp.

- C3 snapshots: Within the log, the beginning or the end of a case is possibly not recorded if the log is cut off by a time range. This will be especially interesting when extracting logs continuously from a source.

The process mining application will have to deal with such inconsistent and unfinished process instances. Another challenge related to snapshots is the decision whether a case is finished or not. Werner [5]

shows that in an SAP example many cases can have few activities (“transactions”), while few cases have a lot of activities, also recurring loops at the last activity. It is therefore not possible to identify the end of a case by knowing the last activity, as the same activity type might reoccur later. In a complete log, the end can be identified, as the whole picture is known. However, in a continuous log, the process mining application needs to deal with this incompleteness of information.

- C4 scoping: Only a selection of tables from a large source like an ERP system is relevant for process mining. Selecting all relevant but no irrelevant data can be difficult. There are different approaches how this is achieved by other researchers which will be discussed in chapter 2.9.

(19)

14 - C5 granularity: The events in the event log may be at a different granularity level than the relevant activities for the end user (compare also GP4 in chapter 2.2.2). An example is the creation of a purchase order in an SAP ERP, where the purchase order itself and each item that belongs to it might create an entry in the event log. From an end user’s view, this is one activity. The handling of such events can be done at different levels, mostly depending on the process mining goal [9]. As it impacts the event log creation and also interferes with the challenge C1 (correlation), the topic of granularity will be further discussed in chapter 2.7.

Apart from these more general challenges, there are some challenges that are specific to ERP systems. Especially correlation (C1) and granularity (C5) can lead to data quality issues.

2.6 Data Quality of the Event Log

The quality of the process mining results depends on the quality of the logs [2].

In order to be able to measure the quality of a log, criteria have to be defined.

Scientific publications have scarcely covered any benchmarks for process mining event logs [1], [3]. There are some publicly available event logs that can be used to evaluate process mining tools. However, processing a self-extracted event log using an established process mining tool does not yield any information about the quality of the event log. It only provides a validation of the syntax.

There are some characteristics which can be used to evaluate process data. van der Aalst classifies quality issues of the event log into the following categories [2]:

- Missing in the log: The entity exists in reality but cannot be found in the event log. This can happen if for example the activities do not require the IT system, i.e. manual tasks like wrapping and unwrapping of goods.

- Missing in reality: The entity does not exist in reality but was recorded in the event log. For example, received goods are considered of correct quality after either a quality check has been performed or automatically after a specific time passed after receival. So the change of the “quality- was-checked”-flag might either be triggered by an action or background job, which is not necessarily clear from the log. The first example should be recorded in the event log, as it is a real event, while the second is not.

- Concealed in log: An entity was recorded and exists in reality, but it is unclear where the entity was recorded, or it cannot be assigned correctly.

For example, if multiple invoices are cleared by a single payment, the individual amounts paid by the transactions are hidden inside the larger payment and need to be split for analysis.

All three problems can occur at any level in the log. A single attribute, an event, or even a whole case can be affected. All these problems arise from the data within the system, and are merely a consequence of the problems inside the source system, here SAP ERP. Extraction and transformation of data will not be able to overcome these challenges, as long as these quality issues arise from the source ERP system.

(20)

15 The IEEE Task Force on Process Mining categorizes data sources into 5 levels of data quality. The higher the level of the data, the more useful it is for process mining. ERP, CRM and other classical business IT systems, as used in this work, are classified as level 3, which is the minimum level for being able to perform process mining at all [1]. Level 4 and 5 event logs originate from process oriented systems like BPM or Workflow systems. Process mining can be directly applied to data of this level. Level 3 logs, on the other hand, that originate from ERP systems, are usually correct, i.e. the event took place as described. But sometimes the information is hidden or unclear, which corresponds to the quality issue categories defined above. The challenge with these logs is that they are in a bad format and need extensive transformations.

This, however, only classifies the quality of the data source, not the quality of the event log after transformations were applied. The event log quality depends on the transformations applied to the original data. It has a good quality for process mining if it is accurate, precise, and comprehensible. It is hard to define how this can be measured. Partly, event log quality can be improved by addressing the correct identification of cases.

2.7 Convergence, Divergence, and Granularity of the Event Log A variety of decisions on how to address challenges C1 (correlation) and C5 (granularity) impact the conclusiveness of the event log. The process mining tool can only find information that is available in the event log, and the log must therefore address the questions that the process mining project shall answer.

2.7.1 Convergence and Divergence

Defining the correlation of documents and events, and thereby identifying cases within the SAP ERP system, is challenging, as there is not always a 1-to-1- relationship. A case of a PTP process may have multiple invoices referring to the same purchase order, or the other way around, multiple purchase orders may be billed in a single invoice. These problems are known as divergence and convergence of cases [5], [6].

Divergence of data describes the situation when the same activity is performed multiple times for the same case. Figure 2-5 shows this for a purchase order, for which two deliveries are received. From a process mining point of view, this looks like performing the same activity multiple times, though technically in the SAP system, the activity is performed on different documents.

Convergence of data describes the opposite situation. The same event cannot be assigned to a single case, but is related to multiple cases. From a process mining point of view, it looks as if the same activity was performed on multiple cases (one event per case), though in the SAP system, it was performed only once. Compare figure 2-6 for an example.

(21)

16

Figure 2-5: Divergence of cases visualized from different views to show the gap between the activities in reality and how they are perceived by process mining

Figure 2-6: Convergence of cases visualized from different views to show the gap between the number of activities in reality and the number of activities in the

event log

Convergence and divergence of cases is the consequence of having a n-m- relationship of foreign keys, meaning allowing multiple references to the same object in both directions [4] [5]. Solving convergence and divergence is hardly possible, unless using workarounds. One of these workarounds is that in a convergent case, the convergent events are duplicated and appended twice in the event log, once for each of the affected cases (compare figures 2-5 and 2-6).

(22)

17 In a divergent case, there is only the possibility to repeat the same activity for the case, as they are actually performed on different objects, e.g. different invoices. In the SAP PTP process, convergence and divergence can occur at any involved document, i.e. PR, PO, delivery, invoice and payment.

One approach to deal with convergence and divergence is using an artifact- centric approach [5]. This approach allows divergence and convergence by creating event logs for each of the involved artifact instances (each representing a document like PO or invoice). The event logs only contain the changes of the artifact, and interactions require multiple event logs for the interacting artifacts. This approach must be supported by the process mining tool, as it differs from a traditional event log. As the process mining tool used in this work does not support this, the artifact-centric approach is not followed.

Figure 2-7: Schematic representation of artifact-centric event logs, adapted from [5]

2.7.2 Granularity of Case Definition

Convergence and divergence can partly be solved by selecting a different level of granularity in the log [6]. Granularity of cases refers to the aggregation of documents. A case can be defined as the sequence following a single purchase order line item. As an alternative, multiple purchase order line items can be combined into a case on the level of the header table, and the PO itself is the defining document for a case. Therefore, coming from the same original data, more or less cases are identified. Some activities in the SAP system are tracked at item level, while others are tracked at header level. The decision for a granularity also impacts the way of handling these events. If a header-level- activity, e.g. a release of a PO, is performed and cases are tracked on item level, this activity needs to be entered in the event log for every case related to the PO header. It therefore occurs more often in the event log than it was performed in reality.

In the opposite scenario, cases are tracked on header level, and an item-level- activity, e.g. invoice receival, occurs on all items at once. This activity should be merged into a single activity. This results in different variants for the same process instance, depending on the notion of the case. Compare figure 2-8 for an example. It shows how the same PO items are seen by the process mining

(23)

18 tool if the case granularity is defined on PO item level or on PO level. On PO level, it is hard to differentiate between activities that affected all items, and activities that only affected a fraction of the PO items. This therefore requires a correct repetition and merger of activities. Convergence and divergence can occur on every level of granularity and must therefore be considered during event log creation.

Figure 2-8: The granularity of a case impacts the number of events related to the cases, and the variants retrieved during process mining, as shown in an example of a PO with two items

Additionally to the granularity level of cases, it must be decided whether a case has a leading document that identifies the case. A leading document can be any business document involved in the process, e.g. the PR, the PO or the invoice.

A new case is created with the creation of the document, and any referenced documents are automatically assigned to the same case. This means that there cannot be any convergence or divergence at this document in the process, though they can occur earlier or later in the process. Deciding for a fixed leading document has the advantage that a case is always clearly identifiable by a business document. This presupposes that this document is included in all cases. Incomplete process instances cannot be identified this way. If a purchase order is the leading document, no cases consisting only of purchase requisitions can be identified, even though rejected PRs (without POs) might offer interesting insights to evaluate during process mining. Also, when choosing a document as leading document that normally appears late in the process, e.g.

the invoice in PTP, continuous extraction is not possible, as the case can only be clearly identified after the invoice was received. As earlier documents are not always mandatory parts of the process, deciding on a leading document can be misleading and narrowing the process mining results.

An alternative to setting a fixed leading document is setting a dynamic leading document. In that scenario, the first document that cannot yet be mapped to an existing case, becomes the leading document of a new case. This makes the identification of a new case more complex and might confuse when analyzing the results of process mining at a later stage, since the documents are not necessarily created in the intended order. If a PO was created before a PR, the PO might be the leading document, even if in most other cases, the PR would be the first document. Also, the first document occurrence might be a bad decision if this is typically a converging document. In that scenario, a hierarchy

(24)

19 of leading documents could be defined, based on which cases are identified. For example, one can define that a PO is a leading document, unless there is a case without a PO, in which the PR becomes the leading document.

The sequence of document creation is a problem when identifying cases in the continuous extraction environment. Foreign key relations not necessarily, but sometimes, point both directions. In the PO, there is a reference to a PR, and in the PR, there is a reference to the PO. When continuously extracting data, the relation needs to be set at the time of extracting the creation event. If the reference cannot be made at the time of creation, a later reassignment of a case is usually impossible. Therefore, it is most suitable for a such an environment to set a dynamic leading document to the first document occurring in the case.

2.7.3 Hierarchies and Granularity of Event Definition

Decisions on granularity, leading documents, and other representational questions bias the potential outcome of process mining [2]. The process model conformance or discovery of a new process model can only include the detail level that already occurred in the event log. Few process mining tools support the representation of hierarchical activities, and can thereby express process models on different granularity levels [2], [3].

Events can be extracted from the data source on different granularity levels.

When a purchase order is changed, it is possible to say that it was changed, or which part (database field) was changed, or even whether the value was increased or decreased. When extracting process models from cases with fine- granular events, more different variants will be found. It therefore may make sense to group the events into hierarchies. An increase of the order volume is still a change of the order volume, which again is a change of the purchase order.

By allowing such hierarchical grouping, less variants can be seen on a higher level, which can benefit the analysis on a higher level. But still the details are kept, and when analyzing individual cases, the information is available.

Even though the process mining tool used for this project does not support hierarchical logs at the current state, the question of event granularity is important when extracting the event log. The event log is intended to be used for conformance checking of processes, and the event log entries should, per the formerly defined guiding principle GP4 (chapter 2.4) be related to the modelled elements of the process model. The granularity must therefore be defined at an equal level in the to-be process model as well as in the event log.

However, the same process may be modelled at different levels of granularity.

In that scenario, either multiple event logs on different levels of granularity must be defined, or a decision for a target granularity level must be taken.

2.8 Requirements for Continuous Event Log Extraction

The Process Mining Manifesto [1, p.9] states that “it should also be possible to conduct process mining based on real-time event data.” The authors compare it to driving a car and having GPS data projected on a map while driving. Using real-time event data can improve cycle times (the time a process instance needs

(25)

20 from the beginning to the end), or avoid quality issues (e.g. in the sense of Total Quality Management or Six Sigma [1]). However, they do not state what they mean by real-time. For this work it is assumed that near-real-time data, i.e.

multiple updates per hour, is sufficient to react to events in the process execution, esp. as many relevant events only carry date information and no complete timestamp.

Most of the common process mining algorithms assume unchanging event log data [3] [2]. However, this is an issue that the process mining application, not the event log extraction, needs to deal with. The main challenge for continuous event data is the case assignment. The complexity to identify beginning and end of a case makes it also more difficult to decide whether an event belongs to an existing case or opens a new one [2].

There seem to be few reports on required processing speed or typical data volume when continuously extracting data from SAP ERP systems. The volume depends on the number of analyzed processes, as well as the number of cases of the process per day. It is therefore highly dependent on the company that runs the ERP system. The required extraction speed, i.e. the time lag between the execution of the activity and the extraction of that activity from the log, mostly depends on the use case. In most cases, when a process instance usually spans multiple days, lagging behind a few hours in the process mining application would still be acceptable. This would be the case for a typical PTP process, when the goal is the analysis of late-payed invoices with an average payment terms of 30 days. If, on the other hand, the analysis covers a sales process of an online retailer that offers overnight delivery of goods, a few hours of difference may make the process mining tool unusable for identifying acute bottlenecks within the process. It can of course still be used to identify conformance issues or drifts in the process. In conclusion, continuous extraction requirements are mainly influenced by the purpose of process mining.

2.9 Existing Research on Log Extraction

For the challenges of data extraction and event log creation, researchers and companies have found different approaches. In this project former research on the identification, extraction and transformation from SAP ERP into event logs for process mining was enhanced. Ideally, one could combine the existing approaches into a single project. However, this is not possible without adaption, as will be shown in this work. Still, some of the most promising approaches considered for this work are introduced. They will be analyzed for their applicability for this work’s use case and their integrability with one another.

In his PhD thesis that won the IEEE task force on Process Mining’s best thesis award in 2016, Burratin [3] published an algorithm to automatically identify processes and cases within a large data set.¹ He applied relational algebra on

1 Refer to chapter 10 of [3]

(26)

21 tuples of activity, timestamp and multiple attributes. Based on this input, the algorithm automatically detects related attributes in the dataset. These attributes identify processes and cases that belong together. This approach has the advantage that the cases are identified in a single pass over the data, but comes with computational heaviness and the requirement for additional domain knowledge for speeding up the computations. These identified attributes should in an SAP system be the foreign key relationships of tables.

But as all tables are heavily interconnected, domain knowledge is required to define a cut into different processes. This cut is the same as the manual definition of relevant tables, as it is defined using expert knowledge today.

Additionally, the attributes defining a case are not necessarily the same as those that are relevant for describing the case. Often, identifiers of a case are IDs, while these have little meaning for the later analysis of the case. The definition of additional attributes can therefore not be automated using this approach.

This approach is not applied in this work, as it does not reduce the amount of manual work.

In his master thesis in 2004, van Giessel [11] developed a tool called table- finder, that identifies relevant data for an application. An application in this context is roughly a business process. This approach was intended to reduce the manual work as well as massive domain knowledge required to extract event logs. The tool is tailored for SAP systems. In multiple loops it identifies relevant tables based on their hierarchical structure. This means that this approach compares well to the company’s value chain as different hierarchical levels are considered. But the approach has its limitations. On the one hand, only processes within the same SAP component can be identified. On the other hand, irrelevant tables are potentially marked as relevant for a process and have to be removed manually, and change records must be added afterwards, as these tables belong to technical components of the SAP system. However, this approach was shown to be error-prone. Mendling et al. [12] detected mistakes in the SAP reference model and Event-driven Process Chains (EPCs) in their work in 2008, which build the foundation of van Giessel’s approach.

Piessens [4] focusses in his master thesis from 2011 on the extraction of data from the SAP source system. He evaluated the extraction via IDocs and database interfaces and decided to use database interfaces. He based his decision on the wrong assumption that only two types of databases (Oracle and MaxDB) are supported by SAP, while in reality SAP supported four database types at that time (today: six), not considering multiple versions. The process described is otherwise straightforward, but requires manual work. This approach allows for continuous data extraction as the case-mapping is done after the extraction of the log. Also, Piessens provides an approach to deal with difficult cases, mainly divergent and convergent cases. This work therefore applies some of the concepts of Pissens in the extraction and case identification, enhanced by some additional considerations regarding continuous extraction and handling of ambiguous cases, events and attributes.

Buijs [6] dedicated his master thesis 2010 to developing a graphical tool that allows process experts to map relations of table fields. The fields can also be mapped into attributes of XES logs, so that process experts can extract that data they want without the need to program or perform technical configuration. The

(27)

22 approach presented by Buijs is not specific for any ERP system. It benefits from the structure of the XES log format. Though the XES format will not be used in this project, the idea of allowing non-technical users to define attribute mappings is interesting and can be seen as a possible extension to this project.

Lu [5] evaluated in her master thesis in 2013 the artifact-centric log extraction.

This approach focuses on artifacts, i.e. documents related to the business process like a purchase order, their lifecycle, and interaction. From there, all events related to the artifact chain are extracted and transferred into a proclet notation. Due to the artifact-centric approach, data divergence and convergence are allowed, as these are divided into smaller processes that represent an artifact’s lifecycle. The interactions are both traced on event-level as well as on trace-level, and thereby artifacts are assigned to (multiple) cases. However, the automatic identification is not working without manually applying domain knowledge. As the artifact-centric approach results in multiple event logs, one per involved artifact, using this type of log with traditional process mining tools requires further processing of logs, and the benefits of the artifact-centric approach are lost.

Apart from research, there are also commercial products that focus on extraction of event logs from SAP ERP systems. There are extraction solutions from consulting firms, e.g. Process Analytics Factory [13] and Deloitte [14], that put process analysis for a given scenario in the focus, often in a project-based approach. ABSI (Automated Business Scenarios Identifier), which also originates from a consulting firm, Banyas, focuses on the continuous analysis and follow-up of individual cases. It does not reuse existing data from the system, but replicates master data and tracks any changes to transactional data to an own database (within the SAP system). From there, it identifies processes.

Though technical details are disclosed, the approach automatically identifies processes and process variations by tracking data in the SAP system [15].

The work presented above covers most questions that are raised for this project.

The work has been studied for its validity for this project, its applicability in the project, and the integrability with other requirements and solutions in use.

None of the studied approaches can be applied without changes. What will be reused is

a) the identification of cases by following foreign key relationships (as in [4], [11], [3] and others).

b) the evaluation based on the SAP IDES system (as in [4], [10] and others).

This is on the one hand useful to evaluate the solution based on the known difficult cases, on the other hand this offers the possibility to compare results.

c) the focus on handling convergent and divergent cases (as in [3], [4], [5]

and others). The approach will be studied from a traditional process mining point of view, but existing approaches will be enhanced by analysing also the impact on events and case attributes. Especially the handling of attributes was not covered in depth by the mentioned research.

Continuous Event Log Extraction for Process Mining

Continuous Event Log Extraction for Process Mining

HENNY SELIG

Abstract

Abstract

Table of Contents

List of Acronyms

1 Introduction

2 Data Extraction for Process Mining