• No results found

An Automated Methodology for Identification and Analysis of Erroneous Production Stop Data

N/A
N/A
Protected

Academic year: 2021

Share "An Automated Methodology for Identification and Analysis of Erroneous Production Stop Data"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

An Automated Methodology for Identification and Analysis of Erroneous Production Stop Data

Master Degree Project in Virtual Product Realization One year Level 18 ECTS

Spring term 2020 Sopal Soman

Supervisors: Sunith Bandaru

Kim Andersson(VOLVO GTO)

Examiner: Amos Ng

(2)

Abstract

The primary aim of the project is to automate the process of identifying erroneous entries in stop data originating from a given production line. Machines or work stations in a production line may be stopped due to various planned (scheduled maintenance, tool change, etc.) or unplanned (break downs, bottlenecks, etc.) reasons. It is essential to keep track of such stops for diagnosing inefficiencies such as reduced throughput and high cycle time variance. With the increased focus on Industry 4.0, many manufacturing companies have started to digitalize their production processes. Among other benefits, this has enabled production data to be captured in real-time and recorded for further analysis. However, such automation comes with its problems. In the case of production stop data, it has been observed that in addition to planned and unplanned stops, the data collection system may sometimes record erroneous or false stops. There are various known reasons for such erroneous stop data. These include not accounting for the lunch break, national holidays, weekends, communication loss with data collection system, etc. Erroneous stops can also occur due to unknown reasons, in which case they can only be identified through a statistical analysis of stop data distributions across various machines and workstations. This project presents an automated methodology that uses a combination of data filtering, aggregation, and clustering for identifying erroneous stop data with known reasons referred to as known faults. Once the clusters of known faults are identified, they are analyzed using association rule mining to reveal machines or workstations that are simultaneously affected. The ultimate goal of automatically identifying erroneous stop data entries is to obtain better empirical distribution for stop data to be used with simulation models. This aspect, along with the identification of unknown faults is open for future work.

(3)

Acknowledgements

I would like to express heartfelt gratitude to Dr. Sunith Bandaru, Associate Professor, Department of Engineering Science for his support and supervision in leading the curriculum and to make this project a possibility.

I hereby take the privilege to thank Mr. Kim Andersson, Simulation Engineer, VOLVO GTO, Pentahuset, Skövde for providing me with the best atmosphere and facilities for this project.

Unflinching encouragement and support from friends at the University of Skovde and the professionals in Volvo helped a lot for the betterment of the project. I must thank all of them from the bottom of my heart

Skövde, March 2020

Sopal Soman

(4)

Certificate of Authenticity

Submitted by Sopal Soman to the University of Skövde as a Master's Degree Thesis at the School of Engineering.

I certify that all material in this Master Thesis Project which is not my work has been properly referenced.

Signature.

Sopal Soman

(5)

Table of Contents

1. Introduction ... 1

1.1 Background ... 1

1.2 Frame of Reference ... 2

1.3 Aim and Objectives ... 3

1.4 Scope and Limitation ... 3

2. Literature Review... 4

2.1 Down Time ... 4

2.2 Simulation ... 6

2.2.1 Simulation Modelling Part ... 6

2.2.2 Application of Simulation ... 8

3. Research Methodology ... 11

3.1 Design and Creation ... 11

3.2 Data Collection Techniques ... 13

4. Identification of Erroneous Stop Data ... 14

4.1 Problem Description ... 14

4.1.1 Known Faults Occurring In The Stop Data ... 14

4.1.2 Unknown Faults... 16

4.2 Proposed Methodology ... 17

4.2.1 Clustering ... 19

4.2.2 Association Rule Mining And Apriori Algorithm ... 23

5. Results and Discussion ... 26

5.1 Scatter plots ... 26

5.1.1 Scatter plot by work areas ... 26

5.1.2 Scatter plot by the month of the year ... 28

5.1.3 Scatter plot by the hour of the day ... 29

5.1.4 Scatter plot by day of the week ... 30

(6)

5.2 Clustering ... 35

5.2.1 Deciding the optimum number of clusters for KMeans clustering ... 37

5.3 Cluster Analysis... 40

5.3.1 Frequency graph of top 10 work stations in each cluster ... 40

5.3.2 Histogram of stop difference in each cluster ... 43

5.4 Application Of Association Rule On Each Cluster ... 45

6. Discussion ... 50

7. Conclusion ... 51

7.1 Further Research ... 51

Appendix A ... 58

(7)

Table of Figures

Figure 1: Pattern shows all stations having a similar start and end time ... 17

Figure 2: Pattern shows only one part of the line has a similar start and end time ... 18

Figure 3: Patterns shows alternative stations have a similar start and end time ... 18

Figure 4: Patterns having a falling start time ... 19

Figure 5: K-Means Clustered data ... 20

Figure 6: Pseudocode for KMeans Clustering ... 21

Figure 7: Gaussian-Mixture Clustered data ... 22

Figure 8: Pseudocode for association rule mining and apriori algorithm ... 25

Figure 9: Pseudocode for finding the known faults in the stop data ... 25

Figure 10: Scatter plot of all work areas together ... 26

Figure 11: Scatter plot of work area WA7 ... 27

Figure 12: Scatter plot of work area WA35 ... 27

Figure 13: Scatter plot of the work area WA33 ... 28

Figure 14: Scatter plot by the month of the year ... 28

Figure 15: Scatter plot of the hour of the day ... 29

Figure 16: Scatter plot of threshold 0.01*stop difference ... 30

Figure 17: Scatter plot of threshold 0.02* stop difference ... 30

Figure 18: Scatter plot of threshold 0.03*stop difference ... 31

Figure 19: Scatter plot of threshold 0.05* stop difference ... 31

Figure 20: Scatter plot of threshold 0.1* stop difference ... 32

Figure 21: Scatter plot of threshold 0.2*stop difference ... 32

Figure 22: Scatter plot of threshold 0.3*stop difference ... 33

Figure 23: Scatter plot of threshold 0.4*stop difference ... 33

Figure 24: Scatter plot of threshold 0.5*stop difference ... 34

Figure 25: DBSCAN ... 35

(8)

Figure 26: Gaussian-Mixture Clustering ... 36

Figure 27: KMeans Clustering with scaling ... 36

Figure 28: KMeans Clustering without scaling ... 37

Figure 29: KMeans Clustering with number of cluster =2………...38

Figure 30: KMeans Clustering with number of cluster =3………...38

Figure 31: KMeans Clustering with number of cluster =4………...38

Figure 32: KMeans Clustering with number of cluster =5……….…..39

Figure 33: KMeans Clustering with number of cluster =6………...39

Figure 34: KMeans Clustering with number of cluster =7……….….….39

Figure 35: KMeans Clustering with number of cluster =8……….……..40

Figure 36: Frequency of top 10 work stations in cluster 1 ... 40

Figure 37: Frequency of top 10 work stations in cluster 2 ... 41

Figure 38: Frequency of top 10 work stations in cluster 3 ... 41

Figure 39: Frequency of top 10 work stations in cluster 4 ... 422

Figure 40: Frequency of top 10 work stations in cluster 5 ... 422

Figure 41: Frequency of top 10 work stations in cluster 6 ... 433

Figure 42: Distribution of stop difference in cluster 1 with the zoomed figure on the right ... 433

Figure 43: Distribution of stop difference in cluster 2 ... 444

Figure 44: Distribution of stop difference in cluster 3 ... 444

Figure 45: Distribution of stop difference in cluster 4 with the zoomed figure on the right ... 444

Figure 46: Distribution of stop difference in cluster 5 with the zoomed figure on the right ... 455

Figure 47: Distribution of stop difference in cluster 6 ... 455

(9)

Index of Tables

Table 1: Association rule on cluster 1... 47

Table 2: Association rule on cluster 2... 48

Table 3: Association rule on cluster 3... 48

Table 4: Association rule on cluster 4... 49

Table 5: Association rule on cluster 5... 49

Table 6: Association rule on cluster 6... 49

(10)

Terminology D

DES

Discrete Event Simulation……….8 S

SD

Stop Data………1

SDS

System Dynamics Simulation………….8 T

TPM

Total productive maintenance………….6

(11)

1. Introduction

1.1 Background

The focus of industries is producing quality products with low production cost in a short time. The important problem facing every production industry from the past is the interruption in the production lines, this problem and its after-effects create loss to the industries. As the machines and also employees increase in the industry, the chance for the interruption in the line increases, and this may be due to the fault of the machine as well as employees. In this competitive world, every industry focus on introducing new and more products into the market in a short time. When there is any interruption in the production line, the products will not be produced and it will adversely affect the industry's profit value.

All industries concentrating on solving this problem in one way or another for getting a better result.

The unexpected errors such as breakdowns etc. that occur in the production line are more complicated and it takes much time to clear the problem as a result losses in the production capacity will occur.

Machines are very sensitive, so the errors will occur easily and also the employees create mistakes during the work. Every industry recorded the interruption in the dataset and they analyze the data and changes are made in the simulation model for the improvement in production.

Error-free data are the main strength of an organization and every organization concentrates on the elimination of errors from the data. Errors in the downtime data adversely affect production so that it will cause problems in the simulation models. Simulation with error-free down tine data will create a much better result in the production.

Simulation and modeling started to emerge from the Second World War. To find out the behavior of neutrons, Mathematicians Jon Von Neumann and Stanislaw Ulam developed Roulette wheel techniques since the trial experiments were costly at that time. After the technique become a success on the neutron problem, it has gained publicity and was adapted and improved for different problems in various industries. (uh.edu, 2020) Simulation is one of the main pillars of automation, as, in simulation, a variety of constraints and data can be applied and compared with each other. Simulation models can give results for various input constraints, which, in real case scenario takes a huge amount of time and effort to complete all our experiments. Simulation results are accurate and also it can give the solution to hard problems as well. The current industry scenario is entirely different from the past as the companies are relying mostly on simulation and optimization also the virtual modeling and experimenting are gaining popularity nowadays.

(12)

1.2 Frame of Reference

1.2.1 Down Time

Because of any unplanned events such as unscheduled maintenance, material issues, machine setup, breakdown, etc. the manufacturing plant stops for a particular time and it is recorded as downtime.

During this period, no products are produced, which is also known as idle time. Loss in the production capacity is the major impact of downtime as a result industries are focusing on reducing downtime and for reducing downtime, there are certain effective strategies. They are

(1)” Track downtime precisely”: the tracking of downtime manually is replacing into an automated tracking system.

(2) “Classify downtime and its reason”: capture the root causes for the occurrence of downtime

(3)” Disclose the downtime in real-time": current status of the whole plant shows in a dashboard. It also includes the production lines which are down

(4)” Attack the downtime causes”. (Vorne Industries, 2020)

Downtime can be overcome by doing risk audits in the plant, finding obsolete equipment is an example of risk audits. The losses that occurred due to the downtime are calculated and it may be in the currency form. In equipment, the installed sensors detect the changes in the machine from its normal condition.

Train the employees to detect the machines which show any early sign of the problems. Providing proper maintenance regularly and documentation is also important in reducing the problems. By applying the above procedures in the manufacturing plant, the downtime can be reduced to a great extent. (Marendra, 2020)

1.2.2 Scheduled Production Stops

The production in the plant is temporarily stopped because of any preplanned activities with a fixed time frame and it is not considered as a loss to the plant, as these stops are essential for the plant in one way or another. The scheduled production stops include scheduled maintenance, lunch breaks, holidays, weekends, and change over, etc. The scheduled maintenance prevents the plant from bigger losses due to the breakdown of the machines and it also reduces the possibility of downtime.

1.2.3 Unscheduled Production Stops

The unexpected stops occurring in the production plant which disturbs the production process, and therefore no products are produced during this time and it is considered as a loss for the production

(13)

plant. These stops are unplanned and it does not have a particular time frame. The unscheduled production stops are breakdown, machine setup, unscheduled maintenance, tool changes, etc. Every industry is focusing on eliminating these types of stops from their production lines as it increases downtime and creates a huge loss to the plant.

1.2.4 Simulation

Understanding the behavior and characteristics of a real system or a conceptual system with the help of a virtual model is known as simulation. Continuous and discrete are the two important groups of simulation. The variables in the continuous system change over time, the simulation of such a continuous system is focused on continuous simulation models. In discrete event simulation, the simulation of a discrete system where the variables changes according to the distinct points in time.

DES focused on different areas of application such as manufacturing etc. (Morshedzadeh, Oscarsson, Ng, Aslam, & Frantzen, 2018)

1.3 Aim and Objectives

Aim: To develop an automaed methodology for identification and analysis of known faults from production stop data.

The objectives are

Automate the operations of identifying known faults in the machine lines.

Create the necessary action to eliminate the known faults.

Perform clustering of the stop data for finding the best fit distribution

Perform cluster analysis using statistical plots and association rule mining.

1.4 Scope and Limitation

By considering the results obtained from the project, the error-free data and the updated simulation model will create better chances in the improvement of the production rate and this is applied to all types of production industries around the world, because the problem in this project is quite common in all the production industries. Automating the work is a better way to save time and effort in an industry.

The unknown errors in the stop data are very difficult to identify because of the lack of knowledge about the error such as its characteristics, behavior, etc.

(14)

2. Literature Review

The literature review was done for obtaining a deep knowledge of the project. The reviews support the theories described in the project. The literature review is mainly related to topics such as downtime and simulation, especially Discrete Event Simulation (DES).

2.1 Down Time

(Chellappa, 2018) the study deals with the analysis of delay time. It is already a proven method as a solution for the reduction of downtime. Delay Time Analysis is a tool for fault management. This article aims to encourage downtime technique implementation in small and medium scale industries.

The algorithm discussed in the study is much effective for the reduction in downtime and is applicable for small and medium enterprises (SME). The first indication of the defect to the failure caused by that defect, there should be a time gap. This time gap is recorded as delay time. By analyzing the delay time, the failure situation can be avoided. So that there is a reduction in the downtime. Mathematical formulas are used for the analysis of the delay time. This study also focuses on the studies of the root cause of surprise breakdown. Design a new system or change the existing system design concerning the solution of the root cause of the breakdown problem.

(Islam, Bagum, & Rashed, 2012) conducted the study on the 31 ready-made garments (RMG) organizations in Bangladesh. The study aims to identify the typical operational disturbances and the root causes. In general, the study identifies 11 disturbances and 26 root causes. The article also deals with the effects of operational disturbances on business performance. The major detrimental operational disturbances are “machine malfunctions, defective products, frequent production schedule changeover, absenteeism, and unexpected WIP”. Finally, the most significant root causes are “lack of skills, incorrect information, the problem in the information flow, priority setting conflicts, etc.” The rise in the production cost and the delivery time, material wastage, increase wastage of time, increased WIP, etc. are some effects of the operational disturbances. Because of the disturbances, the industry could not meet quality, cost, and deadline.

(Martin, 1989), in his studies, gives an idea about different types of downtime from a non-technical point of view.

Planned downtime: it is expected to occur. It appears with forewarning and the time-frame is predictable.

Planned/surprise downtime: it occurs with forewarning, it takes more time to solve.

Surprise downtime: it occurs unexpectedly, but the time-frame is predictable.

Surprise/surprise downtime: occurs unexpectedly and has no predictable time-frame.

(15)

This study deals with the planning for downtime. Planning does not take much effort but the result is very helpful. This helps to prevent or reduce the downtime to a certain extend. It is done with LSU libraries which are a collection of documents. But it is an extra time-consuming process. Integrating computer with this task creates better planning results. (Pandey & Raut, 2016) in their article deals with the role of Total Productive Maintenance in a manufacturing industry. For reducing the problems related to the breakdown in the industries, the article suggesting the execution of Keikaku-Hozen actions. To manage the machines in a reduced downtime condition in Label manufacturing plant is the main objective of the article. The downtime occurring because of the problems in the maintenance of the machines. The solution for this problem is to find the root cause of the maintenance issues which leads to increase downtime. So that the problem will not occur again in the plant. The downtime is reduced by 50% because of the implementation of the TPM and also the root cause analysis. (Rahman, Hoque, & Uddin, 2014) in their article provide information about one of the important maintenance strategies, Total productive maintenance (TPM). The overall effectiveness of the equipment increased by implementing TPM. By checking the mean downtime analysis helps to evaluate the performance of TPM in the production plant. This paper focuses on the downtime causes and other factors of downtime in the TPM. The analysis of the Pareto chart of downtime is done. For the analysis, there are some inescapable, partially and completely removable downtimes. The priority of the factors affecting the downtime, which are examined in the reduction of downtime. After the implementation of the TPM, there is a 14.5% reduction in the downtime. The article by (Boyd & Radson, 1998) aims to reduce the rate of process downtime. The process downtime causes losses for industries, reduction in the quality of the product, competitiveness losses, and increases the repair costs. Mathematical calculations are used to estimate the downtime rate, the number of incidents related to the downtime which occur a particular period divided by the total exposure period. The model of the downtime severity rate is created in the article. The article helps to give information about the statistical analysis which helps in the evaluation of downtime severity rates.

The paper of (Liao & Chen, 2004) deals with machine breakdowns which creates problems in the continuous production line of industry. It will disturb the planned activities in the line. Here the article focus on the breakdown occurs regularly in the textile industry. The solution to the problem is the development of heuristic, which gives a longer idle time. So that it reduces the breakdown rate. The article by (Geng, Lv, Zhou, Li, & Wang, 2014) gives an idea about the compensation-based methodology for the prediction of maintenance time in a virtual model. This is carried out at the initial stage of the product design. Maintenance tasks are included in the simulation model. (Luczak &

Sallmann, 1995) the article shows the problems in the automatic molding industries. There are more continuous material flow occurring in the production line. A small interruption will disturb the whole

(16)

process and cause downtime. Apart from the technical breakdowns, the downtime is caused because of some other disturbances. Proper scheduling for the molding plant and the melting shop helps to reduce the downtime. The downtime can be reduced by providing proper maintenance. The paper of (Patti & Watson, 2010) focuses on the simulation done in the facts analyzer. The study deals with the effect of different combinations of downtime duration pairs or frequency on the performance of the given production system. Varying combinations give varying results for the constant downtime duration and the analysis shows that infrequent and long duration variations produce a higher negative impact on the performance of the system than the frequent and short duration ones.

2.2 Simulation

Simulation helps in understanding the system more clear as a whole and it also helps in viewing the system in front of us and allows applying many variations of the system. Many literature reviews showcase the importance of simulation in the production fields. The report mainly focuses on the modeling and application part of the simulation.

2.2.1 Simulation Modelling Part

The virtual models of production systems play an important role in industries that help to apply different experiments in the production system in a short time so that the industries can introduce new products in the market in a small period. Compared to applying experiments in a real-world scenario, the virtual models are cheaper and accurate, as the exact nature of a production system can be obtained from discrete event simulation models. (Morshedzadeh, Oscarsson, Ng, Aslam, & Frantzen, 2018) in this paper, they discussed the methodology for DES models in various levels of the PLM system. The data required for the simulation is obtained from the BoM, BoP, and BoR. With the help of DES software, the model is created for the manufacturing system and this model is exported to the PLM system. For making changes in the model, an information structure that is related to the Bop is developed and this helps to create a proper model for product, process, or resource. The paper of (Sachidananda, Erkoyuncu, Steenstra, & Michalska, 2016) describes the importance of DES modeling in the manufacturing sector. The manufacturing industries are forced to introduce more products in the market when the demand increases for the products. In this paper, they are dealing with the biopharmaceutical manufacturing industry. The simulation model helps in dynamic decision making in the company and with the help of the simulation model, the company achieved an improvement in its manufacturing process because of the reduction in bottlenecks, operating costs, throughout time, and also the proper utilization of resources. (Zhang, Zhou, Ren, & Laili, 2019) in their paper studies

(17)

the importance of simulation as well as modeling in the manufacturing field and the use of these techniques in the product design, testing, etc. It also undergoes a deep analysis of different simulation techniques in manufacturing like unit simulation, integrated simulation, etc. It focuses mainly on the current and future trends and trying to cope up with the existing technology or improving the existing simulation techniques to produce better output in the field of manufacturing., (Schroer & Tseng, 1988) study the approach of simulation in manufacturing systems and by using this approach a manufacturing system is designed or modeled with the help of simulation generators and results are summarized. The simulation generators are defined by GPSS, which decreases the time of building, validation, and modifying simulation models which makes better results. The article (Holm, 1989) studies the thermal effects by using the simulation models. DEROB system is used for programming and focusses on the evergreen as well as the deciduous vegetation was covered over the external walls and different climatic conditions were simulated and validated concerning field measurements. Hence, the improved results are obtained and artificial cooling/ heating can be avoided or reduced under given design and climate. (Yildirim, Tansel, & Sabuncuoglu, 2009) in their article provide the significance of the simulation model in the military deployment plan.

(Carteni & Luca, 2012) dealing with different DES models that are used in the planning for the container terminal. To find the best-fit approach for the simulation of the handling activity time duration is the main aim of this paper. For this, different models are used with identical logical structures but differ in the estimation process of handling activity time duration. In a tactical planning view, the model which is used to simulate a shorter time duration is more realistic, efficient, and accurate. The simulation of single handling equipment actions or a single movement of the container is effectively done with the help of DES models. (Parthanadee & Buddhakulsomsiri, 2010) the aim of the dispatching rules in real-time for scheduling production by using modeling and analysis. In this paper, an industry where fruit cans are made are taken into consideration. In this article, they have taken consideration of various characteristics that industries produce fruit cans can possess and they are the quantity and quality of the raw materials. The different product that shares the various raw materials making them interdependent, the shared raw materials which are resulting in the formation of bottlenecks. The study of (Paulista, Peixoto, & Rangal, 2019) is about DES and its application in the industrial field, also the analysis of consumption and the photovoltaic energy generation. The simulation model was made and the analysis is done by seconds as well as minute resolutions. The result clearly illustrates the efficiency shown by the model at the resolution at seconds is best in a short-time period and the resolution in minutes shows the best results in finance. The journal of (Malandri, Briccoli, Mantecchini, & Paganelli, 2018) focuses on improving the performance of the baggage handling in airports which helps in reducing the waiting time of the passengers. Using the

(18)

real data and the model is studied and simulated by a detailed analysis of the constraints and non- constraint operations. This simulation created a smooth flow of baggage and thus gives proper administration and provides passengers a good traveling experience. The paper (Nguyen, MPH, Megiddo, & Howick, 2020) is about the use of simulations in the mitigation and investigation of HAI’s.

It also includes the use of DES, system dynamics, and the agent-based model for the analysis of HAIs.

This review study gives a wide view of present simulation models to find out the gaps and improving them in the future or developing the new models. The paper by (Turner, o.a., 2019) is about the use of the DES model to analyze the effects of the truck as well as the driver resources on the system throughput, resource utilization, etc. the simulations are done on different harvest scenarios and also it helps in identifying the variations in performance by changing labor constraints and trucks. This helps the producers to find the bottlenecks and impact of more vehicles, labor used in it, and how it affects the efficiency of grain transportation.

2.2.2 Application of Simulation

(Reiner & Trcka, 2004) in their journal paper, Customized supply chain design: Problems and alternatives for a production company in the food industry, talks about a simulation-based analysis in the supply chain that should be very product-specific. They have made a target system for evaluating the supply chain which analyses the various improvement alternatives. They are studying the product- specific supply chain in the food industry and analyzing the changes made and shows the demand uncertainties. The main aim of such problem - solving supply chain is to decrease the uncertainties like forecast horizon, input data, inherent uncertainties, etc. The paper concludes that the improvement and simulation model allowed further studies in the case of the supply chain. Also, they suggest that universal statements about the supply chain cannot be always true. An example they are saying is that the bullwhip effect they used cannot be always reduced by a shorter supply chain, but it's otherwise in the literature. (Steringer, Zorrer, Zambal, & Eitzinger, 2019) used discrete event simulation in multiple system life cycles to support Zero-defect composite manufacturing in the Aerospace industry is a journal article that depicts the automatic inspection of composites used for the aerospace industry.

Instead of manual inspection, the fiber inspections were replaced by automatic inline inspection. A discrete event simulation model was developed with a series of experiments to quantify the improvements. They are implementing a new approach of fire orientation sensors to measure the fiber orientation. The paper concludes that after the use of the DES model and ZDM system, productivity increased by 18%. The simulation showed very useful and can be used in a real plant situation. The paper showed that the DES could give prised support in different stages of the ZAero production system. The paper (Guimaraes, Leal, & Mendes, 2018) is all about the selection of the correct DES

(19)

software for the implementation in the manufacturing unit, which improves the performance of the production unit. For this, the current status and previous background of the company is analyzed and studied for simulation. The maturity model of the company reveals that if the company can work on computer simulation or not. Also, the use of AHP makes the selection process simpler. Barrera-Diaz et al (Barrera-Diaz, Oscarsson, Lidberg, & Sellgren, 2018) article deal with the analysis and implementation of simulation in the manufacturing plant. Discrete event simulation (DES) is used for the automatically working data-handling system at the output. The paper also follows the standardization of output data which is obtained from DES and studying on different projects are taken.

By the analysis, the implementation is useful in automated as well as other manufacturing plants to improve the production. The journal of (Palacin, Monne, & Alonso, 2011) focusses on the improvement made on the solar cooling installation by using dynamic simulation as well as the experimental diagnosis. The main objective is the disposing of a tool and using it to analyze and evaluate the improvements in different energies in the cooling installation. This concludes with the design, validation of the model, and also, comparison of the model and the results with the newly added geothermal sink.

(Caterino, o.a., 2020) in their paper show the importance of simulation in the verification of the development of the current production lines, design new production lines by eliminating the existing faults and optimize the whole processes. The study is done on two workplaces of the assembly line of an automobile industry where three employees are worked in the assembly works. The simulation, digital twin, virtual reality, and augmented reality create the virtual model of the whole production system which helps in the evaluation of the production system performance and all other related factors. The simulation combines the techniques of big data and analytics with artificial intelligence.

If the completion index is more, Work in Process (WIP) decreases. This implies that the assembly line shows good performance. A drawback to this is the computational time. For simulating 8 hours’ work shift, 48 hours were necessary. (Mourtzis, Vasilakopoulos, Zervas, & Boli, 2019) the article focuses on the study of a manufacturing system and evaluating it with the help of simulation (discrete event).

The simulation is based on the real information obtained from the metal industry. For the improved throughput, the analysis of different individual parameters and their responses to the system is implemented. In this paper, the teaching factory concept is proposed which gives more importance to the educational side (Education 4.0). The main aim of this is to act as a linkage between the real industrial workplace or environment and the class by using the ICT technologies. By this approach, students are aware of the real working environment and feeling the stress, problems, and pressure going through life. This will give valuable practical experience to the students. This journal also gives an idea as well as a deep knowledge of the modern digital world which includes virtual environment,

(20)

simulations, etc. (Freiberg & Scholz, 2015) in their paper analyses the benefits of investment in modern manufacturing equipment with factors like cost-saving, investment, lead time, work in progress, etc.

are taken into account. They compare the existing manufacturing method to the method used by new manufacturing equipment using dynamic discrete simulation. They concluded that the decision of investment does not merely depend on financial calculations but also factors like productivity, quality, and flexibility should be considered. The paper also concludes that discrete simulation is a better tool for making decisions in investments, it is faster and accurate. It also helps in determining the production cost and benefits and also investment decisions. (Antonelli, Litwin, & Stadnicka, 2018) compare the discrete event simulation (DES) and the system dynamics simulation (SDS). For simulating the dynamic behavior of the system, system dynamics simulation is applicable. This is particularly useful in manual work. The simulation of the operations in the production system for different scenarios is simulated by the DES. The combination of these two simulations helps to analyze the whole product family’s performances. (Velumani & Tang, 2017) paper deals with the complexity in batch manufacturing where a variety of parts and processes occur. The authors propose discrete event simulation as an effective tool of identifying the machine status, bottlenecks, work in progress and even the process changes for improved productivity hence reducing the complexities associated with the process. The article of (Zupan & Herakovic, 2015) is about optimizing the production line and the approach used is line balancing and DES (discrete event simulation). For this, the simulation model of the production line, as well as the workplace, was made and the performance before and after the improvements like line balancing, etc. are noted. The results of this paper show the importance of applying line balancing together with DES gives a better result and also optimizes the production line and increases productivity. (Piccinini, Previdi, Cimini, Pinto, & Pirola, 2018) the article is about the reconfiguration of a plant and also optimizing the schedule in a flexible manufacturing unit. The study includes proposing a methodology from identifying problems, its analysis, finding solutions, and assessments. The main objective is to decrease the total cycle time by implementing a scheduling algorithm and assessment of the internalization of production. The integration of simulation with a schedule optimization algorithm gives better results by direct analysis. The journal by (Speeding &

Sun, 1999) focuses on the ABC (Activity Based Costing) evaluation of the manufacturing plant by DES. WITNESS software is used to make the semi-automated model of the PCB assembly line. This gives an overview of how to implement ABC in a manufacturing unit with the help of simulation techniques. The DES implementation of ABC gives efficient capital justification, cost allocation, etc.

Also, it can be used to improve decision making, process improvement, etc. The paper (Sepulveda, Thompson, Baesler, Alvarez, & Cahoon, 1999) studies about a cancer center and the main aim is to increase the throughput (flow of patients) with the existing resources and maximum utilization of it.

(21)

Analysis and evaluation of various floor layouts, scheduling options, etc. with the simulation model are carried out. Relocation of resources and changes are made so that 30% of patient throughput is increased. As patient flow is increased the waiting room size of the building can be reduced.

3. Research Methodology

Research methodology provides the strategy of researching a step by step process, which formulates the whole procedure for the research. There is some research strategy that includes surveys, experiments, case studies, design and creation, action research, etc. Among these, design and creation is most suitable for the research.

3.1 Design and Creation

The evolution of an IT application is the main focus of this research strategy. There are different IT applications which include constructs, models, methods, and instantiations. The output of the research should offer any one of these or its combination as a knowledge contribution. (March & Smith, 1995) Solving complicated problems is the main aim of developing IT applications. The research includes the analysis, design, and development of the IT product. Depending upon the IT system, it has three roles in which the “research project’s main focus is the IT application. IT application is only an end- product but it mainly focuses on the development process or IT application act as a vehicle for other purposes.

It is a problem-solving approach which is a five-step process. (Vaishnavi, Kuechler, & Petter, 2004)

Awareness: it is the identification of a problem. This can be finding out by reading further research areas in the literature, the client needs something new for development, from new technological developments, field research, etc and the output of it is the new research proposal.

The necessity of improvements in the current data for getting a better result is the cause of the research. The main focus of the thesis is to solve the problems related to the stop data.

Suggestion: design ideas for describing the problem in a better way. Suggest ideas for how to handle the problem of getting a good solution.

Many ideas are raised during the discussions and suggestions are evolved to make the ideas better.

The final idea is obtained by considering all the suggestions that arise during the discussions and then the development process is done. In the thesis, there is a lot of ideas that arise to automate the operations for identifying the known faults. By considering all those, Python coding is best suited for automating the operations.

(22)

Development: implementation of the design ideas discussed in the suggestion is done and it is purely based on the proposed IT application.

Python codes are developed to find known faults and thereafter eliminating the faults from the stop data makes the data more accurate and precise. The existing simulation model is updated with the new stop data and the updated simulation model is introduced in the production line.

Evaluation: the implemented IT application is evaluated and finding any deviations from the requirements.

The Python code and updated simulation model are evaluated. The evaluation criteria include accuracy, performance, accessibility, usability, etc. using the evaluation criteria the solution is evaluated and necessary changes are made, if there is any deviation from the requirements.

Conclusion: how the developed application worked in its environment is determined. In this process, all the design process results are coordinated and the knowledge obtained is identified.

The topic for other researches could be the unexpected deviations in the study, which are not discussed yet.

For this project, automating the operations to identify the known faults can be done with the help of Python programming. Python codes are used to figure out the known faults from the stop data and then improve the simulation model with the updated erroneous-free -free stop data. This research strategy is most suitable for the project, compared to other research strategies. The IT application in this project is Python coding and the best fit simulation model. The design and creation research strategy has positive and negative sides. They are

Advantages

In computing fields such as software development etc. It is the expected mode of research.

Other than abstract theories and knowledge, it has some real things to show (IT application).

A lot of opportunities for proposing as well as developing new IT applications.

People who like to do creative and technical development works, this is a good platform for that.

Disadvantages

The artistic, as well as technical skills, are much important for the development process, otherwise, it will become very difficult to obtain a better result

Generalize the IT application to other areas is very difficult

The continuous changes and the advancement of technologies in this field results in the invalidation of the research results. (Oates, 2006)

(23)

3.2 Data Collection Techniques

As this project co-operates with a company, the main source of data is the data from the company which is some Excel files, word documents, etc. For altering the existing simulation model with the updated data, the simulation model is provided by the company.

The other data collection techniques used in this project are observation, questionnaires, opinions, and documents. Observation is for getting a better idea about the project. For a better understanding of the problem discussed in the project, observation in the production area of the company is important. It gives a clear idea about the problem and the factors affecting the problem. The time is taken to ON and OFF a machine, time for normal maintenance, etc. are observed. The observation helps to give a clear picture of the stop data from a practical point of view.

Questionnaires and opinions help to identify the exact requirements of the client and to know how the problem occurs, the problems after effect, doubts in the stop data and the simulation model, etc. The answers from these questions give a detailed idea about the problem and the ways to solve the problem.

Expert opinions give an idea about the existing ways in the company for solving the same problem and its related problems. The experts clarify the doubts in the stop data as well as the simulation model as these techniques are more practical. Questionnaires help to obtain a large amount of information about the problem from the experts as well as employees in a short period.

The in-house procedure manual of the company gives special details that are not obtained from any sources like the internet, case studies, etc. It is much helpful to clarify certain problems occurring during the project. There are some informal descriptions created from the experienced employees which provide information about how the company treats the problems that occur during the working time. It gives some valuable details for the project such as how the employees treat the problem in the project in the past, etc.

The thesis mostly depends on two datasets which are DUGA stop log dataset and the maintenance dataset. The one-year DUGA dataset is used in the thesis. The known and unknown faults are found out from the DUGA dataset. The DUGA dataset has an old dataset and a new dataset. In the old dataset, the stops are divided in different shifts. In the new dataset, the stops are not based on shifts.

(24)

4. Identification of Erroneous Stop Data

4.1 Problem Description

Every company’s strength is the high-performance machinery and highly skilled laborers which helps the company attain its target at the desired time so that the profit of the company rises. Usually, the companies need a quick result, and what if the company have interruptions in the production line?

The client company faced the problem long ago that the desired throughput was not achieving. The company started an investigation to find out the issue and came to know that the production lines, which have several work stations, have interruptions. According to the client, the interruptions in the production lines are documented as “Stop Data” (SD). The scheduled and unscheduled production stops are responsible for the production lines' interruptions. The unscheduled production stops are breakdown, machine setup, maintenance, tool changes, etc. and scheduled production stops may include weekends, lunch breaks, holidays, etc. Any interruptions occurring in the production line is logged as stop in the SD, but the stops might be real or not real stops. For identifying and analyzing these production stops, the company has only minimal ways which are very difficult and time- consuming tasks. Upgrading the ways for identifying the stop’s nature is the main motto of the project.

To create a clean stop data, the elimination of faults and a better simulation model is required.

The dataset used here is dealing with one production area having two production lines with a total of 53 work stations in the production area and the stop in that particular area for one year. The stop is recorded in terms of time, where the time at which the machine stopped is recorded in the start time column in the stop data and the time in which the stopped machine restarted comes under the end time column. The start time and end time give the exact duration of the machine stopped, so the difference between the end time and the start time is added in the column” Total duration”. There is another column” Duration within planned production time” which indicates how much time the machine is stopped in the shift hours and this can be obtained by comparing the start and end time with the calendar dataset.

4.1.1 Known Faults Occurring In The Stop Data

1. A stop is logged when it should have been a planned stop

When the entire department is planned to stop, it should be done in the calendar for the entire department and the planned stops should enter the annual calendar. All the stops are directly connected to the calendar dataset where the calendar dataset includes the timing for shift changes, planned maintenance, holidays, etc. The stops occur for planned maintenance, shift changes, holidays which

(25)

are coded in the calendar are considered as a planned stop. If a planned stop occurs and is forgot to enter the calendar data, then it became a fault in the stop data. The stop is planned, but the mistake of code in the calendar data makes the planned stop into an unplanned stop in the stop data which is a common fault that occurs in the stop data and it occurs due to the manual mistake of coding in the calendar.

Improper maintenance of the machines results in breakdowns, that will disturb the whole process in the production line, so there is a stop of machines every Friday for the maintenance which should be coded in the calendar. Sometimes the planned maintenance fails to code in the calendar dataset so the stop becomes an unplanned event that is logged into the stop data. For example, every Friday out of 8 hours shift, 4 hours is for maintenance purposes and that time the production line is stopped. This is normally coded in the annual calendar but by mistake, if these are not entered in the calendar, that 4 hours will be logged into the stop data as a fault. All three shifts should be properly coded in the calendar, otherwise, it is also logged in the stop data.

2. A stop is logged because other equipment in the line is stopped.

In the production line, the flow is disturbed because of the lack of raw material for the processes, waiting time which occurs due to the unfinished work in the previous work station, etc. When there are no raw materials, the machines are turned off one by one which is normally coded in the machines.

If the operator forgets to code in the machine, it leads to log stops in the stop data, and these stops are not considered as real stops in the stop data, because these stops are planned as it is coded in the machines to turn off.

By considering the energy consumption point of view the machines are turned off, which helps to save more energy. There are a lot of workstations in a production line and also these are interconnected.

The finished output from the workstation 1 is the input of the adjacent workstation 2 and this continues to the last workstation in the production line. If there is no raw material for processing, the workstation 1 turns off after finishing the work and it will affect the next workstation because of the deficiency of the input for the workstation 2. If the machines are in ON condition without doing any work, it is a wastage of energy, so the machines are coded such that whenever there is no work the machine turns off. If there is any problem with the coding or fails to code the machine, it will log the stop in the stop data and is considered as a fault in the stop data as it is a non-real stop.

3. A stop is logged when the machine was running

During the working of the machine, it might be sending the stop signal to the stop data system and a stop is recorded in the stop data but the machine is working. This is due to the programming mistake in the PLC. According to the stop data, the machine is stopped. But actually, there are no interruptions

(26)

that occurred in the production, the products are produced at that time. This fault is more complicated compared to other known faults in the stop data.

4. A stop is logged for fractions of a second

A stop is recorded in stop data for a fraction of a second which can be neglected. The machine stops and restarted for a fraction of a second is seen as not a real stop which is caused by the fault is the programming mistake in the PLC. There is no logic behind the stop of a machine for less than a second, it only occurs due to some errors in the programming part. For example, the start time of this stop may be in the format of 06:30:24.01, and the end time may be 06:30:24.02. Considering the difference between the end time and start time, it will be in a fraction of seconds.

5. A stop is logged repeatedly

The same stops are logged more than one time in the stop data and can be considered as a fault. For example, one workstation having a stop from 10.30 am to 2.45 pm and another stop of the same workstation was recorded on the same day at the same time (10.30 am – 2.45 pm) and duration. This shows that the stop of the workstation is repeated. All these stops are not considered as real stops, only one stop is valid and others are duplicates of the same stop. So the repeating stops are faults in the stop data.

4.1.2 Unknown Faults

There are also some unknown faults occurring in the stop data and it is very difficult to find out those unknown faults from the stop data. There are some procedures for finding known faults in the data.

For making the procedures, the whole details about the fault are required. Otherwise, the output is not accurate. There is no protocol to find unknown faults from the data because of the lack of details regarding the unknown fault. A deep analysis of the data helps to get some details about the unknown faults. Apart from the patterns that occurred due to the known faults, some patterns are present in the data. Focus on such patterns will help to find some unknown faults from the stop data. Some unknown faults occur in the data might be due to sensor problems in the machines, accidents occur while working, environmental factors such as natural disasters, pandemic, etc.

(27)

4.2 Proposed Methodology

Certain patterns will occur in the stop data respective to the known faults. By finding those patterns is the same as that of finding the known faults. The patterns are directly related to the start time and end time in the stop data. The start time in the stop data means the time at which the machine stops and the end time is the time at which the stopped machine restarts. These are the two important parameters used for finding the known faults in the stop data.

The patterns having a similar start and end time

1. Shows all stations having a similar start and end time

Figure 1 represents the patterns that occur for a particular time having a similar start time and end time for all the machines. It might be because someone purposefully switches off the machines for weekends, holidays, or lunch breaks but forgets to enter into the calendar and hence it logged into the stop data. The patterns have a maximum of 53 and a minimum of 2 or should be between these ranges since there are 53 work stations in the production line. If the machine is completely stopped for any holiday or weekends, then there is a pattern of 53 work stations. A minimum of two work stations´

patterns will be required to compare the start time and end time.

Figure 1: Pattern shows all stations having a similar start and end time

2. One portion or part of the line

Only one portion of the plant is switched off for any preplanned activity. A small pattern occurs as compared to the above pattern. This pattern in Figure 2 occurs due to a portion of the production line when stopped for planned maintenance but is forgot to record in the calendar. The patterns have also similar start time and end time but only a few work stations come under this pattern.

(28)

Figure 2: Pattern shows only one part of the line has a similar start and end time

3. Part of all the operation

The patterns in Figure 3 will occur at regular intervals because these are planned stops, but due to a planned stop that is made in some machines which have a parallel machine, it is recorded in the stop data list as a non-real stop. This makes it possible to continue producing at a lower speed in the parallel machines which are not stopped. In other terms, it might be a planned stop in part of the line. Another possibility for the similar pattern is when a common system which are connected to some of the machines in the line break downs as a result of electric, some liquid, ventilation, etc.

Figure 3: Patterns shows alternative stations have a similar start and end time

The patterns having a falling start time

Figure 4 represents the patterns having falling start time. For example, consider a machine is having raw materials for producing 50 parts, after producing 50 parts the machine turns off because of the

(29)

deficiency of raw material and also considering energy consumption. But since there are buffers in the line, it provides extra raw materials for producing 5 more parts. So, the second machine takes more time to produce parts than the first machine and then the second machine is turned off. Likewise, other machines are turned off creating a falling start time. Because of any coding problem in the machine, these stops are logged in the data. But actually, these stops are not considered as the real stops.

Figure 4: Patterns having a falling start time

So, all these patterns are identified from the stop data and eliminate from it, to produce a clean stop data. Automate the operation of identifying the known faults by Python coding.

4.2.1 Clustering

Clustering is a method of grouping given datasets so that data in each cluster will be similar to each other than the ones in other clusters. Clustering is important to find the best fit distribution of the stop data. It is an unsupervised learning method commonly used in the statistical analysis of data. Clustering can be used for gaining valuable information from the given data based on the cluster where the data falls into. It is used for pattern recognition, data compression, image analysis, machine learning, etc.

the main clustering algorithm used are K-Means, DBSCAN, Mean-Shift, etc.

The data given for the thesis is grouped based on stop duration and the number of machines stopped.

For grouping them, clustering algorithms seem to be the best option. K-Mean clustering, Gaussian- mixture algorithm, and DBSCAN clustering algorithm is used in this project, which is explained below; (Seif, 2018)

(30)

4.2.1.1 K-Mean Clustering Algorithm

It is a widely used clustering algorithm used for grouping data mainly in machine learning and data science areas. It is very easy to develop and implement the code. The basis of this method is vector quantization, which groups the given data into k clusters where the value is given by the user. It makes the data points inside the cluster similar to keeping the clusters different. The application of k-means includes market segmentation, image segmentation, document clustering, etc (Dabbura, 2018).

The steps involved in the working of the k-mean algorithm include specifying the number of clusters needed for the given data. The number of clusters, k, is determined by the user, based on the number of data one has. The next step is to determine the centroids by randomly selecting the points. The iteration is the next part which calculates the squared distance of all points to the given centroid. Then the algorithm will assign each data points to the nearest cluster. After assigning, the centroid of each cluster is calculated by taking the mean of all data points in that cluster. The iteration will stop either the change in the values in the squared distance is not changing since the cluster stabilizes or the number of iterations defined has been completed.

Figure 5: K-Means Clustered data

In the graph, the user gives the number of clusters, k as 2. So that the algorithm provides the randomly selected centroid for making the cluster. The centroid is represented in Figure 5 as yellow-colored”*”.

The centroid segregates the nearby data points into one cluster.

Figure.6 represents the pseudocode for KMeans clustering. The KMeans is imported from sklearn.cluster. Preprocessing is also imported from sklearn. The input data need to be scaled because the values of StopDifference and StopCount are in different ranges. So, getting a better result, scaling the data is important. Here, the number of clusters given is six. Also, the color and the names of each

(31)

cluster are given. The advantage of such a clustering algorithm is that its very fast. The main disadvantage of this algorithm is that the number of groups and the initial group center is given by the user which cannot be always correct.

Given mi....mk as randomly selected inputs from a1 ….ax

repeat

FOR i = 1..n DO

cij = {1 if j = argminimunj ││aj – mj││2 , 0 otherwise END FOR

FOR j=1..k DO nj = ∑ cij𝑛𝑖

mj = 1/𝑛𝑗 ∑𝑛𝑖=1cij. ai END FOR

UNTIL concurrence END

data= ['StopDifference', 'StopCount']

data scaled

number of clusters=6 color= [R,B,G,Y,O,V]

classes=[cluster 1, cluster 2, cluster 3, cluster 4, cluster 5, cluster 6]

FOR i in range(0,6)

Identify the adjoining centroid Allocate data points to the cluster plot title (Scatter Plot)

plot x-axis (Stop Difference)

plot y-axis( number of machines stopped) END

Figure 6: Pseudocode for KMeans Clustering

4.2.1.2 Gaussian-mixture Clustering Algorithm

The Gaussian-mixture algorithm uses the expectation-maximization algorithm for processing the data.

It also creates confidence ellipsoids and computes the Bayesian information to find out the number of clusters. For learning about the mixture models, this algorithm is the fastest compared to other algorithms. (scikit-learn developers, 2020). Gaussian-mixture model is the probabilistic model that

(32)

identifies the subpopulations that are normally distributed in a total population. Same as K-means, the user have to give the number of clusters.

Gaussian-mixture clustering algorithm works as same as K-means, but the former one has some advantages over the later. The first advantage is that in K-means, it does not account for variance ie, the width of the bell shape in the graph. In a two dimensional graph, the variance is responsible for the shape of the graph. So k-means are better when the points are circular. So that, the circular shape of the k-means can include the points in the cluster given the radius of the circle being the far point of that cluster. But when the points are not in a circle shape, the Gaussian-mixture algorithm becomes better than the k-means. Gaussian-mixture algorithm can take any shape according to the data points as in Figure 7. Another advantage is that the Gaussian-mixture algorithm performs soft classification compared to the hard classification of K-means. (Maklin, 2020)

Figure 7: Gaussian-Mixture Clustered data

4.2.1.3 DBSCAN

It is a density-based algorithm used for clustering. It is more user-friendly and intelligent. In this, the DBSCAN starts with a point in the data and it will calculate the distance to its neighborhood and label it as epsilon. The algorithm label these scanned points as visited. The algorithm moves to the neighborhood point and checks the epsilon distance to its neighborhood. If the near point has the distance equal to epsilon distance, it is taken as the part of one cluster and others are marked as noise.

The iteration proceeds until there are not points in the epsilon distance. Then the DBSCAN look for another arbitrary point and start from first and repeat all the steps until all the points in the given dataset are visited.

(33)

One advantage of the DBSCAN algorithm is that it does not require pre-set cluster numbers and it can also identify noises. Since the algorithm works based on density, it cannot perform well if the dataset is having varying density. This seems to be the one main disadvantage of the algorithm.

4.2.2 Association Rule Mining And Apriori Algorithm

Association rule helps in finding the association and relation between data in the datasets and it helps to identify how often the data occur. The association rule is divided into two steps; minimum support and minimum confidence. The application of minimum support in the database helps to identify all the frequent itemset. The application of minimum confidence on the frequent itemsets helps to create rules.

Identification of all the frequent itemsets from a large database is the most challenging part of the association rule. For identifying the frequent itemsets, there are many algorithms in the association rule. They are Apriori, FP-Growth, Eclat, etc.

In 1994, R Srikant and R Agrawal introduced the Apriori algorithm. Apriori algorithm is used for dataset mining and for applying association rule in the database. This algorithm works in a database having a large number of data. At first, it identifies the frequent data in the database and makes the data into larger datasets in which the item in the set would be more frequent in the database. This information can be used for association rule which gives the general format of the database. It often finds application in the medical field to find out adverse drug reactions. Another application is that it helps evaluate the sales at a departmental store. The main advantages of the Apriori algorithm are; it is suitable for large datasets. The implementation and understanding of the algorithm are easy. The limitations are; due to the analysis of the whole dataset, it becomes an expensive method and sometimes the process is very slow. (Srinivasan, 2020)

Steps involved in Apriori Algorithm (Malik, 2020) 1. Set a minimum threshold for support and confidence.

2. Take all the subset having high value for support than the minimum.

3. Select data from the subset having a confidence value higher than the minimum.

4. Order the data in descending order of the Lift.

The metrics in the association rule are (Raschka, 2020) 1. Support

Association mining rule contains three different supports, they are consequent support, antecedent support, and support. In the rule, X represents antecedent and Z represents the Consequent. Antecedent support finds the support of X and consequent support finds the support of Z. The metrics

‘Support’finds the combined support between X and Z. The support depends on antecedent and

(34)

consequent support. Support is used to find the frequency of an item in the dataset, which means that the item shows more times than the minimum threshold.

Support(X→Z) = support(X

∪ Z

) 2. Confidence

Confidence is the ratio of support to antecedent support. The confidence for (X→Z) and the confidence for (Z→X) are different, this represents the metrics that are not symmetric. The antecedent and consequent regularly appear together, then the confidence is maximal ie, 1.

Confidence (X→Z) = Support(X→Z) / Support(X) 3. Lift

Lift is the measure of how often the antecedent and consequent occur together. When the value of Lift is 1, it represents that the antecedent and consequent are independent.

Lift (X→Z) = Confidence (X→Z) / Support (Z) 4. Leverage

Leverage is the difference between support as a whole and antecedent and consequent support independently. If the antecedent and consequent are independent, then the value of leverage should be Zero.

Leverage (X→Z) = Support (X→Z) – Support (X)*Support(Z) 5. Conviction

It shows the dependency of consequent on the antecedent. If the conviction value is high, then the consequent is highly dependent on antecedents and vice versa. Same as that of Lift, the value of conviction is 1 when the antecedent and consequent are independent.

Conviction (X→Z) = 1 – Support (Z) / 1 – Confidence (X→Z)

Figure 8 represents the pseudocode for association rule. TransactionEncoder is imported from mlxtend.preprocessing. apriori and association_rules are imported from mlxtend.frequent_patterns.

The output varies based on the values given to the minimum support and minimum threshold in the code. (KDnuggets, 2020)

IF k> a+1 then

Xa+1= aprioriGen(Xa) FOR xa+1 € Xa+1 DO

Confidence=Support(Yk)/Support(Yk-xa+1) IF Confidence >= minConfidence then result(Yk – xa+1)→ xa+1

ELSE

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

• Utbildningsnivåerna i Sveriges FA-regioner varierar kraftigt. I Stockholm har 46 procent av de sysselsatta eftergymnasial utbildning, medan samma andel i Dorotea endast

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

Figur 11 återger komponenternas medelvärden för de fem senaste åren, och vi ser att Sveriges bidrag från TFP är lägre än både Tysklands och Schweiz men högre än i de