Study on Record Linkage regarding Accuracy and Scalability

(1)

Study on Record Linkage regarding Accuracy and Scalability

Johannes Dannel¨ov

Johannes Dannel ¨ ov VT 2018

Examensarbete, 15 hp Supervisor: Lili Jiang Examiner: Jerry Eriksson

Kandidatprogrammet i datavetenskap, 180 hp

(2)

The idea of record linkage is to find records that refer to the same entity

across different data sources. There are multiple synonyms that refer

to record linkage, such as data matching, entity resolution, entity dis-

ambiguation, or deduplication etc. Record linkage is useful for lots of

practices including data cleaning, data management, and business in-

telligence. Machine learning methods include both unsupervised and

supervised learning methods have been applied to address the problem

of record linkage. The rise of the big data era has presented new chal-

lenges. The trade-off of accuracy and scalability presents a few critical

issues for the linkage process. The objective of this study is to present an

overview of the state-of-the-art machine learning algorithms for record

linkage, a comparison between them, and explore the optimization pos-

sibilities of these algorithms based on different similarity functions. The

optimization is evaluated in terms of accuracy and scalability. Results

showed that supervised classification algorithms, even with a relatively

small training set, classified sets of data in shorter time and had approx-

imately the same accuracy as the unsupervised counter-parts.

(3)

For much support and interesting conversations I want to thank my supervisor Lili Jiang.

She has made many contributions that helped me tackle this broad field of computer science

leading to the result of my work. Thanks, too, to Hamon Ansari for being supportive during

the whole process of my research both as friend and adviser. Finally a special thanks to my

family, I could not have done this without them, literally.

(4)

1 Record Linkage 1

1.1 Background 1

1.2 Challenges 1

1.3 Objective 2

1.4 Contributions 2

2 The General Approach 3

2.1 Preprocessing 3

2.2 Blocking/indexing 4

2.3 Field comparison 4

2.4 Classification 4

2.5 Evaluation 5

3 Similarity Functions 6

3.1 Cosine similarity 6

3.2 Jaro-Winkler 7

3.3 Edit distance 7

4 Related Work 8

4.1 Deterministic linkage 8

4.2 Probabilistic linkage 8

4.3 Machine Learning Approach 9

4.3.1 Supervised Machine Learning 9

4.3.2 Unsupervised Machine Learning 9

5 Experimental Studies 10

5.1 Resources 10

5.1.1 Hardware & Software 10

5.1.2 Toolbox 11

(5)

5.2 Record Linkage Strategies and Optimization 12

5.3 Evaluation on Accuracy 13

5.4 Evaluation on Scalability 13

6 Results and Reflections 14

6.1 Results 14

6.2 Reflections 19

7 Conclusions and Future Work 20

(6)

1 Record Linkage

1.1 Background

Record linkage is the task of finding records in a data set that refer to the same entity across different data sources (e.g., websites, databases). There are many different names that refer to this practice, data matching, entity resolution, entity disambiguation, deduplication etc.

[1]. The principle practice of record linkage is to find similarities between two records. A geneticist sparked a theoretical idea of decision rules for defining matches and non-matches between different sets of data around the 1950s [4]. The idea soon showed promise in com- putations on large health files. Fellegi and Sunter later layed the mathematical foundation for this idea [2]. They introduced similarity functions to calculate an estimate of how similar two records are[2]. The contributions from these three persons spawned the idea of record linkage.

Table 1 contains the records of two persons. It is representative for the record linkage problem since it visualizes that the same person may have different assigned attributes in different datasets and still correspond to the same real-world entity.

Attribute Dataset-1 Dataset-2

First Name Alice Alicia

Last Name Smith Smith

Date of Birth 19950821-1320 199508211320 Phone Number 265-5984156 151-0484631

By using record linkage approaches it is possible to identify relations and categorize data according to real world entities. Table 1 shows that the same person attribute may be rep- resented in different forms in different data sources. By using comparison functions on the numeric and textual data in a linkage process and then using classification algorithms we can find whether a pair is a match or not. With the growing number of distributed and heterogeneous datasets, effective record linkage solutions are very valuable for finding a unified view of data [3]. This report is focused on a field comparison and classification phase in the general record linkage approach.

1.2 Challenges

Many different solutions to this problem have been researched and developed. The un-

derlying data is what ultimately effects the result of a linkage process [15]. Because of

this the most commonly used algorithms follow a general step-by-step approach, but due

to the high correlation between data and result, the configuration of each step may vary.

(7)

Therefore, based on the no free lunch theorem

¹

, there is no well-defined way to solve a real-application problem.

Scalability issues present big challenges when the attributes in records are complicated and the scale of datasets is large. When considering all pairs from a dataset A of size |A| and a set B of size |B| the complexity of checking the similarity grows to size |A| × |B|. For example, imagine that |A| = 1000000, |B| = 1000000 and that a similarity function that takes 2 records and calculate a similarity takes 1ms. This means that the total time it takes to control all the pairs adds up to 1 ∗ 10

⁻³

∗ 10

¹²

= 10

⁹

s. The computational cost will grow with an increasing amount of pairs to evaluate.

1.3 Objective

The objectives of our research include i) An overall view of the problem with record linkage [1, 11, 12]. ii) A comparison of different state-of-art record linkage algorithms – especially supervised and unsupervised record linkage machine learning algorithms – with regards to optimizing the accuracy and scalability of record linkage.

1.4 Contributions

Through a literature review, development, deployment and experimental studies, we

• investigate the accuracy and scalability for the state-of-the-art machine learning algo- rithms of record linkage and

• explore the difference between 3 commonly used similarity functions in terms of accuracy and scalability.

1

https://en.wikipedia.org/wiki/No free lunch theorem

(8)

2 The General Approach

The general approach for tackling a record linkage problem has been well-described by Peter Christen [24]. As shown in Figure 1 the steps toward information about matching pairs, generally involves a preprocessing of the different input data as a first step. Evaluation is often considered at the very end of this process. Our approach to the evaluation stage will be described in the experimental setup part in Section 5.

2.1 Preprocessing

Preprocessing – which can also be described as data cleaning – is used to get records in a canonicalized form [15]; this is very important for recognizing similarities. Most real world data is noisy with different formatted fields which can make the cost of data integration very high. The results of using simpler cleaning methods – like removing possible typos – have shown improvements on linkage results [11]. Other methods worth mentioning include the transformation of textual fields into a phonetic encoding. This transformation makes it easier to establish a consensus of e.g. mistyped names.

Data set A Data set B

Preprocessing Preprocessing

Blocking/Indexing Field comparison

Weight vector classification

Matches Non-matches Possible matches

Evaluation

Clerical review

Figure 1: A general approach to solve a record linkage problem.

(9)

2.2 Blocking/indexing

Fellegi and Sunter introduced the blocking issue when they established the mathematical foundation for record linkage [2]. A complete check of all record pairs could become very computationally hard [1]. By using different blocking/indexing techniques one reduces the amount of pairs to check. If two pairs of record attributes are completely dissimilar; those record comparisons are blocked. The blocked pairs do not need to be evaluated and instead the upcoming similarity measures can be used only on the matching pairs.

Table 2

Attribute Data set 1 Data set 2

First Name Bert Bert

Date of Birth 1959-08-21-1320 1995-08-21-3039

The blocking/indexing-step leads to its own research area. Many different methods exist, reaching from trivial techniques were blocking is done on attributes[21] that is expected to be exactly the same, e.g. ”Date of Birth” in Table 2, to machine learning techniques [21, 14].

For practices of scaling and storing big data, blocking/indexing is of great importance[1], but it is out of the scope of this study.

2.3 Field comparison

The end result of record linkage is to output a unified dataset without duplicate entities.

Duplicate entities between the datasets – which are about to be merged – are found by comparing fields. The effectiveness of finding duplicates depends on the similarity between attributes/fields. Similarity measurements vary depending on the type of data that is to be evaluated. For example if we want to measure similarity between names, the focus lies on finding lexicographical differences. A comparison usually ends with a vector containing numerical similarity values [17, 11, 16].

2.4 Classification

Given the output from field comparisons, classification algorithms are used to categorize

each pair of records as ”matches” or ”non-matches”. Our research mostly focuses on this

step in the record linkage process. Non-matches now all consist of blocked records and

additional non-matches are found by using different categories of decision models. A later

chapter will describe the state-of-the-art decision models. Possible matches are handled by

a clerical review process which often needs user input; a time-consuming task.

(10)

2.5 Evaluation

The performance of record linkage could be evaluated in terms of different measurements:

such as accuracy, clustering purity/impurity, clustering precision/recall/F-measure etc. When considering scalability, the focus of the evaluation process lies on the running time for training and testing the learning algorithms. The analyses of record linkage strategies use many metrics dependent on the following, which are measures of found matches versus true matches etc.:

• True Positives (TP),

• True Negatives (TN),

• False Positives (FP),

• False Negatives (FN).

Accuracy is defined as

T P + T N

T P + FP + T N + FN . (2.1)

Precision is

T P T P + FP . Recall is

T P T P + FN , and F-measure is

2 ∗ Precision ∗ Recall Precision + Recall .

Accuracy matters for scalability, but in order to have useful methods on large amount of

data other factors become important. With scalability the practicalness of using a method

on data is emphasized, with regards to accuracy but also how fast the linking can be made.

(11)

3 Similarity Functions

As shown in Figure 1, the final output of a general approach will consists of the following sets:

M : Set o f matches, U : Set o f non − matches.

Capital letters A and B correspond to two sets that contain records. These records are referred to as α(a) which belongs to A, and β(b) which belongs to B. a and b are represen- tative for characteristics such as a name. Therefore the set of matches is

M = {(a, b); a = b, aεA, bεB}, and non-matches

U = {(a, b); a 6= b, aεA, bεB}.

The size of a set, e.g. A, is denoted by |A|. The set of ordered pairs

AxB = {(a, b); a 6= b, aεA, bεB}, with a total size of |A| x |B|.

When Fellegi and Sunter layed the mathematical groundwork for record linkage[2] they introduced a comparison vector function γ[α(a), β(b)]. This comparison vector is observed during the linkage process. With its total comparison space, Γ, consisting of matches, non- matches and those we cannot decide, the result is a pair of elements (a, b) that belong to the set M, U or P. Where P is the set of undecided pairs. For a match M γ[α(a), β(b)] = 1 and on disagreement γ[α(a), β(b)] = 0.

In the following subsection, three commonly used similarity functions are introduced, which will be used to optimize the state-of-the-art algorithms.

3.1 Cosine similarity

The cosine similarity is calculated as the euclidean distance of the dot-product between two vectors, α(a) and β(b). The euclidean distance is calculated as following:

kαk = √

α · α (3.1)

and the dot-product is

α · β = Σ

^N_i=1

α

i

β

i

, (3.2)

(12)

making the cosine similarity function, using (3.1) and (3.2), cosine(α, β) = Σ

^N_i=1

α

i

β

i

q Σ

^N_i=1

α

²_i

q Σ

^N_i=1

β

²_i

.

(3.3)

3.2 Jaro-Winkler

The similarities between the two strings can be measured using Jaro-Winkler similarity.

This function is an adaption of a method developed by Jaro[18]. Jaro-Winkler similarity sim

_jw

is obtained by using the Jaro similarity sim

j

.

sim

_j

= n

0, when m=0

1

3(_|α(a)|^m + ^m

β(b)+^m−t_m ), otherwise

(3.4)

.

In the above equation, m, is the number of matches and t is half the number of transposi- tions of letters to obtain the other strings character ordering. The Jaro-Winkler similarity corresponds to the following equation:

sim

jw

= sim

j

+ (l p(1 − sim

j

)), (3.5) where l is the length of common prefix starting in the beginning of the string and continuing for up to 4 characters p is a scaling factor depending on how much the common prefix should affect the similarity score. sim

j

is displayed in (3.4). The final result used as the distance between the two strings is obtained by taking 1 − sim

jw

.

3.3 Edit distance

Edit distance is measured as the number of transformations it takes to transform β(b) from α(a) in terms of different operations. It is classified as a token-based similarity function.

These operations are a set of finite rules of the form:

δ(a, b) = t.

Where t represents the number of transformations. The set of different operations varies between edit based similarity measurements. Edit distance accept;

• Insertion: δ(ε,c), meaning an insertion of the character c.

• Deletion: δ(c,ε), deletion of character c.

• Substitution: δ(c

₀

,c

₁

), meaning a substitution of c

₀

by c

₁

[23].

For example; the edit distance between ”Hello” and ”Jello” is 1, the edit distance between

”Good” and ”Goodbye” is 3 and the edit distance between any string and itself is 0.

(13)

4 Related Work

The choice of strategy depends on the data that is used for the linkage. In this section an outline of the main approaches is presented. Although a mix of these can be used together, the difference between them is clear, as will be described. What the strategies have in common is that they are all used to find matches, non-matches and possible matches. Two classic approaches exist: Deterministic linkage and probabilistic linkage. Modern research mainly involves usage of different machine learning methods. Below follows a description of these three approaches.

4.1 Deterministic linkage

For the unlikely scenario that the data in a real world application is perfectly clean, a deter- ministic approach may yield very good results. It has been used in medical research where the process has been used iteratively, showing a high validity of the resulting match pairs [10]. When taking two record pairs, a deterministic algorithm inspect agreement on a given identifier and the outcome will either be a true or a false response. Exact matching is one type of linkage which can be used when a unique identifier exists. Exact matching is very precise – as the name suggests – but a unique identifier rarely exists when conducting a linkage. Another type is rule-based matching, the comparison is made character by charac- ter and depending on strategy an error may indicate a non-match or it may initiate another comparison [24, 19, 4].

4.2 Probabilistic linkage

One problem with the deterministic approach is that it needs clean and correct data. Con-

sider for example these strings: s = Bert, t = bert. A strict deterministic approach would

consider s and t as corresponding to different entities; clearly, this is prone to errors. If

instead t = Burt we would consider this as a completely different name. For probabilistic

linkage the importance of errors are taken into account by classifying errors by weights

[10, 2, 19].

(14)

4.3 Machine Learning Approach

Record linkage can be viewed as a clustering/classification problem. Nowadays, machine learning solutions are commonly used, including supervised and unsupervised learning methods. The use of machine learning algorithms has proven valuable. The non-learning counterparts may or may not fit for a specific dataset while learning is more adaptable [20, 22, 19, 13].

4.3.1 Supervised Machine Learning

Supervised machine learning methods are trained using labeled training sets; such as an input where the desired output is known. Examples of commonly used methods are:

• SVM (Support Vector Machine) is a supervised learning algorithm and currently the best-known classifier, which is a separating hyperplane [5, 6].

• Logistic regression is used to solve classification problems, where the objective is a categorical variable. The activation function in logistic regression is the sigmoid function. Logistic regression is simple and effective with probability output [6].

• The Naive Bayes-method assumes that all features are independent given the class label. Bayes networks are based on conditional probability. Naive Bayes is often a good choice if you don’t have much training data [7].

4.3.2 Unsupervised Machine Learning

Unsupervised learning gets its name from the fact that this type of learning algorithms do not need a set of training data. These methods are inferring a function to describe hidden structures from unlabeled data. Examples are:

• KMeans is a partitional clustering approach. K points are used to represent clustering result Each point corresponds to the center (Mean) of a cluster. And the number K, must be specified [8].

• ECM (Expectation Conditional Maximization), it is used in statistics and machine

learning as a method for optimizing the result iteratively by increasing a lower bound

in each step, where the lower bound is set by assuming independent attributes [9].

(15)

5 Experimental Studies

This section covers the platform and resources used in this report to address the record link- age problems. It will explain the choice and usage of dataset and from that, move forward to an explanation of the chosen linkage strategy. This section will also specify what al- gorithms and functions that has been tested. Finally, a description of the standard record linkage metrics – that has been used – is provided.

The main focus of the experiments was on comparison, analysis and optimization of well established machine-learning algorithms that apply to learnable similarity functions for pre- dicting matches and non-matches. Two problems that arises when dealing with larger sets of data were dampened by restricting the problem domain. According to earlier studies, the impact when dealing with record linkage on a large scale will be that the algorithms get more sensitive to an increasing number of datasets and incomplete data[1, 25]. To tackle this, we restricted the problem domain to two datasets. Preprocessed data were used for testing the algorithms to make sure that the datasets had the same canonicalized form[13].

5.1 Resources

The resources used in the experimental part are described in this section.

5.1.1 Hardware & Software

In order to gather test data, we have used the specification described in Table 3.

Table 3 Specification on hardware and software used when running the tests.

Name Version

OS Ubuntu 16.04 LTS 64-bit

CPU Intel Core 2 Duo CPU P7550, 2.26GHz * 2

RAM 4 GB

Hard drive HDD 500 GB

Programming language Python 2.7

IDE Jupyter notebook

(16)

5.1.2 Toolbox

The toolbox which were used comes from an open source project called “The Record Link- age Toolbox”

¹

. This includes libraries containing two datasets as well as state-of-the-art classification algorithms of record linkage. For different configurations the authors have used the Jellyfish toolkit which includes state-of-the-art similarity measurement functions.

Different configurations of the algorithms were used in the experiment, these were pro- vided by the toolkit

¹

. In order to try comparison algorithms we have made adaptations of one dataset. The other provided dataset was preconfigured for testing classification algo- rithms. The two datasets and their application gave us an opportunity to isolate the problem when testing the comparison and classification part of the general record linkage process.

5.1.3 Dataset

Two types of datasets were used for comparison of different algorithms and configurations.

These datasets are from the Record Linkage toolbox. For the evaluation of supervised and unsupervised machine learning algorithms the Krebsregister dataset were used. It consists of data from a cancer study made by the Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI) and the University Medical Center of Johannes Gutenberg Uni- versity in Mainz, Germany. The data in the Krebsregister is precompared using phonetic transformation and equality measurements. The published set consists of 5749132 record pairs. 20931 pairs are known matches obtained from an extensive review involving several documentarists.

The second set was imported from the FEBRL(Freely Extensible Biomedical Record Link- age) project. It consists of randomly generated personal data. The records kept in the dataset consists of both string and numerical data (see Table 3). The data has been formatted such that capital letters has been removed before hand, as can be seen in the table. This offers the ability for considering an analysis of the comparison and classification step due to the assumption of clean, non-precompared, data.

Table 4 This is a sample from the FEBRL dataset, showing attributes and field values.

rec id given name surname street nr adress 1 adress 2 suburb

rec-796-org louise heenan fernlea lakes entrance

rec-690-org edward dent 28 shipard place glenview picton

1

https://recordlinkage.readthedocs.io/en/latest/

(17)

5.2 Record Linkage Strategies and Optimization

To find a solution, the experiments were based on a two-step process. The first step was to try different machine learning algorithms on the Krebsregister data. The size and number of known precompared matches – in the Krebsregister dataset – resulted in an opportunity to focus on the classification part.

The algorithms that has been used classify as state-of-the-art. These are: Logistic Regres- sion, Naive Bayes and Support Vector Machine, representing supervised algorithms. We have also used the unsupervised state-of-the-art learning algorithms: KMeans and ECM.

In the second part of our experiments we considered different similarity functions on a pairwise matching made on FEBRLs dataset. FEBRL consists of personal information in the form of the attributes visualized in Table 3. These tests represent an optimization possibility since the choice of a similarity function greatly affects scalability, as will be seen. The following enumeration visualizes the attributes and their corresponding setup possibilities.

• given name: Jaro-Winkler, Edit distance or Cosine distance

• surname: Jaro-Winkler, Edit distance or Cosine distance

• street nr: Numerical equality matching

• address: Jaro-Winkler, Edit distance or Cosine distance

• suburb: Jaro-Winkler, Edit distance or Cosine distance

One of the three functions for all fields was used when comparing, followed by an evaluation

of the results of using each measurement function in terms of how many matches that are

found.

(18)

5.3 Evaluation on Accuracy

We explored the accuracy of different similarity functions (Cosine similarity, Jaro-Winkler similarity, edit distance) by the number of found matches. A match for the comparison function evaluation is defined as having three of the fields surpass a threshold value set to a likeliness of 85%. The known matches on the FEBRL dataset are known beforehand and therefore we knew how many true matches that existed. Either the function found to many or to few matches. The one with the closest amount of found matches is what we classified as most optimal. This methodology is used when describing the comparison functions in the record linkage toolkit documentation. To further certify the correctness of the similarity function implementations we have tested them on a smaller amount of data. Then we made a clerical review on the accuracy of the found values matching status.

We evaluated the different algorithms (SVM, Logistic regression, Naive Bayes, KMeans, ECM) in terms of accuracy. Regarding each classification algorithm the accuracy is defined by Eq. 2.1, resulting in a number between 0 and 1 where 1 is completely accurate and 0 is completely non-accurate.

5.4 Evaluation on Scalability

We evaluated the different algorithms (SVM, Logistic regression, Naive Bayes, KMeans,

ECM) and different similarity functions (Cosine similarity, Jaro-Winkler similarity, edit

distance) in terms of the computational runtime with regards to different data sizes.

(19)

6 Results and Reflections

6.1 Results

To reach a concordance with related results we consulted different summary studies [1, 11, 17, 12]. According to the accuracy, see Figure 2, there is a difference between the three shown classification algorithms. The KMeans algorithm performs in a relatively non- accurate way. The ECM algorithm is more accurate than KMeans and Logistic regression.

ECM together with Naive Bayes and Support Vector Machine accurately classified almost the same amount of true pairs, see the data presented in Table 5. The accuracy for these classification algorithms is calculated using Eq. 2.1.

Table 5 This table consist of the similarity between the classification algorithms; Expecta- tion Conditional Maximization (ecm), Naive Bayes (nb) and Support Vector Machine (svm) in terms of accuracy with regards to different dataset sizes.

Size Accuracy (ecm) Accuracy (nb) Accuracy (svm)

200000 0.8988 0.8988 0.8989

300000 0.9337 0.9337 0.9338

500000 0.9616 0.9615 0.9616

1000000 0.9826 0.9825 0.9826

2000000 0.9930 0.9930 0.9930

3000000 0.9965 0.9964 0.9965

4000000 0.9982 0.9982 0.9982

5000000 0.9993 0.9992 0.9993

(20)

●

● ●

0.90 0.95 1.00

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

size

accuracy

algorithm

● ecm

kmeans logreg

Figure 2: The relative accuracy of the state-of-the-art machine learning methods applicable

on a record linkage problem. The accuracy were calculated using Eq. 2.1.

(21)

For the scalability issue Figure 3 presents a visualization of the obtained test results. The supervised methods subdue the unsupervised with regards to their time complexities.

●

0.1 1.0 10.0

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06

size

time

algorithm

● ecm

kmeans logreg nb svm

(s)

Figure 3: Running time(measured in seconds and shown using logarithmic scala with base

10) of different classifiers on different data sizes.

(22)

The second step in the approach of studying optimization possibilities included the testing of well used similarity functions. These were: Jaro-Winkler, Edit distance and cosine simi- larity. All three functions behave similar on the three sets of data, finding approximately the same number of matches, see Figure 4. The runtime with regards to three different datasets proved a high dissimilarity between the chosen functions, see Figure 5.

0 1000 2000 3000

1 2 3

Dataset

matches

True matches Dataset 1 Dataset 2 Dataset 3

Similarity functions Cosine Edit JW

Figure 4: The similarity functions matching result are visualized in this figure. The com-

parisons have been made on 3 different sets. Set 1 consists of 500 true matches

and a total of 499500 pairs. Set 2 consists of 1000 true matches and 39775 pairs

and finally Set 3 consisted of 3000 matches and a total of 48883 pairs.

(23)

0 10 20 30 40

1 2 3

Dataset

Time (s)

Similarity functions Cosine Edit

time JW

Figure 5: The time it takes for calculating similarity between different number of pairs is

shown in this figure.

(24)

6.2 Reflections

Testing the accuracy with regards to different sizes of data, taken from the KREBS dataset, proved a difference in performance between the classification algorithms with regards to their accuracy, see Figure 2. The KMeans algorithm is commonly used in applications but this study shows its impracticality when it comes to scaling relative the supervised algo- rithms. The high tendency to scale in a time consuming way for the two unsupervised algorithms could result from the fact that they do not need a labeled training set. This is a throwback, but it makes them more applicable in many situations. The difference between supervised and unsupervised learning methods on the used data, can be seen in Figure 2-3 and Table 5.

The guideline that was set up for evaluating the scalability issue – regarding the similarity functions – was to have a decent amount of correct matches. After deciding if they omit to that guideline we evaluated the time complexity with regards to different sets of data, see Figure 5. The Jaro-Winkler similarity function was the most optimal – according to the tests – when focusing on the time it took to calculate the similarities. This observation keeps the consistence with previous studies

¹

even on different datasets, which makes it a candidate for usage in an optimized record linkage strategy.

We would like to emphasize that the strategies that we have used in this research is not 100% secure, in the way that it does not provide a 100% accuracy for the found matches, see Figure 2 and 4. This may invoke headaches if used on the linkage of datasets where accuracy is very important, e.g. in medical research. In the case of this study we did not take such matters in regard.

1

https://recordlinkage.readthedocs.io/en/latest/

(25)

7 Conclusions and Future Work

We focused on two parts of the general linkage process, the comparison of record pairs and classification of matches and non-matches. This resulted in two contributions 1) for the scalability issue Figure 3 presents a visualization of the obtained test results. The su- pervised methods subdues the unsupervised with regards to their time complexities. The best choice of algorithm for the purpose of this application would be the support vector machine-classifier based on the result shown in Figure 2 and 3.

2) for the comparison we chose to try three of the most common similarity functions and evaluate their accuracy and scalability. By trying these functions, on a linkage between datasets with FEBRL data, as an optimization approach one can see the importance of choosing a fitting function. In these tests Jaro-Winkler had the highest performance re- garding scalability – see Figure 5 – and approximately the same accuracy as the other two, see Figure 4. These results show that the Jaro-winkler function is the most optimal similar- ity function according to our tests.

In the future we suggest two different directions to explore: 1) we hope that benchmark datasets are easier to find and use in research applications. These advances could provide an effective way of finding training data, leading to more optimal strategies. 2) it would be interesting to see the result of a fully optimized strategy, which is problem-specific, i.e.

including optimized blocking functions etc., and compare that to the worst-case. This would

show the importance of adapting the full chain of the general record linkage process.

(26)

References

[1] Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice

& open challenges. Proc. VLDB Endow. 5, 12 (August 2012), 2018-2019. DOI=

http://dx.doi.org.proxy.ub.umu.se /10.14778/2367502.2367564

[2] Fellegi, Ivan; Sunter, Alan (December 1969). ”A Theory for Record Link- age”. Journal of the American Statistical Association. 64 (328): pp. 1183–1210.

doi:10.2307/2286061.

[3] William W. Cohen and Jacob Richman. 2002. Learning to match and clus- ter large high-dimensional data sets for data integration. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discov- ery and data mining (KDD ’02). ACM, New York, NY, USA, 475-480.

DOI=http://dx.doi.org/10.1145/775047.775116

[4] Newcombe, H. B. (1988). Handbook of record linkage: methods for health and statis- tical studies, administration, and business. Oxford University Press.

[5] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3, Article 27 (May 2011), 27 pages.

DOI=http://dx.doi.org.proxy.ub.umu.se/10.1145/1961189.1961199

[6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.

2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 9 (June 2008), 1871-1874.

[7] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 con- ference on Empirical methods in natural language processing - Volume 10 (EMNLP

’02), Vol. 10. Association for Computational Linguistics, Stroudsburg, PA, USA, 79- 86. DOI: https://doi-org.proxy.ub.umu.se/10.3115/1118693.1118704

[8] A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering:

a review. ACM Comput. Surv. 31, 3 (September 1999), 264-323.

DOI=http://dx.doi.org.proxy.ub.umu.se/10.1145/331499.331504

[9] XIAO-LI MENG, DONALD B. RUBIN; Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika, Volume 80, Issue 2, 1 June 1993, Pages 267–278, https://doi.org/10.1093/biomet/80.2.267

[10] Dusetzina SB, Tyree S, Meyer AM, et al. Linking Data for Health Services Research:

A Framework and Instructional Guide [Internet]. Rockville (MD): Agency for Health-

care Research and Quality (US); 2014 Sep. 4, An Overview of Record Linkage Meth-

ods. Available from: https://www.ncbi.nlm.nih.gov/books/NBK253312/

(27)

[11] Winkler, W. E. (2014), Matching and record linkage. WIREs Comp Stat, 6: 313–325.

doi: 10.1002/wics.1317

[12] Winkler , W. (1999). The state of record linkage and current research problems (Tech- nical Report). Statistical Research Division, U.S. Census Bureau.

[13] William E. Winkler. 2005. Methods and analyses for determining quality. In Proceed- ings of the 2nd international workshop on Information quality in information systems (IQIS ’05). ACM, New York, NY, USA, 3-3. DOI= http://dx.doi.org.proxy.ub.umu.se /10.1145/1077501.1077505

[14] George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016.

Comparative analysis of approximate blocking techniques for entity resolu- tion. Proc. VLDB Endow. 9, 9 (May 2016), 684-695. DOI: http://dx.doi.org /10.14778/2947618.2947624

[15] Hadley Wickham. 2014. Tidy Data. Journal of Statistical Software, Articles. DOI:

10.18637/jss.v059.i10

[16] Arvind Arasu, Surajit Chaudhuri, Kris Ganjam, and Raghav Kaushik. 2008.

Incorporating string transformations in record matching. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD ’08). ACM, New York, NY, USA, 1231-1234. DOI: https://doi- org.proxy.ub.umu.se/10.1145/1376616.1376742

[17] Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. 2006. Record linkage: similar- ity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (SIGMOD ’06). ACM, New York, NY, USA, 802- 803. DOI= http://dx.doi.org.proxy.ub.umu.se /10.1145/1142473.1142599

[18] Matthew A. Jaro (2012) Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association, 84:406, 414-420, DOI: 10.1080/01621459.1989.10478785

[19] Pradeep Ravikumar and William W. Cohen. 2004. A hierarchical graphical model for record linkage. In Proceedings of the 20th conference on Uncertainty in artificial in- telligence (UAI ’04). AUAI Press, Arlington, Virginia, United States, 454-461.

[20] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 con- ference on Empirical methods in natural language processing - Volume 10 (EMNLP

’02), Vol. 10. Association for Computational Linguistics, Stroudsburg, PA, USA, 79- 86. DOI: https://doi.org/10.3115/1118693.1118704

[21] Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, Stephen E. Fienberg.

2014. A Comparison of Blocking Methods for Record Linkage. DBLP: jour- nals/corr/SteortsVSF14. URL: http://arxiv.org/abs/1407.3191

[22] R. Sathya and Annamma Abraham, “Comparison of Supervised and Un- supervised Learning Algorithms for Pattern Classification” International Jour- nal of Advanced Research in Artificial Intelligence(IJARAI), 2(2), 2013.

http://dx.doi.org/10.14569/IJARAI.2013.020206

(28)

[23] Shirkhorshidi AS, Aghabozorgi S, Wah TY (2015) A Comparison Study on Similar- ity and Dissimilarity Measures in Clustering Continuous Data. PLOS ONE 10(12):