Thesis no: MGCS-2014-06
Clustering of Image Search Results to Support Historical Document
Recognition
Javier Espinosa
Faculty of Computing
Blekinge Institute of Technology
SE371 79 Karlskrona, Sweden
This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulllment of the requirements for the degree of Master of Science in Computer Science.
The thesis is equivalent to 10 weeks of full-time studies.
Contact Information:
Author:
Javier Espinosa
E-mail: jaes13@student.bth.se
University advisor:
Dr. Niklas Lavesson
Dept. Computer Science & Engineering
Faculty of Computing Internet : www.bth.se
Blekinge Institute of Technology Phone : +46 455 38 50 00
SE371 79 Karlskrona, Sweden Fax : +46 455 38 50 57
Abstract
Context. Image searching in historical handwritten documents is a challenging problem in computer vision and pattern recognition. The amount of documents which have been digitalized is increasing each day, and the task to nd occurrences of a selected sub-image in a col- lection of documents has special interest for historians and genealogist.
Objectives. This thesis develops a technique for image searching in historical documents. Divided in three phases, rst the document is segmented into sub-images according to the words on it. These sub- images are dened by a features vector with measurable attributes of its content. And based on these vectors, a clustering algorithm com- putes the distance between vectors to decide which images match with the selected by the user.
Methods. The research methodology is experimentation. A quasi- experiment is designed based on repeated measures over a single group of data. The image processing, features selection, and clustering ap- proach are the independent variables; whereas the accuracies measure- ments are the dependent variable. This design provides a measurement net based on a set of outcomes related to each other.
Results. The statistical analysis is based on the F
1score to measure the accuracy of the experimental results. This test analyses the accu- racy of the experiment regarding to its true positives, false positives, and false negatives detected. The average F-measure for the experi- ment conducted is F
1= 0.59, whereas the actual performance value of the method is matching ratio of 66.4%.
Conclusions. This thesis provides a starting point in order to develop a search engine for historical document collections based on pattern recognition. The main research ndings are focused in image enhance- ment and segmentation for degraded documents, and image matching based on features denition and cluster analysis.
Keywords: Historical documents, Computer vision, Features extrac- tion, Clustering.
i
Acknowledgments
First and foremost, I thank my supervisor Dr. Niklas Lavesson. With- out his support, encouragement, academic guide, and patient this the- sis would have not be possible. His professionalism is an example to follow.
Secondly, I thank Arkiv Digital AD AB for providing the data for the thesis, its image database has been the foundations of all the experiments conducted.
Thirdly, I thank my parents for giving me the opportunity and privi- lege of studying abroad, in addition to their unconditional trust and support.
Last but not least, I would like to thank the friends who have always supported me, even without understanding my work. No matter how, they always help me.
ii
Contents
Abstract i
Acknowledgments ii
1 Introduction 1
1.1 Terminology . . . . 1
2 Background 3 3 Related Work 6 3.1 Identication of Gap . . . . 7
4 Aim and Objectives 9 5 Approach 10 5.1 Software library . . . 10
5.2 Image processing . . . 11
5.2.1 Binarization . . . 11
5.2.2 Image ltering . . . 12
5.2.3 Segmentation . . . 13
5.3 Feature extraction . . . 15
5.4 Clustering . . . 16
5.5 Data preparation . . . 17
5.6 Data representation . . . 18
5.7 Cluster borders . . . 19
6 Research Methodology 20 6.1 Experimental design . . . 21
6.2 Data collection . . . 22
6.3 Data analysis . . . 23
6.4 Validity threats . . . 23
6.4.1 Internal validity . . . 23
6.4.2 External validity . . . 24
iii
6.4.3 Construct validity . . . 24 6.4.4 Statistical validity . . . 24
7 Results 25
7.1 Image processing . . . 25 7.2 Clustering . . . 27
8 Analysis 29
8.1 Image processing analysis . . . 29 8.2 Features analysis . . . 30 8.3 Clustering approach . . . 30
9 Conclusions and Future Work 33
References 34
iv
Chapter 1
Introduction
Computer-based historical document recognition and analysis has become an im- portant research area due to the amount of documents from the past which have been digitalized. In order to learn from and preserve our cultural heritage, the national archives and companies such as Arkiv Digital AD AB
1have scanned or photographed millions of historical documents. Historical documents are usually handwritten and many of them are damaged. Thus, it is more time consuming to extract knowledge from such documents than from high-quality and typed documents.
For genealogists and historians the nding of multiple occurrences of a name, a marker, or some other fragment of text in a collection of historical documents is still a manual task. There exist no solution for searching automatically for a selected sub image of one historical handwritten document image in a collection of document images. The application of content-based image retrieval techniques in conjunction with machine learning may introduce a solution to reduce the time it takes to manually skim through all the documents to interpret the contents (Dharani and Aroquiaraj, 2013; Zakariya et al., 2010). Several possible solutions appear in order to solve the searching problem, but the following background is focused on image processing for sub image extraction and clustering for image matching.
1.1 Terminology
To avoid misunderstandings and make clear the explanations in the rest of the document, some terms are dened according to their specic meaning in this context. A document is called a collection of digital images from the same book.
1
http://www.arkivdigital.net/
1
Chapter 1. Introduction 2
An image is understood as the digital image of one page from a book. And nally,
a sub-image is an image extracted from one of the aforementioned pages.
Chapter 2
Background
In relation to image processing, the Optical Character Recognition (OCR) method- ology provides a solution to extract the elements from handwritten or machine- printed documents (Vamvakas et al., 2008; Stamatopoulos et al., 2009). OCR methodology can be divided into three general steps: a binarization step to ob- tain a black-and-white copy with imperfections removed; a recursive segmenta- tion of lines, words and characters; and the character recognition using a database elaborated for the case. The last step presents a disadvantage in relation to the database availability, since the character recognition is a supervised task. OCR techniques only work properly with machine-printed documents, which only use one type of font and it makes the recognition process manageable. But, hand- written documents present dierent writing styles and type fonts. Thus, to create a database for each document becomes infeasible. The support for very old hand- written documents is scarce (Vamvakas et al., 2008). However, the two rst steps emerge as a common factor for the historical documents image processing.
Image binarization is the rst step in image processing. In historical documents it is common to have a low contrast between the background color and the ink, physical noises, or even ink bled through the other side (Su et al., 2013, 2011).
The binarization operation consists in transforming a color picture to a black- and-white picture. In the process, each pixel is turned black or white based on a threshold. An appropriate conguration of this threshold is essential to remove the undesirable elements, due to that any element in the background may cause a bad segmentation in the next step.
Once the picture has been pre-processed, in the segmentation phase, the picture is divided into dierent sub-pictures which can contain separated lines, words or characters (Messaoud et al., 2012). This step is decisive for the remaining work, and its success depends on the accuracy of the segmentation model. There is no currently standard for segmentation. Dierent algorithms have been developed and each one can be used depending on the type of document and the distribution
3
Chapter 2. Background 4
Figure 1: Example page of an historical document image. Source: Arkiv Digital AD AB, www.arkivdigital.se
of its content (Likforman-Sulem et al., 2007).
Machine learning techniques are used for image recognition tasks. According to Mohri et al. (2012), there are dierent techniques depending on the training data available and the test data used to evaluate the algorithm. Supervised learning, for example, builds its model making use of a set of labelled data called training set. In semi-supervised learning the learner makes use of a training set with labelled and unlabelled data, and makes predictions for the unknown points.
And unsupervised learning only makes use of unlabelled data. Clustering is an example of unsupervised learning.
Content and pattern recognition are the two dierent objectives once the segmen- tation step is complete. If the purpose of the technique is to obtain the transcript of the document, content recognition techniques may be applied. For this case, we have a classication problem where supervised or semi-supervised machine learning techniques may be applied (Ball and Srihari, 2009; Vajda et al., 2011;
Marinai et al., 2005). A training set is also a necessary condition to predict which
character matches in the transcription. Historical documents present disadvan-
tages which maximize the human eort and only work for particular cases. Each
document has been written by dierent authors, and each author has a dierent
Chapter 2. Background 5 writing style. To carry out a content recognition in that case, a special dataset with a large amount of labeled characters, based in the author writing style, is required for the training data.
On the other hand we have the content-based recognition techniques. In this
case the transcription of the document is unavailable, but in comparison with
the previous technique it presents more advantages. The possibility of using an
unsupervised model minimizes the human interaction, and the previous labeling
work over the same document is not required to proceed with the recognition
process (Panichkriangkrai et al., 2013; Phan et al., 2012). In this scenario we
have an input image used as pattern, and a bundle of sub-pictures corresponding
to all the segmented words previously obtained. The recognition is performed
in two phases: features or metadata extraction, and clustering. Each image is
represented as a vector of features, and these features dene the content of the
image without any knowledge about it. The distance between the features vector
and the others is used in the clustering phase to make the decision. A clustering
algorithm is used due to that, there is no target value to be predicted, but the
images can be divided into groups. The aforementioned distances between vectors
are used by the clustering algorithm to separate the images which match with
the pattern from those which do not.
Chapter 3
Related Work
Image recognition and document analysis have a wide open research area. There are many applications based on image matching such as the Google and Bing image search engines (Tang et al., 2012; Jing et al., 2013). These web applica- tions use image features to evaluate the image and calculate distances to rank the outputs according to their similarity. Other studies also aim to develop a search method in handwritten documents, Imura and Tanaka (2010) proposes a method based on a string-matching technique. This method treats the text in the docu- ment as pattern images by dening them with a vector of statistical features of character shapes. The text search is made by comparison of the feature vectors, measuring the distance between them to make the decision.
Following the pattern matching work, previous studies in pattern recognition have focused their attention in Asian languages. In studies such as Vajda et al. (2011) or Marinai et al. (2005), the common objective of the authors is the character retrieval from historical documents. The study conducted by Marinai et al. (2005) uses a segmentation procedure based on recursive column-row cut method, and a clustering technique divided into three steps: character normalization, feature extraction, and classication. The feature extraction consists of eight dierent values corresponding to the gradient in eight dierent planes. The clustering method used was K-means, and the accuracy of the relation between the correct clustered characters over the total. The evaluation was made manually, and the approach achieved a total accuracy of 71%.
In the study conducted by Vajda et al. (2011), the pre-processing steps include binarization and histogram-based segmentation. Once the characters are sepa- rated, two features are used to dene the content: the histogram of the oriented gradients, and the pixel-wise dierence for global distributions. Similar characters are obtained measuring the feature dierences between the pattern and the rest of the characters. To evaluate the accuracy, the same manual method explained in the previous paragraph is used. The accuracy achieved in this study is between
6
Chapter 3. Related Work 7 86% and 97%.
Studies of documents with dierent alphabets present other techniques for hand- written recognition. Semi-supervised learning techniques are applied in order to minimize the human eort in the transcription of the document. For Vajda et al.
(2011), clustering algorithms are used to gather each character in dierent groups, and then manually label each one. The process used by Ball and Srihari (2009) reverse the order: samples of handwritten characters are labelled before using unsupervised learning for recognition.
For the image pre-processing steps, the previous studies of historical documents have improved the binarization and segmentation techniques. Rangsanseri and Rodtook (2001) develops the idea of local thresholds to balance the background contrast or the irregular illumination in the document. The local thresholds are calculated dynamically for each pixel based on its neighbourhood. This local threshold outperforms in degraded documents those based on static values. Su et al. (2013) also proposes a binarization technique based on adaptive image contrasting. An adaptive threshold based on the image gradient is used for edge detection in the text. This improvement is tolerant to typical types of document degradation or irregular illumination at the moment of the digitalization.
The segmentation method study by Likforman-Sulem et al. (2007) is based on histogram projections. A handwritten text can vary with respect to font and size depending on the author style and the histogram protection adapts the segmen- tation according to these factors. A horizontal histogram projection is calculated to detect the lines, in which each peak represents a dierent line and the same procedure with a vertical projection is used for the word segmentation.
One step beyond the last segmentation example is the framework introduced by Messaoud et al. (2012). The authors combine the pixel projection in addition with angle correction, in order to address the problem of the text-lines with an angle deviation with respect to the horizontal guideline. The method was tested using the IAM historical database (IAM-HistDB) and images from the ICFHR 2010 competition, outperforming the results based only on pixel projections.
3.1 Identication of Gap
Content-based image search in historical documents is still an open research area.
Previous studies have developed specic techniques to address one particular
study object, but there is no general technique for image searching in documents
written in Latin alphabets. The principal challenge remains in nding elements
in a collection of document images which match with a sub-image previously
selected. It can be names, markers, or some other fragments of text. In other
Chapter 3. Related Work 8 words, the handwritten recognition problem is reducing to an image matching problem.
In Figure 2 there is an example of one possible image searching objective. The same name, Augusta occurs in consecutive rows apparently written by the same person. With a suitable content recognition and image searching technique, the detection of the same name in image documents would be possible with a sub- image input example. After selecting the rst Augusta, the sample in the second row would be recognized regardless of subtle dierences between the writing styles of each letter.
Figure 2: Example page from the Uppsala Cathedral household records, years
1881-1883. The rst column represents a machined-typed index and the sub-
sequent eight columns contain tabular, handwritten information about indi-
viduals from the Uppsala Cathedral parish. Source: Arkiv Digital AD AB,
www.arkivdigital.se
Chapter 4
Aim and Objectives
The aim of this thesis is to develop a technique for image searching in historical handwritten document images. Image processing techniques such as binarization and segmentation, in conjunction with clustering algorithms, will be analyzed and evaluated in order to achieve high performance. As a result of achieving the aim, a complete methodology for image searching in historical documents, and a window-based software tool to perform the sub-image selection and matching will be provided.
The objectives to fulll the stated aim are to:
1. Find an optimal conguration for binarization and segmentation to correctly extract the words in a document.
2. Study which combination of features is optimal to dene a text sub-image without any additional knowledge about the content.
3. Design a clustering approach to analyse all the sub-images features and provide a solution for the image matching problem.
4. Develop a window-based software tool to perform the image recognition experiments and evaluations, enabling a visual tool to validate the results.
9
Chapter 5
Approach
The approach is separated into three dierent phases: image processing, features extraction, and clustering analysis. The objective of the image processing is to split the documents into sub-images containing all the objects of interest. The image separation is a trade-o, and it is necessary to nd the balance in the binarization and segmentation steps to separate the content information from the background. Dierent thresholds have to be tuned to nd a suitable conguration for the studied datasets.
Once the image documents are fragmented into sub-images, it is necessary to dene their content without any knowledge about it. The content of a picture can be expressed by dening a set of features, creating a vector for each image.
The feature vector will be used later for the clustering algorithm to match the dierent sub-images.
The last step of the process is the cluster analysis. With a previously selected sample of the document, the clustering algorithm identies which sub-images match the sample. The aforementioned feature vector is used to measure the disparity between images in order to make the decision.
The steps of the approach are now explained in detail, including a description of the software library used for the application development.
5.1 Software library
OpenCV
1is an open source computer vision and machine learning software li- brary. It includes programming functions focused in real-time image processing.
The windows-based software tool for the project is developed on this platform.
The library includes a set of algorithms, e.g., for binarization, segmentation, and
1
http://opencv.org/
10
Chapter 5. Approach 11 recognition. This platform also has a large user community and it is extensively used in companies and by researchers. For these reasons, OpenCV provides the tools to develop the aforementioned application. OpenCV has interfaces for dif- ferent programming languages, but since its primary interface is in C++, the software tool is written in that language to make the most of the platform.
5.2 Image processing
The image processing procedure is conducted in three steps: binarization, lter- ing, and segmentation. The binarization process consists in converting a colour image to black-and-white using a threshold for all the pixels in the image. Af- terwards the image is ltered to remove all the imperfections remaining, like background noise, holes or bumps. And later, the segmentation process splits the binary image into multiple sub-images using vertical and horizontal projections of pixels.
5.2.1 Binarization
A digital image is processed as an array of pixels, and each pixel is dened with 8 bits, from 0 for black to 255 for white. To binarize an image it is necessary to use a threshold to decide the new value of the pixels. This threshold is dened as a number between 0 and 255 and it can be static or dynamic. A static threshold uses the same value for all the pixels of the image, whereas a dynamic threshold recalculates its value for each pixel according to a formula. Figure 3 illustrates a static threshold operation.
For the binarization of historical documents the threshold chosen is dynamic, con- sidering that it provides a better result when the background or the illumination is irregular (Su et al., 2013). The dynamic threshold calculates its value for each individual pixel as a weighted sum of the neighbour pixels. The weighted sum is calculated as a cross-correlation with a Gaussian window subtracted by a chosen constant C. In this way, the function takes into account a predened number of neighbouring pixels for the adaptive threshold. However, the threshold type used is inverted; the dark pixels become white and the bright pixels black. White pixels are represented with the value 255. Therefore it is convenient to invert the threshold in order avoid histograms lled with high numbers, that may be a disadvantage in order to nd the maxima and minima to cut the image.
The binarization function is dened according to the formula 1, where maxValue
is 255 for a black-and-white image, and T(x,y) is the calculated threshold, which
Chapter 5. Approach 12
Figure 3: Example of threshold operation. The red contour represents the pixel values of a picture, and the horizontal blue line a static threshold. The pixels above the threshold are set to the predened maximum value and the others are set to zero. Source: OpenCV Documentation, docs.opencv.org
is the aforementioned weighted sum of the neighbouring pixels subtracted by the constant C.
dst(x, y) =
( maxV alue if src(x, y) > T (x, y)
0 otherwise (1)
5.2.2 Image ltering
In handwritten documents it is common to nd undesirable elements like ink drops, background marks that persist after the binarization step, or even holes and bumps (Nina et al., 2011). These imperfections may cause a decrease in the segmentation accuracy. The purpose of this ltering process is to remove those elements from the black-and-white image.
The ltering process is based on erosion and dilation operations. These functions
have contrary reactions over the image. The erosion operation makes bigger the
dark areas of the image and reduces the bright regions, whereas the dilation
reduces the dark areas and dilates the brighter zones. Figure 4 illustrates the
eect of both operations over the same image. The objective of the erosion lter
is to remove the undesired elements in the images such as ink drops, whereas the
dilation lters is aimed to make bigger the text lines in order to facilitate the text
detection.
Chapter 5. Approach 13
Figure 4: Example of ltering, from left to right: original image, erosion ltering, dilation ltering. Source: OpenCV Documentation, docs.opencv.org
The image ltering process is conducted in two dierent phases: The dilation
lter is applied before the segmentation, and the erosion lter before the features extraction. In order to increase the accuracy of the segmentation, the dilation operation is performed to expand the white pixels. The segmentation process is based on the projection of these pixels. Therefore, bigger lines cause a better detection of lines and words.
The erosion lter is applied to the sub-images once they are extracted. The features extracted are based on the distribution of the pixels over the image, but after the dilation, the content of the images lose quality regarding the writing style. The objective of the erosion operation is to return the image to its original form, and remove the aforementioned imperfections.
5.2.3 Segmentation
The method used for the segmentation process is based on horizontal and vertical
pixel projections. These projections are calculated as the sum of pixels along
a particular direction. This concept is inspired by the histogram projections
presented by dos Santos et al. (2009), and which are used to separate lines and
words in handwritten documents.
Chapter 5. Approach 14
Figure 5: Example of horizontal and vertical pixel projection histograms. Source:
http://archive.nlm.nih.gov/
The results of the projections are column and row vectors where the maximum values represent the lines with the higher amount of white pixels and vice versa.
These maximum points are used to identify where the image has to be cut. A recursive function over all the images is used to conduct the process, separating
rst the lines and then the objects of interest in them.
Handwritten documents present particular threats for an accurate segmentation (Sanchez et al., 2008). Due to the quality of the picture or the writing style, many lines in the document may appear diagonally, making the horizontal line detection harder. In other cases, the same line may contain sub-lines as in the
gure below.
Figure 6: Example of line with detection threats. The document style is not the same in all the columns, some of them only have one word per column whereas others have three lines of text. Source: Arkiv Digital AD AB, www.arkivdigital.se
The vertical segmentation to extract complete words also presents threats in
handwritten documents. The process is based on the search of the minima in the
projection, which are points that identify a space between written elements. But
this theory is only reliable in the ideal case where the background is clear and
Chapter 5. Approach 15 there is a lack of other elements. If the document presents tables, lines, or similar elements, the minima of the projection may be unreliable.
In order to solve the problem, and considering that the objective is not to identify the content but it is to detect the text, the sub-images are extracted between minima regardless of whether the word is complete or not. The extraction is not necessary full-text based. The same vertical segmentation process is conducted over the user sample in order to get the same sub-images as the extracted from the original image. Further, the cluster analysis is conducted over each sub-image extracted from the user selection. This selection is made in order to address the problem of the extracted images containing incomplete words.
To evaluate the validity of the previous techniques, it is necessary to manually count the number of elements incorrectly extracted. The relation between the correctly separated elements and the total gives us the accuracy of the method.
If the previous techniques do not achieve a suitable result, the steps are repeated tuning the parameters of the process until obtain a suitable conguration.
5.3 Feature extraction
Once the elements are extracted in separated images, the following step consists in dening the content. For this purpose, a features vector is created for each image. Kumar and Bhatia (2014) classify the feature extraction models into three groups: statistical features, signal processing methods, and geometrical or topological features.
For this particular case of handwritten historical documents, the features selected have to be meaningful and ecient (Impedovo and Pirlo, 2011). The features must dene the content of each sample, and they have to provide a balance between computing time and quality performance. Statistical and geometrical features are based on the distribution of pixels and the contour shapes. Focused on the writing style, they may provide more information with less complexity (Impedovo et al., 2012). However, signal transformations and series expansion are focused in deformations like rotations or translations, properties which are more relevant for pictures than for written documents.
To extract the features, the image is no longer modied due to its basic level of segmentation, similar in some cases to a syllable extraction. Each sub-image is analysed separately and the same numerical attributes are extracted for each.
First the geometrical features are extracted, these features only depend on the
distribution of the pixels. The rows and columns of the sub-image are the rst
parameters in order to measure the dimension of the sub-image. Number of white
pixels and its density on the sub-image are the subsequent parameters, they are
Chapter 5. Approach 16 calculated in order to measure the amount of text. And the last geometrical parameters are the mean and the standard deviation of the image, understood the image how an array of data.
The statistical features are calculated afterwards. They compare the sub-images extracted against the user samples by template matching overlapping. The over- lapped images are compared using three dierent methods based on square dif- ferences, correlation, and clustering coecients. The result is a similarity ratio between both images. This triad of features is computed for each sub-sample of the user selection.
Once the features are extracted for each image they are used to construct a feature vector. This vector contains the numerical features previously calculated, and represents the content of the image for the pattern recognition. Therefore, all the image features may be represented as a matrix where each row belongs to an image and each column to a feature.
F eatures =
F eature 1 F eature 2 · · · F eature N
Element 1 a
1,1a
1,2· · · a
1,nElement 2 a
2,1a
2,2· · · a
2,n.. . .. . .. . . . . .. .
Element M a
m,1a
m,2· · · a
m,n
(2)
Once the features matrix is built the next step is the cluster analysis.
5.4 Clustering
For the analysis and the classication of the extracted images according to their features is used a clustering algorithm. Jain et al. (2004) classify clustering al- gorithms based on: the input and output data representation, and optimization process. For the sub-images scenario, the input data are the feature vectors, or the feature matrix, and the output data a pair of clusters with the matched and non-matched images. The clustering algorithm for this image matching has to be centre-based. The main cluster is built using the user sample as centre. The main cluster contains the matched images while the other images belong to the second cluster.
Well known clustering centre-based algorithms, as K-means or K-medoid, require
a priori knowledge of the centre of the clusters. Therefore, they cannot be used
since the centre of the non-matching cluster is undened. The algorithm to match
the images needs the centre of the main cluster and the function to mark out
Chapter 5. Approach 17 the area of it. The clustering algorithm developed is explained in detail in the subsequent sections.
For this approach, the outputs given are the input image, the extracted images, and those in the match cluster. This way the evaluation is made manually by counting the correctly matched images and the total of them. The relationship between the correctly matched and the total gives us the accuracy of the feature extraction and the clustering step. Thus, the total accuracy of the developed technique may be expressed as the combination of the two previously calculated accuracies.
Regarding to the input data and the expected outputs, the algorithm developed is divided in two phases: analyse the matrix of features previously built, and divide the data into two clusters, one with the matched elements and other with the non-matched. The algorithm is centre-based: the representation of the sample provided by the user is used denes the centre of the matching cluster. The objectives to reach the aim of the algorithm are to prepare the data for analysis, and to dene the function that represents the border of the cluster.
According to the classication of Dubes and Jain (1988), this clustering algorithm is exclusive, intrinsic, and partitional. Exclusive clustering means that the clus- ters are non-overlapping, one element cannot belong to both clusters. Intrinsic clustering due to a dataset to train the clusters is unavailable, there it is no knowl- edge a priori about the elements to classify. And partitional or non-hierarchical because it creates single partitions without sub-categories or groups.
5.5 Data preparation
Before proceeding with the clustering algorithm, it is necessary to prepare or transform the data. The features vector contains attributes related to the distri- bution of pixels in the image, and triads of features measuring the dierence with the user selection sub-images. To prepare the data it is necessary to take only the triad of features which compare the image to the pertinent user sub-image.
F eatures = (
Pixel distribution
z }| {
a
0, . . . , a
5, a
6, a
7, a
8| {z }
Triad 1
, . . . , a
n−2, a
n−1, a
n| {z }
Triad m
) (3)
Following, it is necessary to remove the features which do not bring information for
the clustering approach. In the previous chapter, it is explained that the features
extracted have to be concise and meaningful. But it may be possible that dierent
features provide the same or any information in order to dierentiate one image
from the others. For that reason, the cross-correlation, variation, and standard
Chapter 5. Approach 18 deviation of the features are calculated beforehand. A high-dimensional matrix of data may cause lower accuracy, in these matrix the distances between closest and farthest neighbours may be similar. Therefore, the features with little variation (no-information) or high cross-correlation (same information) are discarded.
Once the number of features is reduced, in order to avoid predominant attributes, the next step to prepare the data is to standardize the values. This way all the measures are in the same scale. The data is transformed according to the formula 4. Creating a new scale for all the attributes where the results lie between -1 and 1:
a
m,n0= a
m,nmax
m|a
m,n| (4)
These new feature values, with a suitable number of dimensions and the same scale, create a new matrix. Which is used to represent the data in the following steps.
5.6 Data representation
The data is represented using a proximity matrix. This matrix contains the distances between the values of the sample image and the others, measuring the dierence of all the transformed feature values.
The distances, or proximity indices, are calculated according to the formula 5 for the Euclidean distance, which is the most common way to measure a distance between two points:
p
m,n= v u u t
d
X
k=1