Autonomous testing of web forms

(1)

Autonomous testing of web

forms

KEVIN YERAMIAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

forms

KEVIN YERAMIAN

Master in Computer Science Date: December 13, 2019 Supervisor: Johan Boye Examiner: Viggo Kann

(3)

(4)

Abstract

(5)

Sammanfattning

(6)

1 Introduction 1

1.1 Research Questions . . . 2

1.2 Approach . . . 2

1.3 Limitations of the project . . . 3

1.4 Sustainability and ethics . . . 3

1.5 Societal aspects . . . 4

2 Background and theory 5 2.1 Structure of a web page . . . 5

2.1.1 HTML & CSS . . . 5

2.1.2 JavaScript . . . 6

2.1.3 Webform . . . 6

2.1.4 Requests & Communication . . . 7

2.2 Keyword search methods . . . 7

2.2.1 Extension . . . 7

2.2.2 Madumbo’s Method . . . 9

2.3 Scraping and Crawler . . . 9

2.4 Deep / Machine Learning . . . 10

2.4.1 Natural Language Processing . . . 10

2.4.2 Word embedding and Vectorization . . . 10

2.4.3 Term Frequency-Inverse Document Frequency . . 10

2.4.4 Hyperparameter Optimization . . . 11

2.4.5 Overfitting and Underfitting . . . 11

2.5 Pre-processing methods . . . 12

2.5.1 Scaling and Normalization methods . . . 12

2.5.2 Features reduction . . . 13

2.5.3 Data Cleaning . . . 13

2.6 Classification methods . . . 14

2.6.1 One-vs-One and One-vs-Rest . . . 15

(7)

2.6.2 Bootstrap aggregating . . . 15

2.6.3 KNN . . . 15

2.6.4 Decision Tree and Random Forest . . . 16

2.6.5 Logistic Regression . . . 16

2.6.6 Support Vector Machine . . . 17

2.6.7 Multi-layer Perceptron . . . 19

2.7 Evaluation methods . . . 20

2.7.1 F1 . . . 20

2.7.2 Cross-Validation . . . 21

3 Method 23 3.1 Gathering data from the web . . . 23

3.1.1 Initialization of the crawler . . . 23

3.1.2 KeyWord Scoring System . . . 24

3.1.3 Classification and submission detection of web forms . . . 25

3.1.4 Design of the Crawler . . . 29

3.1.5 Data Representation . . . 31 3.2 Pre-processing . . . 31 3.2.1 Data Selection . . . 32 3.2.2 Data Cleaning . . . 32 3.2.3 Vectorization . . . 34 3.2.4 Data reduction . . . 35

3.2.5 Classification of the data . . . 35

4 Results 36 4.1 Machine Learning models . . . 36

4.2 Dimensionality reduction . . . 40

4.3 Crawler . . . 40

4.4 Machine Learning model and KWSS comparison . . . 41

5 Discussion and conclusion 42 5.1 Discussion . . . 42

5.2 Future work . . . 43

5.3 Conclusion . . . 44

Bibliography 46

(8)

Introduction

The digitalization of society has lead to an explosion of the number of web pages [1]. Millions of users use web pages simultaneously and a simple bug could have a dramatic impact. In order to foresee a poten-tial bug, companies resort to hiring Quality Assurance teams (QA).

These teams will either perform manual checks or develop scripted tests that guarantee that the website is working as expected. Particu-larly, they will verify specific behaviors, such as a user buying a prod-uct, a sign-up or any other expected scenarios. The company Mad-umbo has developed a computer program that mimics user interac-tion. It aims at reproducing the user behavior in order to test a web site. The robot will report any bugs encountered during website visits. The main idea is that the site’s state is made accessible by the succes-sion of a set of interactions. An interaction can be a click, a movement of the mouse, a key pressed, a drag and drop. These mechanisms can fail due to bugs, or due to cases that have been overlooked in the im-plementation. Some web site states are only available after a form has been filled. Access to these states is crucial in order to test the entire web site. Therefore it is crucial for any program that tests web sites to include a functionality to fill web forms. A form consists of fields, some visible information like the headlines of each field, and some invisible information like some attributes of the form. In order to fill the form, it is needed to (a) classify each input as belonging to a certain type, like "e-mail" address, (b) generate a string that is suitable for each type, (c) fill all fields of the form according to their type, and (d) submit the form. Madumbo is a company that works on the automatization of web testing. They have already developed a method to fill web forms.

(9)

On each input of a form, they look for specific keywords in the HTML that are relevant. This process is similar to methods used by browser extensions that fill web forms. These methods are straightforward and do not use modern techniques. What is new is that we will develop a method that automatically labels inputs and train machine learning al-gorithms. Studies have been done on the classification of inputs of web forms using Machine Learning [2, 3]. We demarked ourself by devel-oping a method that labels inputs automatically which is a limit of the previous papers and we exploited all the information contained in the input. Our approach is also different in the fact that we classify each input independently, without any extra data nor human intervention.

1.1 Research Questions

The problem of classifying inputs using Machine Learning and Deep Learning points several questions. During this project, we focused on the following questions:

1. How can an autonomous robot fill and pass a web form?

2. How can we use Natural Language Processing algorithms to fill web forms?

3. What are the pros or cons of using NLP algorithms to classify short texts?

1.2 Approach

Considering that a website and the different views that it will display are a succession of states. It is sufficient to accurately classify forms inputs in order to answer a form and submit it correctly. Several au-thors [4, 5] have already studied the possibilities of classifying short texts. Their work focuses on texts extracts from books or structured datasets, this is a slightly different situation from the one faced.

(10)

relevant for us. Moreover these hidden attributes are prone to spelling mistakes.

Our approach will follow two steps. First, we will create a dataset of labeled web forms and then use Machine Learning techniques to learn how to classify inputs. In order to create that specific dataset, we have developed a crawler that can navigate through websites. On each form encountered, we look for keywords that let us classify the inputs and we submit the completed form. We have designed a method that can detect with high accuracy whether or not a form is being correctly submitted. If a form is successfully submitted, we assign the predic-tion to the input. The second step consists in the training of Machine Learning algorithms and comparisons of their obtained results bench-marked against the current keywords method.

1.3 Limitations of the project

This section shows the scope of the project and the practical goals set for our models. The project is limited to English forms. We will clas-sify fields as belonging to one of 13 types: “city”, “company”, “day”, “email”, “first name”, last name, “month”, “number”, “password”, “telephone”, “user name” and “zip code”. The number of types is limited to the more frequent type of inputs used for text box [6]. This is a relatively serious restriction, this choice has been made in order to establish a proof of concept before eventually increase the number of types. A wide variety of web forms can be handled with these 13 classes. We could add the class “street” to the list. However we es-timated that this would add unnecessary complexity to our proof of concept because a street name can be split on “street name”, “street type” and “street number”.

1.4 Sustainability and ethics

(11)

This thesis isn’t focused on ethic issues but we respect the privacy of the websites crawled by following the robot.txt standards [7]. These standards focus on rules that robots (a crawler in our case) have to respect when visiting, scraping or interacting with a website.

1.5 Societal aspects

(12)

Background and theory

2.1 Structure of a web page

A successful web page is loaded quickly by a web browser and offers a great user experience. While a human can easily navigate through a web page, the code behind it can be tricky to understand. This ex-plains why companies have a team of engineers to develop and main-tain websites. This subsection focuses on the mechanisms behind web pages and highlights the tradeoff between standardization and flexi-bility of web technologies.

2.1.1 HTML & CSS

The Hypertext Markup Language (HTML) is used to create web pages [8]. An element of a page is defined with markup elements. For ex-ample, the markup <p> is used to create a paragraph. This HTML structure is often described as the skeleton of the web page. Elements of a page can be nested (a block can contain several other blocks, each containing several rows, containing text images. . . ). We talk about par-ent and child. An elempar-ent has properties like background color, size, position, or other properties applying to the text it could contain. This nested structure involves a legacy of proprieties: all the elements con-tained by another element inherit the properties added to this other element and mix them with the properties defined at their level.

The Cascading Style Sheets (CSS) is a language that describes the aspect of elements of a page. The CSS allows developers to set the

(13)

properties of several elements at the same time conditionally to selec-tors. CSS selectors can be included in the HTML to change the proper-ties of the HTML elements. CSS is often described as the “skin” cover-ing the HTML skeleton. JavaScript functions, the language presented in the next session, are the muscles allowing the skeleton and the skin to be animated.

2.1.2 JavaScript

The JavaScript (JS) is a language used in web page interaction. It is understood and executed by every browser (Google Chrome, Firefox, Internet Explorer, Edge, Safari, Opera. . . ). The JS is an untyped lan-guage. It is impossible to declare a variable with a type. It is inter-preted during the execution. Moreover, JS enables triggering functions when a user interacts with an element of the page. For example, when a user clicks on a button. The JS allows developers to build a complex system via the mechanism of attaching, firing and triggering events.

All these mechanisms make JS a language very flexible and easy to use. It is also possible to use JS outside a browser. This is the case of this project.

2.1.3 Webform

The web form is a base element of the HTML. The markup “<form>” declares it. A web form can be used in many cases: search bar, sign-up page, sign-in, a form of contact, a pool. . . A form is composed of input which can have different types [6]: text box (infinite domain fields: any text can be input in the section) and selection box (finite domain fields: e.g. choose between “foo” and “bar”). The submission mechanism of a web form is usually done this way:

• an URL to define where the data are sent.

• a function which checks if the values are correct, additionally to the check done by the input attributes itself.

(14)

2.1.4 Requests & Communication

The exchange between a client and a server is possible by a protocol of connection. In most of the cases, HTTP / HTTPS (Hypertext Trans-fer Protocol Secure) is used. POST and GET are two methods of the protocol to send data with the protocol HTTP. The server issues a sta-tus codes in response to the client’s request. We distinguish 5 types of status codes in the standard[9]:

• 1xx (Informational): The request was received, continuing pro-cess.

• 2xx (Successful): The request was successfully received, under-stood, and accepted.

• 3xx (Redirection): Further action needs to be taken in order to complete the request.

• 4xx (Client Error): The request contains bad syntax or cannot be fulfilled.

• 5xx (Server Error): The server failed to fulfill an apparently valid request.

We limit our strategy to websites which respect these rules. A more in-depth discussion about the strategy adopted is in section 3.

2.2 Keyword search methods

There are numerous browser extensions and open source projects which offer solutions to fill web forms, and some of them are able to do this autonomously. Such extensions use simple methods to identify input fields. This part is devoted to compare the in-house company method and extension methods available. Then, to explain the enhancement made in this project compared to Madumbo’s methods.

2.2.1 Extension

(15)

GitHub repositories [10, 11, 12]. We saw that theses extensions ana-lyze inputs attributes and look for specific keywords. This method is under-exploited because the list of keywords is too short and not suf-ficient number of attributes are used. The following example shows that "First Name" field is not recognized by a web extension whereas it is a simple case.

(a) (b)

Figure 2.1: Web form from https://login.yahoo.com/account/create filled with "Form Filler" extension

(a)

(16)

so "first name" isn’t recognized. Same errors happened for most of the fields.

2.2.2 Madumbo’s Method

Madumbo’s method checks attributes of the input with different key-words and abbreviations. The method can link an input field to a vi-sual text using the the attribute "for" of the HTML base element "label". They also use the web convention: an input has an optional attribute “type” that should describe the input (text, date, email, file, image, month, password, etc.). In most cases, the attribute “type” is not use-ful (i.e. text) or is missing. So it is necessary to look for other attributes. This method immediately assigns a label when a keyword matches the attribute “type”. It works on simple cases when the text contains only one keyword type. However, it falls short to assign a label to a text like “Name of the company” or “Name of the company’s CEO”. Both texts match “company” and “name” with different meaning. Another lim-itation concerns keywords that are sometimes case-, space-, hyphen-or underschyphen-ore-sensitive. The various versions of whyphen-ords and spellings can mislead the keyword search too (plural, UK vs. US English). An exhaustive list of keywords does not resolve the problem because of the computing time would have taken too long.

2.3 Scraping and Crawler

Scraping reunites methods that extract texts or more generally data from websites. There are multiple methods to scrape websites and select specific attributes or elements.

The robustness is the quality of the extracted data. The scalability is the extraction speed.

(17)

2.4 Deep / Machine Learning

Machine Learning (ML) and Deep Learning (DL) progressively out-performed [13] classical methods in many areas like classification or prediction. In our case, we want to classify an input field of a web form. The final goal is to fill the form in order to explore all the pages of a web product. This problem is linked to Natural Language Pro-cessing which gathers all the methods that process language.

2.4.1 Natural Language Processing

NLP is a subfield of computer science, information engineering and ar-tificial intelligence related to the interactions between computers and human (natural) languages. It is focused on how a computer can pro-cess and analyse human language and text. Modern techniques of NLP are based on ML and DP [14, 15].

2.4.2 Word embedding and Vectorization

ML and DL models are based on mathematics and optimization in vec-tor spaces. Representing words as vecvec-tors is a crucial task in NLP. many methods were developed to solve this problem. New methods use ML and DL due to the amount of data available. Word embedding is a set of methods which focuses on mapping a word to a vector. One of the popular methods is Word2Vec [14] which transforms a word into a vector and preserves the semantic and syntactic relationship. Words like “dog” and “cat” should have vectors which are close to each other. The vector representation also keeps property of proximity of concept, vector of “king” minus vector of “man” plus vector of “woman” is close to vector of “queen”. One of the main limitations of such repre-sentation is that a word can have different meaning depending on the context of the sentence. Word2Vec only assign a single representation for a word.

(18)

to transform a word or a text into vector. The formula is split into two part, Term Frequency and Inverse Document Frequency. Term Frequency represents the frequency of the term w in the specific docu-ment i.

T F (w) = Total number of term w in i

Total number of words in i (2.1) Inverse Document Frequency (IDF) represents the information pro-vided by the word w.

IDF (w) = log( Total number of document

Number of documents that contain w) (2.2)

2.4.4 Hyperparameter Optimization

A different parameterization of an ML algorithm can have a significant impact on the result[17]. ML algorithms are sensitive to the change of hyperparameters and finding the right combination can be quite hard. Finding the best combination can be time-consuming due to the number of possible combinations.

The most famous one is the Grid Search method which is simply an exhaustive search through a subset of hyperparameters. A subset of hyperparameters is manually specified and the cartesian product produces all the possible combinations. A model is trained and evalu-ated for each of them. The Grid Search is easily parallelizable.

2.4.5 Overfitting and Underfitting

Overfitting occurs to a model that fits too tightly to the data [18]. It

(19)

Underfitting occurs when a model is not able to capture patterns

in the data. The sign of underfitting is represented by a low variance but a high bias. The model is too simple to catch relevant relations be-tween features and target outputs.

Both concepts lead to a poor result on the test set. The test set helps to detect overfitting and underfitting.

2.5 Pre-processing methods

This section enumerates and explains the pre-processing methods used in the project. Most of the data science work consists of cleaning and organizing the data[19]. This step is crucial for ML or DL project.

2.5.1 Scaling and Normalization methods

Scaling and Normalization are important pre-processing steps for nu-meric values [20]. They are of a great help when two nunu-merical fea-tures have very different scales. For instance, the price of a house and its area. Indeed, this can badly affect ML models which are based on distance or a similarity. Most of the times, for such models, features with high magnitudes will have a more significant impact, which can lead to poor results.

Standardization method uses the mean and the standard deviation

of the data to compute a new distribution of the values with a means of 0 and a standard deviation of 1.

Mean normalization method uses the mean, the maximal value,

and the minimal value to compute a new distribution of the value be-tween -1 and 1.

Min-Max Scaling method uses the maximal value and the minimal

value to compute a new distribution of the value between 0 and 1.

Unit vector methods use the norm to compute a new vector with

(20)

2.5.2 Features reduction

The amount of exploitable data growing up exponentially but process high dimensional data is costly. Dimensionality reduction solves the problem of Curse of dimensionality [21].

The high dimension of the data can also lead to overfitting in the case of a few number of samples. In that case, Machine Learning meth-ods will find a pattern that seems to help to classify, but which is too specific to the training dataset. Moreover, visualizing and interpret-ing a high-dimensional dataset can be an issue. There is two different types of features reduction: feature selection and feature projection. Features selection reduces the dimension by selecting a subset of fea-tures, while features projection changes the spatial representation of data. In that case, a projection in a lower dimension space is com-puted based on linear or non-linear transformations and new features in this new space are then used.

Singular-value Decomposition (SVD) is a feature projection method

[22]. The SVD computes the factorization of the matrix of the data M = U ⌃V. U and V contain vectors that form orthonormal bases. ⌃represents a diagonal matrix, each coefficient represents the dilata-tion of the space for a direcdilata-tion. A coefficient with high magnitude is relevant. SVD removes coefficients with a low magnitude.

2.5.3 Data Cleaning

Data Cleaning is an essential part of the pre-processing step [23]. Many

samples in a dataset can be incomplete or incoherent. A various num-ber of techniques exist to handle this problem, a statistical method, for example, can help to predict missing data points. In NLP, data cleaning could be methods like spelling correction and removing stop words.

Lowercase is the process of convert all the letter into lower case. This

method is useful to handle case sensitive algorithm like count word frequency.

Removing Punctuation is necessary because punctuation does not add

(21)

reduces the size of the text for each instance.

Removing Stop Word. A Stop Word is a common word used in a

text like “the” or “of”. It is not relevant for classifying text, and it adds noise for frequency based algorithms. Removing them reduces the size of a text.

Common Words are not necessary to classify text in this project. Words

like “input”, “element” and “field” are irrelevant because they do not provide any information on the type of text. All our data are inputs, so the word “input” does not give more information.

Spelling Correction helps to avoid error due to developers misspelling. Stemming is the process of removing suffixes like “ing” or “s”.

This-method is useful to handle case sensitive algorithm like count word-frequency.

Lemmatization is the process of transforming a word into its root word.

The process of determining the lemma helps the process of learning by reducing the number of words present in the text. For example, ‘stud-ies’ becomes ‘study’.

Accented letters are present, and an ASCII representation does not

handle accent. Removing accent is essential. Word likes “cliché” will become “cliche”.

Removing spaces and HTML special characters. Spaces and HTML

special characters like ", ’, , < or > don’t give more information.

2.6 Classification methods

(22)

su-pervised version of the algorithm.

2.6.1 One-vs-One and One-vs-Rest

One-vs-one (OvO) [24] is a strategy to reduce multiclass problems to

multiple binary classifiers problems. The approach of OvO is to use K(K-1)/ 2 binary classifiers when K is the number of classes. A binary classifier between the classes is trained. In the end, we have K(K-1)/ 2 predictions (the class A against B and C and B against C). The class that gets the most prediction (vote) is considered predicted for the multi-class problem.

One-vs-Rest (OvR) [25] is a strategy to reduce multiclass problem to

multiple binary classifiers problems. The approach of OvR is to use K binary classifiers when K is the number of classes. A classifier repre-sents one class. Classifiers are trained with one class labeled as “pos-itive” and the rest of the classes labeled as “negative”. OvR requires that the binary classifiers output a value which represents the certainty of the prediction. Instead of voting, OvR predicts the label by selecting the classifier which is the most confidence on the prediction.

2.6.2 Bootstrap aggregating

Bootstrap aggregating or Bagging is a machine learning technique

[26] that improves the stability and the accuracy of machine learning algorithms. The process is designed to reduce the variance and avoid overfitting. The concept of bagging is to train an ML algorithm on many different subset of the train set and apply a voting system to predict the class.

2.6.3 KNN

K-nearest neighbors (KNN) [27] is an algorithm that can classify

(23)

with the most occurrences tend to bias it. One way to overcome this problem is to attach weight to each vote of the K’s closest neighbors. The closer the sample is to the data the higher is the weight.

2.6.4 Decision Tree and Random Forest

Decision Tree (DT) [28] is an algorithm that classifies samples based

on successions of split conditionally to the sample values. At each node of the tree, a condition on the sample values determines the next node. At the end of the tree, leaf represents the class predicted. One class can be associated to several leaves. In order to learn which fea-ture and condition constitute each nodes, the algorithm tests different features and maximizes a criterion. The most used criteria are the en-tropy and the Gini coeficient.

Random Forest (RF) is an algorithm based on bagging and decision

trees[29]. The principle of bagging is used on decision tree to reduce variance and overfitting. RF is different from bagging, a subset of fea-tures is used on each subset of data. This process is called “feature bagging”.

2.6.5 Logistic Regression

The Logistic Regression (LR) [30] is an ML algorithm to classify sam-ples based on the following hypothesis:

log(p(X|1)

p(X_|0)) = a0+ a1x1+ a2x2+ ... + aJxJ (2.3)

xirepresents the i-th value of X and aia coefficient that needs to be

(24)

The equation (2.3) is equivalent to:

log(p(1|X)

p(0_|X)) = b0+ b1x1+ b2x2+ ... + bJxJ (2.4)

The equation (2.4) shows a regression problem, a relation of depen-dency between a variable and a set of explicative variables.

The equation (2.4) can be transformed to:

p(1_{|X) =} e

b0+b1x1+b2x2+...+bJxJ

1 + eb0+b1x1+b2x2+...+bJxJ (2.5)

wheren b0 = a0+ log(p(1)_p(0))

bj = aj, j 1

The bjcoefficients are estimated using maximum likelihood

estima-tion on the observaestima-tion data. Explanaestima-tion of the maximum likelihood is out of the scope of this project.

2.6.6 Support Vector Machine

(25)

is commonly used to classify datasets with only 2 classes. The follow-ing explanations concerne a linearly separable dataset with only two labels. SVM uses the one-vs-one scheme to classify datasets with more than two labels. The concept is to build a hyperplane w, if a point xi

is “below” the hyperplane the point is classified as the class A. Oth-erwise, the point is classified as the class B. The hypothesis is that the two classes can be transformed to -1 and 1 to be exploitable. In order to build a hyperplane, the following function is minimized according to w with a dataset of pairs (xi, yi)that represents points and labels.

1 n n X i=0 max(0, 1 yi(xiw b)) + kwk (2.6)

xi is correctly classified if the multiplication of yi with wxi + b is

positive. Otherwise, the product will be negative and the error will increase. The isolated term is called Regularizer. It helps to impose a penalty on the complexity of the model and prevent overfitting. In order to minimize the function, we use the subgradient method. This methods is like the gradient descent but handle non-differentiable ob-jective function.

SVM is also capable of performing a non-linear classification. The min-imization problem (2.4) can be reworded as a constrained optmin-imization problem: maximize f(c1...cn) = n X i=0 ci 1 2 n X i=0 n X j=0 yici(xixj)yjcj (2.7) subject to n X i=0 ciyi = 0 and 0  ci  1 2n , w = n X i=0 ciyixi

(26)

two transformed points. The following equation shows the idea be-hind the kernel trick.

maximize f(c1...cn) = n X i=0 ci 1 2 n X i=0 n X j=0 yici('(xi)'(xj))yjcj = n X i=0 ci 1 2 n X i=0 n X j=0 yici(k(xi, xj))yjcj (2.8) subject to n X i=0 ciyi = 0 and 0  ci  1 2n , w = n X i=0 ciyi'(xi)

2.6.7 Multi-layer Perceptron

The Multi-layer Perceptron (MLP) is an ML algorithm [backpropagationAndMLP] based on artificial neural network (ANN). MLP has at least three

lay-ers: An input layer, a hidden layer, an output layer. Each layer is com-posed of nodes, connected to the next layer.

(a)

(27)

The node j receives weighted information xiwij from the previous

layer nodes i and applies a nonlinear function ' which is called acti-vation function.

(a)

Figure 2.4: Node of a Multi-layer Perceptron from www.kdnuggets.com

The transfer function in the schema is usually a sum and the thresh-old function is used after the activation function to avoid the explosion of the value. Below are examples of activation function:

f (x) = 1

1 + e x (2.9)

f (x) = tan 1(x) (2.10)

The weights, wij can be initialized with Gaussian or uniform

distri-bution. The weights are updated with the backpropagation methods.

2.7 Evaluation methods

2.7.1 F1

(28)

(a)

Figure 2.5: Confusion matrix

The F1-score is based on precision and recall. The next equations show the formula for precision and recall:

P recision = T P

T P + F P (2.11)

Recall = T P

T P + F N (2.12)

The F1-score is the harmonic average of the precision and recall. In order to compute it, we need to calculate the precision and recall. The following equation represents the F1-score:

F 1 = 2 P recision· recall

P recision + recall (2.13)

2.7.2 Cross-Validation

Cross-validation (CV) is a model validation technique [33]. It uses to

(29)

the number of iteration. At each iteration, a different part is picked for the validation set and the rest for the training set. The following example summarizes the CV used in the project:

(a)

(30)

Method

The purpose of this project is to develop a method that fills web form using NLP methods. The actual method looks for keywords in the HTML. Our approach is to train ML algorithms in order to classify inputs of a form. Supervised methods are trained on a labeled dataset. The code is written in TypeScript, a language close to JS but with typed variables which is converted in JS and Python. The first part focus on the data extraction and the process that assigns labels to the data. The second part describes the pre-processing applies to the data and diverse techniques of vectorization. The third part summarizes the machines learning algorithms and validation procedures used.

3.1 Gathering data from the web

This part focus on the extraction of the data from web forms. There is no database of HTML web forms with labeled input. Additionally, we want to use nonvisual attributes of inputs. Therefore, we need to create our own. In order to create a database, we have implemented a crawler which scrapes data from websites.

3.1.1 Initialization of the crawler

We decided to implement a simple version of the crawler without a distributed system [34]. Moreover, the crawler do not follow links that we find in websites. We only look for web form on each of the web-sites from a list of more than 5 millions of webweb-sites. Our power of computing is limited to one computer.

(31)

3.1.2 KeyWord Scoring System

The company model directly assigns a label when a keyword matches. We identify two problems: more than one type of keyword in the attributes and the fact that keyword is too case sensitive. Our ap-proach is to use regex instead of keyword and a scoring system. Regex can handle better separator, numbers or extra letter, “user.?id” regex match “userid”, “usersid”, “user-id”, “user id”, “user_id”, "user1id", "user2id" etc. . . A scoring system manages better multiple types of keywords, instead of directly selecting a label, we keep the number of matches. The scoring system allows us to use the same regex for two different classes, for example, “firstname”, “lastname” and “full-name” contain ““full-name”. Furthermore, the distinction between visual and background (aka. HTML) text is also taken into account, match-ing visual text is more relevant that background one. We give extra points for visual text. After looking for all regexes, we sum the score and assign the label of the class with the highest score. If none of them match, we assign the “null” class. The following pseudo-code gives details on the scoring method:

Algorithm 1 KeyWord Scoring System (KWSS)

Input: HTML input object, regex list, score visual, score non visual Output: Label

1: score initialization to 0

2: _{visualAttributes getVisualAttributes(input)}

3: NonVisualAttributes getNonVisualAttributes(input) 4: for label 2 labels do

5: for regex 2 regexList[labels] do

6: _{numberVisualMatches match(regex, visualAttributes)} 7: numberNonVisualMatches match(regex,

NonVisualAt-tributes)

8: score[label] score[label] + numberVisualMatches . score-Visual + Nonscore-VisualAttributes . scoreNonscore-Visual

9: return arg max_label_2labelsscore[label]

(32)

is useful to classify “first name”, “last name”, “full name” and “user-name”. All the classes contain the word ““user-name”. The tests help us to find the correct score to assign when a word match. We assign a score of five when a word matches a visual text and one otherwise.

3.1.3 Classification and submission detection of web

forms

Our goal is to assign labels to inputs automatically. The keyword search method fills inputs by classifying them and fill texts box with predefined texts. Our approach is to fill web forms with the prediction from the keyword search method. Then we submit the form and iden-tify if the inputs were rightly filled. We assign a label to each input if the form is accepted. Otherwise, we do not assign any label.

This method faces the problem of detection of submission. The mechanism of submission is complex. There are 3 steps of verification.

• Inputs are checked independently.

• The submission triggers a verification of all inputs with a specific function.

• The data are sent to the server for additional verification.

It is hard to know if the last verification is correct. We needed to identify patterns to know if the form was correctly filled. We suppose that a form is correctly filled if we access to the next page.

(33)

(a)

Figure 3.1: Analysis of requests during submissions

The analysis shows us that a positive (200–399) response to the request for the last verification is necessary but not sufficient to be sure that a form correctly filled. Moreover, the analysis highlights that sometimes the page does not change and a local reload of the page is done. Search forms do not reload the whole page for example. Thus, the page change is not a relevant test. We manually submit and ana-lyze the behavior of websites (Appendix A). Unfortunately, the anal-ysis of the use of the cookies is not conclusive. The use of cookies is variable and no pattern were found. The conclusion is that it is not possible to detect a correct submission only with the examination of the cookies and requests. Furthermore, we detect that the process of submission is related to the type of form. The process of submission of a sign-up form is different from a search form. Finally, we conclude that the detection of the submission can’t be applied to every form. In-deed, there are too many specific cases, and some of the developers do not follow standards.

(34)

of form. The behavior of the rest of the categories was analyzed. The conclusion of the investigation points to particular behaviors of forms. This is the list of behaviors encountered during the submission:

• After a correct submission of a “Sign-in” form, the form is not in the page anymore. A positive response status is received. An incorrect submission leads to staying on the same page.

• “Sign-up” forms are not present after a correct submission. An incorrect submission leads to staying on the same page. The page is not always reloaded after correct submission.

• “contact” forms are not present after a correct submission. An incorrect submission leads to staying on the same page and the form is still present.

• “search” forms are present after a correct submission.

(35)

(a)

Figure 3.2: Detection of the validation of a sign-up form

(36)

order to solve this problem, we implement a variant of the keyword search algorithm to detect the type of forms. We look for specific key-words on the HTML code of the form and apply a scoring system. The following pseudo-code resumes the method that assigns label:

Algorithm 2 Automatic Labeling Input: HTML form object Output: labeled data or null

1: formTypeGuess KeyWordsMethodForm(form) 2: inputTypeGuess empty array

3: _{for input 2 form do}

inputTypeGuess add KeyWordsMethodInput(input) 4: isFormPassed heuristicFormSubmission(form)

5: if isFormPassed then

6: return formTypeGuess,inputTypeGuess

7: else

8: return null

3.1.4 Design of the Crawler

(37)

(a)

(38)

3.1.5 Data Representation

This subsection explains which data was selected for each form. The process of crawling and scraping is costly. The common mistake is to select a few data and realize later that a piece of information is missing. The following list represents all information we selected:

• The HTML of the page before the submission of the form. • The HTML of the page after the submission of the form. • The HTML of the form submitted.

• A list of information of each input of the form.

• The guess of the type of the form (sign-up, sign-in, contact us, search, others).

• A boolean that represents if the form was passed. • The URL of the website.

• A number that identifies the form in the page. • The HTML of the inputs.

• The label assigned by the scoring method. • The text generated.

3.2 Pre-processing

(39)

3.2.1 Data Selection

Generally, a corpus of documents is given to the pre-processing algo-rithm. In our case, a corpus of HTML inputs is given. The HTML is not usable directly, too many attributes are irrelevant. Our approach is to use only standardized attributes and use the concatenation of them. We chose the hidden attributes:

• “id” • “name” • “className” • “value” • “type” • “title”

We selected the visual attributes:

• “placeholder”: It represents the text inside a text box.

• “label”: The label is not a real attribute, it is an HTML object linked to the input.

Most of the hidden attributes are noised, frameworks add IDs and irrelevant words.

3.2.2 Data Cleaning

The data cleaning aims to remove irrelevant information and helps the vectorization process. Indeed, vectorization techniques can be sensi-tive to the case or the end of the word. Therefore, the following steps were applied using the python package NLTK [35]:

1. Convert all letters to lower cases because algorithms are sensitive to the case.

2. Remove markup contained in attributes because it is irrelevant. 3. Convert all characters with an accent. Many techniques do not

(40)

4. Convert the text from Unicode to ASCII in order to remove spe-cific characters.

5. Remove all separate values like commas.

6. Remove the irrelevant words. We identified as irrelevant “field”, “input, ”element”, “edit” and “form”.

7. Remove all numbers. Some frameworks generate IDs to match with CSS or to enumerate a field like “name1” or "email2". 8. Stemming and Lemmatization on the text, it is used to remove

plural, conjugation and transform the word into a stem. Lemma-tization will transform “studies” to “study” whereas Stemming algorithm transforms “studies” to “studi”.

9. Remove stop words. Stop words are irrelevant for the classifica-tion.

10. Remove isolated letters. After the process, isolated letters can be present because the ID contains letters and numbers for example.

(a)

(41)

The figure 3.7 shows the number inputs for each labels after the pre-processing. The number of data before the pre-processing is not relevant because many data were empty after the pre-processing. Con-cerning the data, the classes are unbalanced what adds more difficulty to the learning process.

Figure 3.5: Example of a web input and the code associated from https://www.surveymonkey.com

The example in figure 3.5 shows an input from a web form and the code associated. The visual information "Password" corresponds to the attribute "placeholder". "type", "id, "name" are hidden attributes which can be extracted. In this case, it is very simple to guess that the input needs to be filled with a password.

Therefore, after the prepossessing the text "password password pass-word passpass-word" is extracted to train ML algorithms. This string corre-sponds to "placeholder", "type", "id" and "name".

3.2.3 Vectorization

Vectorization is the process that transforms a text into a vector. This step is necessary because ML algorithms manipulate vectors. The pro-cess needs to catch the meaning of a text and transform it into a vector. At this point, the text is a list of words separated by spaces.

The first algorithm used is TF-IDF method. We apply the TF-IDF to each token of a text. The text is represented as a sparse vector, it is sparse because the TF-IDF use all tokens of the corpus. The dimension is the number of token in the corpus. The value is 0 when a token is not present in the text. We end up with 18943 dimensions. Decrease dimensionality is needed in order to reduce the computation time of the training part.

(42)

represen-tation of a text is the sum of the vectors. The problem of this method is that the word needs to be present into a list of vocabulary. In our case, we use hidden attributes that contain abbreviation. The pre-processing is not perfect, and some tokens are random letters. Word2Vec can’t transform the concatenation of two words like "wordvector" or "city-name".

The second algorithm is FastText [15]. Like Word2Vec, FastText transforms words independently. The main difference is that FastText is trained on k-grams so it can’t be out of vocabulary. We transform a text by concatenating each token with an underscore. FastText is able to transform words like “word_vect” or "city_name". Vectors gener-ated by FastText have 300 dimensions.

3.2.4 Data reduction

The data reduction is the process that reduces dimensionality. This process is useful to save computing time. In our case, the vectoriza-tion with the TF-IDF method gives a dimensionality of 18943. In order to process the data, we need to reduce the dimension. The Principal Component Analysis (PCA) is the most used method to reduce dimen-sion. Unfortunately, it does not work for sparse matrices because of the normalization during the PCA. Instead of applying a PCA, we de-cided to use the Singular Value Decomposition (SVD). It selects only the eigenvectors of the matrix with the highest magnitude. We decided to reduce the dimensionality to 500, it is an arbitrary value.

3.2.5 Classification of the data

(43)

Results

This chapter will present the results obtained at the end of the project. Firstly the results from ML models, then a section with results from practical cases.

4.1 Machine Learning models

We have trained 5 models with different hyperparameters. The com-parison between FastText and TF-IDF can be limited because our dataset is split into two groups: text that contains more than one language and text that contains only English. TD-IDF can be used on text that con-tains more than one language. We decided to limit our dataset on texts that contains only English words. On each model, we applied a grid search to find the best parameters. In order to avoid overfitting, we apply 3-fold cross-validation on each parameter. Cross-validation es-timates the effect of overfitting. The following tables show the average accuracy obtains on the test set for each model:

(44)

KNN (5, uniform) (5, distance) (10, uniform) (10, distance)

TF-IDF + SVD 0.921 0.927 0.915 0.923

fastText 0.925 0.931 0.923 0.930

- (20, uniform) (20, distance) (30, uniform) (30, distance)

TF-IDF + SVD 0.904 0.917 0.892 0.913

fastText 0.916 0.927 0.908 0.923

- (100, uniform) (100, distance) -

-TF-IDF + SVD 0.795 0.851 -

-fastText 0.882 0.910 -

-Table 4.1: KNN table result (accuracy) with different parameters (num-ber of neighbours and norm).

RF (gini, 100) (gini, 300) (gini, 500) (gini, 800)

TF-IDF + SVD 0.936 0.937 0.937 0.936

fastText 0.92 0.923 0.923 0.923

- (gini, 1000) (entropy, 100) (entropy, 300) (entropy, 500)

TF-IDF + SVD 0.936 0.934 0.934 0.934

fastText 0.923 0.922 0.92 0.922

- (entropy, 800) (entropy, 1000) -

-TF-IDF + SVD 0.935 0.934 -

-fastText 0.923 0.922 -

(45)

MLP ((100,100), 0.9, 0.001) ((100,100), 0.9, 0.1) ((75, 75), 0.5, 0.1) TF-IDF + SVD 0.838 0.931 0.938 fastText 0.947 0.947 0.95 MLP ((75, 75, 75), 0.99, 0.1) ((75, 75), 0.99, 0.01) ((75, 75), 0.99, 0.1) TF-IDF + SVD 0.938 0.947 0.938 fastText 0.937 0.935 0.937

Table 4.3: MLP top results (accuracy) with different parameters(layer, momentum and learning rate) with a maximum of iteration to 200 and a constant learning rate

SVM (C) 0.1 1.0 10.0 TF-IDF + SVD 0.0375 0.108 0.340 fastText 0.277 0.906 0.923 LR (C) 0.1 1.0 10.0 TF-IDF + SVD 0.902 0.918 0.935 fastText 0.943 0.948 0.947

Table 4.4: SVM tables result with different regularization parameter (C)

(46)

on the RF however using RF and FastText seem to be a good trade-off. This choice took into account that the number of data and label will grow. The only problem is that FastText is limited by the fact that it can only process text that contains one language. Indeed, FastText is trained on one language and it can transform word to vector only of that language.

The following matrix shows results obtain on the test set (25% of the set ) by FastText / RF with parameters criterion to ’gini’ and num-ber of estimators to 100.

(a)

(47)

4.2 Dimensionality reduction

This part focuses on dimensionality reduction and the importance of this step.

The training step depends on the dimension of data and the algo-rithms. The following table shows the computing times (in minutes) for all algorithms depending on the dimension:

methods (dimension) KNN RF SVM LR MLP TF-IDF (18943) 1.52 4.92 27.63 0.038 6.89 TF-IDF + SVD (500) 0.08 0.13 0.31 0.037 0.24 FastText (300) 0.05 0.11 0.08 0.035 0.15

Table 4.5: Computing times (in minutes) during the training for all algorithms depending on the dimension

The table 4.5 shows the importance of dimensionality reduction. SVM takes more than 27 minutes to train for one parameter. The grid search and the cross-validation took more than 4 hours. The same problem happened for MLP, we tried 36 combinations of parameters, and it tooks more than 12 hours. The MLP training took only 16 min-utes with 300 dimensions.

4.3 Crawler

This part focuses on the result obtained by the crawler. The table 3.6 shows metrics on the website manipulated by the crawler:

websites visited websites with forms forms passed

285598 41116 19585

forms submitted inputs labeled inputs filled

62703 25737 82329

(48)

we do not detect the correct submission. The limited number of classes have a direct impact on the number of forms that we are able to pass. Indeed, forms have many different types of input, and our solution only handles 13 types. Several web forms do not check in deep the value of the data, so we can pass the form even if we fill the form with incorrect data.

4.4 Machine Learning model and KWSS

com-parison

This section shows a comparison between the ML model and KWSS on practical examples.

Input RF (entropy, 100) + FastText KWSS visual: "company name" last name company

visual: "last name" last name last name visual: "company" last name last name

visual: "email" email email

visual: "email address" email email

visual: "city" city city

visual: "city name" lastname city

visual: "city name" hidden:"name" lastname city visual: "city name" hidden:"city name" lastname city visual: "city name" hidden:"city" lastname city visual: "name" hidden:"first_name" lastname first name visual: "name" hidden:"first_name" lastname first name visual: "city" hidden:" city city name" city city Table 4.7: Comparison between RF method with FastText and KWSS

on different inputs

(49)

Discussion and conclusion

This chapter will resume the results obtain in the previous chapter and the conclusions that can be drawn from them. Secondly, a discussion about the future possible work and enhancement of the project will be presented. Finally, this chapter ends with a global conclusion on the project.

5.1 Discussion

The accuracy obtained by RF using FastText is sufficient regarding the error in the data. The use of FastText guarantees a low dimensionality and avoids the problem of vocabularies of Word2Vec. A limition of FastText is the time needed to initialize the model and the space nec-essary. The initialization of the model can take minutes and the model takes more than 10 giga octects of RAM.

Moreover, the project improves the method that fills web forms de-veloped by Madumbo. This project is included in a bigger one that aims to navigate through websites and trigger bugs. The whole project is implemented in TypeScript whereas the ML code is in Python. Main-taining two languages is costly, the original idea was to use deep learn-ing. A network trained in python can be exported and used with Ten-sorFlow.js which is a javascript library.

The amount of data is not sufficient to ensure that the ML model is stable. We found cases where the ML models do not classify simple texts correctly. Furthermore, ML models confused close classes like “first name”, “last name”, “surname” and “full name”. The form can still be passed but the prediction is false.

(50)

5.2 Future work

This project has been created from scratch during the last months with a limit of time. Despite the limit of resources of computer power and time, we aimed to reach a version of the project that can classify the input in order to pass a web form. Directions for the future work and improvement of currents techniques are explained below.

• The project only classifies 13 different classes. The future goal is to increase this number. For example, the classes “street” can be easily added to the label and detected by keywords.

• The technique from [36] can improve the detection of submis-sions. The authors propose to submit a wrong solution intention-ally in order to detect the behavior of the form. This technique helps us to detect error pages.

• The previous idea of intentionally submitting a wrong solution gives us the idea of a new method: A classical pattern of the web form is to display an error message when a form is not correctly submitted. The detection of this message can give us more infor-mation on the input and the correctness of our prediction.

• The information contained in specific attributes can improve KWSS. For example, some inputs have attributes “max”, “min” and “range”. These attributes mean that a number should fill the box. More-over, the attribute “regex” is used to check the input box value by the web page. The attribute “regex” can be used to generate a value which should be correct or directly check our prediction. • Previously, we talked about the element label which is connected

to an input. The link between the object label and input is not always trivial. Developers forgot to link the ID to objects. The paper [37] gives a solution that finds the link between two objects in HTML structure. The label gives crucial information because it is a visual text.

(51)

Moreover, most of the submissions were rejected due to one or more fields being badly predicted. Data cleaning methods can also be applied to detect the correct label input and mislabeled data.

• The actual method does not detect the links between inputs. In-deed, a more in-depth analysis of the structure of a form should be done. What type of input succeed after an input of type “first name”? We think that this information can enhance the actual method to predict the label of an input.

• The last improvement concerns the language used. The whole project is written in TypeScript and JavaScript. The initial goal was to develop a DL algorithm and use TensorFlow.js. The next step is to have enough data to train DL models and export the network.

This list gives us the direction to follow and points out the limits of the actual result.

5.3 Conclusion

(52)

(53)

[1] Total number of Websites. http://www.internetlivestats. com/total-number-of-websites/. Accessed: 2019-02-10. [2] Guilherme A. Toda et al. “A Probabilistic Approach for

Auto-matically Filling Form-based Web Interfaces”. In: Proc. VLDB Endow. 4.3 (Dec. 2010), pp. 151–160. ISSN: 2150-8097. DOI: 10. 14778/1929861.1929862. URL: http://dx.doi.org/10. 14778/1929861.1929862.

[3] Gustavo Zanini Kantorski, Viviane Pereira Moreira, and Car-los Alberto Heuser. “Automatic Filling of Hidden Web Forms: A Survey”. In: SIGMOD Rec. 44.1 (May 2015), pp. 24–35. ISSN: 0163-5808. DOI: 10 . 1145 / 2783888 . 2783898. URL: http : //doi.acm.org/10.1145/2783888.2783898.

[4] Mengen Chen, Xiaoming Jin, and Dou Shen. “Short Text Clas-sification Improved by Learning Multi-Granularity Topics.” In: Jan. 2011, pp. 1776–1781.DOI: 10.5591/978-1-57735-516-8/IJCAI11-298.

[5] Ge Song et al. “Short Text Classification: A Survey”. In: Journal of Multimedia 9 (2014), pp. 635–643.

[6] Jayant Madhavan et al. “Google’s Deep Web Crawl”. In: Proc. VLDB Endow. 1.2 (Aug. 2008), pp. 1241–1252. ISSN: 2150-8097. DOI: 10.14778/1454159.1454163.URL: http://dx.doi. org/10.14778/1454159.1454163.

[7] robot.txt. https://www.robotstxt.org/robotstxt.html. Accessed: 2019-06-25.

[8] Web Page. https : / / www . w3 . org / TR / WCAG10 HTML -TECHS/. Accessed: 2019-06-21.

[9] List of HTTP status codes. https : / / en . wikipedia . org / wiki/List_of_HTTP_status_codes. Accessed: 2019-02-10.

(54)

[10] Fill form extention. https://chrome.google.com/webstore/ detail/autofill/nlmmgnhgdeffjkdckmikfpnddkbbfkkk/. Accessed: 2019-10-25.

[11] Fill form extention. https://chrome.google.com/webstore/detail/form-filler/bnjjngeaknajbdcgpfkgnonkmififhfo?hl=en.

[12] Fill form extention. https://github.com/husainshabbir/form-filler. [13] Yann LeCun, Y Bengio, and Geoffrey Hinton. “Deep Learning”.

In: Nature 521 (May 2015), pp. 436–44.DOI: 10.1038/nature14539. [14] Tomas Mikolov et al. “Distributed Representations of Words and

Phrases and their Compositionality”. In: Advances in Neural In-formation Processing Systems 26. Ed. by C. J. C. Burges et al. Cur-ran Associates, Inc., 2013, pp. 3111–3119.URL: http://papers. nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality. pdf.

[15] Armand Joulin et al. “Bag of Tricks for Efficient Text Classifica-tion”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Pa-pers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 427–431. URL: https : / / www . aclweb . org / anthology/E17-2068.

[16] Stephen E. Robertson. “Understanding inverse document fre-quency: on theoretical arguments for IDF”. In: Journal of Docu-mentation 60 (2004), pp. 503–520.

[17] James S. Bergstra et al. “Algorithms for Hyper-Parameter Opti-mization”. In: Advances in Neural Information Processing Systems 24. Ed. by J. Shawe-Taylor et al. Curran Associates, Inc., 2011, pp. 2546–2554. URL: http : / / papers . nips . cc / paper / 4443-algorithms-for-hyper-parameter-optimization. pdf.

(55)

[19] Breaking the 80/20 rule: How data catalogs transform data scientists productivity. https : / / www . ibm . com / blogs / bluemix /

2017/08/ibm-data-catalog-data-scientists-productivity/. Accessed: 2019-02-10.

[20] Sotiris B. Kotsiantis, Dimitris Kanellopoulos, and Panayiotis E. Pintelas. “Data Preprocessing for Supervised Leaning”. In: 2007. [21] Curse of dimensionality. https://en.wikipedia.org/wiki/

Curse_of_dimensionality. Accessed: 2019-02-10.

[22] V. Klema and A. Laub. “The singular value decomposition: Its computation and some applications”. In: IEEE Transactions on Automatic Control 25.2 (1980), pp. 164–176.

[23] Erhard Rahm and Hong Hai Do. “Data Cleaning: Problems and Current Approaches”. In: IEEE Data Eng. Bull. 23 (Jan. 2000), pp. 3–13.

[24] Anderson Rocha and Siome Klein Goldenstein. “Multiclass from Binary: Expanding One-Versus-All, One-Versus-One and ECOC-Based Approaches”. In: IEEE transactions on neural networks and learning systems 25 (Feb. 2014), pp. 289–302. DOI: 10 . 1109 / TNNLS.2013.2274735.

[25] Ryan Rifkin and Aldebaro Klautau. “In Defense of One-Vs-All Classification”. In: J. Mach. Learn. Res. 5 (Dec. 2004), pp. 101–141. ISSN: 1532-4435.URL: http://dl.acm.org/citation.cfm? id=1005332.1005336.

[26] Leo Breiman. “Bagging Predictors”. In: Mach. Learn. 24.2 (Aug.

1996), pp. 123–140.ISSN: 0885-6125.DOI: 10.1023/A:1018054314350. URL: http://dx.doi.org/10.1023/A:1018054314350.

[27] T. Cover and P. Hart. “Nearest neighbor pattern classification”. In: IEEE Transactions on Information Theory 13.1 (Jan. 1967), pp. 21– 27. ISSN: 0018-9448.DOI: 10.1109/TIT.1967.1053964. [28] J. R. Quinlan. “Induction of Decision Trees”. In: Mach. Learn. 1.1

(Mar. 1986), pp. 81–106. ISSN: 0885-6125. DOI: 10 . 1023 / A : 1022643204877. URL: http://dx.doi.org/10.1023/A: 1022643204877.

(56)

[30] P. McCullagh and J.A. Nelder. Generalized Linear Models, Second Edition. Chapman and Hall/CRC Monographs on Statistics and

Applied Probability Series. Chapman & Hall, 1989.ISBN: 9780412317606. URL: http://books.google.com/books?id=h9kFH2%5C_

FfBkC.

[31] Corinna Cortes and Vladimir Vapnik. “Support-Vector Networks”. In: Mach. Learn. 20.3 (Sept. 1995), pp. 273–297. ISSN: 0885-6125. DOI: 10 . 1023 / A : 1022627411411. URL: https : / / doi . org/10.1023/A:1022627411411.

[32] Cyril Goutte and Eric Gaussier. “A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation”. In: vol. 3408. Apr. 2005, pp. 345–359. DOI: 10.1007/978- 3-540-31865-1_25.

[33] Jayant Madhavan et al. “Google’s Deep Web Crawl”. In: Proc. VLDB Endow. 1.2 (Aug. 2008), pp. 1241–1252. ISSN: 2150-8097. DOI: 10.14778/1454159.1454163.URL: http://dx.doi. org/10.14778/1454159.1454163.

[34] V. Shkapenyuk and T. Suel. “Design and implementation of a high-performance distributed Web crawler”. In: Proceedings 18th International Conference on Data Engineering. 2002, pp. 357–368. [35] NLTK package. http://www.nltk.org/api/nltk. Accessed:

2019-06-21.

[36] Luciano Barbosa and Juliana Freire. “Siphoning Hidden-Web Data through Keyword-Based Interfaces”. In: SBBD. 2004, pp. 309– 321.

[37] Zou Y. Upadhyaya B. Khomh F. “Extracting RESTful services from Web applications”. In: 5th IEEE International Conference on Service- Oriented Computing and Applications (SOCA) (2012). [38] Carla E. Brodley and Mark A. Friedl. “Identifying Mislabeled

(57)

(58)

• https://www.dell.com/Identity/global/LoginOrRegister/ • https://secure.telegraph.co.uk/secure/registration/ • https://account.sonyentertainmentnetwork.com/liquid/reg/account/create-account!input.action • https://www.contentful.com/sign-up/ • https://www.powtoon.com/account/signup/ • https://www.paypal.com/welcome/signup/ • https://www.import.io/signup/ • https://www.hushmail.com/signup/trial/ • https://freshdesk.com/signup • https://www.ibm.com/account/reg/us-en/signup?formid=urx-19776 • https://www.campaignmonitor.com/signup/ • https://www.fastmail.com/signup/

List of regexes for each class:

• city: [’city’, ’ville’, ’village’, ’town’, ’municipality’] • username: [’pseudo’, ’user.?id’, ’user.?name’]

• company: [’business’, ’organisation’, ’organization’, ’company’ , ’company.?name’]

• zipcode: [’postal’, ’code’, ’postal.?code’, ’zip’, ’zip.?code’] • first name: [ ’first.?name’]

• last name: [’lastname’, ’famil’, ’last.?name’, ’surname’, ’second.?name’] • tel: [’phone’, ’telephone’, ’fax’, ’smartphon’, ’tel’, ’home.?tel’, ’work.?tel’] • password: [’password’, ’pwd’, ’pass’]

(59)

• day: [’day’, ’date.?d’, ’dd’]

• month: [’month’, ’m?date’, ’date.?m’, ’mm’], • url: [’website’, ’url’, ’site’]

(60)