Automatic Patent Classification

(1)

Automatic

Patent

Classification

Documents

PAPER WITHIN Product development AUTHOR: Nala Yehe

TUTOR:Rachid Oucheikh, Lirandë Pira JÖNKÖPING June 2020

(2)

This exam work has been carried out at the School of Engineering in

Jönköping in the subject data analysis. The work is a part of the two-year

university master programme, of the Master of Software product

engineering programme.

Examiner: Anders Adlemo

Supervisor

:

Rachid Oucheikh, Lirandë Pira

Scope: 30 credits

(3)

Abstract

Patents have a great research value and it is also beneficial to the community of industrial, commercial, legal and policymaking. Effective analysis of patent literature can reveal important technical details and relationships, and it can also explain business trends, propose novel industrial solutions, and make crucial investment decisions. Therefore, we should carefully analyze patent documents and use the value of patents. Generally, patent analysts need to have a certain degree of expertise in various research fields, including information retrieval, data processing, text mining, field-specific technology, and business intelligence. In real life, it is difficult to find and nurture such an analyst in a relatively short period of time, enabling him or her to meet the requirement of multiple disciplines.

Patent classification is also crucial in processing patent applications because it will empower people with the ability to manage and maintain patent texts better and more flexible. In recent years, the number of patents worldwide has increased dramatically, which makes it very important to design an automatic patent classification system. This system can replace the time-consuming manual classification, thus providing patent analysis managers with an effective method of managing patent texts. This paper designs a patent classification system based on data mining methods and machine learning techniques and use KNIME software to conduct a comparative analysis. This paper will research by using different machine learning methods and different parts of a patent.

The purpose of this thesis is to use text data processing methods and machine learning techniques to classify patents automatically. It mainly includes two parts, the first is data preprocessing and the second is the application of machine learning techniques. The research questions include: Which part of a patent as input data performs best in relation to automatic classification? And which of the implemented machine learning algorithms performs best regarding the classification of IPC keywords?

This thesis will use design science research as a method to research and analyze this topic. It will use the KNIME platform to apply the machine learningtechniques, which include decision tree, XGBoost linear, XGBoost tree, SVM, and random forest. The implementation part includes collection data, preprocessing data, feature word extraction, and applying classification techniques. The patent document consists of many parts such as description, abstract, and claims. In this thesis, we will feed separately these three group input data to our models. Then, we will compare the performance of those three different parts.

Based on the results obtained from these three experiments and making the comparison, we suggest using the description part data in the classification system because it shows the best performance in English patent text classification. The abstract can be as the auxiliary standard for classification. However, the classification based on the claims part proposed by some scholars has not achieved good performance in our research. Besides, the BoW and TFIDF methods can be used together to extract efficiently the features words in our research. In addition, we found that the SVM and XGBoost techniques have better performance in the automatic patent classification system in our research.

(4)

Keywords

XGBoost; support vector machine (SVM); random forest; decision tree; machine learning; text data mining; patent classification; IPC

(5)

1 Introduction ... 7

1.1 B

ACKGROUND

... 7

1.2 P

URPOSE AND RESEARCH QUESTIONS

... 7

1.3 D

ELIMITATIONS

... 9

1.4 O

UTLINE

... 9

2 Theoretical background ... 11

2.1 P

ATENT

D

OCUMENTS

... 11

2.1.1 P

ATENTS

... 11

2.1.2 T

EXT DATA MINING

... 11

2.1.3 C

LASSIFICATION STANDARDS

... 11

2.1.4 P

ATENTS DOCUMENT CLASSIFICATION RESEARCH DIRECTION

... 11

2.2 C

LASSIFICATION PROCEDURE

... 12

2.3 D

ATA

P

ROCESSING

... 13

2.3.1 D

ATA PREPROCESSING

... 13

2.3.2 F

EATURE EXTRACTION

(TF/IDF

METHOD

) ... 14

2.4 C

LASSIFICATION

T

ECHNIQUES

... 14

2.4.1 S

UPPORT

V

ECTOR

M

ACHINE

(SVM) ... 15

2.4.2 D

ECISION TREE

(DT) ... 15

2.4.3 R

ANDOM FOREST

(RF) ... 15

2.4.4 XGB

OOST

... 15

2.5 R

ESULTS

A

NALYSIS

... 16

2.6 T

OOLS

... 16

2.6.1 P

YCHARM

... 16

2.6.2 KNIME ... 16

3 Methodology ... 18

3.1 L

ITERATURE REVIEW AND DATA COLLECTION

... 18

3.2 D

ESIGN

S

CIENCE RESEARCH

... 19

(6)

3.2.2 R

ESULTS COMPARISON

... 20

4 Research procedure and implementation ... 21

4.1 D

ATA COLLECTION

... 22

4.2 D

ATA PREPROCESSING

... 23

4.3 F

EATURE WORDS EXTRACTION

... 24

4.4 C

LASSIFICATION TECHNIQUES

... 25

5 Discussion and conclusions ... 28

5.1 D

ISCUSSION OF METHOD

... 28

5.2 D

ISCUSSION OF FINDINGS

... 28

5.2.1 R

ESULTS AND ANALYSIS

... 28

5.2.2 C

OMPARISON RESULTS

... 30

5.2.3 S

UMMARY OF RESULTS

... 31

5.2.4 C

OMPARISON WITH OTHER STUDIES

... 32

5.3 C

ONCLUSION AND PERSPECTIVE

... 32

(7)

List of figures & list of tables

List of figures

Figure 1. Research procedure. ... 13

Figure 2. Medicine as the search string. ... 19

Figure 3. Program running for patent data collection. ... 19

Figure 4. Sample of patent txt file. ... 19

Figure 5. Patent classification system. ... 22

Figure 6. Processing data workflow. (Thiel, 2014) ... 24

Figure 7. Feature words extraction workflow. (Thiel, 2014) ... 25

Figure 8. SVM technique workflow. (Berthold & Thiel, 2012) ... 26

Figure 9. XGBoost linear technique workflow. (Berthold & Thiel, 2012) ... 26

Figure 10. XGBoost tree technique application. (Berthold & Thiel, 2012) ... 27

Figure 11. Decision tree technique. (Berthold & Thiel, 2012) ... 27

(8)

List of figures & list of tables

List of tables

Table 1. Classification results using only description data ... 28

Table 2. Classification results using only abstract data ... 29

Table 3. Classification results using only claims data text ... 29

Table 4. Comparison of accuracy ... 30

Table 5. Comparison of recall ... 30

Table 6. Comparison of precision ... 31

(9)

Introduction

1 Introduction

This thesis aims to use machine learning techniques to automatically classify patent documents. Generally, most of the patent documents are classified manually currently (PRV, 2019). This report will introduce how to get patent text data, clean it, find the feature words of a text in a patent document, and then how to classify it. Besides these goals, this report will also introduce some methods on how to do data processing in chapter three and introduce technology background information in chapter two. This thesis includes also the experiment procedure, results, and comparison.

1.1 Background

The role of a holder in a patent document describes the rights of the holder for a limited period of time. As Olsson & Söderström (2019) introduced, in order to prevent other commercial infringements within the specified time and within the prescribed area, such as selling, importing, distributing, etc., patents could work as a measure of technological innovation and development and can also promote socio-economic and technological progress (PRV, 2019).

In recent years, the number of patent documents has increased dramatically (WIPO IP Services, 2019). As the number of applications continues to increase, it is important to quickly categorize and retrieve patent documents so that the patents could be used in many areas such as patents analysis (Aristodemou, Tietze, Athanassopoulou, & Minshall, 2017).

According to statistical research by the World Intellectual Property Organization (WIPO), 90%-95% of inventions apply for patents every year (WIPO IP Services, 2019). Patent research has a positive impact on product sales, company performance and stocks. Patented companies have also created better pay, as studies have shown that such companies can give 26% higher wages than other companies (WIPO IP Services, 2019). It can be seen that the patent contains rich scientific and technological information, reflecting the development level and trend of science and technology. Therefore, how to mine patents and obtain useful patent information has become the focus of research by relevant experts and scholars (Olsson & Söderström, 2019). Besides, there are some people that work for patent documents transfer, who aim to use another way to make patents document more interesting and attractive for the public. It also demonstrates that patent documents research work has a significant meaning. (IPscreener, 2019)

The patent classification system uses the International Patent Classification (IPC) as the main classification standard, but there are other classification standards in Europe, such as CPC and DPK. In the future, CPC may become the most popular classification standard (Lee & Hsiang, 2019).

1.2 Purpose and research questions

At present, the classification of patent texts is mainly based on manual work. It is common to find that the patent examiners of the intellectual property department or experts in related fields are required to deal with classification categories on the newly applied patent texts. This manual approach leads to a lot of time-consuming work required to complete the classification, and there may be some drawbacks (PRV, 2019).

(10)

Introduction

With the advancement of computer technology, automatic classification of patent texts can be used as an aid. Automated or semi-automatic assistance with classification by computer technology can reduce the uncertainty of manual classification and classification errors. At the same time, it can also reduce the workload of the examiner and improve the efficiency of classification (Wang, et al., 2020). However, based on the current literature review, relevant research is still in the experimental stage. Some researchers prefer to use the abstract or a part of patents to make analysis and classification (Lee & Hsiang, 2019). In addition, the majority of concerned researchers are still working on developing a suitable method to classify them automatically (IPscreener, 2019). Certainly, the accuracy still has no acceptable results and the large-scale automatic classification of patent texts has not been achieved. Therefore, the study of the application of machine learning to the automatic classification of patent texts can automatically divide a large number of patent texts according to the semantic features of patent texts, which can better help people grasp the rich technical information contained in the text, so this study is important regarding the practical significance (Li, Hu, Cui, & Hu, 2018).

The purpose of this thesis is to use text data processing methods and machine learning techniques to classify patents automatically (Tran & Kavuluru, 2017). It mainly includes two parts, the first is data preprocessing and the second is the application of machine learning techniques. Besides, this paper will research by using different machine learning methods and different parts of a patent. Some researchers mentioned that the SVM technique shows the best performance in patent automatic classification (Lee & Hsiang, 2019). Some researcher says that XGBoost technique gets the most advanced results in many machine learning competitions (Wang, et al., 2020). Decision tree and random forest are two techniques of researchers usually choose when they classify text data (Lee, Kwon, Myeongjung, & Kwon, 2018). Therefore, the machine learning techniques to be used in this thesis include decision tree, XGBoost, SVM, and random forest. On the other hand, some researchers mentioned that the claims can be used as input data for patent classification (Suominen, Toivanen, & Seppänen, 2017). Jieh Sheng Lee and Jieh Hsiang mentioned that claims part is sufficient for patent classification (Lee & Hsiang, 2019). Most researchers focus on improving the performance of patent classification by using the abstract and title of a patent (Christopher, Lin, & Spieckermann, 2011). Besides, some researchers mentioned that the description part usually shows detailed information about one patent which might be used in patent classification (Suominen, Toivanen, & Seppänen, 2017). So, in this thesis, we will feed separately these three group input data to our models include abstract, description, and claims. Then, we will compare the performance of those three different parts to give a suggestion that which part more suitable for patent automatic classification and which machine learning techniques should be used in this thesis. Therefore, the main contribution of our research is summarized as follows:

• We use XGBoost techniques in patent automatic classification.

• We use abstract, description, and claims as three different input data in the patent classifiers and compare these performances.

• We use SVM, XGBoost, DT, RF techniques with TF/IDF method to build classifiers and compare these performances.

The aim of the thesis is to develop a framework that automatically classifies patent documents.

(11)

Introduction

An important aspect that needs to be evaluated in the framework relates to the different parts of a patent as input data that could be applied and their individual performance. The exact meaning of performance in this thesis is explained in section 2.5. This leads to the first question:

Which part of a patent as input data performs best in relation to automatic classification?

Another important part of the framework is the evaluation of different machine learning algorithms, to identify the algorithm with the best performance.

This leads to the second research question:

Which of the implemented machine learning algorithms performs best regarding the classification of IPC keywords?

1.3 Delimitations

Since the total number of patent documents is large, and there are many small classifications, there are about tens of thousands of categories (USPTO, 2020). Moreover, in a general category, a patent document may be classified into several different subcategories. Considering the operability of the experiment, a small range of data will be taken, and experiments will only be conducted for several small categories in one large category. The goal is to complete the experiment completely.

In this thesis, we choose one hundred documents as the test data from different five categories. We have used design, medicine, human, plant, and Swedish as the keywords to download patents from the US patent website. We randomly choose twenty patents from each category as the test data. Therefore, the data set is not big and randomly choose five categories in our research. Besides, we choose SVM, decision tree, random forest, and XGBoost machine learning techniques and TF/IDF methods to build the classifier. Therefore, these are limitations of our research.

1.4 Outline

Rest parts of the report:

The second section is the theoretical background, which will mainly introduce the detailed process of the automatic classification of patent documents and also gives a brief explanation of patents. In this chapter, we also include the data mining methods, machine learning techniques, and the comparison elements explanation. Different methods of building classifiers will be sorted out and summarized.

The third section is the methodology. This chapter will show the methods which are used in this research. It also includes a research design, implementation plan, and data collection.

The fourth section is about results analysis and discussion. The experimental analysis, discussion, and results comparison will be detailed.

(12)

Introduction

The fifth section is the conclusion and perspectives of this study. It sums up the work and its achievements. It covers some elements like the limitations of the thesis and offers personal opinions for potential future work.

(13)

Theoretical background

2 Theoretical background

This chapter introduces some theoretical explanation. It includes patent document’s description, text data mining methods, machine learning techniques of classification, and elements of how to evaluate results.

2.1 Patent Documents

2.1.1 Patents

A patent document is a document that describes a patent application. Patent documents structure includes basic information, background introduction, description, claims. The basic information including the patents number, title, CPC/IPC code, references, drawing, etc. (Olsson & Söderström, 2019).

When the document is approved and ready to be published to the public, the document should have a patent classification code or two or even more codes with it. It will easily search the patent in a specific category. There are different codes for different classification standards in different patent authorities. (Li, Hu, Cui, & Hu, 2018) Patents could as a measure of technological innovation and development for a country and also could promote the socio-economic and technological process. Besides, patent documents can be used to predict future professional technology trends and detect infringement. Automatic patent classification allows patent experts to reduce the amount of work involved, including manual analysis of patent documents and determination of patent quality. (Wu, Chang, Tsao, & Fan, 2016) It required a higher cost, such as the needing for professional experts to participate. And the manual classifications also with uncertainties or misclassifications. (PRV, 2019)

2.1.2 Text data mining

Data mining is the process of extracting unknown, potentially useful information and knowledge from a large number of data. This data should be incomplete, noisy, or fuzzy, and it also should be random data. When the object of data mining is a text, that is, the data type is text data, this process is called text mining. (Aggarwal C & Zhai, 2012) 2.1.3 Classification standards

The classification criteria for patent documents have different standards in different regions. Currently, most of the standards used are three, IPC (International patent classification), CPC (The cooperative patent classification), and DPK (German patent classification). (PRV, 2019)

In fact, for the classification of patent documents, Europe and the United States have high-quality classification systems and they have different patent classification standards. Therefore, CPC is produced by the European Patent Office and the US Patent Office. The CPC is modified in accordance with the existing European patent document classification standard and complies with International Standards (IPC). It includes more details and sub-categories. (European patent office & United States patent and trademark office, 2010) In the future, CPC might be the most used standard (Lee & Hsiang, 2019).

(14)

Theoretical background

For a patent document, the text data preprocessing, feature extraction methods, selection and improvement of various machine learning algorithms in patent documents are the main research directions in these fields. Most of the major research concerns are in the selection and improvement of machine learning algorithms. For each step of the automatic classification process of patent texts, relevant experts have conducted in-depth research and achieved great results. Moreover, it can be learned from the relevant literature that as long as any breakthrough in the pre-processing of patent text automatic classification, feature selection, or machine learning algorithm selection and improvement, the classification effect can be greatly improved the classification results. (Hongjie , et al., 2018)

Most of the researchers classified patents through using the whole text data such as the abstract, title. (Xia, LI, & Lv, 2016) But this year, some people have proposed a new direction that only uses the claims part in a patent document is enough for classification. (Lee & Hsiang, 2019) This section covers the subject matter covered by patent documents or applications. It includes the scope of legal protection sought or provided by dominant or intergovernmental associations (Olsson & Söderström, 2019).

2.2 Classification procedure

Text classification is an effective way to extract useful and meaningful information from text data (Wang, et al., 2020). It is used in information extraction, text retrieval, dynamic summary, and other fields. The text classification system usually consists of three steps: text processing, model training, and prediction. The text processing includes cleaning the training data, express it to vector, and select feature words (Aggarwal C & Zhai, 2012). The model training step means to use machine learning techniques to build the training model. Prediction means that the same preprocessing is performed to obtain the document vector. Then the classification is predicted according to the machine learning technique to obtain the corresponding category (Wang, et al., 2020). In order to make this research clearer, refer to figures using Figure 1 shows the procedure of the text classification system.

First, the text data set is randomly divided into a test set and a training set. Then, the training set and the test set are preprocessed separately. The quality of preprocessing will also affect the accuracy of a classification. The preprocessing process includes removing stop words and low frequency words. In some cases, low frequency words have an important influence on text classification, and there is no need to remove them. (Wang, et al., 2020)

Because there are too many feature words in the text, the dimension of the constructed model will be too large, which will affect the accuracy of a classification and the performance of calculation (Tseng & Lin, 2007). Therefore, it is necessary to select features based on the importance of feature words for text classification.

Finally, using machine learning techniques to construct a classifier and then using the classifier to process data, classify data, and evaluate the test data set (Christopher, Lin, & Spieckermann, 2011). The classification and evaluation in Figure 1 points to the above preprocessing, feature selection, and weight of value calculation. In the KNIME workflow, it evaluates the quality of the preprocessing process, feature selection method, and text representation model based on the results of the classification (AG, 2020).

(15)

Theoretical background

Figure 1. Research procedure.

If a classifier could be used for classification, the input data could be a patent document without knowing its patent code. After the same processing, the result should have a suggesting category for the patent. (Li, Hu, Cui, & Hu, 2018)

2.3 Data Processing

Text classification task means to find a category or a class quickly and automatically. Nowadays, with the improvement of technologies and economy, more and more useful data hides in chaos data (Wang, et al., 2020). So, how to find useful information or needed data will be an interesting research area. Because of these reasons, there are SVM, KNN, CNN, K means methods could be used (Li, Hu, Cui, & Hu, 2018). Some researches show that SVM has some advantages such as high-dimensional input space, few unimportant features, sparse document vector space, most text classification problems are linearly separable and so on (Li, Hu, Cui, & Hu, 2018). Comparing different methods also proves that this method has better classification accuracy (Tran & Kavuluru, 2017). Patent documents are also text, so patent classification belongs to the scale of text classification, therefore machine learning and text can be used. SVM could be the first choice. Besides, there is the XGBoost method that usually used in the text classification field with good performance (Wang, et al., 2020). Therefore, that could be the second choice.

(16)

Theoretical background

Data preprocessing is the conversion of text into usable data. This step aims to process the information that is not needed in the document such as noisy words, delete document formats, punctuation, symbols, special characters. It might also include some meaningless words like “articles, pronouns, prepositions, conjunctions, and auxiliary words” in the document (Xia, LI, & Lv, 2016).

2.3.2 Feature extraction (TF/IDF method)

The text of each patent document contains many different feature words that will be the key first step in the classification. Due to a large volume of patent texts, the vector space dimension obtained by the processed text data is particularly high. Maybe there is some highly relevant text in a patent. Some of these words are interrelated. There may be some words that appear in different categories with different meanings. Therefore, it is important to extract suitable feature vocabulary for classification experiments. (Tseng & Lin, 2007)

Currently, there are some methods that have been used to select feature words (Aggarwal C & Zhai, 2012). Even these methods could be used in text data mining, most researchers choose to use the TF/IDF method to extract feature words. Some of them mentioned that TF/IDF shows the best performance in feature words extraction of patent classification. (Li, Hu, Cui, & Hu, 2018) Based on the KNIME tutorial, we find that TF/IDF is the common choice of text classification. (Berthold & Thiel, 2012) Therefore, we choose TF/IDF as the method to extract feature words.

According to the literature review, TF/IDF is a popular and usual method in the text classification field (Li, Hu, Cui, & Hu, 2018). This method can calculate the weight value of words. This method is considered usually as the first choice for text classification. The TF / IDF weighting method is based on high frequency words in the concerned document and low frequency words in other documents have a great influence on text classification. It is affected by the two values of TF value and IDF value. Among them, TF means word frequency, and the calculation formula of TF / IDF weights means that the higher the value TF, the greater its range of influence on the classification of the text. IDF means inverse document frequency. The high value of IDF indicates that it is less likely that the less likely the feature word appears in other documents. This means that this word could more able to distinguish between text categories. (Aggarwal C & Zhai, 2012) Here is the formula to calculate:

W(t, d) = 𝑡𝑓(𝑡, 𝑑) ∗ log ( N n(t) + a) 45 [𝑡𝑓(𝑡, 𝑑) ∗ log 7 N_{n(t) + a8]} :∈< =

The W(t, d) is the weight of the feature word t in the text d, tf (t, d) represents the word frequency of the feature word t, and n(t) is the number of texts containing the feature t, a is a small positive value, and log (N / n(t) + a) is the inverse text frequency function. (Aggarwal C & Zhai, 2012)

2.4 Classification Techniques

Classification algorithms are an important part of text classification and machine learning. Classification can be seen actually as a prediction. The mission is a process of

(17)

Theoretical background

data. The current classification algorithms are mainly based on statistical classification, such as Naive Bayes, K nearest neighbor algorithm, support vector machine, maximum entropy model, etc. Or based on neuron connection classification, such artificial neural network. Or it is a classification that is based on classification rules, such as decision tree, etc. (Taher, Jisan, & Rahman, 2019)

2.4.1 Support Vector Machine (SVM)

The principle of the support vector machine algorithm is to find a separated hyperplane so that the hyperplane can separate different categories. The specific training process is to find this hyperplane. The positive and negative examples of this hyperplane fall on both sides. The best hyperplane is the one that maximizes the distance between the positive and negative examples and is located at an equal distance to the nearest positive and negative examples. This means that the patent text of the unknown category is calculated on the side of the hyperplane, that is, the category to which it belongs. (Aggarwal C & Zhai, 2012)

The support vector machine algorithm has some advantages over other methods. For example, the advantages of proper processing of high-dimensional problems, insensitivity to text feature correlation, and high accuracy. (Araghinejad & Modaresi, 2014)

2.4.2 Decision tree (DT)

Decision tree classification algorithm is an example-based inductive learning method. It can extract the tree-like classification model from the given disordered training samples. Compared with other machine learning classification algorithms, the decision tree classification algorithm is relatively simple as long as the training sample set can be expressed using feature vectors and categories. (Abdelaal, Ahmed, Ghribi, & Alansary, 2019)

2.4.3 Random forest (RF)

Random forest is composed of many decision trees, and there is no correlation between different decision trees. When we perform a classification task, new input samples are entered. Each decision tree in the forest is judged and classified separately, and each decision tree will get its own classification result. Once the forest builds successful, the classifier can select the most voted category by summarizing all the decision tree categories. RF processing large data set are very efficient. (Lee, Kwon, Myeongjung, & Kwon, 2018)

2.4.4 XGBoost

XGBoost provides parallel tree promotion, which can be expanded quickly and accurately. XGBoost is an optimized distributed gradient enhancement library designed to be efficient, flexible, and portable. The same code runs on the main distributed environment that can solve the problem of over billions of samples. It also a usual machine learning technique for text data processing. (Wang, et al., 2020)

(18)

Theoretical background

2.5 Results Analysis

In the results of the experiment, several indicators are needed to compare the results of the experiment. Based on the KNIME platform, accuracy, recall, and precision are shown in the results (Berthold & Thiel, 2012). The accuracy rate refers to the probability of correct predictions of the text classification. The recall rate refers to the ratio of the accuracy of a certain classification in text classification to all documents of this category in the document. The precision value is the ratio of the number of retrieved related documents to the total number of retrieved documents. It measures the accuracy of the retrieval system. When the value is higher and near to 1, the performance is better. Usually, we use the accuracy to measure the effect of classification. (Lee, Kwon, Myeongjung, & Kwon, 2018)

Here is the expression:

• Accuracy = (right patents classification number) / (all the patents documents in this experiment)

• Recall = (right patents classification number) / (number of patents should be in this category)

• Precision = (right patents classification number) / (number of patents be classified in this category)

2.6 Tools

2.6.1 Pycharm

Pycharm is an IDE with a set of tools that can help developers improve their efficiency when developing in the Python language. There are many functions like debugging, syntax highlighting, smart prompts, unit testing, and so on. In addition, the IDE provides some advanced features to support professional web development under some frameworks. (JetBrains, 2020)

2.6.2 KNIME

Knime is an open source data mining software based on Eclipse. It completes the data extraction and transformation loading operations in the data warehouse and data mining through a workflow. Among them, the workflow is completed by the nodes with convenient functions. The nodes are independent of each other and can be executed separately and pass the executed data to the next node. (AG, 2020)

Drag the node in the Node Repository area in the lower left corner to the Workflow Editor in the middle to form a workflow. There is three status of the node. When the node is just dragged into the work area, the red light is on to indicate that the data cannot pass through. The node needs to be configured so that it can be executed. When the configuration is complete and correct, the yellow light will be on, indicating that the ready data can be passed. When execute order is selected to run this node, the green light is on and it means that the node has been successfully executed, and the data has been passed and passed to the next node point. (AG, 2020)

It includes several types of nodes. IO type nodes, used for input and output operations of files, tables, and data models. Database operation node operates the database through the JDBC driver. Data operation nodes, such as filtering, transforming, and simple statistical calculations on the data passed in from the previous node. Data view

(19)

Theoretical background

box charts, pie charts, histograms, data curves, etc. Statistical model nodes, encapsulating statistical model algorithm nodes, such as linear regression, polynomial regression, etc. Data mining model class nodes provide Bayesian analysis, cluster analysis, decision tree, neural network, and other major data mining classification models and corresponding predictors. (Berthold & Thiel, 2012)

(20)

Methodology

3 Methodology

3.1 Literature review and data collection

To carry out a scientific research project, one must start with literature review and get knowledge of the state of the art. It is an important task where we look for the information in reliable sources such as published papers, JU library, google scholar website, IEEE website, Elsevier, etc. This surely helps to get good knowledge on the current status of patents documents automatic classification.

Data collection: the needed data will be obtained directly from the US patent and trademark official website, which has patents documents from all over the world (Li, Hu, Cui, & Hu, 2018). The data has been settled as a txt format. It also includes the original full-text pdf document which means that the patent document includes all the text and pictures data and needs to be cleaned.

The International Patent Classification (IPC) is a complex patent classification system that includes large classifications and small subcategories. “The latest version of the IPC contains eight parts, about 120 classes, about 630 subclasses, and about 69,000 groups” (USPTO, 2020). For example, in the IPC classification, “A is human necessities, B is performing operations, transportation, C is chemistry, metallurgy, D is textiles, paper, E is fixed structure, F is Mechanical engineering, lighting, heating, weapons, blasting, G is physics, and H is electricity” (USPTO, 2020). Each part is subdivided into two categories, and its symbol consists of a part symbol followed by two digits, such as D01. Similarly, each category is divided into several categories, and its sub-symbols are composed of sub-symbols and capital letters of the larger category, such as A01B or D8724 (USPTO, 2020). Because there are many patent files under the IPC classification and there are many sub-categories, this paper selects five categories in the crawled data as the research object. According to the patent code, select the first four char and number as the category of the patent in this paper.

During the data collection, we randomly choose plant, Swedish, design, human, and medicine as search strings in our program and download these patents automatically from the US patent website. When we choose the design as the keyword, we downloaded patents from D8724 and D8723 which belong to “washing, cleaning, or drying machine” category. When we choose the medicine as the keyword, we downloaded name 1052 which is “investigating or analyzing materials by determining their chemical or physical or physical properties” category. When we choose the human as the keyword, we downloaded name 1053 which is the “loudspeaker, microphones, gramophone pick-ups or like acoustic electromechanical transducers; deaf-aid sets; public address systems” category. When we choose the plant as the keyword, we downloaded the name PP313 which is “new plants or processes for obtaining them; plant reproduction by tissue culture techniques” category. When we choose the Swedish as a keyword, we downloaded patents from different categories. In order to randomly choose patent for our research, we decided to choose us patent code start with 104 which belongs to the “bioinformatics” category as our input data (USPTO, 2020). This experiment processed about 1400 patents in the field of medicine, design, and others. Including 182 articles in the PP313 category, 442 articles in the 1052 category, 278 articles in the Swedish category, etc. For this research, 100 texts of those categories were selected for test classification because we only have one laptop to do the experiment and KNIME cannot process more files at one time. Therefore, we

(21)

Methodology

Here is the detailed procedure. For example, when we choose medicine as the search string in our program like Figure 2, there is the code and program running.

Figure 2. Medicine as the search string.

Running the program like Figure 3 and it will download the document automatically.

Figure 3. Program running for patent data collection.

Here is an example of a data txt file in Figure 4. It includes the patent code number and all the text description in the full pdf document.

Figure 4. Sample of patent txt file.

After that, it will be prepared for classification tasks.

3.2 Design Science research

3.2.1 Design Science research method

This thesis will use the design science method. It will use existing machine learning techniques to design an experiment and compare results. Actually, in this thesis, what we do is to develop an artifact to measure the accuracy of the different machine learning techniques. Therefore, we think the design science method is more suitable

(22)

Methodology

than the experiment (Wieringa, 2014). The experiment research method involves the distinction between two basic conditions, exposed and unexposed independent variables (Tanner, 2002). That is, there are an experimental group and a control group. There can be multiple experimental conditions and control conditions in an experiment (Tanner, 2002). The research in this article is based on the classification application and discussion of patent, so it is not suitable to use experiments as research methods.

3.2.2 Results comparison

In the training model, it will read the data set, process it, apply it in machine learning techniques, and analyze results. Through comparing results, it will help to get a conclusion that which machine techniques and which part of a patent have a better performance.

(23)

Research procedure and implementation

4 Research procedure and implementation

The automatic classification of patent text falls in the general field of text classification, so this paper combines the text classification techniques and uses machine learning algorithms to automatically classify the patent text. Figure 5 shows the structure of the patent text automatic classification procedure. The workflow of the procedure is as follows:

Firstly, this experiment needs a test data set. This paper analyzes the website of the US patent official website, implements a patent collection program, and downloads more than 1400 patent abstract texts in the field of design, medicine, plant, Swedish and human facilities. We choose the five different words as the search string in our program to automatically download patents. We choose the five words because in the IPC standard, there are eight categories and we just randomly choose the plant, human, medicine, and design. Because we do our research in Sweden, therefore we add Swedish as the fifth search string. After we download these patents, there are about 1400 patents.

Secondly, we analyze the downloaded patent document and extract the main data including patent title, classification code, patent content, etc. The processed data set is randomly divided into training set and test set according to a ratio.

Thirdly, we preprocess the training set and test set obtained by dividing the patent text separately. The preprocessing task includes stop words removing, feature extraction and so on.

Then, for the training set of patent text, we use the decision tree, random forest, XGBoost, and SVM and TF/IDF method to construct the text vector and train the classification model. The TF/IDF shows the best performance of feature word extraction (Berthold & Thiel, 2012). SVM and XGBoost have a better performance in classification (Lee, Kwon, Myeongjung, & Kwon, 2018) (Wang, et al., 2020). Decision tree and random forest usually be used in classification (Lee & Hsiang, 2019). Therefore, we choose these techniques in our research.

Finally, we build the classifier and use the patent text test set to evaluate the classification accuracy of the model.

(24)

Research procedure and implementation

Figure 5. Patent classification system.

4.1 Data collection

The automatic classification of patent text requires a patent text dataset, and there is currently no open patent text data set ready for use. Through research and analysis of the US patent website, a corresponding automatic collection code of patent texts is developed to request several categories of patents within a defined field. A large number of patent texts have been collected in this field. Therefore, crawler technology is needed to perform this patent texts collection operation. The web crawling is the process of downloading the content corresponding to the URL from the server to the local machine, similar to the use of the browser. When we put the URL and HTTP request on the server, the response returned by the server to the browser is displayed to the user after interpretation. (Vadivel, Shaila, Mahalakshmi, & Karthika, 2012) We randomly choose plant, Swedish, design, human, and medicine as search strings and download about 1400 patents from the US patent website. For this research, we randomly choose 100 patents for our research because the KNIME can processing no more than 100 patents at one time in our laptop. Therefore, we choose twenty patents from each category and the total number is one hundred. The patents collecting program was implemented to collect the patent text in different fields. The patent content includes drawing, patent code, application date, publication date, IPC classification number, applicant, inventor, abstract, description, and other information. In our patent text classification program, only some necessary information is needed. Therefore, we use Beautifulsoup to parse some pages (Richardson, 2020). We get the description information and category information we need and the title of the invention. In short, there are three main steps to obtain text data. The first step is to obtain the relevant patent name according to the keyword search and obtain the patent code number according to the patent document name. The second step is to find the corresponding single page preview pdf link and it could be found from the patent code number. The last step is to get a link to the full pdf document from the single page pdf preview page. In summary, we use beautiful soup library in pycharm to finish the data collection step. We did not create an interface to enter a search string. We need to hard code the keywords as input data and then run the program to get the patent file. This is the first step in our research, and then the data could be used in classifiers.

(25)

Research procedure and implementation

4.2 Data preprocessing

This part aims at removing the noisy words and useless data in patents. It intends to get a group of vocabularies that are valuable for feature words extraction. From this step, this research will use the KNIME application as the analysis tool. The workflow of data import and preprocessing looks like what is shown in Figure 6. The data preprocessing step means to load, clean, and prepare the data for classification. Using Excel’s reading node to read the data and then use the string to the document node to convert the specified string in the read file into a document. For each line, a document is created and attached to the line. Through this node, we confirm the category that the text data should belong, and then put the category and the corresponding document together in one line.

In this part, data preprocessing is required. Because the corpus data collected by the crawler in our research, because there are some Html tags in the crawled content, and the non-text part of the text data needs to be removed also. There are six detailed explanations of the preprocessing step in the workflow.

• Punctuation Erasure

The first node in the preprocessing step is to remove all punctuation marks in the data. Deleting punctuation marks allows the machine to better process text data. So, the node is to remove all punctuation marks. (Berthold & Thiel, 2012)

• N Chars Filter node

The second node in the preprocessing step is to remove all words with less than N chars in the text data, and also to make it easier to process (Berthold & Thiel, 2012). The N value set in this article is 3, which means to delete all words less than 3. For example, is, too, are, am, and, but, etc.

• Number filter

The third node is mainly to remove the numbers in the text data, which includes integers, decimals, and negative numbers. (Berthold & Thiel, 2012)

• Case Converter

The fourth node concerns the proper nouns and capitalized vocabulary in a patent text. Through the use of this node, all chars in the text data are changed to lowercase in order to unify and process the data. (Berthold & Thiel, 2012)

• Removal of stop words

The fifth node called the stop word filter, which can remove stop words in English (Berthold & Thiel, 2012). The stop words are common words that are not representative. This step and the second step (N Chars Filter) basically achieve similar goals. However, after the second step of processing, this step of processing will be faster and more comprehensive. The text data processed through these two steps will be more convenient for subsequent data classification. (AG, 2020)

(26)

Research procedure and implementation

The sixth node called the snowball stemmer, which is to find the original form of the word. English words have several different forms such as singular, plural, or different tense. Stemming can be found through the stemmer node. That is to find the same word in different forms to facilitate subsequent data processing. (Berthold & Thiel, 2012)

Figure 6. Processing data workflow. Source: Thiel, 2014

4.3 Feature words extraction

After obtaining the feature words in the patent text, the computer cannot directly process the feature words, and the workflow shows in Figure 7. Therefore, it needs to be converted into a format that the computer can recognize. (Wang, et al., 2020) In this phase of feature word processing, first use the node of a bag of words (Bag of Words, BoW for short) which shows in Figure 7. BoW assumes that it does not consider the contextual relationship between words in the text but focus only on the weight of all words (Berthold & Thiel, 2012). The weight is related to the frequency of words appearing in the text. This part consists of three steps, tokenizing, statistically revising word feature values, and normalizing (Thiel, 2014). The word bag model will first perform word segmentation.

After word segmentation, by counting the number of occurrences of each word in the text, we can get the word-based features of the text. If these words of each text sample are put together with the corresponding word frequency, it is vectorized. After the vectorization is completed, the TF-IDF method can be used to correct the weight of the feature, and then normalize the feature, and then put the data into the machine learning techniques for classification. (Aggarwal C & Zhai, 2012) (Berthold & Thiel, 2012)

TF-IDF is the abbreviation of Term Frequency-Inverse Document Frequency, namely “Word Frequency-Inverse Text Frequency” (Aggarwal C & Zhai, 2012). It consists of two parts, TF and IDF. The technical description in part two in this paper.

(27)

Research procedure and implementation

frequency. For example, although the word frequency of “to” is high, it appears in almost all texts and thus does not help in classification. Its importance should be lower the other word related to the field of the patent for example, which has a low frequency of words. IDF helps to reflect the importance of this word, and then modify the word feature value expressed only by the word frequency. IDF reflects the frequency of a word appearing in all texts. If a word appears in many texts, its IDF value should be low. Conversely, if a word appears in a relatively small text, its IDF value should be high. For example, some professional terms such as “Machine Learning”. The IDF value of such words should be high. In an extreme case, if a word appears in all text, its IDF value should be zero. Here is the workflow for extract feature words. (Guo & Yang, 2016)

Figure 7. Feature words extraction workflow. Source: Thiel, 2014

4.4 Classification techniques

By reviewing the research reports in the past decades, the text classification problem has been extensively studied and solved in many practical applications. Especially with the deepening of research on natural language processing (NLP) and text mining, many researchers are currently interested in developing applications that utilize text classification methods (IPscreener, 2019). Most text classification and document classification systems can be decomposed into the following four stages: feature extraction, dimensionality reduction, classifier selection, and evaluation. Patent documents are a direction in the research of text classification. Because the cost of patent manual classification is too high, it is also a meaningful issue for machine classification to replace or help manual classification (Wang, et al., 2020). After extracting feature words, we use KNIME to build the workflow and apply it. We start with the SVM method to perform the classification. The workflow is depicted in Figure 8.

(28)

Research procedure and implementation

Figure 8. SVM technique workflow. Source: Berthold & Thiel, 2012

Then, we will use the second technique, which is the XGBoost algorithm. The workflow shows in Figure 9 and Figure 10.

Figure 9. XGBoost linear technique workflow. Source: Berthold & Thiel, 2012

(29)

Research procedure and implementation

Figure 10. XGBoost tree technique application. Source: Berthold & Thiel, 2012

Then, we use decision tree and random forest techniques. The workflow shows in Figure 11 and Figure 12.

Figure 11. Decision tree technique. Source: Berthold & Thiel, 2012

Figure 12. Random forest technique. Source: Berthold & Thiel, 2012

(30)

Discussion and conclusion

5 Discussion and conclusions

This section includes a discussion of the method, a discussion of findings, and results in comparison.

5.1 Discussion of method

Some researchers use more than 600,000 patents to classify and got the highest 83.98 precision when they only the abstract and title as input data (Lee & Hsiang, 2019). We use 100 patents as input data and got the highest 84.5 precision when we use abstract as input data. This can prove that our results are reliable. Besides, we got a similar conclusion that the SVM technique shows the best performance, which shows the validity of our research. However, we take a small group of data in our research because of the limitation of the laptop. Therefore, we can improve that and might take a big scale dataset in classifier to get more reliable results.

In our research, we take DT, RF, SVM, and XGBoost techniques and TF/IDF methods together in classifiers. The results offer a comparison of different techniques. Since different scholars use different parts to study the classification of patents, this thesis uses abstract, description, and claims as the input data. The abstract is the most common choice which most scholars prefer to choose (Christopher, Lin, & Spieckermann, 2011). A small number of scholars mentioned that the claims part can be used as well (Lee & Hsiang, 2019). Besides, some researchers mentioned that the description part as input data might achieve a good performance of patent classification (Suominen, Toivanen, & Seppänen, 2017). Therefore, we use the three research objects are abstract, claims and description of a patent to conduct classification research and comparison. The results can show a comparison of which part suitable for patent classification. In a word, we think our method is suitable for our research and the results are reliable.

5.2 Discussion of findings

5.2.1 Results and analysis

This part shows the result of the experiment in which only we take into consideration recall and precision on a category called 1052. Because there are five different categories result data could be used. In order to make it clearer and simple, only randomly choose one category to show the results.

Table 1. Classification results using only description data

Technique Accuracy Recall on 1052 Precision on

1052 1. SVM 92.86% 0.85 0.75 2. XGBoost linear 88.89% 0.875 0.778 3. XGBoost tree 78.57% 0.724 0.714 4. Decision tree (Gradient Boosted) 87.5% 0.8 0.75 5. Random forest 85.75% 0.875 0.875

(31)

Discussion and conclusion

The accuracy is a clear result comparison in Table 1. The accuracy of the SVM technique is 92.86% so it shows better performance than other methods. The XGBoost linear technique has 88.89% accuracy, which shows the second-best performance. Therefore, SVM and XGBoost linear techniques have higher accuracy which means they have better performance than other techniques when the data object is the patent description. The decision tree and random forest techniques have almost similar accuracy like XGBoost Linear techniques, the accuracy separately 87.5% and 85.75%. The XGBoost tree technique has a bad performance when only take the description as the text data part. The accuracy only has 78.57%.

Table 2 shows the results when we take only abstract as the text data processing object.

Table 2. Classification results using only abstract data

XGBoost tree shows the best performance compared to other methods when only we use abstract as the text data because it has the highest accuracy. SVM technique shows better performance because the accuracy is 89.81%. XGBoost linear and random forest techniques show a similar performance because the accuracy is 87.75% and 88.91% respectively. The decision tree technique shows the worse performance because the accuracy is only 83.83%. Therefore, the XGBoost tree and SVM show better performance when only take the abstract as the object of data processing.

Table 3 shows the result of using only claims as the object of patent text data.

Table 3. Classification results using only claims data text

Based on the result table, the XGBoost tree technique shows the best performance because it has the highest accuracy. The SVM and decision tree techniques show similar accuracy, which is 63.64%. XGBoost linear and random forest does not show

(32)

Discussion and conclusion

high accuracy. When only comparing those five techniques, XGBoost tree techniques show an acceptable performance contrary to other techniques. This means that only XGBoost tree could be applied when only take claims as the object of patent text data. Besides, when we feed to our models only claims part, the results are not showing a good performance.

5.2.2 Comparison results

Through comparing three different experiments to find which part could be the object for classification. When using description, the automatic classification effect of the patent is relatively best, the accuracy is more than 92% when using the SVM technique. Using the abstract data is relatively better, the accuracy is around 90%. But the claims part for classification performance is not so good, and its accuracy is less than 73%. Therefore, it is recommended to use the description in the patent automatic classification and feed it as input data to the classifier.

Here is a comparison of accuracy in Table 4.

Table 4. Comparison of accuracy

Technique Description Claims Abstract

SVM 92.86% 63.64% 89.81% XGBoost linear 88.89% 62.23% 87.75% XGBoost tree 78.57% 72.73% 90.90% Decision tree (Gradient Boosted) 87.5% 63.64% 83.83% Random forest 85.75% 61.76% 88.91%

While it can be seen from the comparison of different machine learning algorithms, the SVM method is the best. The second is the XGBoost tree method. They have a better performance than other techniques, and especially when the object of text data is the description part. Therefore, the algorithms of random forest or decision tree are not very suitable for automatic classification of patents, particularly when we only have a small dataset.

The second performance metric we use for the comparison and analysis of techniques is the recall value. Table 5 shows the results of recall value.

Table 5. Comparison of recall

SVM 0.85 0.667 0.85

XGBoost linear 0.875 0.667 0.875

(33)

Discussion and conclusion

Decision tree

(Gradient Boosted) 0.8 0.667 0.875

Random forest 0.875 0.818 0.809

The recall value refers to the ratio of the number of retrieved related documents to the number of all related documents in the document library, which is the recall rate of the retrieval system. It means that the recall value is higher and near to 1, the performance is better. Based on the table, XGBoost linear and random forest techniques have better performance than others when using description as the text data input. The random forest technique shows the best performance when using claims as the input data to process. The XGBoost linear and decision tree techniques show better performance compared to the others when choose abstract as the input data.

Table 6 illustrates the precision value result of category 1052.

Table 6. Comparison of precision

SVM 0.75 0.65 0.667 XGBoost linear 0.778 0.667 0.778 XGBoost tree 0.714 0.875 0.780 Decision tree (Gradient Boosted) 0.75 0.65 0.75 Random forest 0.875 0.55 0.845

According to the table 6, the random forest and XGBoost linear have a better performance when using description as text data. The XGBoost tree and random forest techniques have better performance when using abstract as the processing data. XGBoost tree technique has a better performance when using claims as the data input. 5.2.3 Summary of results

Table 7 shows the comparison results from accuracy, recall value, and precision value.

Table 7. Comparison of performance

Description Claims Abstract

Accuracy SVM

XGBoost tree XGBoost tree SVM XGBoost tree

Recall SVM

XGBoost linear Random forest

XGBoost tree

Random forest SVM XGBoost linear

Decesion tree

Precision XGBoost linear

Random forest XGBoost tree XGBoost tree XGBoost linear

(34)

Discussion and conclusion

Table 7 serves as a summary to answer the research questions in this thesis. There are two research questions to be answered. The first one is which part of a patent as input data performs best in relation to automatic classification? For this question, in order to have a good classification performance, the description part shows the best accuracy. So, we can affirm that description of the patent can be used as the main data part to effectively accomplish automatic classification of patent documents.

The second research question is which of the implemented machine learning algorithms performs best regarding the classification of IPC keywords? To give an appropriate answer to this question, we should tackle it from the perspective of data input. For instance, the SVM is the most suitable technique that should be used for classification when the used data is the patent description text. On the other hand, XGBoost tree technique should be used when the abstract or claims text are used as the main input data.

5.2.4 Comparison with other studies

Most researchers mentioned that SVM technique shows the best performance of patent classification. Most of them use abstract or title as the input data. We got similar results that the SVM technique shows the best performance when we use the description as the input data. But when we use the abstract as the input data, the XGBoost technique tree technique got the best performance. We think the first reason is that they did not choose XGBoost in their research when they use abstract to do the classification. The second reason might be the description of a patent shows better performance than the abstract of a patent as input data.

However, some researchers mentioned that “using patent claims alone is sufficient for classification” (Lee & Hsiang, 2019). We got a different result. In our research, we found that the patent claims are not suitable for patent classification. We think the reason might be that our test data is smaller than them so that the accuracy when we use claims as input data is lower than their research. Besides, they compared claims with abstract and title, and then they got the result show that patent claims has better performance. If they consider comparing the results with the description, the description might have better performance.

5.3 Conclusion and perspective

Generally, in the text classification task the text document is represented as a fixed-length feature vector. Then, machine learning methods are used to train the model and make predictions based on these feature vectors. In the text representation, the most simple and common method is the word bag model representation, and this is the method we preferred to use in this article. It counts the words in each document to represent the feature vectors. The weight of this feature vector is usually based on the TF-IDF weight technique. The advantage of TF-IDF feature weighting method is that it can make full use of the distinguished word set in the document. However, this method also has some shortcomings. Simple word counting statistical methods cannot capture the deep semantic information of words and may cause high-dimensional problems (Tseng & Lin, 2007). However, the performance in this article is still relatively good.

To better answer the research questions in this article, we made experiments on three different data inputs, namely abstract, description, and claims. According to the results

(35)

Discussion and conclusion

description part is the most important data by which we can achieve the best performance in English patent text classification. The description text can be used as the main classification input, then the abstract is the auxiliary standard for classification. However, the classification based on claims part proposed by some scholars has not achieved good performance in our research.

In this study, the classification of patent texts is based on only a small sample dataset, and the text corpus includes only the patent abstract text, description, and claims. The three parts are studied separately. Other patent-related information has not been fully utilized. Classification still needs to be expanded by data, further research and verification, and the results of classification will be applied to the analysis of patents. On the other hand, with the development of technology, the combination of deep neural network and natural language processing technology, the classification method based on word vector and convolutional neural network has also proved their efficiency. In the future, we can continue to study the technical methods in this area and focus on these methods to carry out further research on the automatic classification of patent texts.

Automatic Patent Classification

Automatic

Patent

Classification

Documents

This exam work has been carried out at the School of Engineering in

Jönköping in the subject data analysis. The work is a part of the two-year

university master programme, of the Master of Software product

engineering programme.

Examiner: Anders Adlemo

Supervisor

:

Rachid Oucheikh, Lirandë Pira

Scope: 30 credits

Abstract

Abstract

Keywords

Keywords

Contents

Contents

1

Introduction ... 7

1.1

B

... 7

1.2

P

... 7

1.3

D

... 9

1.4

O

... 9

2

Theoretical background ... 11

2.1

P

D

... 11

2.1.1

P

... 11

2.1.2

T

... 11

2.1.3

C

... 11

2.1.4

P

... 11

2.2

C

... 12

2.3

D

P

... 13

2.3.1

D

... 13

2.3.2

F

(TF/IDF

) ... 14

2.4

C

T

... 14

2.4.1

S

V

M

(SVM) ... 15

2.4.2

D

(DT) ... 15

2.4.3

R