Relevance feedback-based optimization of search queries for Patents

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/007--SE

Relevance feedback-based

op-timization of search queries

for Patents

Sijin Cheng

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

In this project, we design a search query optimization system based on the user’s rel-evance feedback by generating customized query strings for existing patent alerts. Firstly, the Rocchio algorithm is used to generate a search string by analyzing the characteristics of related patents and unrelated patents. Then the collaborative filtering recommendation algorithm is used to rank the query results, which considering the previous relevance feed-back and patent features, instead of only considering the similarity between query and patents as the traditional method.

In order to further explore the performance of the optimization system, we design and conduct a series of evaluation experiments regarding TF-IDF as a baseline method. Ex-periments show that, with the use of generated search strings, the proportion of unrelated patents in search results is significantly reduced over time. In 4 months, the precision of the retrieved results is optimized from 53.5% to 72%. What’s more, the rank performance of the method we proposed is better than the baseline method. In terms of precision, top10 of recommendation algorithm is about 5 percentage points higher than the baseline method, and top20 is about 7.5% higher. It can be concluded that the approach we proposed can effectively optimize patent search results by learning relevance feedback.

(4)

Acknowledgments

I would like to express my honest gratitude to all the person who helped me during the thesis.

Thanks to IamIP for offering this thesis proposal, giving me an opportunity to do such an interesting project. Especially, thanks to my supervisor Falah Hosini for being supportive and engaged in our project. Many thanks to him for his valuable time on describing re-quirements and rating datasets. I also want to thank my supervisor at Linköping University, Marco Kuhlmann, for helping me too much on many key technical problems. He can always quickly pointed out the problem, and gave me useful advice to pull me out of confusion. A big thank to my examiner, Arne Jonsson, for his expectations, valuable comments and kindly understanding. I am also grateful to professor Kristian Sandahl at Linköping University for making the structured and clear time schedule of the project. Great thank Ola Leifler for giving honest reviews and issues on thesis report.

Further, I want to thank my supervisor at Harbin Institute of Technology (HIT), Tianyi Zang. He helps me a lot on understanding the requirement, proposing suggestions, giving comments and guidance of the paper.

Finally, I would like to thank all person mentioned above for your honest and encouraging comments. Last but not least, thank my family and friends for their support and understand-ing durunderstand-ing my hard time.

(5)

List of Figures

2.1 The architecture of query optimization system . . . 5

2.2 An application of Rocchio algorithm . . . 8

2.3 Architecture diagram of recommendation engine . . . 9

2.4 Decomposition of "user-item" rating matrix . . . 11

2.5 Precision and recall . . . 13

3.1 Entity relationship diagram (ERD) . . . 16

3.2 Function modules diagram . . . 17

3.3 Processes of patent preprocess . . . 18

3.4 Flowchart of Query reformulation . . . 20

3.5 Flowchart of Patent query . . . 22

3.6 The architecture of the recommendation engine . . . 22

3.7 Flowchart of recommendation engine . . . 23

3.8 Distribution of related patents and unrelated patents in the dataset . . . 24

4.1 The distribution of relvant patents in the search results varies with the number of extended words . . . 28

4.2 Precision varies with the number of extension words . . . 29

4.3 Recall varies with the number of extension words . . . 29

4.4 Relative recall rate varies with the number of extension words . . . 30

4.5 The number of related patents in the results changes with the parameters . . . 31

4.6 Precision of the results changes with the parameters . . . 32

4.7 Recall of the results changes with the parameters . . . 32

4.8 Relative recall of the results changes with the parameters . . . 33

4.9 Size of all test datasets . . . 33

4.10 Proportion of related patents and unrelated patents in each dataset . . . 34

4.11 The relationship between the number of related patents and the size of the dataset 35 4.12 Precision varies with data set size . . . 35

4.13 Recall varies with dataset size . . . 36

4.14 Relative recall varies with dataset size . . . 36

4.15 Precision of top-10 results . . . 37

4.16 Precision of top-20 results . . . 38

4.17 Recall of top-10 results . . . 38

4.18 Recall of top-20 results . . . 38

4.19 Statistical characteristics of the original Patents . . . 39

4.20 Distribution of related patents in weekly sub-dataset . . . 40

4.21 The number of related patents in the results changes with time . . . 42

4.22 Precision of the results changes with time . . . 42

4.23 Recall of the results changes with time . . . 43

4.24 The number of related patents changes with time (Month) . . . 43

4.25 Precision of the results changes with month . . . 44

(8)

4.27 Top-20 Precision of the results changes with time (week) . . . 46 4.28 Top-10 Precision of the results changes with time(month) . . . 46 4.29 Top-20 Precision of the results changes with time(month) . . . 47

(9)

List of Tables

2.1 Example Word list of the stemming word and the original word . . . 6

2.2 user-patent scoring matrix . . . 9

2.3 An example of User-based CF . . . 10

2.4 An example of Item-based CF . . . 10

3.1 Development tools and environment . . . 15

3.2 The statistics of the IamIP dataset . . . 16

3.3 patent_in f o . . . 16 3.4 company_in f o . . . 17 3.5 Alert_administration . . . 17 3.6 ranking . . . 17 3.7 Rocchio algorithm . . . 21 3.8 weight vector . . . 21

3.9 The statistics of the dataset (cancer research and development) . . . 24

3.10 The distribution related to the subject (DNA, RNA, Protein Sequence) . . . 24

3.11 The statistics of subdatasets . . . 25

4.1 Generated queries on setting different numbers of extended words . . . 27

4.2 Search results of running generated queries on sub-dataset d8 . . . 28

4.3 Generated queries under different parameter settings . . . 30

4.4 Results of running generated queries on sub-dataset d8 . . . 31

4.5 Relative recall . . . 32

4.6 Sub-datasets of d6 . . . 33

4.7 Generated queries on different dataset . . . 34

4.8 Results of running generated queries on sub-dataset d8 . . . 34

4.9 Relative recall . . . 36

4.10 Precision and recall rate of related patents in Top-10 and Top-20 . . . 37

4.11 The statistical characteristics of the weekly dataset . . . 39

4.12 Generated query from original patents . . . 39

4.13 Subscription results of the first month . . . 40

4.14 Traing dataset and generated query of the second month . . . 40

4.15 Subscription result of the second month . . . 41

4.16 Traing dataset and generated query of the third month . . . 41

4.17 Subscription results of the third month . . . 41

4.18 Traing dataset and generated query of the fourth month . . . 41

4.19 Subscription results of the fourth month . . . 42

4.20 The number of related patents (month) . . . 43

4.21 The precision of the search results every month . . . 44

4.22 The number of related patents in the top10 and top20 of the results . . . 44

4.23 The precision of the top10 and top20 results . . . 45

4.24 The number of related patents every month . . . 45

(10)

1 Introduction

1.1 Motivation

In recent years, science and technology has been developing rapidly, the importance of in-tellectual resources has become increasingly prominent, and inin-tellectual property has also become the core competition among enterprises. Patents, as the most typical intellectual property, are the largest and fastest growing source of technical information. According to the report issued by The Patent Technology Monitoring Team (PTMT), [38] since 2013, the yearly number of granted patents are more than 300,000 at the U.S. Patent and Trademark Office (USPTO). While people benefit from the progress of knowledge, they are increasingly demanding to obtain patents more efficiently and accuratedly. As the amount of informa-tion has surged, however, the efficiency of discovering target patents from a large amount of patents decreases. The phenomenon is called information overload.

One of the solutions to tackle this problem is the information retrieval technology, which can be implemented in search engines. With the development of the Internet and database management system, a large number of information management and retrieval systems have been built. These systems play an irreplaceable important role in helping users obtain infor-mation efficiently. However, patents have some special characteristics compared with other scientific information:

1) The particularity of the writing style. When writing a thesis, the author usually uses fa-miliar descriptions, which makes it easier for readers to understand the author’s expressions. However, for the patents writing, in order to expand the scope of protection of patents and increase the possibility of patent authorization, applicants are likely to use some vague ter-minologies and expressions. Sometimes,they even create some new terms for patent writing. 2) The complexity of the data format. A complete patent contains many information, such as International filing date, Priority(s), Inventor(s), International Patent Classification (IPC), Cooperative Patent Classification (CPC), Simple family member(s), Claims and Drawing. The types of fields cover raw text, date information, classification code, and graphics.

In the content of the patent document, the IPC and CPC are classification numbers, which can be used to retrieve target patents in the patent search engine. However, this approach has some limitations: On the one hand, the patent classification number is not always known to the user. The users usually want some patents related to a specific topic or a technique. How-ever, sometimes they are unfamiliar with specific terms and therefore have trouble in their

(11)

1.1. Motivation

searches. In this case, the search query could not always accurately locate the patents [40]. On the other hand, technology is constantly changing. As time goes on, new competitors are entering the market and new materials are emerging, which are difficult to be observed by static search queries. In order to grasp the trend of the important technologies and materials, discover the state of technological development, and allocate valuable resources rationally, a company needs to keep a continuous watch on the patent activities within their field and capture the most advanced and competitive information among numerous updated patents. This requirement can typically be accomplished by setting a search alert, which could inform the user when any new patents related to the search query are published. IamIp is one of the companies who supply service on setting patent search alert and patents management. The aim of the IamIp is to revolutionize intellectual property which is currently manually handled and paper-based. IamIP’s platform is an electronic patent search and patent man-agement tool. Their dataset updates every week by adding about 100 million patents [24]. The customer can set a search alert by defining an initial search query to capture the patents for a specific technology. At present, to observe the changes in users preferences, the search alert system with continuously updated and optimized search string is expected.

In order to make search queries update and optimize continuously, we are trying to make full use of user’s feedback on previous patents, exploring the user’s preference by analyzing feedback scores and applying it to optimize search queries. For example, an environmental company is currently focusing on new garbage recycling technologies. They have set a patent alert on the IamIP’s platform, which alerts them whenever some patents related to recycling technology are newly published. However, there are unavoidably some unrelated patents being sent to this customer, leading them to have to spent much time mannually filtering out unrelated patents every week. Our optimization system could help to speed up their work by understanding their feedback on the previous patents, and reduce repetitive work. Ideally, the related patents in the results of the search alert can increase as time goes on. This application scenario is highly similar to recommendation systems [1], which make use of the user’s previous feedback information, identifies the user’s preference, and recommend the results that may be interesting for the users.

Recommendation system is essentially a system that predicts whether a user likes an item and the degree of like [1]. Recommendation algorithms have been widely used in many fields such as e-commerce, movie websites, personalized music service, social networks, reading, etc. For example, Youtube recommends tailored videos for users. Twitter and Facebook rec-ommend friends to users based on their social network [1]. Especially, Amazon has grown rapidly in recent years thanks to its integration of the recommendation system into the whole process of product discovery, user purchases and payment. As estimated by a Wall Street analyst, the purchase conversion rate of Amazon’s online recommendation system is up to 60% [32]. They calculate the similarity between items offline. When a user logs in, the system checks the user’s purchase history, and then shows all the items that are similar to these his-tory items using some ranking (such as best-selling ranking). Another example is Cinematch, a movie recommendation system developed by Netflix [18][17]. Users can score movies, then the Cinematch system would calculate multiple regression correlations online, based on these user scores. Lastly, they recommend possible high-scoring movies to the user. Similarly, Google News use same strategy to discover user preferences for different categories of news based on user tags, thereby recommending news that may be interesting for users. How-ever, they found that the news is updated very quickly, which led to a new problem. When a group of news appear, no users have scored them previously. Therefore, the score-based recommendation algorithm could not judge whether the user liked the news or not. In this case, a content-based analysis was introduced to the original algorithm, which established a blended recommended model, containing scored-based model and content-based model [10]. Inspired by the idea of the recommender system, we decided to introduce the recommen-dation algorithm to the patent information retrieval system, to assist for the optimization of patent information retrieval. We combined query reformulation algorithm and

(12)

recommenda-1.2. Aim

tion algorithm to generate new search queries to better locate the technology topics, and make the patents related to the topic rank high. In our design, the user’s feedback on patents is the most important information, the feedback is considered in both query reformulation process and rank process. The search query is reformulated by analyzing the feedback and the corre-sponding patent features. In this process, the features of the realted patent are retained, while the features of the unrelated patent are re-learned. When it comes to the recommendation al-gorithm, the recommendation algorithm firstly learns the user-patent rating pattern to build a recommendation model, then the model is used for prediction. The optimization system we proposed can be combined with the existing patent retrieval system. It can optimize the patent search process and reduce the repetitive work on reading patents. What’s more, it can increase the efficiency of patent searchers and save users time and energy.

1.2 Aim

The aim of this project is to design a search query optimization system for an existing patent alert. The optimization should make full use of the user’s feedback information, and rank the most related patent high. Further, the performance of the optimization system should be evaluated by designing and executing a series of experiments.

1.3 Research questions

The research questions explored in this paper are as follows.

1. How much can the proportion of unrelated results in patent searches be reduced by reformulating search queries after analyzing user‘s relevance feedback?

2. Compared to the Term Frequency-Inverse Document Frequency (TF-IDF) method, does the recommendation algorithm work better for ranking retrieved results?

1.4 Delimitations

Our work focuses on the design and implementation of the optimization method. We have implemented all the core parts of the method, and carried out experiments based on it. How-ever, we did not implement a complete system which could collect data and run the opti-mization process automatically.

(13)

2 Theory

The great research value contained in Patents arouses the strong interest of many researchers. In recent decades, there are many patent-related workshops are set up for communication of researchers. Some important international conferences and institutions contribute to these workshops. For example, in 2002, NTCIR (Nll Testeds and Community for Information ac-cess Research) established a special symposium for patent search, they also released some patent datasets for different research topics, such as cross-language patent search, technical status research and patent classification. In 2009 and 2010, CLEF (Conference and Labs of the Evaluation Forum, formerly known as Cross-Language Evaluation Forum), an open evalua-tion platform for informaevalua-tion retrieval of Europe, set up CLEF-IP as a topical seminar about patent search. They also provided about 1.3 million English Patents for researchers to freely download and test.

Currently, the content of patent research can be mainly divided into three categories: 1) patent search; 2) in-depth analysis of patent content; 3) other research related to patents, such as patent forecast, patent partner recommendation [47], etc. Some researchers have worked on introducing the recommendation algorithm to the patent information retrieval system. Take Zhong Weijun as an example, after tracking and recording the user’s access operation to patents for a long time, he analyzed the patent documents that was frequently accessed, and then discover the relevant patent by using association rules, lastly, he recommend these rel-evant patents to users. However, his work needs to track the patent search record of a huge number of users. What’s more, he has to calculate the similarity of all patents. However, Krestel et al. [28] used TF-IDF pre-selecting the candidate patents when they were trying to find similar patents for a given patent. This pre-selectiong process greatly improves the efficiency of the recommendation algorithm. They also proposed a topic-based recommen-dation method, which combined language model and topic model. This method introduced the Latent Dirichlet Allocation (LDA) model to patent recommendations to obtain higher ac-curacy of result. Similarly, in our work, we firstly pre-selected the candidate patent set from total dataset, then we selected patents from the candidate patent set by combining language model and the collaborative filtering recommendation algorithm. At the same time, in order to improve the credibility of the candidate set, query reformulation algorithm was applied to form a better located search string.

Figure 2.1 shows the architecture of the query optimization system that we designed. This section will introduce some technologies related to text processing, including the method

(14)

2.1. Text preprocess

of mapping text into vector space, the feature selection for text representation, and some language models.[2]

Figure 2.1: The architecture of query optimization system

2.1 Text preprocess

In the research of patent mining, the research related to patent metadata has been relatively developed. Nowadays, more and more researchers are interested in patent text mining. For patents, about 80% information is stored in form of text. The method of obtaining the useful information from the huge text resource has become a hot issue in the field of information retrieval. Since the text is unstructured data, it is very difficult to process a large amount of text. A great deal of research has been conducted on the method of discovering knowledge from large text quickly and efficiently. Among these research, text representation is a vital issue, which can be regarded as a necessary prerequisite for all further researches.

Before doing text representation, we need to preprocess the patent document. The first step of text preprocessing is tokenization. Word is the smallest meaningful unit in the natural language [40]. THe purpose of tokenization is to divide sentences into single words, the results of tokenization will be used for further processing. Therefore, the accuracy of word segmentation is the basis of the performance of Natural Language Processing.

After tokenization, some terms need to be ignored from the document when building the vocabulary, called filtering. Stop words and punctuation are usually needed to be filtered. Stop words are those words that occur frequently in the text but without actual meaning, such as definite articles, indefinite articles ("the", "a", "an", "that", "those", etc.) and preposi-tions ("on", "at", "above"). Similarly, those words that appear frequently in texts, but are not representative for the document should also be ignored. [5]

Stemming is the last step of text preprocessing, which [44] considers the morphological changes of words. In another word, the words with the same inflected form should be re-garded as the same word. PorterStemmer()is a popular stemmer function. Table 2.1 shows a list of words, which compare the original word and corresponding stemmed word. As we can see, some words may become hard to read and understand for users. For example, after stemming, ’surface’ became to ’surfac’, which is difficult to understand.

2.2 Feature extraction

After text preprecessing, we got a group of words. Then we need to select features and set weights to these features. If all words are used directly into the vectore to represent the

(15)

2.3. Text representation

Table 2.1: Example Word list of the stemming word and the original word

stem words

compris comprising surfac surface

form formed

grip gripping

text, the dimension of the text vector would be quite large, which lead to many problems. On the one hand, the text vector with high-dimension leads to heavy workload for further computation. On the other hand, high-dimensional text vector would influence the precision of algorithms. The most effective method to solve this problem is reducing vector dimension by feature selection.[23] Therefore, it is necessary to find out the most representative features of the text [44].

Feature selection

The main purpose of the feature extraction is to decrease the number of terms to be considered as far as possible without damaging the meaning of the text [13]. By reducing the dimension of text vector, the calculation can be simplified and the efficiency of text processing would be improved accordingly.

To reduce the feature terms, there are two methods:

1. set a fixed number as the maximum number of feature terms;

2. set a certain threshold, select terms that has greater value than the threshold. The spe-cific value of threshold depends on the selection methods, such as document frequency, mutual information and information gain [19].

weight calculation

After getting a group of features, the next step is to set weight to these features. The weight calculation methods includes Boolean weighting method, word frequency statistics method and TF-IDF weight method. The TF-IDF weight method is the most popular method [29]. The idea of TF-IDF is based on a basic assumption that words that appear frequently in a text should also appear frequently in another similar text [42]. Therefore, term frequency (TF) can reflect the characteristics of the similar text. In addition, TF-IDF method also introduce the concept of inverse document frequency (IDF), which considers how many documents contain the given term. TF-IDF algorithm use the product of TF and IDF as the feature weight. This method could select the terms that occur in a document with higher frequency, while occur in the other documents with lower frequency. That means, the terms with higher weight have the stronger ability to make the document distinguished. [29][34]

2.3 Text representation

Bag-of-Word,BOW

The bag-of-word is a traditional model to describe document as text vector, called vectoriza-tion. Once the document is vectorized, the similarities between the two documents can be calculated by the distance of vectors. The traditional BOW can be regarded as the superpo-sition of the one-shot of the word vector [4] [39]. Suppose the word vector of "protein" is [0, 0, 0, 0, 1, 0, ... ], and the word vector of "cell" is [1, 0, 0, 0, 0, 0, ...], so a text containing only the words "protein" and "cell" can be expressed as [1, 0, 0, 0, 1, 0, ...]. The problem of this

(16)

2.4. Query reformulation algorithm

model is obvious. It is difficult to effectively calculate the similarity between two documents using the vector distance due to the high dimensionality and sparsity of its feature vector. However, traditional BOW has many optimization methods. Replacing the value of non-0 in the text vector with TF-IDF weight is one choice. The feature vector using TF-IDF is better than traditional BOW.

Vector Space Model

To make the calculation and operation conveniently, we can represent text by weighting the feature words and establish mathematical models. Vector space model is a commonly used mathematical model to describe text vectors.

The Vector Space Model (VSM) transfers the processing of text to the operation of the vector in Vector Space [9]. When documents are represented as vectors in the space, the similarity between the documents can be measured by calculating the similarity between the vectors, which is intuitive and easy to understand [9] [33]. In addition, the importance of feature terms can be reflected by setting different weight in the vector.

2.4 Query reformulation algorithm

Query reformulation technology is widely used in information retrieval systems. In order to improve the results of the Patent search, the query reformulation method is used for the optimization [12] [45]. Query reformulation refers to rewrite queries automatically, making the query tend to retrieve more relevant documents. The original query is typically extended by adding terms from the synonym or the initial retrieval results [35]. The user’s patent rele-vance feedback can be used to train the word vector, thereby, the words that are representive for relevant patents could be selected as extend words to reformulate the query string.

In our design, we use the Rocchio algorithm to formulate the new query by merging the relevant feedback information into the vector space model. The process starts from the original query, then moves the new query closer to the centroid of the relevant document, and away from the centroid of the unrelated documents meanwhile.

Rocchio algorithm

The Rocchio algorithm is a classical algorithm for using the relevance feedback [8]. It provides a method of merging relevance feedback information into a vector space model. For example, there is an original query in information retrieval, we have some feedback about the relevant documents and unrelated documents. The purpose of this process is to find an optimal query vector q, which has the highest similarity with relevant documents and lowest similarity with unrelated documents. Rocchio algorithm supplies an intuitive solution. Based on Rocchio algorithm, the suggested optimal query vector q can be obtained by the following formula:

q=α ˚ q0+β ˚ Dr+γ ˚ Dnr

Where q0is the original query vector; Dr and Dnr are sets of relevant and unrelated doc-uments, α, β, and γ are three weights. The new query starts from q0, moves to the centroid of the relevant documents and away from the centroid of the unrelated documents. If set

γ =0, which shows the system only pays attention to positive feedback, the weight of

neg-ative terms can be ignored. The effect of implementing the Rocchio algorithm for relevant feedback is shown in the figure 2.2.

2.5 Recommendation Algorithm

In recent years, the recommendation system developed rapidly on the background of infor-mation explosion. Recommendation is not a new technology, in fact, we have benefited from the service of the recommendation system since long time ago. The newspaper is actually

(17)

2.5. Recommendation Algorithm

Figure 2.2: An application of Rocchio algorithm

a recommendation system. The content is selected by the newspaper editor manually. It is a kind of manual recommendation. Another type of recommendation is aggregation sys-tem, for example, bookstores have best-selling lists, cinemas have box office rankings, and some time-based recommendation (such as up-to-date books recommendations). These rec-ommendation systems are based only on the simplest statistics. It is very effective in many cases though the idea is simple. However, the two methods mentioned above, manual se-lection and simple aggregation, are not customized for the user. No matter who opens the newspaper or goes into the bookstore, he would receive the same recommended information. In IamIP’s scenario, the methods metioned above may also work to some extent. We could recommend the patents that are selected by experienced patent searchers manually, we could also send the most updated patents to the user. However, if we choose the first method, it needs lots of repetitive work, costing much time and energy. While using the second method, the results may contain much unrelated patents, which would decrease user’s satisfaction. The purpose of our work is to recommend the most relevant patents to the user, and increas-ing the efficiency of the process at the same time.

Typically, the recommendation system is to recommend those items that the user may be interested in. To find these items, the recommendation system use the user-item interaction socres, together with the features of items, to predict the score for the items without score based on the known score. The item features can be extracted from the information of the item itself or the social environment in which the user is located. Based on the method of extracting features, the recommendation engine can be divided into the following four categories [1]:

The recommendation engine [31] uses special information filtering (IF) technology to rec-ommend different content (such as movies, music, books, news, pictures, web pages, etc.) to users who may be interested.

Content-based recommendation

This method calculates and recommends the items that are similar to the items that the user shows preference. For example, if you always buy some books related to history online, then the content-based recommendation engine would recommend some popular historical books to you.[11]

Collaborative filtering recommendation (CF)

This method recommends item to the user based on other users who have similar taste. For example, when you buy clothes online, the collaborative filtering-based recommendation engine would analyze your dressing style based on your previous purchase history or

(18)

brows-2.5. Recommendation Algorithm

ing history, find some other users who are similar with your taste, then discover the clothes they liked and recommend them to you.

knowledge-based recommendation This method recommends to the user some items using the association rule discovery algorithm. There are many discovery algorithms for association rules, such as Apriori, AprioriTid, DHP, FP-tree, etc.

Hybrid recommender systems This method combines two or more methods metioned above to get a more comprehensive recommendation.

Collaborative filtering recommendation (CF)

Collaborative filtering algorithm has been implemented successful in many recommendation system. The algorithm could be divided into two types: memory-based CF and model-based CF [3][1]. Memory-based algorithm assumes that similar users have similar score for the same items, and similar items are given the similar score by the same user. Therefore, the important process of this method is to find similar users/items, then predict target score based on similar users/items; Model-based algorithms, such as matrix-based decomposition models and network-based probability models, use the model to make recommendation. The model is establised to represent to score pattern, which can be trained by using the previous interaction information. [30]

The CF algorithms are based on user-patent scoring matrix, where each row represents a user, each column represents an item, and scores are given by users to patent [15]. The table 2.2 shows a user-patent scoring matrix. Based on the previous score, we could predict that the score of user 3 for patent 5 is 5, because P5 is similar with P4. Similarly, the rating of user 2 for patent 2 could be 3.

Table 2.2: user-patent scoring matrix

P1 P2 P3 P4 P5

U1 5 1 1 3 3

U2 3 - 3 1 1

U3 3 1 1 5

-Figure 2.3: Architecture diagram of recommendation engine

(19)

2.5. Recommendation Algorithm

Table 2.3: An example of User-based CF Item A Item B Item C Item D

User A 1 1 R

User B 1

User C 1 1 1

Table 2.4: An example of Item-based CF Item A Item B Item C

User A 1 1

User B 1 1 1

User C 1 R

A: User model: this part is responsible for taking the user’s score data from the database or cache, and analyzing the preference to generate the feature vector for the current user. The output of this part is the feature vector.

B: Item list: feature-item matrix is also an input of the recommendation algorithm, which is generated from the item feature vector, combined with the candidate item list.

C: recommendation: Taking the User-based Collaborative filtering algorithm (User-based CF) as an example, the items in the initial recommendation set are scored and predicted by the user’s preference, then filter the initial recommendation results based on some filtering rule, such as Top-N score. Finally, the final results are the output of the recommendation system.

User-based Collaborative filtering algorithm (User-based CF)

The main idea of the user-based CF is to discover a group of users with similar preference, which can be identified by the similar score to the same item. The algorithm recommends to one user with the top-interested items of the others [22].The steps are:

(1) Find neighbors of a specific user based on user scores to items. (2) Recommend the favorite items of neighbors to the current user.

Computationally, user’s score for all items can be regarded as the user vector, the neigh-bors can be discovered by calculating the similarity between user vectors. After finding the top-k neighbors, the item scores of the given user can be predicted based on the neighbor’s previous score to the items [46]. At last, the items would be ranked based on the predicted score. There is an example in the table 2.3. For User A, based on the user-item interaction score, User C is the neighbor of User A. Therefore, item D would be recommended to User A because User C shows interest in item D.

Item-based Collaborative filtering algorithm (Item-based CF)

The principle of item-based CF is similar to user-based CF, while the item-based CF is find similar items from the user-item interaction matrix. All users’ score for an item are used as an item vector to calculate the similarity of items. There is an example shown in the table 2.4. For item C, according to the historical score of users, item A is similar to item C. User A, B and C all highly scored item A, so it can be inferred that item C may also be highly scored by User A, B, and C. Therefore, the user C is recommended to item C in this case [46].

Matrix decomposition based Collaborative filtering algorithm

In recommendation system, the matrix decomposition model has received highly attention because of good theoretical foundation and expansibility. The traditional matrix

(20)

decomposi-2.5. Recommendation Algorithm

tion model obtains the potential feature matrix of users and items by decomposing the user-item scoring matrix. This algorithm predicts the target user’s score for a particular user-item based on the product of the two matrixes: user-feature matrix and feature-item matrix. Salakhut-dinov et al. explained the traditional matrix decomposition model from the perspective of probability, and then proposed probabilistic matrix factorization (PMF) and Bayesian proba-bilistic matrix factorization (BPMF) [16][7]. The decomposition model can efficiently process large-scale data in a good performance. In the Netflix competition, Koren et al. [27] proposed SVD++ algorithm (an imporvement algorithm of Singular Value Decomposition(SVD), which is one of the common matrix factorization methods used in collaborative filtering). On the basis of PMF, they integrated user bias, item offset and users’ implicit feedback into SVD++. The success of this method pushed the application of matrix decomposition model in recom-mendation system to a higher level.

In recent years, the matrix factorization method is used in a popular moedl: Latent Factor Model (LFM). The Latent Factor Model was first introduced into the recommendation sys-tem by Koren of Yahoo Research Institute in 2009 [27]. The characteristics of it is that the user and the item are contacted by the implicit feature, and the user-item scoring matrix always is complemented by the dimension reduction matrix factorization [26]. The idea of the Latent Factor Model mostly uses the matrix factorization method to decompose the “user-item scor-ing matrix into “user-implicit feature” matrix and “item-implicit feature” matrix, as shown in the figure 2.4 [26]. The matrix R is a user-item matrix, the value Rijrepresents the score

Figure 2.4: Decomposition of "user-item" rating matrix

of user i to item j. For a user, after we predict his score for all items, the items can be ranked and recommended to him based on the predicted scores. The LFM algorithm extracts several classes or features as a bridge between the user and item. Then, the matrix R can be rep-resented as a product of matrix P and matrix Q[43]. The matrix P is a user-class matrix, Pij represents the interest of user i on class j; the matrix Q is a class-item matrix, Qij represents the weight of item j in class i [1]. Therefore, LFM calculates the interest of user U to item I according to the following formula.

RU I=PUQI= K ÿ K=1

PU,KQK,I

Next, we need to calculate the parameter values in matrix P and matrix Q. The usual approach is to calculate the parameters by optimizing the loss function. By minimizing the loss function, the parameters of the model can be calculated based on the training dataset. In the user-item dataset K = (U, I), if (U,I) is a positive sample, then RUI = 1, otherwise RUI =0. The loss function is defined as follows:

C= ÿ (U,I)PK (RU I´ ˆRU I)2= ÿ (U,I)PK (RU I´ K ÿ k=1 PU,KQK,I)2+λ }PU}2+λ }QI}2

Where λ in the above formula is a regularization term, which is used to prevent over-fitting. λ needs to be experimentally repeated according to the specific application scenario [41]. The process of calculating U and I is called the training process. In order to determine whether the trained model can be used for prediction, some known scores are ignored during

(21)

2.6. Model optimization

training process, called known test sets. Once U and V are found, some evaluate metrics, such as root mean square error (RMSE), mean absolute error (MAE) would be used to measure on this known test set. Only if the prediction accuracy of the trained model on the known test set is higher than the predefined threshold, the model can then be used as a prediction model for the real unknown set. [1]

2.6 Model optimization

Gradient descent algorithms can be used for the optimization process of the loss function.

Batch Gradient Descent (BGD)

Gradient descent (GD) is an iterative optimization algorithm, the purpose of it is to find the minimum value of a function[25]. To find the local minimum value of the function, a step size proportional to the negative of the gradient (or approximate gradient) of the function at the current point is used[1]. If the minimization process is repeated until finding the global minimum, the above method is also referred to as batch gradient descent (BGD). In each iteration, all data is used to find the greatest gradient. Every iteration is called an epoch.

Algoritehm 1Batch Gradient Descent (BGD) 1: U Ð random(u, k)

2: I Ð random(k, i)

3: foreach known rating do Utemp Ð µ+α(ř(eυ)´ λµ) 4: Itemp Ð υ+α(ř(eµ)´ λυ) 5: end for 6: U Ð Utem 7: I Ð Itemp U: User Matrix; µ: Element of U Matrix I: Item Matrix; _: Element of I Matrix

Utemp, Itemp: Temporary matrices where updated elements are stored e: Error for each element;

λ: Regularization Term α: Learning Rate

In order to make Gradient Descent work well, learning rate must be selected as a suitable value. This parameter determines the speed of moving to the optimal value. If the value of learning rate is too large, it is possible to skip the optimal value in the process of finding the minimum value; if it is too small, it will cost too many iterations to find the optimal value, which would decrease the speed of the algorithm. Therefore, the choice of learning rate is important.

Stochastic gradient descent (SGD)

One of the problems with the BGD algorithm is that it runs very slowly on large datasets. The reason is that BGD needs errors accumulation of all instances in the training dataset for updating each parameter. Therefore, when there are millions of instances, it would take a long time. In this case, another more suitable iterative algorithm, stochastic gradient descent (SGD) can be used. Similar to BGD, SGD is also an iterative optimization algorithm. How-ever, the training process of SGD is much faster because the matrix factor can be updated continuously through the epoch.

(22)

2.7. Evaluation method

Specifically, Stochastic gradient descent is to find the local minimum of the function by updating the model parameters on each pair of training data iteratively. Firstly, initialize a set of parameter values, then optimize step by step until finding the local minimum value. Each step, the stochastic gradient descent method moves to the fastest direction in which the gradient of the loss function decreases.

Algoritehm 2Stohastic Gradient Descent (SGD) 1: U Ð random(u, k)

2: I Ð random(k, i)

3: foreach known rating do Utemp Ð µ+α(ř(eυ)´ λµ) 4: Itemp Ð υ+α(ř(eµ)´ λυ)

5: U Ð Utem 6: I Ð Itemp 7: end for

2.7 Evaluation method

Root Mean Squared Error (RMSE) and Mean Absolute Error(MAE)

During the training process, Root Mean Square Error (RMSE) can be used to evaluate the performance of the training model. The system generates a predicted score ˆr for user-item pairs (u, i) in the test set r, where the real score of these user-item pairs are known as r. Usually, r is regarded as unknown because we ignore them during the training process. During the evaluation process, we compute RMSD and MAE using the real score and the predicted score. [21]

RMSD= c_řn

t=1(ˆrt´rt)2

n .

The Mean Absolute Error (MAE) is shown as below, which is an alternative to RMSE. MAE=

řn

t=1|ˆrt´rt|

n .

Precision and Recall[22]

Figure 2.5: Precision and recall

Precision= Truepositive

(23)

2.7. Evaluation method

Recall= Truepositive

(24)

3 Method

The patent search system is the existing system in the company IamIP, which can be accessed and reused directly. Therefore, we focus on search optimization subsystem in this project. To design the optimization subsystem, there are several processes need to be considered: preprocess patent text, extract keywords based on the feedback to reformulate search query, rank the results and evaluate the performance of the optimization using suitable metrics.

The information of a complete patent information includes Patent Title, Abstract, Pub-lication(s), Applicant(s), International filing date, Priority(s), Inventor(s), IPC, CPC, Simple family member(s), Claims and Drawing. In this project, we only focus on title and abstracts of the Patent, since the title and abstract contain most of the important information of a patent.

3.1 The environment of system implementation

The development tools and operating environment of this project are shown in the following table 3.1:

Table 3.1: Development tools and environment

Name Enviroment or tools

Oprearting System Mac OS 10.13.6

IDE Anaconda 1.8.5

Language Python 2.7.11

Database MySQL 5.6.35

Packages Scikit-learn: nltk 3.2;

Python: numpy 1.10.2, pandas 0.18.0; Graphlab 2.1: recommender;

3.2 Dataset

The dataset was collected and examined by IamIP during four-months from December 2017 to March 2018. It is a CSV format file, each row contains a complete patent, including: Patent

(25)

3.2. Dataset

Title, Abstract, Publication(s), Applicant(s), International filing date, Priority(s), The user pro-vides relevance feedback in the form of ratings, in which users select numerical values from a specific evaluation system (e.g. five, three or one-star rating system) that specify related and unrelated of various patents. Patent ids are in the range [1, 182]. The statistics of the IamIP dataset is shown as table 3.2.

Table 3.2: The statistics of the IamIP dataset

month most related (5) less related (3) unrelated (1) total number

2017/12 7 6 52 65

2018/01 0 7 32 39

2018/02 5 4 38 47

2018/03 12 14 5 31

According to the principle of database design, that is, Atomic (A), Consistent (C), Isolated (I), Durable (D). The database is designed with the following four data tables. The figure 3.1 is the relation schema of the dataset.

Figure 3.1: Entity relationship diagram (ERD)

The data table and field in the database is as follows. (1) Patent information patent_in f o (shown as table 3.3)

Store patent information, including patent number, patent title, patent abstract, and pub-lication time. Except for id and pubpub-lication time, other fields can be NULL.

(2) Company information company_in f o (shown as table 3.4)

Store customer company information, including company number, company name. To avoid disclosing the information of the company, we didn’t collect any real information of customer companies. The company name in the dataset is just a nickname.

Table 3.3: patent_in f o

Field Type Null Description

patent_id int (11) NOT NULL Primary key

title text

abstract text

(26)

3.3. Implementation Process

Table 3.4: company_in f o

Company_id int (11) NOT NULL Primary key

name text

Table 3.5: Alert_administration

Admin_id int (11) NOT NULL Primary key

topic text

company_id int (11) NOT NULL foreign key Table 3.6: ranking

Admin_id int (11) primary key/foreign key patent_id int (11) primary key/foreign key

score int

(3) Alert administration Alert_administration (shown as table 3.5)

Each company can set more than one alerts, each alert focus on a specific topic. This table administrates the topic for each alert of the companies. The relevance feedback is given by per administer based on the topic.

(4) Ranking ranking (shown as table 3.6)

Store the administer’s rating for the patent, including the administer id, patent id and relevance score.

3.3 Implementation Process

Four function modules are implemented in our project, shown as figure 3.2:

Figure 3.2: Function modules diagram

Patent preprocess module: Convert the title and abstract of the patent document into feature vectors in space.

(27)

Query reformulation module: Input the vector representation of the patent title and ab-stract, and the corresponding socre of each patent document. Output the reformulated search string.

Patent retrieval: input the search string and search dataset, output search results.

Rank module: input the user’s score for the original document, as well as the feature vector of all patents (the new patents plus original patents), output the predicted score for new patents and ranked the candidate results based on predicted scores.

Preprocess

The first step is to convert the raw text into structured data, making it possible to be identified and processed by the computer. The figure 3.3 shows the processes of patent preprocess:

Figure 3.3: Processes of patent preprocess

text preprocess

We defined a function tokenize(text)for text preprocess, the input of this function is raw text, it includes the processes: tokenization, remove stopwords and remove punctuations.

The package nltk is used for implementation of this function, it is a package for python programs to work with human language data. There are numerous ways to tokenize text in nltk [6].

1. word_tokenize(s)

Tokenizers divide strings into lists of substrings. If s = ”’The train ticket from Linkop-ing to Stockholm cost $10.11. Do you want to book one ticket now?”’, then after usLinkop-ing word_tokenize(s), the output is

(1_The1_,1_train1_,1_ticket1_,1 _{f rom}1_,1_Linkoping1_,1_to1_,1_Stockholm1_,1_cost1_,1_$1_,1_10.111_,1_.1_,

1_Do1_,1_you1_,1_want1_,1_to1_,1_book1_,1_one1_,1_ticket1_,1_now1_,1_?1₎ (3.1) 2. wordpunct_tokenize(s)

NLTK also provides tokenizer that splits text on whitespace and punctuation. For the same example as above, the output of this function is:

(1_The1_,1_train1_,1_ticket1_,1 _{f rom}1_,1_Linkoping1_,1_to1_,1_Stockholm1_,1_cost1_,1_$1_,1₁₀1_,1_.1_,1₁₁1_,

1_.1_,1_Do1_,1_you1_,1_want1_,1_to1_,1_book1_,1_one1_,1_ticket1_,1_now1_,1_?1₎ (3.2) 3. sent_tokenize(s)

(28)

The tokenization can also be operated at the level of sentences by using the sentence tok-enizer directly.

([1_The1_,1_train1_,1_ticket1_,1 _{f rom}1_,1_Linkoping1_,1_to1_,1_Stockholm1_,1_cost1_,1_$1_,1_10.111_,1_.1_]_,

[1_Do1_,1_you1_,1_want1_,1_to1_,1_book1_,1_one1_,1_ticket1_,1_now1_,1_?1_]) (3.3) Considering the application scenario of our project, the function word_tokenize in mod-ule nltk is used for tokenization. After that, Stop words can be removed by using stopwords.words(”english”), which imports a list of stop words from nltk module. In ad-dition, we define 34 stopwords in this domain manually. What’s more, we mark 15 common used punctuations, and remove them from the tokenzied word list.

vectorizer

In this part, we defined two functions:

process(row_text = [], patent_index = [])and process_withquery(row_text = [], query = [], patent_index= []).

The difference between the two functions is parameters. In addition of the parameters of process(row_text= [], patent_index = []), process_withquery() has another parameter for query. In this way, Process_withquery()takes the query into consideration when extracting feature words. The output of this function is two matrices with the same feature words, one matrix is patent-feature matrix, another is the feature vector of query. It maps query and patent set to the same feature vector space, so the similarity calculation can be directly used between query and patent when searching.

Scikit-learn [37] is used for implementation of these two functions, including feature se-lection, weight calculation and transform to matrix. There is the description of parameters for function t f id f _vectorizer, which could work for converting raw documents to a tfidf features matrix.

t f id f _vectorizer = (max_d f = 1.0, min_d f = 1, max_ f eatures = None, use_id f = True, stop_words=None, tokenizer, ngram_range= (1, 1))[36]

1. max_d f : float in range [0.0, 1.0] or int, de f ault=1.0.

Here, we set it as 0.8, when building the vocabulary, terms would be ignored because of occurring in too many documents.

2. min_d f : float in range [0.0, 1.0] or int, de f ault=1.0

Similary, ignoring terms that have a document frequency lower than the given thresh-old. it is also called cut-off. We set min_d f as 0.2 in our project.

3. max_ f eatures : int or None, de f ault=None

When building the vocabulary, only terms that occur frequently in the corpus could be considered. We set 200000 to control the dimension of the vector.

4. use_id f : boolean, de f ault=True

Enable inverse ´ document ´ f requency reweighting. 5. stop_words : string ’english’, list, or None (default)

If setting stop_words as ’english’, then a built-in stop word list for English would be used.

6. tokenizer : callable or None(de f ault)

(29)

7. ngram_range : tuple min_n, max_n

For example, set the lower boundary as 1, the upper boundary is 3. Then, 1-gram, tuple-gram and tripel-gram would be considered.

Transform to matix

To transfer vertors to matrix, f it_trans f orm()is usd for implementation. We set features terms as column index, and set patent index as row index.

Query reformulation

The figure 3.4 is the flowchart of Query reformulation. Among them, Patent_process uses the

Figure 3.4: Flowchart of Query reformulation

process function in the preprocess module, the input is a set of patent text, the output is a set of patent-feature vectors.

Rocchio algorithm refers to a function that we defined: rocchio(vector = [], unrelated = [], related= [])

The parameters of the function are the patent vectors, the patent index of the related patent as well as the patent index of the unrelated patent. We multiply the related patent and the unrelated patent by different weights, then sum the values of the same column to calculate the feature weight. The features in the collection are then ranked by weight. Finally, select the number of extension words, return a set of feature terms that represent the related patents, and use them to formulate a new search string.

The output of this step is a weight vector of all selected features, which was calculated by giving different weight to different patent based on relevance. After that, we tried to find the index of the highest sum_weight and get the top ´ n terms to form the new query. A simple

(30)

example showing how the algorithm works is following. If the initial query is "heat light", shown as the example in table 3.7, the parameters are:

Table 3.7: Rocchio algorithm

energy heat light power relevance

D1 1 0.5 0 0 5

D2 0.5 1 1 0.5 3

D3 1 1 1 0 1

D4 0 0 0.5 0.5 1

α=1, β=0.4, λ=0.4, γ=0.2

The weight vector of all features is as table 3.8. If set n in topn as 3, then the new query is "energy light heat".

Table 3.8: weight vector

feature energy heat light power

weight 1.8 0.8 1.5 0.1

Patent retrieval

The figure 3.5 is the flowchart of Patent query:

Among them, patent_process refers to the function process_withquery in the module preprocess. It maps the patents set together with the query into a matrix. Then, the similarity calculation is performed using the cosine similarity between the query and each patent. Fi-nally, when the top-n patents are output as the search result, we use concat()to splice results. Concat()is a function of the package pandas that can splicing data horizontally or vertically. Here we vertically splice the top-n patents together as the query result.

Ranking

We choose factorization recommender model to implement the CF algorithm. One advantage of matrix factorization is that it allows incorporation of additional information [43]. In our design, we set patent features as the additional information for the recommendation. Firstly, the model will be trained by the previous relevance feedback, considering the similarity be-tween patents, and then predict score for new captured patents. Lastly, give the patents with top-n score as the most related results. The figure 3.6 is the architecture of the recommenda-tion engine in our project. It was designed based on the recommendarecommenda-tion engine architecture (figure 2.3)that was introduced in section 2.5.

We query the results using the new query, and use recommendation algorithm to rank the query results. The flowchart of recommendation algorithm is shown as figure 3.7

Define function: model(user_id= [], patent_id = [], ranking= [], f actor= [])This model learns the potential features between users and patents based on user and patent rating in-formation. The textual feature of the patents is addded as side information for recommen-dation. We implemented the function by adatping f actorization_recommender from package GraphLab [20]:

For function f actorization_recommender, the main information is the rating of the user to patent. The feature of the patent is side information to create the recommended model. The user-patent matrix should contains three columns: user_id, patent_id, and rating, each row presents an observed interaction between the user and patent. This matrix is the input of the

(31)

Figure 3.5: Flowchart of Patent query

(32)

3.4. Evaluation

Figure 3.7: Flowchart of recommendation engine

model function, then the model discovers latent factors between user and patens to calculate parameters. As we mentioned in Section 2.5, this process is called training process. [20]

Side information for the items is added for the observation by adding a SFrame named item_data. This SFrame must have a column with the same name as the item_id. item_data can provide any additional item-specific information. We can also add users features as side information. [20]

When creating recommender model, user_id, patent_id, and rating of train set are the input parameters of the model function, in addition, we form an SFrame to store the feature of original patents set and new patents set.

The predict rating for the test set can be implemented by using the function: predict(test). The parameter of this function is a collection of new patents that need to be predicted.

3.4 Evaluation

Dataset

We are going to use Cancer Moonshot Patent Data [14] from the United States Patent and Trademark Office (USPTO) to further evaluate the performance of the system. The dataset contains detailed information on published patents related to cancer research and develop-ment (R&D), covering approximately 270,000 patent docudevelop-ments from 1976 to 2016. Various fields and subjects about cancer are covered in this dataset, including drugs, diagnostics, sur-gical equipment, data analysis, and genome-based inventions. In addition to compiling key patent data fields. They also constructed a set of fields to indicate whether the patent belongs

(33)

3.4. Evaluation

to certain high-level technology category. These high-level technology categories can work for us as the standards of whether the patent is related to the specific technology, 1 represents related, 0 represent unrelated. The statistics of this dataset is shown in table 3.9

Table 3.9: The statistics of the dataset (cancer research and development) Technology topics patents related unrelated DNA, RNA, Protein Sequence 269353 85144 184207

Food and Nutrition 269353 3849 265504

Drugs and Chemistry 269353 205019 64334 Radiation Measurement 269353 396 268957

Cells and Enzymes 269353 41234 228119

Removing some patents without title or abstract, 264,000 Patents are selected for evalua-tion experiments. Among them, there are 83,694 patents related to the subject ’DNA, RNA, Protein Sequence’, accounting for about 32%. The distribution of related patents and unre-lated patents in the dataset is shown as figure 3.8

Table 3.10: The distribution related to the subject (DNA, RNA, Protein Sequence) total related (1) unrelated (0)

264000 83694 180306

Figure 3.8: Distribution of related patents and unrelated patents in the dataset

For doing experiment, the total data set has been randomly divided into 8 equal parts, ensuring that the proportion of related patents is keep about 32%. The statistics of the sub-datasets obtained by the division is shown as table 3.11:

The number of extended words

When using the query reformulation algorithm, the first issue to consider is the reasonable choice of the number of extension words. In order to figure out the suitable number of exten-sion words, we designed a series of experiments to evaluate the number of different extenexten-sion words by comparing the precision of the search results, which are retrieved from the same search dataset by generated search string. We trained on dataset d6, analyzed the related patent to obtain feature words, and set the number of extension words from 2 to 16. Recorded

(34)

3.4. Evaluation

Table 3.11: The statistics of subdatasets total related (1) unrelated (0)

d1 33000 10536 22464 d2 33000 10407 22593 d3 33000 10500 22500 d4 33000 10336 22664 d5 33000 10495 22505 d6 33000 10432 22568 d7 33000 10536 22464 d8 33000 10452 22548

generated search string under different extension numbers, and then record the search results that runing the search string on the dataset d8. Finally, compare the recall and precision of the total results set, top10 results, and top20 results.

Parameters of Rocchio algorithm

In the Rocchio algorithm, the weight selection of related items and unrelated terms is very im-portant. The initial research is biased towards considering related items only. As the research development, some researchers proposed that unrelated items should also be considered, and some even proposed to only consider unrelated items.

In this experiment, we aim to choose the suitable parameters for Rocchio in this particular application scenario. We designed five sets of parameters from (1:0), (0.9:-0.1) to (0.6:-0.4), compared the reformulated search string trained on dataset d1 under different parameter settings. Finally, we analyzed the precision of the query results, as well as the precision and recall of top10, top20 results.

Dataset size

In order to explore the impact of dataset size on the query reformulation algorithm, we di-vided the sub-dataset d6 iterationly into six smaller sub-datasets, forming a series of dataset in variou sizes. Run the query reformulation algorithm, to extract the search string by an-alyzing the text features of the patent. The retrieve the search results by running the new generated search string on dataset d8. Finally, compare the precision, recall of related patents in the search results, together with the proportion of related patents in Top-10 and Top-20.

The effect of recommendation algorithm

In order to explore whether the recommendation algorithm could work for ranking the result or not. On the basis of the experiment about dataset size, we further rank the research re-sults using the recommendation algorithm. Finally, we compare the precision of top10, top20 results with the results using TF-IDF ranking.

Iterative test

Application scenario:

IamIP’s platform is a tool for patent search and patent management. The dataset is up-dated with approximately 100,000 new Patents per week [24]. Customers can set up search alerts by defining initial search string to capture the latest, competitive intelligence for specific technology requirements.

(35)

3.4. Evaluation

The patent platform updates thousands of new Patents every week. Using the user-defined search string, 50 patents are retrieved as search results every week, so a user is informed with 10 patents per day. Then the user give feedback (related/ unrelated) to the patents. In this way, there are 50 feedback can be collected per week, therefore, there should be 200 patents with feedback can be collected per week. The initial training set of this intera-tive experiment is a collection of 2,400 Patents that was collected in one year. This collection of patents is used for optimizing the search string by understanding the relevance feedback.

This test randomly selected 2,400 Patents from dataset d7 as the originally collected Patents with user relevance feedback. After that, suppose the platform captured 8250 new Patents per week and get about 50 feedback from users per week. The database of the system will be updated 4 times a month (by 4 weeks), and the feedback collected in the latest month will be added to the training set continually, which means that the optimization system is it-erated once a month. We simulated the iteration of query alert among different sub-datasets. By performing iterative tests 4 times among the 4 datasets, we could evaluate whether our system works to generate a good search string for a specific topic and rank the related results higher or not.

Relevance feedback-based optimization of search queries for Patents

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/007--SE

Relevance feedback-based

op-timization of search queries

for Patents

Sijin Cheng

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Text preprocess

2.2

Feature extraction

Feature selection

weight calculation

2.3

Text representation

Bag-of-Word,BOW

Vector Space Model

2.4

Query reformulation algorithm

Rocchio algorithm

2.5

Recommendation Algorithm

Collaborative filtering recommendation (CF)

2.6

Model optimization

Batch Gradient Descent (BGD)

Stochastic gradient descent (SGD)

2.7

Evaluation method

Root Mean Squared Error (RMSE) and Mean Absolute Error(MAE)

Precision and Recall[22]

3

Method

3.1

The environment of system implementation

3.2

Dataset

3.3

Implementation Process

Preprocess

Query reformulation

Patent retrieval

Ranking

3.4

Evaluation

Dataset

The number of extended words

Parameters of Rocchio algorithm

Dataset size

The effect of recommendation algorithm