Searching and Recommending TextsRelated to Climate Change

(1)

April 2021

Searching and Recommending Texts Related to Climate Change

Karolin Gjöthlén

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress:

Box 536 751 21 Uppsala Telefon:

018 – 471 30 03 Telefax:

018 – 471 30 00 Hemsida:

http://www.teknat.uu.se/student

Searching and Recommending Texts Related to Climate Change

Karolin Gjöthlén

This project considers the design of a machine learning system to search efficiently a database of texts related to climate change. The efficient search and navigation of such a database make it easier to find actionable information, detect trends, or derives other useful information. A key feature of such an information retrieval system is the numerical representation of such a text. This project implements and compares three different ways to represent a text in a vector space.

Specifically, we contrast Bag-of-Words, Term Frequency - Inverse Document Frequency, and Doc2Vec in this context.

The reported results indicate two cases: firstly, we observe that all 3 embeddings outperform a naive (fixed, expert rule-based) method for retrieving a text. In this case, the query contains part of the text with a small modification, while the result of the query should be the text itself. The Bag-of-Words approach turns out to be best in class for this task.

Secondly, we consider the task where the query is a random string, while the desired result is based on a manual comparison of the results. Here we observe that the doc2vec approach is best in class. If the random queries become abstract-alike, the Bag-of-Words approach is performing almost as well.

(4)

(5)

Det här projektet tar hänsyn till utformningen av ett maskininlärningssystem för att ef- fektivt söka i en databas med texter relaterade till klimatförändringar. Effektiv sökning och navigering av en s˚adan databas gör det lättare att upptäcka trender eller hitta användbar information. En nyckelfunktion i ett s˚adant informationshämtningssystem är den nume- riska representationen av en s˚adan text. Detta projekt implementerar och jämför tre oli- ka sätt att representera en text i en vektorrymd. Specifikt jämför vi Bag-of-Words, Term Frequency - Inverse Document Frequency och Doc2Vec i detta sammanhang.

De rapporterade resultaten indikerar tv˚a fall: i det första fallet observerar vi att alla 3 implementationer överträffar en naiv metod för att hitta en text. I det här fallet inneh˚aller förfr˚agan en del av texten med en mindre modifikation, medan resultatet bör vara själva texten. Bag-of-Words-metoden visar sig vara bäst lämpad för denna uppgift.

I det andra fallet är förfr˚agan en slumpmässig sträng, medan det önskade resultatet ba- seras p˚a en manuell jämförelse av resultaten. Här observerar vi att doc2vec-metoden är bäst. Om förfr˚agan är lik ett förväntat resultat fungerar Bag-of-Words-metoden nästan lika bra.

(6)

1 Introduction

A key feature of an information system is to make it easier to browse and find information relevant to the user. This project focuses on the implementation and testing of a way to represent texts about climate change. One desired outcome of this is a snowball effect where people will find interesting information about climate change, allowing a faster spread of knowledge, which in turn might lead to more action being taken in this area.

Texts about climate change, like the ones described above, can be obtained from the EU- funded projects INTERACT [17] which are described further in Section 2.1. However, data comparable to this can also be gathered from other sources, for example from Google’s own website for dataset searches [16].

Three different approaches have been used when implementing a recommender system, in order to see which of them would work best in the setting of texts related to climate change. The different approaches are Bag of Words [28], TF-IDF [12] and Doc2vec [23], where the first two methods do not take the order of words into account, but the last one does. Each method structures the documents as vectors, which are then transformed in different ways within the vector space they exist in. How they are transformed and what happens after that is explained in Section 3.

1.1 Scope

The goal of the project is to investigate different recommender system and their useful- ness in the context of texts related to climate change research. The focus will not be to make a user-friendly interface, but rather look into and investigate if the technique of natural language processing in combination with a recommender system works well in the given context. Three different implementations are developed, tested, and compared.

Regarding limitations on the data, the focus of the project will only be on text-based data and not e.g. images or temperature measurements. Unfortunately, the original plan of only having the dataset contain texts from the INTERACT projects was not possible.

This was due to the inability of the responsible parties to provide said data within the time frame of this master thesis. Instead, the dataset contains abstracts of articles and reports published in the Journal of Industrial Ecology, all related to climate change. The dataset only contains abstracts of the texts, and not the whole published article, because of memory issues occurring when having a large dictionary. The reason for abstracts

(10)

1 Introduction

the abstracts opens up to the possibility of providing more texts than when providing the whole report or article.

(11)

2 Background

The evidence for rapid climate change is compelling. It includes, among other things, rising global temperatures, warming oceans, shrinking ice sheets, glacial retreat, de- creased snow cover, rising sea levels, declining arctic sea ice, extreme weather events, and ocean acidification [3]. The environmental challenges facing our planet, and the Arctic, in particular, are on such a great scale that a global cooperation effort is needed.

Indeed, studies are showing that the Arctic is warming more than twice as fast as the rest of the world [13].

2.1 The INTERACT Project

Preventing climate change in the Arctic is not something that a country or an organi- zation can do alone. Which is why an international cooperation was initiated: the EU funded project INTERACT (International Network for Terrestrial Research and Moni- toring in the Arctic) [17]. The cooperation has a long history, starting in 2001. How- ever, the network has expanded during the last 6 years with new members joining from Russia and North America, to become the network it is today. INTERACT’s network currently consists of 88 terrestrial field bases with a multitude of researchers around the world. They all focus on building capacity for identifying, understanding, predicting, and responding to diverse environmental changes throughout the wide environmental and land-use parts of the Arctic. The stations involved are spread across countries such as the US, Canada, Norway, Sweden, Finland, Switzerland, Austria, and some other European countries [2].

As of the writing of this thesis the third edition of the project, INTERACT III [8], is being carried out. INTERACT III focuses on nine different issues, or work packages (WP), all related to environmental sciences and cooperation in the Arctic, see Table 1. One of the work packages, WP6, focuses on how to use new techniques to come up with solutions to problems, new issues and hidden topics not discussed before, all related to climate change. Another work package, WP2, focuses on how to make sure the dis- covered data, predictions, and results are shared between scientists, work packages, and stations. A combination of these two work packages is the focus of this master thesis project, even though it officially belongs to WP6. The company responsible for super- vising this thesis project is AFRY [27], by providing guidance and expertise.

The mission of the stations is to collect data in the Arctic region. Therefore they produce

(12)

2 Background

WPno Description

1 Project Coordination

2 Station Managers Forum (SMF) 3 Giving Access to the Arctic

4 Unpredictable Arctic – extreme weather events

5 Connecting the Arctic: Transport and Communication 6 Climate Action: Making data widely available

7 Preparing for a future world: improving education and awareness at all societal levels

8 Cleaner Arctic, cleaner world: documenting and reducing pollution

9 The Arctic Resort: increasing benefits and reducing impacts from developing Arctic tourism

Table 1 The Work Packages of the INTERACT III Project

other texts.

2.2 Motivation

Research and data from the first two INTERACT projects, together with similar projects, have resulted in large repositories of information [15] [18]. Consumers of this information include climate researchers, climate activists, and curious citizens interested in questions related to the climate. However, the availability of large resources causes challenges. They arise from the fact that users cannot utilize available resources effec- tively when the quantity of information requires an unreasonable amount of time spent on familiarizing with and grasping the information. Thus, the risk of overloading users with information enforces new requirements on the software systems that manage the information. One of these requirements is to find and present relevant data.

Recommender Systems (RS) are a way to address some of the problems caused by information overload [11]. When navigating through a vast amount of data, finding relevant subtopics is crucial to make the information more accessible and usable. The aim is for the information to be reached by more people than before, in order to spread information about climate change, thus leading to more action being taken.

(13)

2.3 Recommender Systems and NLP

Users of applications such as Netflix [21] and Spotify [26] have definitely encountered a recommender system. When using e.g. Spotify, the recommended products, from now on referred to as items, are songs, artists, and podcasts. The system uses information about the user, such as what they have previously listened to, to recommend music that the user will most likely enjoy.

Aggarwal, in his book about recommender systems [1], claims that the main goal for recommender systems is to increase product sales. This statement is based on the recommended items being products for sale, but can be applied to texts as well. We can look at two cases; one with texts behind paywalls and another when the texts are open for everyone. In the first case, relevant texts recommended to the user will hopefully lead to more users buying reports or subscribing to services providing reports and therefore increasing revenue. In the second case, the aim is not to make money but to spread information. If users of the systems have easier access to reports of interest they are more likely to read the information. It also increases the visibility of similar reports or articles.

To make use of recommender systems in the context of items containing texts, Natural Language Processing (NLP) is used. NLP can be seen as the transformation of natural language into a language that a computer can understand. It contains, among other things, fetching of the text data, dividing it into words, removing irrelevant words (e.g.

prepositions, noise, very rare words), and representing the words as vectors [4]. NLP is explained further in Section 3.2.

There are different ways of representing words as vectors. In this thesis, three different approaches are compared: Bag of Words, TF-IDF and Doc2vec. They are all described in Section 3.

(14)

3 Method/Approach/Techniques

3.1 Recommender Systems

According to the author of Recommender systems, Aggarwal, there are two ways to formulate a recommender system problem: prediction version of problem and ranking version of problem [1]. The first model, prediction version of problem, predicts the rating of an item for a given user, based on earlier given information about user-item combinations. The second approach, ranking version of problem, does not focus on rating items for given users, but rather on ranking items. The focus of the problem is to create a list of items that the user will most likely find relevant and then displaying the top k ones. This is referred to as the top-k recommendation problem and this is the approach that will be used further on.

For a recommender system to be considered good, it must strive to achieve certain goals.

These goals include relevance, newness, serendipity (or unexpectedness), and diversity [1]. The items recommended need to be relevant, but also novel to the user, as in texts that the user has not already read. Also, preferably, the items should not be too obvious or expected, the user should discover something new. Last but not least the items should be diverse, increasing the chance of the user finding something they like.

3.2 Natural Language Processing (NLP)

For a computer to be able to process natural language text, the text needs to be preprocessed. This is necessary for the program to be able to handle the data in a correct way.

The preprocessing can be performed using the following steps [9]:

1. Tokenization: Divide the text into words, also called tokens, to be able to handle the text more easily. The division is made wherever there is a space and will result in a sequence of multiple tokens after each other.

2. Remove stop words and punctuation: Remove unnecessary words that occur often in most texts, such as “was”, “them”, “by”, and “both”, and remove punctuation (!?:;., etc.). This is to be able to focus on what distinguishes different texts.

3. Lowercase: All letters are turned into lower case to make sure there is no difference between Climate (if the word is at the beginning of a sentence) and climate.

(15)

4. Part-of-speech tagging (POS-tag): Each word is tagged as an adjective, verb, noun, or adverb, depending on the context of the word [20]. The POS-tagging is necessary to make the next step possible.

5. Lemmatization: A word’s lemma is the root form of the word. For example writing, wrote and written all have the lemma write. The same goes for adjectives (happy, happier, happiest ! happy), nouns (apple, apples ! apple) and adverbs (good, better, best ! good). In this step the word is checked with a dictionary to get the correct lemma, with help from the POS-tag.

6. Stemming: Turn all words into their word stem by removing the ending of the word. The stem does not need to be a real word. For example fishing, fished, and fisher have the stem fish and argue, argued, argues, arguing have the stem argu.

Here is an example of texts with their corresponding result after preprocessing:

1. “There is snow in the Arctic.” ! [’there’, ’snow’, ’arctic’]

2. “Snow is white and cold.” ! [’snow’, ’white’, ’cold’]

3. “My favorite color is white.” ! [’my’, ’favorit’, ’color’, ’white’]

3.3 Vector Representation

We want to represent the documents as vectors to be able to transform and compare them easily. There are multiple ways to represent a document as a vector. Two approaches are Bag of Words and Doc2vec. Their main difference lies in whether they take the order of words into consideration or not.

3.3.1 Bag of Words

The Bag of Words method treats the documents regardless of the order of their words.

To get the length of each vector representing the documents, we need to create the dictionary. The dictionary is the combination of all unique words within the corpus, where the corpus refers to all processed documents. In the previous example the dictionary will be a list containing (’there’, ’snow’, ’arctic’, ’white’, ’cold’, ’my’, ’favorit’,

’color’). The document vectors will then be represented in the document-term matrix

(16)

doc 1 doc 2 doc 3

’there’ 1 0 0

’snow’ 1 1 0

’arctic’ 1 0 0

’white’ 0 1 1

’cold’ 0 1 0

’my’ 0 0 1

’favorit’ 0 0 1

’color’ 0 0 1

Table 2 Document-Term Matrix

within the document and place that number on the corresponding row. In this example, there are no documents containing the same word more than once and therefore the matrix only contains ones and zeros.

3.3.2 Doc2vec

Doc2vec [23] is a method for word embedding and text processing based on artificial neural networks [10]. Doc2vec is an extension of Word2vec [24], but applied to whole documents. Thus, to understand Doc2vec, it is best to start by explaining Word2vec.

The goal of Word2vec is to make vector representations of each word, where words similar to each other appear close in the vector space. In this vector space, we also want to be able to add and subtract word vectors to/from each other and get reasonable answers, e.g. king - man + woman = queen. These vectors can be created as a “side effect” of a neural network trained to predict a word based on a specific context.

Both the Word2vec and Doc2vec network, as seen in Figure 1, has three layers: the input layer, hidden layer, and output layer. For the Word2vec network, the input layer encompasses word vectors describing the context, and the output layer is what word we expect from that given context. This input and wanted/expected output can be seen as a sliding window with e.g. five open spots. The two spots to the left and right are the

“context” of the word in the middle spot, which is the expected output. This sliding window is moving one spot to the right after each run. An example can be seen in Figure 2, with a window of width 5.

In the middle spot, we transform the input vectors using weight matrices (WV xN). The network is trained to deliver the correct output by comparing the produced output to the expected output and then sending that information back to the system. This method is known as back-propagation. After training a given amount of times or when outputs

(17)

Figure 1 Overview of the Doc2vec Network

Figure 2 First Two Steps of Sliding Window

The green spot is the word we want to predict and the yellow spots are the words we feed as input to the network .

(18)

predicted by the network are deemed accurate enough, the weight matrices are obtained.

These weights are what we will use as the vector representation of each word.

This is similar for Doc2vec, but then we also put in a vector representation of the document, which is just a document tag that could be the index of the document. This vector will also get a weight matrix and that is what will be used in the vector representation of the document [14].

3.4 Transformation

3.4.1 TF-IDF

TF-IDF [12] stands for Term Frequency - Inverse Document Frequency and is a method used for weighting the elements in the document-term matrix created with the Bag of Words method. The TF-IDF method operates by assigning a weight to a word based on the frequency and extent by which it appears. A large weight would be signified by a word appearing with a high frequency in a small number of documents. A small weight is signified by words that are present in several or all documents (such as stop words).

For each element, we look at a specific term and a specific document, which generates four numbers that are taken into account: the number of times the word occurs in the document (this number is what we can find in the document-term matrix), the number of words in the document, the number of documents in the corpus and the number of documents that contain the word. See Equation (1), (2) and (3).

TF = number of times the word occurs in the document

number of words in the document (1)

IDF = number of documents in the corpus

number of documents that contain the word (2)

TF-IDF = TF ⇥ IDF (3)

The transformation is performed on each and every element in the matrix but will not affect the elements that are zero. This is due to the fact that the TF term will be zero since it does not appear in the text.

If TF-IDF were to be applied to the document-term matrix seen in Table 2, we would get the transformed matrix seen in Table 4 in addition to the transformation with the TF and IDF terms that can be seen explicitly in Table 3. This transformation is made with

(19)

doc 1 doc 2 doc 3

’there’ ¹₃ ⇥³₁ 0 0

’snow’ ¹₃ ⇥³₂ ¹₃ ⇥ ³₂ 0

’arctic’ ¹₃ ⇥³₁ 0 0

’white’ 0 ¹₃ ⇥ ³₂ ¹₄ ⇥ ³₂

’cold’ 0 ¹₃ ⇥ ³₁ 0

’my’ 0 0 ¹₄ ⇥ ³₁

’favorit’ 0 0 ¹₄ ⇥ ³₁

’color’ 0 0 ¹₄ ⇥ ³₁ Table 3 A TF-IDF Transformation of a Document-Term Matrix

doc 1 doc 2 doc 3

’there’ 1 0 0

’snow’ ¹₂ ¹₂ 0

’arctic’ 1 0 0

’white’ 0 ¹₂ ³₈

’cold’ 0 1 0

’my’ 0 0 ³₄

’favorit’ 0 0 ³₄

’color’ 0 0 ³₄

Table 4 A TF-IDF Transformed Document-Term Matrix

3 as the length of the corpus (amount of documents) and 3, 3, and 4 as the document length.

3.5 Related Work

There are multiple ways to chose which texts are to be recommended to a specific user.

Two different ways to solve this is either recommendation due to the appearance of identical words, or to actually try to “understand” the meaning of the text and make the recommendation based on that. The first method can be a simple keyword search but the second needs to be more sophisticated than that. Examples of methods that attempt to understand texts include the three implemented in this thesis, but also include a method called BERT, which will be explained in Section 3.5.3

(20)

3.5.1 Keyword Search

Keyword search is the most basic way for a user to find what they are looking for.

Vector representation is not used when using keyword search. Each word in the search string is compared with each word in the corpus and the one with the most matches is recommended.

3.5.2 Regular Expressions

Regular expressions is quite similar to keyword search; it simply compares the characters and decides on what result to give. What makes regular expressions different from a simple keyword search is that it compares the exact order of all characters and words.

The sentence “The arctic ice is melting” can give a result, but putting a new character in the middle, even if it is just an extra space or a comma, might change the result completely.

3.5.3 BERT

Researchers at Google AI Language have developed the language representation model BERT (Bidirectional Encoder Representations from Transformers) [6]. Unlike Word2vec, BERT’s vector representations are non-unique in the sense that the same word can have several representations, depending on what context the word appears in. This is explained by BERT using local vector representations and Word2vec using global vector representations. This will be important if we have a corpus containing e.g. texts mentioning the word bank as a financial institute but also bank as a “river bank”. In BERT, bank will be two different vectors, one close to words such as money, financial and credit, and the other close to words such as river and water. The corpus used in this project does not contain a lot of homonyms, because all texts are related to the same topic, and therefore this feature is not as relevant as for other corpuses where the texts differ more.

3.6 Evaluation Methods

The dataset used when training and evaluating the implemented models contains 10.000 unique abstracts in random order, all in one independent file, each separated by a “\n”

(newline).

(21)

We will look at the results given when requesting the three different systems (Bag of Words, TF-IDF and Doc2vec) for the top five abstracts most similar to the input. The input will be one of three different types:

• an abstract that is already in the dataset

• a made up sentence

• a sentence taken from an abstract in the dataset

Undoubtedly, when using an abstract as input the top 1 expected abstract should be that exact abstract. Otherwise, something is wrong with the implementation. The top 5 results for all the inputs will be compared with each other based on their content. The content of the results will also be compared with the content of the input, to find any similarities in the meaning of their content. If there are not any similarities, the method is not working as expected. The words themselves and their meaning will be taken into account. For instance, water level and sea level refer to the same phenomenon even though different words are used to describe that phenomenon.

Further, a sentence taken from an abstract will be used, to see if that abstract is among the top 5 recommended ones when searching. In addition to this, the sentence will be modified to see if the method still recommends the correct abstract or not. The implemented methods will be compared to a naive/simple approach for searching. This approach will be Regular Expressions [19], where the exact string entered into the search field is searched for (except for capitalized letters that are not taken into account). Reg- ular expressions have been introduced more thoroughly in Section 3.5.2.

The evaluation will be done by subjectively evaluating the output, no interviews or forms will be done/sent out. These two methods have been used in other thesis projects, similar to this one [7], but due to insufficient time because of the issue of obtaining data, there was no room for neither interviews nor forms.

(22)

4 System Structure

The recommender system is written in Python 3.7.4. Python was chosen because of libraries such as NLTK [22] (for tokenizing, stemming, and lemmatizing), SciPy [25]

(that has classes for sparse matrices), Gensim’s method Doc2Vec [23] and pickle (for saving models and matrices after training). The system contains one part for training and one part for recommending. The training part will be explained below in Section 4.2 and the recommendation part is walked through in Section 4.3.

4.1 The Dataset

The data are to be found in a separate file and are read into the program by opening the file and reading it line by line. Each line is an abstract, and a line break indicates a new abstract. Following this, all abstracts are put into a list which is then used by the preprocess function.

Google’s Dataset Search was used to find a climate change dataset provided by the University of Illinois at Chicago [5]. This dataset is not without flaws. In fact, there might be cases of one line containing two abstracts, and also one line containing only half an abstract. This is due to the fact that the data were not given in the desired way when extracted. The data were given with multiple abstracts (all published the same year) on one row, and not separated by any character, not even a space. This was however used to split the abstracts, e.g. “This is the first abstract. It contains words.This is the second abstract. It also contains words.This is the third abstract.”.

Regular expressions were used to find these critical points in the text, where a sentence is followed by a dot and then the next sentence without a space, at which a new line was inserted instead. Since there is no guarantee that spelling mistakes do not exist, e.g. a sentence followed by another sentence without a space, this method is not 100%

accurate. Alternatively, an abstract can incorrectly end with a period followed by a space. When putting together such an abstract with another one these will have a space between them even though they should not.

The provider of the data specifies the amount of abstracts to be 36,997 and using the described method 34,926 abstracts can be found. It might be a mix between lines containing two abstracts and abstracts spanning more than one line. The former will subtract abstracts while the latter will add them. In other words, they compensate each other. In any case, the discrepancy is small and considered to be good enough. Of these 30,000+

abstracts, 10,000 were randomly chosen and became the dataset. The reason for only having 10,000 is due to computational constraints. The bigger the dataset, the longer

(23)

time it takes to train the model and use the recommender system.

4.2 Training of the System

Before training the system, we need to preprocess the data. This includes tokenization, removing of stop words and punctuation, POS-tagging, lemmatization, stemming, and lowercasing. How these steps work is explained in Section 3.2.

After preprocessing the data, all words are put together in a dictionary. The dictionary contains all words of all texts, but with duplicates removed. This dictionary is what is called a “bag of words”, just random words in a list with no regard to the order of the words. From the dictionary, the document-term matrix is created, as described in Section 3.3.1. This matrix is going to be a sparse matrix since most of the elements will be zeros. Since the matrix will be a sparse matrix, the data type scipy.sparse.coo matrix (“A sparse matrix in COOrdinate format”) is used to represent the document-term matrix. If a list of lists were to be used to represent the matrix a vast amount of memory would be used only for saving zeros, but when using SciPy’s sparse matrices only the values that are different from zero are saved, together with the size of the matrix. The document-term matrix is then filled with the count value for how many times a term appears in a document. The type of the elements in the matrix is set to NumPy’s np.int8 which only occupies 8 bits, instead of the default np.int that occupies 32 bits. A decision that was taken in order to save memory. The output of this part is a pickle file containing the document-term matrix, which is later used when creating the vector for the search string and then being compared with the input search string in the recommending part.

A pickle file is a way to save information (e.g. variables, models) from the program even after it has been executed.

4.2.2 TF-IDF

The first steps of this part have already been carried out in the functions of the Bag of Words method. The document-term matrix created is transformed with the TF-IDF method, which is explained in Section 3.2. Because values in the matrix will be trans-

formed, the matrix is converted to another of Scipy’s sparse matrix types: scipy.sparse.dok matrix

(24)

4 System Structure

The output is a pickle file containing the transformed document-term matrix.

4.2.3 Doc2vec

The preprocessing of data for training the Doc2vec model is a light version of the preprocessing explained above. No removing of stop words and punctuation, POS-tagging, lemmatization, or stemming are being carried out. Only tokenization and lowercasing are applied in the preprocessing. When preparing the data for Doc2vec it is desired to keep as many words as possible, to be able to get a better understanding of the context.

If stop words are removed some context might be missing. The model is trained using the Gensim model Doc2Vec with the corpus as the only needed input. The output of the training is pickle files, saved into a folder specifically created for them. They are then fetched by the recommender file and used when calculating which abstracts to recommend.

4.3 Making a Recommendation

Making a recommendation is done in the same way for all the methods (Bag of Words, TF-IDF and Doc2vec). The search string is preprocessed in the same way as the corresponding model/matrix. After that, it is represented as a vector in the same space as the corresponding matrix. The similarity is decided based on the cosine similarity, also called the cosine distance, which will be described in Section 4.3.4.

The search string is preprocessed in the same way as the dataset for the Bag of Words method. It is then transformed into a vector, which can also be considered as a 1⇥m matrix, where m is the length of the original dictionary. The dictionary does not have to expand with new terms from the search string if there should be any. This is due to the fact that if the search string contains new words, they will not match with any of the words already in the dictionary for the dataset and therefore not make any difference.

The vector representation of the search string is then compared with each of the columns in the document-term matrix with the cosine similarity measure.

(25)

4.3.2 TF-IDF

The search string is preprocessed and transformed into a vector, just as for the Bag of Words method described in the section above. Before calculating the cosine similarity, the vector is transformed with the TF-IDF method. This is done using information both from the search string vector and the document-term matrix. The vector representation of the search string is then compared with each of the columns in the TF-IDF transformed matrix, using the cosine similarity measure.

4.3.3 Doc2vec

Preprocessing the search string is done the same way as for the dataset before training the model. This preprocessing is the same as in Section 4.2. The search string is then represented as a vector, using information from the trained model. Then, the vector representation of the search string is compared with the representations of all texts in the dataset. These are fetched from the Pickle file created in the training step.

4.3.4 Cosine Similarity Measure

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. What cosine similarity does is that it calculates the angle between two vectors and takes the cosine value of that. This cosine value is an estimation of how close the vectors are, in other words, how similar they are. The cosine of 0°, 90° and 180° is 1, 0 and -1. It follows that two almost identically oriented vectors have a value close to 1. The formula of finding the cosine similarity between two vectors (A and B) can be seen in Equation (4).

similarity(A, B) = cos(✓) = A· B kAkkBk =

Pn

i=1AiBi

pPn

i=1(Ai)²pPn

i=1(Bi)² (4) 4.3.5 Precision and Recall

Another applied measurement is precision and recall which is used in the evaluation of the results to determine the accuracy of the model. This measurement observes how many times a model correctly or incorrectly predicts a class. Observing at the two values

(26)

4 System Structure

Figure 3 Visualization of true positives, false positives, true negatives and false negatives.

Precision answers the question “how many selected items are relevant?” and recall answers the question “how many relevant items are selected?”. The selected elements that are positive are called true positives (TPs), the selected elements that are negative are called false positives (FPs), the positive values that are not selected are called false negatives (FNs) and the false values that are not selected are called true negatives (TNs).

See Figure 3.

Precision = TPs

TPs + FPs (5)

Recall = TPs

TPs + FNs (6)

As seen in Equation 5 and 6, precision is calculated by taking all true positives divided by the number of selected elements and recall is calculated by taking all true positives divided by the number of relevant elements.

(27)

Figure 4 A PCA Plot of the Doc2vec Representations of All Abstracts in the Dataset

5 Evaluation of Results

Results have been produced from multiple different types of input, both strings and a whole abstract, with variations of some of the strings.

To see how good the methods are as recommender systems, we will look at an abstract (see Appendix A) and a made-up sentence (see Appendix B) as inputs and compare the results to each other. It would have been a good idea to chose a made-up sentence based on wanted results that are in the dataset. The made-up sentence “differences in ice thickness over a longer period of time, affecting the water levels globally” did not give as good results as expected but that may be due to it being too specific and results matching this specific description do not exist.

The PCA plot in Figure 4 shows a representation of the embeddings created with the Doc2vec method. PCA stands for principal component analysis and offers a way to reduce the number of dimensions, to be able to visualize vectors in a vector space with more than 2 dimensions. The x and y axes do not have any specified labels since this is not relevant. The only thing that is relevant is where the representations are relative to each other. The choice of k is random since it is not important how the abstracts are grouped but just a tool for visualization. The red dot is the representation of the input

(28)

5 Evaluation of Results

To see how good the methods are as search systems, we will look at sentences taken directly from abstracts within the dataset. 35 shorter ones, approximately one sentence long (see Appendix C) and 35 longer ones, approximately two sentences long (see Ap- pendix D) and variations of each one of them. The results from this will be a boolean that tells if the correct abstract is within the top 5 of the search or not. For the naive method we are comparing with, a regular expression search, the result is true for the non-modified search string and false for the modified ones.

(29)

6 Results and Discussion

When creating (and modifying) the dense matrices used for all three methods, more memory than a regular computer possesses is used resulting in the computer I used not being able to execute the computations. When using sparse matrices, explained in Section 4.2, the memory usage was improved but in exchange, the running took a longer time.

A thing to keep in mind when looking at the results is that numbers and other characters not being a-z characters are not taken into account since they are removed in the preprocessing. Thus, when wanting to include e.g. a year or a specific temperature will not be considered by any of the three methods, in contrast to the regular expression method.

When it comes to searching and recommending published reports and the like, it could be worth mentioning that many of them have parts like “in this report we will discuss...”,

“the subject of this report is” and similar. The longer the text, the less do these types of sentences matter. This goes for both input and texts in the dataset. However, when having shorter texts, these could impact the result.

Three different results have been taken into account when evaluating. How well the systems work with a string extracted from an abstract as input, a made-up string as input, and a whole abstract as input, as mentioned before.

6.1 A String That Is a Part of an Abstract as Input

Given a string that is a part of an abstract in the dataset, Table 5 and 6 shows the precision and recall for both a shorter sentence and a longer one. Appendix C and D shows if the correct abstract is found or not among the top 5 recommended abstracts and for the regular expression (regex) method the tables just shows if the abstract is found or not, with T meaning true and F meaning false. All four methods are tested using the original (orig.) string and a modification (mod.) of the string. The different types of modifications applied are exchanging of words to their synonyms, changing the sentence from a statement to a question, or just moving the last part of a sentence to the beginning but keeping the same meaning of it. All sentences used are to be found in Appendix C and D.

In Table 5 and 6 the precision and recall are shown. How precision and recall is calculated is explained in Section 4.3.5. In this case, because the numbers of selected

(30)

6 Results and Discussion

Method Precision Recall

Bag of Words, original string 0.900 0.900

Bag of Words, modified string 0.800 0.800

TF-IDF, original string 0.267 0.267

TF-IDF, modified string 0.233 0.233

Doc2vec, original string 0.067 0.067

Doc2vec, modified string 0.067 0.067

Regular expressions, original string 1 1

Regular expressions, modified string 0 0

Table 5 Precision and Recall With a Shorter String as Input

Method Precision Recall

Bag of Words, original string 1 1

Bag of Words, modified string 1 1

TF-IDF, original string 0.600 0.600

TF-IDF, modified string 0.567 0.567

Doc2vec, original string 0.700 0.700

Doc2vec, modified string 0.700 0.700

Regular expressions, original string 1 1

Regular expressions, modified string 0 0

Table 6 Precision and Recall With a Longer String as Input

(31)

are to be found in Appendix D.

All three methods perform better when having a longer string, whilst the naive method performs the same. For the unmodified string the only method that performs as well as the naive one is the Bag of Words, and that is only for the long string. However, when looking at the modified string, all methods perform better than the naive one, no matter the length of the string.

If the search string is modified, the probability of finding the correct abstract using the regular expression method is 0%. For the Bag of Words method the probability is 69%

given a modified short string since 24 out of 35 correct abstracts were found. And for the original short string, the probability is 77% for the Bag of Words method. See Table 7 and Table 8 for all results, which are the basis of these calculations.

From what can be seen in these results, all of the three implemented methods perform better than the naive method when it comes to a modified string, i.e. a string that is not formulated in the exact same way as in the searched for abstract.

6.2 A Made Up Search String as Input

The results when using a made-up search string for Bag of Words, TF-IDF and Doc2vec can be seen in Appendix A. The string “differences in ice thickness over a longer period of time, affecting the water levels globally” is expected to give results related to this sentence. The general subject of the recommended abstracts for the three methods can be seen below.

1. water scarcity and difference in methods for assessing water use

2. analysis of the impact of water ownership and water market trade strategy 3. usage of water footprint as a measure of freshwater use and resources 4. indicators used to measure the performance of water technologies 5. the application of water footprinting methodologies

(32)

texts. The other words of the search string, such as differences, ice and thickness, do not appear as many times in any abstract as the word water do. The reason for this is probably because the word water is a more common word than any of the other words in the search string when looking at the abstracts in the dataset.

None of the recommended abstracts are related to the meaning of the whole search string.

6.2.2 TF-IDF

1. a multivariate data assimilation experiment conducted in order to improve the global representation of both the ocean and sea ice fields through the inclusion of sea ice concentration (SIC) data

2. the financial costs of a national transition to electric vehicles

3. determine the carbon footprints of fermentable sugars stored as raw thick juice 4. assessing alternative retrofit strategies for the roof and exterior walls of dwellings 5. environmentally friendly hemp concrete, used for building construction

The reason for an abstract about raw thick juice being recommended for the TF-IDF method could be because of the word thickness. In the context of comparing ice thickness with juice thickness, this does not make sense but since the TF-IDF method does not take context into account this is not unexpected. The same goes for the text about retrofit strategies for roof and exterior walls of dwellings and the one about hemp concrete, which also mentions the word thick multiple times (when talking about the thickness of isolation and wall thickness).

None of the recommended abstracts are related to the meaning of the whole search string, except for the first one that discusses for instance surface air temperatures and areas containing sea ice.

6.2.3 Doc2vec

1. water self-sufficiency potential of urban systems from urban runoff

2. impacts of climate on groundwater through natural and human-induced processes 3. waste solvent technologies as alternatives to the disposal of spent acetone-water

mixtures

(33)

4. evaluating water systems of the water sector in Copenhagen 5. methodology for assessing the water footprint of machine tools

All abstracts recommended by the Doc2vec method are about water levels in some way but none of them are related to ice thickness.

None of the methods recommended the same abstract which is due to the fact that the methods are taking different measures into account. It is for example expected that the Bag of Words method would recommend longer abstracts since the chances of finding more mentions of a specific word increases with the increase of the total amount of words in a text. All of the results mentioned the word water at least one time except for two of the abstracts recommended by the TF-IDF method.

To get a better result it might have been more successful to use a less specified string.

The reason for the used string not resulting in any too good recommendations could be because there are no abstracts in the dataset related to this search string.

6.3 An Abstract as Input

The full abstract used for input can be read in Appendix B, as well as the outputs, but the main topic is the problem with setting the system boundary of waste disposal for the Life Cycle Assessment (LCA) method. The general topic of the recommended abstracts for the three methods can be seen below. All of the methods recommend the input abstract as the top 1 similar, but the rest of the top 5 are different for all methods.

1. present a comprehensive discussion of system boundaries in LCA and to develop an appropriate boundary delimitation method

2. to examine the common performance indicators used to assess the environmental benefits of municipal waste systems

3. evaluation of the effect on environmental load decline by LCA, applied to a new recycling technology

4. a comparison between Environmental Impact Assessment (EIA) and the product Life Cycle Assessment (LCA), and to establish how to evaluate the effect of re-

(34)

The first recommendation is highly related to the input and discusses the same problem albeit from another angle. The second one does not mention LCA, but also discusses how to assess environmental impacts associated with the stages of the life-cycle of a product, process, or service, which is the definition of LCA. The third and fourth recommendations were also related to the input, but the fourth wrongfully consisted of two abstracts (although both of them are related to the input). Read more about why the datasets contain some lines that are two abstracts put together in Section 4.1.

6.3.2 TF-IDF

1. environmental impact of lead-free electronics through their entire life cycle 2. a study on the exergetic efficiencies of different unit technologies and manage-

ment systems, based on an inventory analysis

3. discussion about contents in hybrid perovskite solar cells (PVSCs)

4. to measure and evaluate the interactions among various variables in the complex urban ecosystem (CUE), through employing an equilibrium model, a metabolism model, and a harmonious development model

The first recommendation discusses the life cycle of electronics, which is related to LCA and its boundaries. The second and third ones are not relevant to the input. The fourth one also discusses models to use when evaluating systems related to environmental issues.

6.3.3 Doc2vec

1. the reduction effects of CO2 emission from typical steel products by the reduction agent injection into blast furnace, analyzed by LCA methodology, and LCA conducted on residential gas appliances designed based on the eco-design guide that city gas suppliers planned, for certifying validity and clarifying problems of it

2. the contribution of misplaced special waste to environmental impacts from the incineration of residual household waste quantified through LCA modeling.

3. a LCA comparison between a traditional nickel plating process (the Watts bath) and the citrate bath

(35)

4. LCA employed to compare the environmental impact of incineration and the land- filling of municipal solid waste in Sao Paulo City, Brazil

The first recommendation contains two abstracts, just as the Bag of Words method.

All of the recommended abstracts are about a specified setting where LCA is used to compare or analyze. A suggestion for further work could be to aim for a result that is a mix between abstracts about the method itself and different settings where it can be used to analyze.

Two of the methods returned a result containing two abstracts. The reason the dataset contains these abstracts are discussed in Section 4.1. The explanation to why we get two double abstracts could be that the text gets longer when containing two abstracts and therefore the chances of the looked-for words appearing more times are bigger. The only method where this does not matter, since the times a word occurs is reduced when the texts are long because of the division by the length of the text, is TF-IDF and that was also the only method that did not have a doubled abstract as result.

(36)

7 Conclusions

Looking at the results; when having a string that is a part of an abstract as input, and knowing exactly what string to use to get the correct abstract, as well as it being a short sentence, the best method to use is the regular expressions (see Table 8). However, when the string becomes longer than or equal to two sentences, Bag of Words is equally as good. Also, given the probability of knowing exactly what to search for is low, the best method to use in this specific context is Bag of Words.

What can be seen in the results is that both Bag of Words, TF-IDF and Doc2vec are performing better when having a longer string. This is probably due to the fact that more words give a more specific description of what is required.

When having a made-up search string as input, for the TF-IDF method, three out of five on the top 5 list are recommendations based on the word thickness, when that word is not relevant unless it is considered in its context, which is the thickness of ice. The two other methods gave results related to the input.

The recommender system takes around 69 minutes to produce a top-5 result for Bag of Words, 79 minutes for TF-IDF, and 150 seconds for Doc2vec, when having a dataset of 10.000 abstracts in the dataset. This excludes the alternative of using the Bag of Words or TF-IDF when wanting a quick result but could still be useful when not in a hurry to get results. Doc2vec is the most convenient method when wanting the results faster but instead takes a lot of time to train.

The fact that all methods recommend the input abstract as the most related is a good sign that the recommender system is working well. Regarding the TF-IDF, both that only half of the recommendations were relevant and that it takes a long time to run this method, it is not useful to use. But both Bag of Words and Doc2vec produced relevant recommendations and are therefore good to use.

(37)

8 Future Work

Due to time constraints when comparing three different methods, the time was not enough to make sure the implementation of these is optimal. It would be interesting to explore only one of the methods but with various parameters and compare the results of that. A suggestion would be to tune the parameters of one of the methods and compare different variations of that method with each other.

The testing with a random string as input could be improved. If a string were to be used that definitely has related items in the dataset, the evaluation could also be automated.

This would though take some time since the knowledge about all texts in the dataset needs to be extensive.

If the results could be evaluated automatically that would make the evaluation much more effective. Two solutions to that could be to either use a tagged dataset, that has tags for which texts are related to each other. Or to use a user-item-based system, that has information about users’ ratings of items and base new ratings on that information.

Then the recommender system can check itself on how good its predictions are.

Since the memory capacity was not enough on the available computer the methods containing matrices had to be implemented with sparse matrices. These take less memory space but in exchange, they take a longer time to work with. The creation of vector representations of the texts in the dataset took hours. It would be interesting to investigate how long much the time could be shorted down when using faster ways of representing the matrices. To solve the problem with memory space a cloud-based solution could be used.

(38)

References

[1] C. C. Aggarwal et al., Recommender systems. Springer, 2016, vol. 1.

[2] M. F. Arndal and E. Topp-Jøgensen, Eds., INTERACT Station Catalogue, 3rd ed.

Aarhus University, DCE - Danish Centre for Environment and Energy, 2020.

[3] E. S. C. T. at NASA’s Jet Propulsion Laboratory. (2020) Climate change: How do we know? California Institute of Technology. Accessed 2020-02-14. [Online].

Tillg¨anglig: https://climate.nasa.gov/evidence/

[4] T. Beysolow II, Applied Natural Language Processing with Python: Implementing Machine Learning and Deep Learning Algorithms for Natural Language Process- ing. Apress, 2018.

[5] T. Complex and S. U. N. Laboratory. (2020) Datasets.

https://csun.uic.edu/datasets.html. Accessed: 2020-08-18.

[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[7] K. Ersson, “The effects of implementing domain knowledge in a recommender system,” 2018.

[8] I. N. for Terrestrial Research and M. in the Arctic, “Research infrastructure and societal challenges,” Unpublished Manuscript.

[9] N. Hardeniya, J. Perkins, D. Chopra, N. Joshi, and I. Mathur, Natural language processing: python and NLTK. Packt Publishing Ltd, 2016.

[10] M. H. Hassoun et al., Fundamentals of artificial neural networks. MIT press, 1995.

[11] A. S. Lampropoulos and G. A. Tsihrintzis, “Machine learning paradigms,” Appli- cations ˙In Recommender Systems. Switzerland: Springer International Publishing, 2015.

[12] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to information retrieval. New York: Cambridge University Press, 2008.

[13] E. Post, R. B. Alley, T. R. Christensen, M. Macias-Fauria, B. C. Forbes, M. N.

Gooseff, A. Iler, J. T. Kerby, K. L. Laidre, M. E. Mann, J. Olofsson, J. C.

Stroeve, F. Ulmer, R. A. Virginia, and M. Wang, “The polar regions in a 2 C

(39)

warmer world,” Science Advances, vol. 5, no. 12, 2019. [Online]. Tillg¨anglig:

https://advances.sciencemag.org/content/5/12/eaaw9883

[14] A Medium Corporation. (2020) A gentle introduction to doc2vec.

https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e.

Accessed: 2020-07-21.

[15] Centre d’´etudes nordiques. (2020) Nordicana d.

http://www.cen.ulaval.ca/nordicanad/en index.aspx. Accessed: 2020-05-13.

[16] Google. (2020) Dataset search. https://datasetsearch.research.google.com/. Ac- cessed: 2020-07-06.

[17] Interact International Network for Terrestrial Research and M. in the Arctic. (2020) Interact. https://eu-interact.org/. Accessed: 2020-02-17.

[18] InteractGIS. (2020) Interact publications. https://interactgis.org/Home/Publications. Accessed: 2020-05-13.

[19] Jan Goyvaerts. (2020) Regular expressions info. https://www.regular- expressions.info/index.html. Accessed: 2020-07-10.

[20] Lexical Computing CZ s.r.o. (2020) Pos tags.

https://www.sketchengine.eu/blog/pos-tags/. Accessed: 2020-07-21.

[21] Netflix Inc. (2020) Netflix. https://www.netflix.com/. Accessed: 2020-02-25.

[22] NLTK Project. (2020) Nltk 3.5 documentation. https://www.nltk.org/. Accessed:

2020-07-30.

[23] Radim ˇReh˚uˇrek. (2020) models.doc2vec – doc2vec paragraph embeddings.

https://radimrehurek.com/gensim/models/doc2vec.html. Accessed: 2020-07-21.

[24] Radim ˇReh˚uˇrek. (2020) models.word2vec – word2vec embeddings.

https://radimrehurek.com/gensim/models/word2vec.html. Accessed: 2020-07-21.

[25] SciPy developers. (2020) Scipy.org. https://scipy.org/. Accessed: 2020-07-30.

[26] Spotify AB. (2020) Spotify. https://www.spotify.com/. Accessed: 2020-02-25.

[27] ˚AF P ¨OYRY AB. (2020) Afry. https://afry.com/. Accessed: 2020-02-25.

[28] Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding bag-of-words model: a sta-

(40)

A Top 5 for Search String as Input

Using the string “differences in ice thickness over a longer period of time, affecting the water levels globally” as input to the methods Bag of Words, TF-IDF and Doc2vec, the top 5 recommended texts shown below are given for corresponding method. All of them are abstracts of published reports, all related to climate change.

A.1 Bag of Words

1. Water scarcity, a function of supply and demand, is a regional issue with global reper- cussions, given that i) the increasing human population and demand for animal products will increase water demand and that it) global climate change is altering rainfall patterns worldwide. Water can be divided into ”blue” (surface and groundwater), ”green” (soil water subject to evapotranspiration) and ”grey” water (water necessary to dilute pollu- tants to acceptable levels). On a global scale, agriculture represents 70% of blue water use. One main difference among all methods for assessing water use is whether and how they include green and grey water with blue water. The ”water footprint” approach includes green and grey water, whereas life cycle assessment approaches tend to exclude them or to include only the variation in green water availability resulting from land use change. A second difference is whether water use is reported as a volume of water (L) or a volume weighted by a water stress index (L water equivalents). Because of these differences and the few livestock systems studied, methods give wildly different results for the same livestock product. Ultimately, water scarcity depends on blue water use.

The contribution of livestock to water scarcity can be reduced by decreasing their water consumption and/or that of the irrigated crops they consume.

2. Reforms in the Murray-Darling Basin over the past several decades have led to well developed water entitlement and allocation markets. Irrigators now use a diversity of water trade and ownership approaches, ranging from owning relatively large amounts of water entitlements relative to their annual demand and selling when they have excess water, to owning smaller amounts (or less secure) water entitlements and relying heavily on water allocation markets to meet annual demands. Some irrigators do not trade at all.

Although the benefits of water markets in reallocating water have been well established, there has been very little empirical analysis of the impact that water ownership and water market trade strategy has had on irrigators’ farm net incomes. This study uses irrigation industry survey data collected over a five year period from 2006/2007 to 2010/2011 across the Murray-Darling Basin to investigate the relationship that water trade strategy and water ownership have with farm viability (namely farm net income and rate of return). Although this is an interesting period to investigate these relationships, it must

(41)

be noted that it was a period of extreme water scarcity and high water prices; hence any interpretation of results must take this into account. It was found that the actual volume of water received (which is a measure of water allocations for that region and size and security of water entitlements) is a more significant and positive influence on farm net income than water ownership per se, with this result most strongest in the horticulture industry. Water reliability is not as important in the broadacre industry as other industries. Selling water allocations was a significant and positive influence on farm net income and rate of return. Buying water entitlements was sometimes associated negatively with farm net income and rate of return in our time period, with no statistical significance found for the impact of selling water entitlements in the current year. (C) 2014 Elsevier Ltd. All rights reserved.

3. With increasing pressures on water resources throughout the world, the role of busi- ness in managing water resources is increasingly important. Corporately, the management of water resources is changing from one of compliance with local regulations to corporate water stewardship, where industry not only conserves water use within its direct control, ”inside the facility fenceline” but also acts responsibly and seeks to influence water managed within the supply chain, the local communities and the wider regional, national and international watersheds, ”beyond the fence line”. Understand- ing the risks and opportunities at these different scales can be developed through water footprinting. The water footprint is a measure of freshwater use and impacts, a more comprehensive measure of freshwater resources appropriation than traditional water ab- straction and discharge measures. The water footprint calculates the total water consumption and impact on water resources of a product or service. The footprint, when mapped against water availability, allows businesses to understand where the pressures may lie, so that appropriate strategies can be developed. This chapter presents the method, developed from its implementation across different products, and highlights how the results of the water footprint can help develop and manage appropriate water stewardship strategies.

4. The effective management of fresh water through the use of water technologies is central to international debate. However, available indicators used to measure the performance of water technologies have several limitations: they do not comprehensively assess the quantity and quality of water use; they are not able to measure the benefits of locally recovered resources; and they are not simple to apply in a life-cycle perspective.

The goals of this paper are to develop a set of indicators based on the Cumulative Energy Demand (CED) and Energy Pay-Back Time (EPBT) models used in the energy sector to compare the performances of water-use technologies in different locations and therefore measure their benefits in term of recovered water resources in different contexts. The

(42)

A Top 5 for Search String as Input

(Italy) and installed in Rovigo (Italy). To determine their effectiveness, a simulation of their application to the same technology in different Italian locations was performed.

The results confirmed the applicability of the designed set of indicators and the effectiveness of the WPBT in measuring their performance in different contexts. To obtain comprehensive information on the quantity and quality of water used, it is recommended that CWD and WPBT be used together. (C) 2014 Elsevier Ltd. All rights reserved.

5. Efforts undertaken for reducing environmental impacts of energy production have been primarily focused on carbon reduction while the fact that energy production also reciuires water has been largely overlooked. During the last decade, and despite the fact that global warming still remains today the focus of many environmental evalua- tions, water scarcity issues have increasingly received attention. Despite the fact that recently, an increasing demand for large industrial companies to calculate and report on their water footprint exists, water resources have only recently been addressed in life cycle assessment (LCA) and their assessment still lacks wide application. The paper presents a practitioner’s experience with respect to the application of three recently developed water footprinting methodologies that are considered as current-state-of-the- art. The methods are applied with as objective the estimation of the water footprint of combined cycle gas turbines with different cooling technologies. The study reveals that absolute values of water footprints (L-eg. kWh(-1)) are very different among methods and therefore results are not directly comparable in terms of their absolute results. In contrast, ranking among power plants agree for the different methods when large differences in water consumption/impact exist. However, the ranking may differ between different methods when differences are small. It therefore remains impossible to select one method as the preferred method to use. This study therefore serves as support to the recently emerging working groups (eg. WULCA) that aim at harmonizing the different existing methodologies. (C) 2014 Elsevier Ltd. All rights reserved.

A.2 TF-IDF

1. A multivariate data assimilation experiment was conducted in order to improve the global representation of both the ocean and sea ice fields through the inclusion of sea ice concentration (SIC) data. Our method corrects the surface forcing and ocean temperature fields (as well as the SIC field) through the use of three-dimensional variational analysis. The adjustments to surface air temperatures resulting from the SIC assimilation are estimated on the basis of two constraints. First, we assume that the interfacial temperature difference between the surface air and the average value at the ”top” of the grid (which represents a weighted mean according to the relative coverage of sea ice to open water within the grid) is maintained at the pre-assimilation value. Simi- larly, the vertical temperature structure for each of the five sea ice categories considered

Searching and Recommending TextsRelated to Climate Change

April 2021

Searching and Recommending Texts Related to Climate Change

Karolin Gjöthlén

Searching and Recommending Texts Related to Climate Change

Contents

1 Introduction

1.1 Scope

2 Background

2.1 The INTERACT Project

2.2 Motivation

2.3 Recommender Systems and NLP

3 Method/Approach/Techniques

3.1 Recommender Systems

3.2 Natural Language Processing (NLP)

3.3 Vector Representation

3.4 Transformation

3.5 Related Work

3.6 Evaluation Methods

4 System Structure

4.1 The Dataset

4.2 Training of the System

4.3 Making a Recommendation

5 Evaluation of Results

6 Results and Discussion

6.1 A String That Is a Part of an Abstract as Input

6.2 A Made Up Search String as Input

6.3 An Abstract as Input

7 Conclusions

8 Future Work

References

A Top 5 for Search String as Input

A.1 Bag of Words

A.2 TF-IDF