Question answering on introductory Java programming concepts using the Transformer

(1)

Question answering on

introductory Java programming

concepts using the Transformer

LUKAS SZERSZEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

introductory Java

programming concepts using

the Transformer

LUKAS SZERSZEN

Civilingenjör Datateknik Date: March 14, 2021

Supervisor: Richard James Glassey Examiner: Olof Bälter

(4)

(5)

Abstract

AI applications for education could help students learn in their introductory programming courses. Many applications for education try to simulate a hu-man tutoring session that engages the student in a dialogue. During the ses-sion, they can ask questions and have them answered while working through an exercise. Refining the question-answering capability of such applications may prove to be a base for supplementary education tools. These could be used by students in introductory programming courses to ask questions to review concepts in programming, facilitating the teaching done by professors.

This thesis investigates question-answering on introductory Java program-ming using The Transformer model. The focus is on the extent to which the model can answer questions on Java concepts when trained on questions and answers from the online programming forum Stack Overflow. A total of five Transformer models with default parameters were trained on posts segmented with different granularities using byte-pair encoding. Each model was evalu-ated using perplexity as an automatic metric and a qualitative evaluation done by the author.

(6)

Sammanfattning

AI-applikationer skulle kunde användas inom utbildning för att hjälpa studen-ter med deras inlärning i introduktionskurser i programmering. Många appli-kationer med utbildningssyfte försöker att simulera en mänsklig utbildnings-miljö där studenten i en dialog kan ställa frågor samt få dem besvarade i sam-band med att studenten arbetar med övningsuppgifter. Att förbättra förmågan hos denna sortens applikationer att besvara frågor skulle kunna utgöra en bas för kompletterande utbildningsmaterial. Dessa applikationer skulle kunna an-vändas av studenter som en del av introduktionskurser inom programmering. Studenten skulle då kunna ställa frågor för att få hjälp när denne går igenom koncept och ämnen inom programmering för att underlätta inlärningen.

Den här uppsatsen undersöker besvarande av frågor angående

Java-programmering på introduktionsnivå med hjälp av Transformermodel-len. Fokus har lagts på i vilken grad modellen kan besvara frågor angående koncept inom Java när den är tränad på frågor och svar från onlineforumet Stack Overflow. Totalt fem transformermodeller med standardparameterar trä-nades på foruminlägg som segmenterats med olika granularitet med hjälp av byte-pair encoding. Varje modell utvärderades med perplexitet samt med en kvalitativ utvärdering gjord av författaren.

(7)

1 Introduction 1 1.1 Problem Specification . . . 3 1.2 Research Question . . . 4 1.3 Limitations . . . 4 2 Background 6 2.1 AI In Education . . . 6

2.1.1 Intelligent Tutoring Systems . . . 7

2.2 Machine Translation . . . 9

2.2.1 Sequence to sequence model . . . 9

2.2.2 Recurrent Encoder-Decoder . . . 10 2.2.3 Transformer . . . 11 2.2.4 Sub-word NMT . . . 13 2.2.5 Word Embeddings . . . 14 2.2.6 Transfer Learning . . . 15 2.2.7 Evaluation . . . 16 2.3 Related Work . . . 18

2.3.1 Intelligent Tutoring Systems . . . 18

2.3.2 Java question answering . . . 20

2.3.3 Conversational Modeling & Dialogue Generation . . . 21

(8)

3.5.1 Training setup . . . 33

3.5.2 Optimizer . . . 34

3.5.3 Regularization . . . 34

3.5.4 Pre-trained subword embeddings . . . 34

3.6 Evaluation . . . 35

3.6.1 Automatic evaluation - Perplexity . . . 35

3.6.2 Qualitative evaluation . . . 37 3.7 Method limitations . . . 37 3.7.1 Hyperparameter optimization . . . 37 3.7.2 Data preparation . . . 38 4 Results 40 4.1 Metric Evaluation . . . 40 4.2 Qualitative Analysis . . . 44

4.2.1 Evaluation on test dataset . . . 44

4.2.2 Out-of-corpus questions . . . 45

5 Discussion 49 5.1 Generated Response Analysis . . . 49

5.2 Segmentation . . . 51

5.3 Pre-trained Embeddings . . . 51

5.4 Metric & qualitative evaluation . . . 52

5.5 Implications for AI in education . . . 53

5.6 Limitations and future research . . . 54

5.7 Sources of Error . . . 55

5.8 Sustainability, ethics and societal aspects . . . 55

6 Conclusion 57 Bibliography 58 A Appendix 65 A.1 Abbreviations . . . 65

(9)

Introduction

Learning programming and Computer Science is a difficult task. Program-ming is not learned by memorization of facts or concepts from course text-books. It requires students to apply their knowledge to solve programming problems. This process develops the programming skills of the student, but more importantly, their ability to solve. A student’s ability to problem-solve is crucial to master programming as they continue in their education and are required to solve harder problems [1].

Teachers responsible for introductory programming relate to the struggles of their students in learning the fundamentals of programming [2]. A common factor is that students lack generic problem-solving skills, which are essential to learning programming [2]. As students progress through their introduc-tory course, they fail to connect skills and techniques learned from past pro-gramming assignments to new ones. The consequence becomes that students develop incorrect solutions in the long-term [2, 3].

Help and guidance from a teacher are essential to students learning. Intro-ductory programming courses are often comprised of a larger than usual body of students. Teachers and Teaching Assistants (TA) are limited in both quan-tity and the amount of time they can spend on helping students [4]. This is a problem throughout education. A large number of students make it difficult to provide the necessary individual help during the limited amount of dedicated time.

Providing individual help, or individual one-on-one tutoring with a hu-man tutor has long been established as the most effective means of teaching and improving learning outcomes [5, 6]. As it is not realistic to provide each student with an individual tutor, other means of facilitating learning have been sought after. For this reason, technology has had an increasingly integral role

(10)

in providing solutions that enhance learning. The hope is that technology can supply students and teachers with tools that can mimic, or supplement, indi-vidual tutoring, without the time and accessibility constraints of humans.

A variety of computer-assisted learning tools have been developed and in-troduced into teaching. A common example is that of an Intelligent Tutor-ing System (ITS). ITS are applications for education that simulate one-on-one personal tutoring. These systems tutor students through practice problems on which they receive feedback. An integral feature is tracking the student’s de-velopment and knowledge. ITS have shown to be beneficial to learning when combined with traditional teaching methods, and when used on their own [7]. The recent advancement of Artificial Intelligence (AI) has lead to an emer-gence of research looking to incorporate AI into education. Various AI-supported tutoring approaches and tools have been developed to help students learn basic programming [1]. While different AI-supported ITS approaches to teaching programming exist, one such approach is based tutoring. A dialogue-based ITS approach centers around the student engaging with an AI-agent in a dialogue about specific programming concepts and problems [1].

The dialogue-based approach to AI-facilitated tutoring centers on the stu-dent being able to ask questions to initiate a dialogue, and the system’s ability to respond. The responses provided by the agent progress the conversation through a tutoring session. Different configurations for dialogue-based ap-proaches to basic programming exist, and there is not one generic structure for the tutoring session. Instead, the level of dialogue is adapted to the task and the concept of programming which is being taught [1].

The broader class of systems which answer questions are referred to as question answering (QA) systems. Such systems are related to the fields of neural machine translation, conversation modeling, and dialogue generation. The neural approach has become increasingly common due to the promising results shown by neural sequence-to-sequence models in predicting the next sentence given the previous sentence [4, 8]. For example, they have shown to be able to find a solution to IT-help-desk technical problems over the course of a few dialogue turns in a conversation [8].

(11)

to AI-dialogue-based approaches could help students to repeat and practice topics/concepts in their course(s) by asking questions for which they receive responses. This would provide a much-needed supplement to class-based tu-toring, without the need for external help from a human.

1.1 Problem Specification

Question Answering (QA) can be approached in different ways, including, among the QA systems utilizing Deep Learning. The approach taken in this thesis, is to that of QA as a neural machine translation problem. This has been successfully done through the adaptation of the seq2seq framework in the technical domain. This is a generative approach, where an encoder-decoder model generates the answer to a question [8, 13].

The primary aspect of this thesis is to investigate a question answering system using The Transformer for introductory Java concepts. In order to es-tablish this, the thesis will employ both a quantitative evaluation using au-tomatic metrics and a qualitative evaluation. Emphasis will be put on the qualitative assessments of the generated responses. This involves primarily assessing whether the generated response is an answer to the question posed, and thereby, whether the system can help students who require help.

Evaluating systems that generate responses with automatic metrics has proven to be difficult. Common metrics in machine translation rely on word-overlap and do not take into consideration the multiple possible answers to a question [14]. The automatic metric, Perplexity, is a metric that is used to evaluate the certainty of a probabilistic model. It has been used to evaluate dialogue systems, such as chatbots, and question answering systems [15]. Per-plexity is used to quantitatively evaluate the model(s) trained in this thesis.

Transfer learning (TL) with pre-trained word embeddings is a method cur-rently receiving a lot of attention in NLP. Pre-trained embeddings have the potential to reduce the need for domain-specific data, and to improve the per-formance of models on the tasks on which they are being tested. Neural ma-chine translation is known to require large quantities of data and it is, therefore, crucial to consider any means of alleviating any shortage of data [16]. As this thesis deals with a data-scarce scenario, it is of interest to employ pre-trained embeddings and observe whether it leads to an improvement in the quality of responses.

(12)

aid teachers by providing an insight into what autonomous tools could be de-veloped as a future means for helping students learn.

1.2 Research Question

The premise for the thesis is as established in the previous sections: how well does a Transformer model perform when applied to Question Answering for introductory Java programming concepts. This results in the research ques-tion:

To what extent can a Transformer question answering model trained on Stack Overflow data answer questions on introductory Java programming con-cepts?

The research question is centered on the extent to which a Transformer model can answer questions on introductory Java programming concepts. The intention is to evaluate whether the model provides an actual answer to the posed question. Additional aspects to analyze are whether the response is on the same topic as the question, related to it, or incoherent noise. The evaluation of the research question is based on the combination of the automatic metric score and the qualitative analysis. The question is intentionally broad as to evaluate it in terms of whether such a system can be of help in learning and education, as opposed to only evaluating the deep learning aspects.

1.3 Limitations

A core limitation of the model will be that it cannot generate Java code. The intention is to investigate whether it can help students learn by asking ques-tions about programming concepts. If the model was trained to produce code, it could be used as a means to extract code for programming assignments. This would in turn by-pass the entire programming process meant to educate students. Instead, the thesis centers on investigating whether a more concept-focused knowledge can be attained, which could be used by students to review material reflect and facilitate their learning process.

(13)

training. As a consequence, the data could prove to be too scarce for training the network.

(14)

Background

The following chapter covers how AI is being used in education and the the-oretical concepts of machine translation. It introduces and explains the ter-minology used within machine translation and the development of the neural models within the discipline. It also covers the concept of Transfer Learning, Pre-trained Embeddings, and the Perplexity metric. The chapter is concluded with a section on studies related to the work in this thesis.

2.1 AI In Education

Applications in education that use artificial intelligence are on the rise and have begun to receive more attention. The long-standing promise of AI in ed-ucation is to improve student learning, outcomes, as well as provide teachers with tools to accurately map their students’ knowledge. While the study of AI in education dates back thirty years, along with the development of AI appli-cation appliappli-cations, it is only in recent times where the wide-scale adoption of AI in education is being considered [17, 18].

The AI software applications used in education can be defined according to three categories. The categories group the software according to whom the software is meant to aid, or whom it is "facing". The three categories are "learner-facing", "teacher-facing" and "system-facing". Learner-facing appli-cations are software which students use to learn course material and adapt to the individual student’s level of knowledge. This category of software is com-monly referred to as “Intelligent Tutoring Systems“, or ITS. In general, ITS offer the following features [17, 18]:

• Adaptation of the learning material to the student’s level

(15)

• Monitoring and mapping of the student’s strengths, weaknesses and/or gaps in knowledge.

• Providing automated feedback

• Enable collaborative work between students

Teacher-facing AI applications are meant to support teachers in their work [17, 18]. A significant portion of teacher workload is administrative work, and the time spent on it detracts from the time spent on teaching. Teaching-facing applications offer the capability to alleviate the teacher from administrative work. They contain features meant to automate tasks, such as assessment, plagiarism detection, administration, and feedback [17, 18]. In addition to reducing administrative work, teacher-facing software is meant to provide in-sight into the progress of individual students, and the class as a whole [17, 18]. Providing insights about their students could facilitate how teachers or-ganize their teaching activities by following up on students who fall behind, or grouping students with a similar level of knowledge. An important character-istic of these applications is that they are not there to replace the teacher but to facilitate and enhance their work [17, 18].

System-facing applications are tools for those managing and administrat-ing schools [17, 18]. This category of AI applications in education has the least number of existing tools currently [17, 18]. Through sharing of data be-tween schools, system-facing tools can be used for predicting which schools are in need of a school inspection. The scarcity of such tools can be attributed to the lack of shared data among schools and administrations [17, 18].

With the increased push to include AI applications in education, there is an increased emphasis on the ethical implications of AI [17, 18]. Monitoring and collecting student data must come with assurances of ethical use, as well as security. Furthermore, incorporating AI into the classrooms is not meant to make teachers redundant, but to improve and facilitate their work. As such, there’s a call for research within AI in education to include ethical implications of the tools being developed [17, 18].

2.1.1 Intelligent Tutoring Systems

(16)

ITS model a student’s capabilities and provides personalized instruction based on the modeling [1, 7, 19]. An integral part of ITS centered on adaptive learning paths is to provide the student with supportive hints which help them through the exercise. There are also ITS which core feature is to engage with the student in a dialogue. The various features of ITS replicate the highly social and adaptive role of a human tutor [1, 7, 19].

Computer science education is the most common subject for developing and using ITS [19]. They have been used to detect and remedy knowledge misconceptions, provide diagnostic feedback and detection of mastery in pro-gramming, automated guidance of propro-gramming, and natural language discus-sions on topics of programming. Different ITS for computer science education has been developed and used. There are ITS teaching programming in C, C++, Java, Lisp, or, of concepts such as linked lists, software design, database de-sign, SQL query, and other [19].

Intelligent tutoring systems lack a generic architecture for which to de-scribe them. Despite this, ITS can be defined in terms of the capabilities they provide. A precise definition of an ITS is a software that interacts with a stu-dent for the purpose of learning and has the following features [19]:

• Performs some form of tutoring, such as asking and/or answering ques-tions, assigning tasks, providing hints and feedback

• The student’s input is used to model the cognitive, motivation, and af-fective states in a multidimensional space

• The modeling function is used to adapt the tutoring function(s)

The adaptability aspect of ITS has received increasing demands and some research excludes ITS which adapt content only through pre-scripted testing and branching based on student responses [19].

(17)

2.2 Machine Translation

The following sections cover the theoretical background of machine transla-tion. This includes primarily the neural approaches based on the sequence to sequence framework.

2.2.1 Sequence to sequence model

Neural conversation models predict the next sentence given the previous sen-tence(s) [20]. These models implement the encoder-decoder architecture, also referred to as the sequence-to-sequence (seq2seq) framework [8, 13]. Sequence-to-sequence (seq2seq) modeling is the general class of models that convert one sequential input into another and is effective at mapping sequences and gener-ating sequential data [20, 21].

The framework first demonstrated its effectiveness in the domain of Ma-chine Translation (MT) [8, 13, 21]. MaMa-chine translation is the task where one human language is translated into another [20]. The translation process between one language and another can be described as the conversion of a se-quence of words in the source language to a sese-quence of words in the target language [20]. The sentence input into a machine translation system is the source sequence, whereas the translated output is the target sequence[20].

It was applied to Question Answering (QA) by framing it as a Machine Translation problem [13, 8]. The adaptation entails that the question is a se-quence of words that is mapped to a sese-quence of words in the answer [13]. Applying the encoder-decoder to question answering showed that the model could produce short, yet coherent responses to questions in specific and gen-eral domains [13, 8].

Neural Machine Translation (NMT) is the set of models which use neural networks for MT [20]. In NMT, a probabilistic model is trained to maximize the probability of target sentence Y = {y1, ..., yn} given the input source

sen-tence X = {x1, ..., xm}, where n and m the respective variable lengths. The

translation can be expressed as the conditional probability pθ(Y |X), where θ

is the set of parameters trained to minimize the negative log-likelihood [20]: −

n

X

i=1

(18)

the entire source sentence. The parameters (weights) θ, are trained on data formatted as aligned source and target sentences, referred to as parallel corpora [20].

When applied to question answering the question is treated as the source sentence, which is mapped to an answer. The parallel corpora are aligned in such a way that one question is written on one line in the source corpus, and the answer is written on the same line in the target corpus [8, 13].

2.2.2 Recurrent Encoder-Decoder

Recurrent neural networks are used in tasks dealing with sequential data where one sequence is mapped to another [13, 21]. They are proficient at mapping long-distance relationships which makes them effective at NMT because it im-proves the translation quality for long sentences [8, 13, 21]. This characteristic is helpful as positions of words are not a one-to-one mapping between trans-lations. Words at the beginning of a sentence may have more or less of an impact on the words in the same position for the translation. Recurrent neu-ral networks enable effective mappings between words as the information is encoded for the entire sequence [8, 13, 21].

The Recurrent Encoder-Decoder is a model comprised of two recurrent neural networks. One network acts as an Encoder and the other as a Decoder [21]. The encoder takes a variable-length sequence and encodes the informa-tion in the sequence into a fixed-length vector, called the context. The decoder decodes the context along with the target sequence back into a variable-length sequence, being the translation of the source sequence.

The encoder model processes each symbol in the sequence X = {x1, ..., xm}

sequentially. For each symbol xt, at time step t the information is encoded into

a hidden-state ht. Formulated as an expression [21]:

ht = f (ht−1, xt) (2.2) The hidden state is updated by passing the current symbol along with the pre-vious hidden state into the activation function f. The inclusion of the prepre-vious hidden state encodes the information from the previous symbols along with the information of the current symbol. [21, 13, 8].

The decoder functions in the same manner as the encoder. It generates an output of variable length Y = {y1, ..., yn} by predicting the next target symbol

ytfrom the previous symbol yt−1, the hidden state st−1and the encoded context

(19)

st= f (c, st−1, yt) (2.3) The final step of the decoder is to generate valid probabilities for each symbol in the entire sequence [21].:

Yt= g(c, st, yt) (2.4)

where yt is a token marking the end of a sequence. The activation function g is commonly the softmax function applied to every symbol j = 1, .., K producing values between 0 and 1 [20, 21]:

exp(wjht)

PK

j=1exp(wjht)

(2.5) Where w are the rows in the models weight matrix [21]. Completing the training of the RNN-based Encoder-Decoder model maximizes the conditional log-likelihood of the target sequence conditioned on the source sequence [21].

2.2.3 Transformer

An issue with the encoder-decoder model is that the encoder has to encode all of the symbols in the source sequence into a single, fixed-length, context vector. This entails that the model has to encode all information for very long sequences into a single vector. A consequence of this is that the encoder-decoder model starts to perform worse the longer the input sequence is [22]. This is because the longer the sequence, the more information is lost by the end of the encoding [9, 22].

A solution that enables for the effective mapping of long-term dependen-cies is an attention mechanism. An attention mechanism improves the trans-lation quality on longer sequences. The intuition is that it helps the model to focus and encode the relevant parts of the sequence. This is in contrast to, treating all the information as equally important for all parts of the translation [9, 22].

In addition to the limited mapping of long-term dependencies in sentences, the sequential processing of the preceding hidden state ht−1 in the recurrent

encoder-decoder models results in a processing bottleneck. As a consequence, RNN-based NMT models require long training times and are limited by mem-ory [9].

(20)

recurrent models, and it relies entirely on attention and feed-forward networks. Its attention mechanism, Multi-Head Attention, is highly parallelizable, lead-ing to improved utilization of the GPU which successfully lowers the trainlead-ing time and makes it possible to map longer sequences [9].

The full architecture of the Transformer is depicted in figure 2.1.

Figure 2.1: The Transformer architecture [9]

The attention mechanism used in the transformer is Scaled Dot-Product Attention, and is calculated as:

Attention(Q, K, V ) = sof tmax(QK

T

√ dk

)V _(2.6) Attention is calculated on the (Q)uery, (K)ey, and (V)alue matrices. It differs from dot-product attention in that the termp(dk), is used to scale the

dot-product, where dkis the dimension of the Key [9]. The Q, K, and V matrices

(21)

from the preceding layer. In “Encoder-Decoder Attention“, the query matrix will be the output of the previous decoder layer, while the key and value ma-trices will be comprised of the output from the encoder [9, 23].

Replacing the RNN-based components are stacked layers of attention mech-anisms connected to feed-forward networks. The attention mechanism devel-oped is called Multi-Head Attention. It calculates the Scaled-Dot Product in parallel on several linear projections of the Query, Key, and Value inputs. Each result is concatenated and multiplied by an additional weight matrix, forming the final result of the Multi-Head Attention component [9, 23].

The encoder is composed of a stack of N identical layers. Each layer is composed out of two sub-layers. The first sub-layer is the Multi-head attention mechanism, here in the form of “self-attention“, connected to the second sub-layer, a feed-forward network [9, 23]. In addition to this, residual connections are included which bypass the Multi-Head Attention and feed-forward layer. Both the Multi-Head Attention and the feed-forward layer have their outputs added to the residual connection, followed by normalization [9].

The Decoder is also composed of a stack of N identical layers. The first sub-layer is a modified “self-attention“ which masks future positions. This is referred to as Masked Multi-Head Attention. It prevents the decoder from “peeking“ into the future during training and copying the right target word. The second sub-layer is a Multi-Head Attention layer, here in the form of “Encoder-Decoder Attention“ [9, 23]. The last sub-layer is a feed-forward layer. Residual connections and normalization are applied around each sub-layer in the same manner as in the encoder. The final output from the decoder stack is fed to a linear layer and then converted to probabilities by applying the softmax function [9].

Inputs to the Transformer are transformed into real value vectors through an embedding algorithm. Due to the lack of recurrence, and thereby, sequen-tial processing, the Transformer cannot track the positions of the symbols. Therefore, the embeddings are positionally encoded using sine and cosine po-sitional encoding [9].

2.2.4 Sub-word NMT

(22)

common problem in NMT is dealing with large vocabularies. As NMT mod-els create probability distributions over words, large vocabularies hamper the performance of the model. In addition to this, rare words can be difficult to translate, and/or, not be translated [24].

A solution that reduces the size of vocabularies is to break up the words into segments of varying granularity [24]. In more general terms, NMT pro-cesses symbols in sequences. These symbols can be words, sub-words, and characters. The common approach for NMT is to transform words into sub-words [24]. Sub-sub-words are sub-words broken up into segments. For example, “de-forestation“ could be split into “de@@“, “‘forest@@“, and “ation“. The @@ symbol signifies that a word isn’t finished and should be merged [24]. Sub-words reduce the size of the vocabulary, as more Sub-words can be represented by the merging of sub-words, rather than, mapping each word. Furthermore, splitting words into sub-words enables a model to generate out-of-vocabulary words. This improves the performance of the model and helps to better model the probability distribution for (sub)words in the vocabulary[24].

Byte-pair encoding (BPE) is a simple data compression technique adapted to transform words into sub-words [24]. The algorithm first decomposes each word into a sequence of characters, with an additional end-of-word symbol. It counts all symbol pairs and replaces the most common pair with a new sym-bol [24]. Through each merge operation, a new character n-gram is produced. Throughout the iteration, frequent n-grams, including whole words, can even-tually be merged into a single symbol [24].

2.2.5 Word Embeddings

Deep Learning models in NLP process text as real number vectors. The vec-tors are mapped to a continuous vector space through an embedding algorithm [25]. These vectors are referred to as word embeddings. Each number in the vector represents a different aspect of the word. For example, some elements in the vector may express that the word represents an animal, another, its gender, whether it’s plural, singular, etc. As such, word embeddings capture dimen-sions of meaning: syntactic and semantic similarities between words [26, 27]. A common example that demonstrates how embeddings capture the meaning of words:

emb(king) − emb(man) + emb(woman) = emb(queen)

(23)

the vector for “queen“. The example above is a demonstration of how a word embedding model captures word similarity [26].

There are various approaches to creating word embeddings, one being the GloVe model. GloVe captures global dimensions of meaning. It utilizes global, word-word co-occurrence combined with a neural network trained to generate a word given its surrounding words [27, 28]. The intuitive explana-tion for the model is that it builds on the benefits of two approaches: count and prediction-based. Count-based methods capture global statistics, and the prediction-based methods capture syntactic and semantic similarities between words. This combination captures the probability of a word on its global con-text, the word-word co-occurrence for the entire corpus, as opposed to in one sentence [27, 28].

The GloVe model’s cost function is defined as: J =X

ij

f (Xij)(viTu −

j − log Xij)2 (2.7)

where the weighting function f is: f (x) = ( (_xx max) α if x < xmax 1 _otherwise (2.8) X denotes the word-word co-occurrence matrix computed at the start wherein Xijrefers to the i, j element which is the number of times word j appears in the

context of word i. The weighting function prevents the most frequent words from overweighting the loss function, where the max occurrence frequency, xmax, is set to 100. α is constant set to α = 3₄. The center word vectors are

denoted by v, and u are the word vectors for the context [27, 28].

2.2.6 Transfer Learning

(24)

applied to several tasks in NLP, as the method has shown to be highly effective at improving results [29, 30].

Pre-trained word embeddings are a form of transfer learning which have received a lot of attention in NLP [16]. First, a language model is trained on a labeled dataset. The model’s embeddings can be used to initialize the em-beddings for words in the vocabulary of another model. In effect, the quantity of parameters that need to be trained from scratch is reduced, and the learned representations of words are leveraged. Despite their effectiveness in multi-ple NLP tasks, they remain uncommon to utilize in traditional NMT scenarios [16]. The availability of bilingual corpora several magnitudes larger than the annotated data for other tasks has resulted in pre-trained embeddings to only yield promising results in specific NMT scenarios [16].

There is, however, research that presents practical guidelines for when pre-trained word-embeddings can be of use in NMT: In cases where little training data is available, but not to the degree that the system cannot be trained [29]. In addition to this, pre-trained embeddings perform well on similar translation pairs. In regards to QA, pre-trained word embeddings have been successfully incorporated with Machine Comprehension models for context-based question answering [29].

2.2.7 Evaluation

2.2.7.1 Metric Evaluation

Several automatic metrics exist for evaluating NMT models [14]. Standard practice for evaluating machine translation is to use automatic metrics which rely upon word-overlap. The most common one for translation is the Bilingual Evaluation Understudy, or, BLEU [14, 31]. BLEU is a metric that relies on n-gram overlaps, and while it remains the predominant metric in translation tasks, it has been criticized for not co-relating with human judgment. Ad-ditional automatic metrics have been proposed, while stressing the need to include human evaluators, despite being expensive [14, 31].

(25)

that rely on statistical word-overlap similarity have been shown to not correlate with human judgment in regards to the selection of valid responses [14].

Perplexity is a performance metric used to evaluate probabilistic models and has been used to evaluate the performance of dialogue response genera-tion models [14, 15]. An intuitive definigenera-tion of perplexity is a measurement of uncertainty. It can be expressed as “how confused is the model about its decision?" [20]. In the context of language modeling, it expresses a value that represents how many words, on average, would have to be picked from a probability distribution to pick the correct one [20]. For example, a perplexity value of 5, entails that the model was trying to pick between 5 words.

In more accurate terms, perplexity measures how well a probabilistic model predicts a sample from a dataset [20]. Generally, a lower perplexity is indica-tive of a better model [15]. Perplexity has been used to measure the perfor-mance of dialogue generation systems, such as chatbots, and question answer-ing, precisely because of the possibility of multiple correct answers [15]. This is because it measures the certainty of a response [15]. Furthermore, also fea-tures in evaluations of NMT models, as low perplexity values have shown to have a strong correlation with a high-quality translation in machine translation [32].

Perplexity is defined as the exponent of the average negative log-likelihood per symbol [20]:

ppl = exp− P|Y |

i=1log P (yi|yi−1, ..., y1, X)

|Y | (2.9) Where X is the source sequence, Y is the true target sequence, and yi is the i-th target token [20, 15].

2.2.7.2 Human Evaluation

(26)

Another popular manner in which to evaluate chatbots is a user satisfaction score. In such an evaluation, participants rate the satisfaction of the interac-tion, typically by rating on a Likert scale. In addition to this, studies have also asked participants to rate specific aspects, such as appropriateness and natural-ness [31]. Current research is encouraged to include human evaluations along with automatic metrics [14].

2.3 Related Work

2.3.1 Intelligent Tutoring Systems

Intelligent Tutor systems for teaching programming concepts have been devel-oped and used in individual University courses [7]. These ITS are also referred to as “Intelligent Programming Tutors“, or IPT [7]. IPTs vary in the degree to which they teach programming. There are systems that only teach a few con-cepts such as conditionals and loops, or specific programming techniques such as recursion, and others that offer a more complete suite of learning functions [7].

The “Collaborative, constructive, Inquiry-based Multimedia E-learning“ (CIMEL) ITS is an IPT for teaching Object-oriented design [33]. CIMEL ITS teaches students programming concepts through exercises for which stu-dents design UML-diagrams in an Eclipse IDE environment. The designs are reviewed by the system and send feedback back to the student based on the continuous modeling of the student’s abilities. In addition to the exercises, the system provides reference and lesson materials on programming concepts. An integral part of CIMEL ITS is that system adapts the exercises, and quizzes based on the student’s knowledge [33].

CIMEL ITS is an example of an IPT which offers a rich suite of functions for teaching programming concepts meant to transition into actual program-ming skills. However, research has yet to show whether CIMEL ITS improves actual programming skills [7, 33]. It has been shown to help understand con-cepts, but it remains to be determined whether it can be used as a preamble to teaching Java, specifically whether it improves a student’s ability to write Java programs [7, 33].

(27)

tests [34].

While Lisp-tutor has shown to improve actual programming skills, there are limitations to its features [7]. One key aspect, pointed out by the re-searchers, is the limited dialogue of the tutor [7, 34]. There are very few adaptive dialogue options, and the actual dialogue is rather restrictive. It is stressed that improving the dialogue, and thereby the interaction, would im-prove the instruction of Lisp-tutor, and bring it closer to simulating human tutoring [34].

Other successful ITSs have been implemented and shown to improve re-sults when used in computer science education. Examples of successful ITS are ILMDA, SHERLOCK, PACT, ANDES, among others [7]. However, of primary interest to this thesis are those ITS which adopt a dialogue-centric approach. The ITS covered up to this point mainly provide various means of exercise feedback, and adaptive navigation through learning material [7].

ProPL is a dialogue-based ITS built for teaching problem-solving skills for programming. Students engage with ProPL to write pseudocode solutions to problems using natural language. A key aspect of the system is that the tutor initiates different types of questions. ProPL has four types of questions to initiate dialogue with [1, 35]:

• Identifying a programming goal • Describing a Schema for attaining this

• Suggesting pseudocode steps that achieve the goal • Placing steps with the pseudocode

Students then engage with the system in a conversation regarding the er-rors and misconceptions that the student may have committed [1, 35]. The system’s dialogues are less restrictive than, for example, Lisp-tutor, in that it can engage in sub-dialogues if the student’s answers are unclear, misunder-stood, or could be improved upon for an ideal answer [1, 35]. Highlighting the effectiveness of the dialogue approach taken by ProPL, the system has shown to improve student’s problem-solving skills on composition problems, as com-pared to students only relying on reading material [1, 35].

(28)

to give the correct answer to the question posed by the student(s) [36]. How-ever, the answers confused some participants as too much detail was provided in the answer. Another limitation shown was that the tutor lacked adaptability. The Watson tutor failed at detecting intent and nuances in the questions for Classes in Java, and thereby responded with insufficient detail [36].

The study concludes that using IBM Watson for the development of a vir-tual tutor for introductory Java programming is possible [36]. Limitations of the Watson lite platform are brought up as challenges for practical use in courses. Of primary concern is that all answers and need to be provided in a detailed manner when building the prototype, which proved to be very time-consuming [36]. Furthermore, the lack of adaptability of the answers to par-ticular student’s needs. A key conclusion is that IBM Watson could be used as a complementary tool, but does not exhibit the skill and adaptability of a human teacher [36].

2.3.2 Java question answering

A pertinent study has been carried out in the domain of NMT-based QA on Java question answering. The study used data from the community-driven, online forum Stack Overflow, to develop and train a tutorial QA system for introductory Java programming [4]. The tutorial system was meant to answer general questions about Java in the context of a potential tutorial system used in a programming course. Three different approaches were investigated: a retrieval based system, a generative system using the seq2seq framework, and a hybrid model [4]. It is evident that this study is of high relevance to the approach and goal of this thesis. For these reasons, additional details on this study are provided.

A data dump from Stack Overflow was used to extract questions and answer pairs from posts. The posts were filtered to only include posts on Java. In addition to this, questions that included code in their description were omitted. As a last pre-processing step, the question was split into its description and title. In total, the study gathered 107961 question-description-answer triplets [4].

(29)

[4].

The seq2seq framework was implemented according to the recurrent encoder-decoder model with attention [4, 37]. The model was trained on description-answer pairs. The rationale for using description to description-answer was that due to the similar lengths, it could be formulated as a Machine Translation problem. Thereby, it could capture and map the sequences to one another [4].

Lastly, the hybrid model combines the retrieval-model with the generative-model. In this model, the retrieval-model encodes the question’s title and de-scription, the output from the retrieval-model is used to fetch the top 10 prob-able sentences. These 10 sentences are embedded and then processed by an encoder-decoder model to produce the final ouput[4].

The primary conclusion of interest from the study is that the generative model was able to produce coherent responses [4]. The model was able to generate code as part of the answers to questions relating to specific program-ming questions, such as how to initialize an array. The answers produced were short and did not necessarily answer the question. Nevertheless, it does indi-cate a potential to further investigate generative models for QA systems on introductory programming [4].

2.3.3 Conversational Modeling & Dialogue Generation

The sequence-to-sequence framework has shown promising results in the field of conversational modeling and dialogue generation [8]. The approaches taken in these fields are similar to those taken in this thesis. They demonstrate using a machine translation approach can be used for different text generation tasks. Of key interests are the limitations shown when generating responses and their characteristics.

In a prominent study on conversational modeling, the recurrent encoder-decoder model was trained on two types of datasets [8]. A closed-domain IT-helpdesk troubleshooting dataset, and an open-domain movie transcript dataset. Two models were trained, one on each dataset. Upon completion of training, a conversation was held with each model. Results from the conversa-tions show that the models were able to identify context, remember facts, and generalize to new questions [8]. The promising result showed that the model was able to resolve IT-issues, such as e.g. connection issues. A drawback with the responses was that they were simple, short, and at times, unfulfilling answers which did not progress the conversation forward [8].

(30)

the context with the message when training the model. That is, the message and the context were concatenated into one vector during training. Overall, the context-aware model produced responses that were generic, commonplace, and did not necessarily drive the conversation forward [38]. Evaluation of the responses is made further difficult as there are many plausible responses [38]. A reoccurring phenomenon when using sequence-to-sequence models for dialogue generation is that the responses are short and generic. For open-domain, social conversation models, the common responses are: “I don’t know “, “I’m ok“, “I’m fine“. In addition to the generic responses are nonsensical and repeating responses, for example “no. no. no. no“. That is, the responses lack diversity and do not provide a meaningful contribution to the conversa-tion. Studies have attributed this to the seq2seq models optimizing for the likelihood of outputs given inputs resulting in assigning a high probability to “safe“ responses which do not drive the conversation forward [39, 40].

Various attempts have been made to resolve the generation of generic re-sponses. A suggested approach was to use a different loss function. Instead of maximum likelihood, maximum mutual information was investigated as a loss function [39]. The results show that the change in loss function resulted in the model generating more diverse, interesting, and appropriate responses. This result was reflected in both the automatic metric and human evaluation [39].

(31)

Method

This chapter describes the steps taken to construct and prepare the dataset to train the model. It motivates steps taken to train the model in a data-scarce scenario with the use of pre-trained embeddings and different granularities of subword segmentation. The last section(s) cover how the model was evaluated using perplexity and a qualitative evaluation, along with the limitations of the method.

3.1 Dataset

The text data for this thesis was collected from the online forum, Stack Over-flow (SO). Stack OverOver-flow is an online community for any level of program-mer to post questions related to programming, as well as, discuss topics in programming. A post on the forum is open to the community to view and answer. Posts are self-moderated through a peer upvoting system, where the answers to a post can be voted on by the community to determine its quality. Various answers may be proposed and upvoted, it is the author of the post who can accept an answer as the “accepted answer“. An accepted answer indicates that the post, or question, has been resolved.

An example of a question on Stack overflow can be seen in figure 3.1. The question is comprised of a question title, “When to use LinkedList over ArrayList in Java“, and below is the description detailing the question. Below the question post is an answer. As indicated by the green tick beside the answer, the answer has been accepted. The number of votes is displayed above the green tick, in this case, the accepted answer has the highest number of upvotes, but this is not a requirement. Furthermore, a list of meta-tags is displayed below the question, “java“, “collections“, etc. The tags serve to categorize the

(32)

topics of the post on the forum.

Figure 3.1: A question on Stack Overflow1

Accessing the Stack Overflow data can be done in various. Anonymized data dumps for Stack Exchange’s community-driven forums, of which Stack overflow is a part of, are available as XML databases 2. In addition to the XML databases, Stack Exchange forum data can be accessed through the Stack Exchange API.

The dataset constructed is two aligned, parallel corpora. One corpus con-tains the question description, and the other concon-tains the accepted answer. These corpora are aligned such that a question’s description is written one

1_{https://stackoverflow.com/questions/322715/}

when-to-use-linkedlist-over-arraylist-in-java/24607151

2

(33)

line in the question (source) corpus, and, the accepted answer is written on the same line as the answer (target) corpus. The rationale for structuring the cor-pora this way is that a question’s description and answers are of similar length which aids in framing the problem as a machine translation problem, as shown in previous technical domain NMT [13, 8, 4].

The parallel corpora can be contrasted to the corpora used in [4]. A similar “question description“ to “answer“ corpus was created with data from Stack Overflow [4]. It differs in that no code is included in any of the corpora for this thesis, meanwhile, code was allowed in the answers corpus in [4]. The focus in the study was on a tutorial-style QA system that would be able to answer with code [4]. This is in contrast to this thesis, which focuses on programming concepts (as stated in section 1.3).

3.1.1 Data collection

The API was used to gather questions and answers, or posts, from the Stack Overflow forum. This was done by calling the two API endpoints /questions and /answers/{ids}. All questions posted on Stack Exchange forums can be accessed through the /questions endpoint. The /answers/{ids} endpoint re-turns all the answers for a list of answer ids3. In general, the strategy was to gather the descriptions of questions tagged “java“ over a long interval, and for these questions, extract the accepted answers.

Questions tagged “Java“ from the 1stof January 2014 to the 13thof Febru-ary 2020 were collected. Additional criteria were defined programmatically to filter out questions from the API response. Questions that did not have an accepted answer were discarded. The “tagged“ parameter, on the questions endpoint, includes all questions from Stack Overflow which have the “Java“ tag, regardless of any other additional tags. In an attempt to keep the questions related to Java programming, a list of excluded tags was defined. If a given question’s tags contained any of the excluded tags in addition to the “Java“ tag, it was discarded.

The list of excluded tags is quite extensive, but, does not guarantee that all questions unrelated to fundamental Java programming are discarded. Exam-ples of excluded tags are “c++“, “clojure“, “spring“, amongst many more. The main principle of the excluded tags is to exclude questions dealing with other programming languages, specific frameworks, IDEA, and operating system is-sues, while retaining questions tagged with “Java“ and e.g. “Data-structures“ and/or “Algorithms“.

3

(34)

From the collected questions, the question id and description, along with the id for the accepted answer to the question, were extracted. The description of the questions is formatted with HTML-tags. Specifically, the description is enclosed in paragraph tags, <p>, and, pre-formatted text tags, <pre>. The pre-formatted tags are used to format code segments. Only text enclosed in the <p> tag was extracted. Ignoring the pre-formatted segments is meant to exclude any code from the description, but, is not a guarantee.

For each question, the corresponding (accepted) answer was collected by calling the aforementioned answers endpoint. In the same manner as the ques-tion’s description, the text was extracted from the paragraph tags of the answer, excluding any pre-formatted code segments. As a final step, all the text gath-ered was formatted into tuples of question-answer (where the question is the question’s description).

3.1.2 Dataset statistics

Before splitting the answers and questions into aligned parallel corpora, ad-ditional filtering and cleaning steps were taken. Using regular expressions, white space was normalized, and newline breaks, urls, non-ascii characters, and, java style comments, were removed. Code segments had been inadver-tently included as some were not properly enclosed in the pre-formatted tag. As a consequence, there are questions and answers which still contain code segments.

The collected dataset was comprised out of 242378 question-answer pairs. Using the open-source NLTK tokenizer4 to tokenize the words in each ques-tion and answer, the average lengths were calculated to be ≈104.33, and ≈89.74 tokens, for the questions and answers respectively. A summary of the statistics is available in table 3.1. Plotting a frequency distribution over the tokenized lengths for the questions and answers, illustrated in figure 3.2 and 3.3, it can be observed that there are large outliers present in both corpora. Not seen in the graph, the longest question and answer contains 3585, and, 2475 word-level tokens, respectively.

In a study done on Java programming QA with SO data, questions and answers which were too long were gradually discarded until a relatively stable average length was attained [4]. The average length arrived at was 71.45, and, 87.54 words, for questions and answers, respectively [4]. A max length of 125 for questions, and, 175 for answers, was set for their recurrent encoder-decoder model [4]. In contrast, a study was done on optimizing hyperparameters for

4

(35)

Total Dataset Size (pairs) Average Length (tokens) Longest Sequence (tokens)

Question Answer Question Answer

242378 104.33 89.37 3585 2475

Table 3.1: Raw dataset statistics

NMT using the Transformer on an aggregated English-to-Czech dataset [41]. It showed that, if combined with the right batch size, a max sentence length of ≥150, and up to 400, resulted in better translation quality, as measured with BLEU, than using smaller lengths [41]. The conclusion was to set a reasonably high sentence length, if the hardware allows it, such that too much of the source and target corpora are not discarded [41].

There are factors to consider when setting the maximum (and minimum) sentence length in NMT. Setting too small of a sentence length causes the model to be biased to generate shorter responses, which could lead to poorer responses. Too long of an input, and it can become difficult to map sequences, and, there also memory limits to consider when training the model [4, 41].

Basing the maximum input length on a study for optimizing parameters for NMT using the Transformer, a length of 250 was selected as the maxi-mum threshold [41]. A max length of 250 covers ≈94.56% of the answers, and ≈95.65% of the questions. A minimum threshold of 8 tokens was also selected. This was based on inspecting the corpora manually, after noticing a high percentage of short length answers in figure 3.3. The answer corpus contained several answers which were only one short sentence, such as, “try this:“, or “take a look at my code:“. Such short answers were a consequence of removing pre-formatted code segments. If either the question or the answer, were outside the threshold range, the entire pair was discarded. This was done to retain the alignment between questions and answers in the corpora.

Filtering out question-answer pairs which do not fall in the range resulted in a final dataset with 212202 question-answer pairs, a total reduction of ≈12.5% from the original dataset. The resulting average lengths are ≈93.85 and ≈75.72, for questions and answers, respectively. The re-computed statistics, as well as the percentage covered of the raw corpora, are tabulated in table 3.2.

3.2 OpenNMT

(36)

Figure 3.2: Question length frequency distribution, raw dataset

The project includes tools that cover the full NMT workflow: tokenization, predefined model architectures, inference engine, and tools for deployment and monitoring of the NMT system. The aim of OpenNMT is to provide re-searchers with a suite of tools with high extensibility and readable code. Its primary target group is researchers with a solid background in machine trans-lation, deep learning, and understanding of larger codebases [42].

(37)

Figure 3.3: Answer length frequency distribution, raw dataset given a fundamental understanding of theory.

3.3 Preprocessing

The parallel corpora of size 212202 were shuffled while maintaining the align-ment between question-to-answers and split into train-validation-test splits. The proportions of the splits are shown in table 3.3. The validation and test splits each have 2000 question-answer pairs, and the remaining 208202 pairs are used for training. A split of 2000-5000 samples for validation and test is recommended by the maintainers of OpenNMT, even for larger NMT datasets with 10M samples5.

OpenNMT provides tokenizers and scripts for tokenization and transfor-mation of word-level tokens into subwords. The OpenNMT tokenizer6, was

5

https://forum.opennmt.net/t/validation-data/128

6

(38)

Total Dataset Size (pairs) Average Length (tokens) Coverage (%) Question Answer Question Answer 212202 93.85 75.72 95.65 94.56

Table 3.2: Final dataset statistics

Split Size

Train 208202

Evaluation 2000

Test 2000

Table 3.3: Size of the data splits

used to tokenize and byte-pair encode (BPE) all of the data samples. When applying BPE, the number of merge operations has to be specified. The merge operations determine the granularity of the segments. Common NMT “recipes“ use 32000 merge operations.

Recent research has shown that a more prudent choice of merge operations should be considered. In particular, in NMT architectures using the Trans-former, a smaller amount of BPE merge operation tend to be optimal [43]. Between 0-4000 merge operations have shown to improve the BLEU score for Transformer-based NMT [43].

An additional factor to consider when applying BPE is the size of the dataset. If a large number of merge operations are set for a small dataset, all tokens may ultimately be merged back into the original word. This was observed to be the case when BPE with 32000 merge operations was applied to the data samples. The resulting encoded corpora had very little, to no, seg-mentation. Instead, the corpora had been merged back into word-level tokens. With the aforementioned factors in mind, 4 sets of train-eval-test datasets were encoded, each with a different amount of BPE merge operations. The datasets were encoded with 32000, 16000, 4000, and 3000, merge operations.

Each dataset was preprocessed using the OpenNMT preprocessing script7. Preprocessing the datasets builds the words and features vocabularies, and, assigns each word to an index within the vocabulary. The resulting vocabulary sizes for each BPE configuration are shown in 3.3.

7

(39)

The preprocessing script was run with a set of parameters, shown in figure 3.4: $ python preprocess.py \ -share_vocab \ -src_vocab_size 50000 \ -tgt_vocab_size 50000 \ -src_seq_length 250 \ -tgt_seq_length 250

Figure 3.4: Preprocess script with parameters used

• Parameters src_seq_length, and tgt_seq_length set the maximum se-quence length for questions and answers. Due to the default value being 50, the parameters have to be set to 250 to retain the samples in the prepared datasets

• The vocab_size options specify the maximum vocabulary size and were set to the default value, 50000.

• Merging of the vocabularies is done with the inclusion of the share_vocab option. Due to the questions and answers both being in the same “lan-guage“, same language structure and set of words, the vocabularies for each dataset were merged. The Merging of vocabularies is done in NMT with languages that share the alphabet, or, in technical domains where the terminology is shared8.

BPE Merge Operations Questions Vocab Size Answers Vocab Size Merged Size

32000 26595 26817 32568

16000 13795 13879 16366

4000 3697 3697 4430

3000 2812 2808 3399

Table 3.4: Vocabulary sizes for each BPE configuration

8

(40)

3.4 Model

The Transformer available in the OpenNMT-py library is implemented accord-ing to the annotated Pytorch version done by the Harvard NLP group9. For its hyperparameters, OpenNMT uses the values listed in the paper for the Trans-former as their default [9]. The implementation and hyperparameters have been confirmed to reproduce the results of the original Transformer paper10. Following the description of the model given by its paper, the dimensions of input and output of the model are dmodel = 512, and, the inner-layer of the

feed-forward network is df f = 2048. The “base“ configuration of the

trans-former uses a stack of 6 encoder and decoder layers, and 8 attention heads [9].

The equivalent dimensions are configured in OpenNMT-py by setting the parameters of the train.py script with the values shown in table 3.5. The dimensions of the embedding layer, the attention heads, and, the inner layer of the feed-forward network are set by the word_vec_size, rnn_size, and transf ormer_f f , respectively. Specifying that the encoders and decoders should be “transformer blocks“ with no recurrence is done with the encoder_type and decoder_type options. Setting the number of layers and attentions heads is done with the layers and heads parameters.

parameter Value rnn_size 512 word_vec_size 512 transformer_ff 2048 heads 8 layers 6 encoder_type transformer decoder_type transformer

Table 3.5: Transformer size parameters for train.py

9

https://nlp.seas.harvard.edu/2018/04/03/attention.html

10

(41)

3.5 Training

Five Transformer models were trained in total. A baseline model is trained on the dataset encoded with 32000 byte-pair merge operations and without pre-trained embeddings. Four additional models with pre-pre-trained subword embed-dings were trained on each of the four byte pair encoded datasets.

The following section describes the hyperparameters used for the Trans-former and, the pre-trained embeddings. Where appropriate, the hyperparam-eters used are compared to the ones in [9]. This is done to establish any differ-ence between the default values used in the paper introducing the Transformer and in the default version supplied by OpenNMT-py. The full table of the most significant hyperparameters are summarized in table 3.6, and the full command to train the model, train.py, is depicted in appendix A.2.

parameter value optim adam adam_beta2 0.998 decay_method noam train_steps 100000 batch_size 8192 warmup_steps 8000 batch_type tokens label_smooting 0.1 dropout 0.1

Table 3.6: Transformer training hyperparameters for train.py

3.5.1 Training setup

Each one of the models was trained on one Nvidia Tesla K80 GPU for 100000 training steps. The number of steps was set after some initial test experiments which observed that it was a large enough amount to allow the model to con-verge to a minima.

(42)

8192. Important to note, that the option batch_type was set to tokens, which enables dynamic sizing of batches. With this option set, the batch size is an approximate number of tokens in a batch.

3.5.2 Optimizer

The Adam optimizer was used with the value for the second decay rate, β2,

set to 0.998. This differs from the “base“ configuration in the original Trans-former paper which set it to 0.98. Despite this, it is the value used in the OpenNMT-py tutorial for using the Transformer, and, was kept at that value. For learning rate, the “noam“ schedule is used. The schedule increases the learning rate linearly for the specified amount of warmup steps, and then low-ers it (decays) exponentially [41]. Warmup was set to 8000 steps, as defined in the OpenNMT-py tutorial, as opposed to, the 4000 warmup steps used in the Transformer paper.

3.5.3 Regularization

Dropout and label smoothing are used for regularization. Dropout is a method that randomly ignores, or "drops", units in the network. The benefit of using dropout is that it helps to prevent over-fitting [44]. An over-fit model does not generalize the knowledge gained from training and only has good performance on the training dataset 11. A dropout value of 0.1, the probability that a unit will be dropped, is used in the paper and this thesis.

Label smoothing helps to distribute the NMT model’s probability mass, which is commonly assigned to just a single word/token, with a probability of over 99%. The peaked distribution is “smoothed“ during training by modify-ing the softmax function such that less probability is given to the most likely choice [45]. Label smoothing is set to a value of 0.1. The higher the value, called “temperature“, the smoother the distribution, the less probability is as-signed to the most probable choice [45].

3.5.4 Pre-trained subword embeddings

Due to the varying degrees of word segmentation in the datasets, pre-trained BPE encoded embeddings, BPEmb were used. BPEmb trains its embeddings using GloVe on byte pair encoded Wikipedia articles [46]. The embeddings

11

(43)

are available in 275 different languages and with different embedding dimen-sions and amount of merge operations. Embeddings with a dimension of 300 and 200000 merge operations were selected for this thesis.

The pre-trained embeddings are used to initialize the matching subwords in each of the (merged) vocabularies. The percentage of subwords initialized to pre-trained embeddings, for each vocabulary, are shown in table 3.7:

BPE merge operations Coverage (%) in Vocabulary 32000 32.54

16000 45.66 4000 66.66 3000 70.23

Table 3.7: Percentage (%) of tokens initalized to pre-trained embeddings in vocabulary

3.6 Evaluation

The evaluation is comprised out of two parts: an automatic evaluation, and a qualitative analysis. An automatic evaluation with the Perplexity metric is performed during the training of each model, and, on the test data set after training. The perplexity value attained by each model on the test data set is used to select the best-performing model for the subsequent evaluation. The purpose of the qualitative analysis is to demonstrate the responses generated by the selected model, on the test dataset, and questions outside of the test corpus. A similar qualitative evaluation was carried out in [4].

3.6.1 Automatic evaluation - Perplexity

(44)

It is important to note that subword perplexities across different segmen-tations cannot be compared directly. For the comparison to be valid, the gran-ularity of tokens in the denominator has to be the same 12. It is, therefore, necessary to convert the perplexities to word-level perplexity. Perplexity is calculated by the translation script in the following manner:

ppl = exp(-total_subword_score / total_subwords) Figure 3.5: Subword Perplexity calculation

Where −total_score is the cumulative negative log-likelihood of the model’s predictions, and, total_subwords is the number of tokens in all predictions.

To calculate the word-level perplexity for each model, the following cal-culation was carried out:

ppl = exp(-total_subword_score / total_words) Figure 3.6: Word Perplexity calculation

that is, the only difference is in the denominator, which has to be the total amount of word-level tokens in the predictions, as opposed to the total amount of subwords.

The translate.py script was modified to report the cumulative negative log-likelihood (-total_subword_score), and, the total amount of tokens (sub-words). The number of tokens is included only for purposes of confirming the calculation of the subword perplexity. An example of the output is seen in

fig-ure 3.7. Seen in the figfig-ure is the average score (calculated as total_score/total_words), perplexity, the cumulative negative log-likelihood (total score), and, the total

amount of words. The total amount of words refers to the total amount of tokens, which in this case are subwords.

PRED AVG SCORE: -0.7479, PRED PPL: 2.1126,

PRED TOTAL SCORE: -69340.2578, PRED total words: 92710 Figure 3.7: Example of evaluation values reported as part of translate.py

To calculate the total amount of words, the generated responses from each model’s evaluation were detokenized. That is, each subword in the generated

12

(45)

responses was merged back into complete words. The completed sentence was then tokenized into word-level tokens using the NLTK tokenizer. From each word-level response-dataset, the total amount of words was counted, and, used to calculate word-level perplexity as described in 3.6.

3.6.2 Qualitative evaluation

In the qualitative evaluation, the (detokenized) responses for the best perform-ing model are compared to the expected responses in the test dataset. Of pri-mary interest is to judge whether the responses answer the posed question, i.e., the question from the test dataset. In addition to this, it is of interest to assess the quality of the response. That is, is the response coherent, is it relating to the same topic as the question, or, completely irrelevant.

As part of the qualitative evaluation, questions outside of the test dataset are posed to the model. The intention is to examine the response to questions to which the answer is not known. Four questions are posed to the model:

• What is a LinkedList?

• What is the difference between an interface and abstract class in Java? • How do I find the shortest path in a Graph?

• How do I create an array of integers in Java?

The first three questions are meant to demonstrate how the model responds to conceptual questions. This is in contrast to the fourth question which asks a specific question where the expected answer would contain code. The idea is to compare what the response is to the desired, conceptual questions, and, how it answers the more typical questions found on Stack Overflow, and in the dataset.