Distinguishing between human and computer generated questions

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2020 ,

Distinguishing between human and computer generated questions

SEBASTIAN EMTELL

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(2)

Bachelor in Computer Science Date: September 3, 2020 Supervisor: Ric Glassey Examiner: Pawel Herman

School of Electrical Engineering and Computer Science

Swedish title: Skilja på människa- och datorgenererade frågor

(3)

(4)

iii

Abstract

This thesis aims to determine how accurate humans are at distinguishing be- tween human written programming questions and programming questions gen- erated by Heilman’s algorithm. Not a lot of previous work has been done com- paring computer generated questions to human written questions. Heilman’s algorithm is relatively old but widely researched.

A survey was shared with personal contacts of the author of this paper in order to gather responses. The survey aimed at making the choice harder than the simple format of Is this questions written by a human or generated by a computer? by using groups of three questions and asking the questions Which question is generated by a computer? and Which question is generated by a human? respectively, including the option None of the above.

The results showed that people were able to distinguish between human

written programming questions and programming questions generated by Heil-

man’s algorithm about 44.5 percent of the time. However, this number was

slightly higher for the participants with programming experience (54.5 per-

cent) and slightly lower for the participants without programming experience

(38.5 percent). The results showed no significant difference whether the ques-

tion was Which question is generated by a computer? or Which question is

generated by a human?.

(5)

iv

Sammanfattning

Denna avhandling syftar till att avgöra hur bra människor kan skilja mellan programmeringsfrågor skrivna av människor och programmeringsfrågor ge- nererade med Heilman’s algoritm. Inte mycket tidigare arbete har genomförts inom området att jämföra datorgenererade frågor med frågor skrivna av män- niskor. Heilman’s algoritm är relativt gammal men har undersökts mycket.

En enkät delades ut till personliga kontakter av författaren till denna upp- sats. Enkäten hade till syfte att göra valet svårare än det enkla formatet av Is this questions written by a human or generated by a computer? Genom att använda grupper om tre frågor och ställa frågorna Which question is genera- ted by a computer? respektive Which question is generated by a human? och inkludera alternativet None of the above.

Resultaten visade att människor kunde skilja mellan mänskligt skrivna

programmeringsfrågor och programmeringsfrågor genererade med Heilman’s

algorithm ungefär 44,5 procent av tiden. Detta antal var dock något högre för

deltagarna med programmeringserfarenhet (54,5 procent) och något lägre för

deltagarna utan programmeringserfarenhet (38,5 procent). Resultaten visade

ingen signifikant skillnad oavsätt om frågan var Which question is generated

by a computer? eller Which question is generated by a human?.

(6)

Chapter 1 Introduction

According to Mendoza and Zavala [1] Introductory computer programming courses (CS1) have an unusually high failure rate of 63 percent around the world. Because of this many pedagogies and innovative approaches have been developed in order to make students learn computer programming more effec- tively. To name a few examples we have collaborative learning, pair program- ming, peer-led instruction, flipped classroom and live coding. In CS1 courses, having relevant questions for tests and quizzes is especially important since proficiency in programming is typically achieved through substantial practice and repetition.

Answering questions is a fundamental tool in education with a variety of applications, some of which being formal assessment and as an instrument to benefit learning. According to Thalheimer [2] answering questions in the form of quizzes, tests and exams can improve learning drastically, whether or not the questions are presented before or after the associated learning material.

It is also important to use relevant questions since irrelevant questions can create misconceptions about what concepts are important, which can impair the learning process. Some of the benefits of answering well-defined questions are: providing practice in retrieving information from memory, give feedback about misconceptions, focusing the student’s attention on the most important learning material and repeating core concepts, providing a second chance to learn, relearn or reinforce what has already been learned.

However, manually generating questions is a challenging task that requires a lot of training, expertise and resources [3].

Automatic Question Generation (AQG) methods were developed as a so- lution to the problem of generating a large number of high quality questions and have the potential to be used in several technologies such as intelligent tu-

1

(9)

2 CHAPTER 1. INTRODUCTION

toring systems, dialogue systems [4] and educational technologies [5]. AQG refers to using algorithms in order to generate questions from a learning mate- rial. AQG can be used to reduce the resources needed when creating questions for either learning purposes or formal assessment, allowing the educators to spend more time on other educational activities. However, AQG presents var- ious challenges, one of which is when questions are generated from sentences containing pronouns instead of names. Such sentences would generate unac- ceptable questions such as “What country is he the president of?” which is unusable since there is no way of knowing who “he” is. Another challenge is choosing the correct question word. Choosing the wrong question word sig- nificantly diminishes the quality of the question and can even make questions unusable, such as “When wrote Kallocain?”.

While many different AQG algorithms have been developed relatively little research has been performed on measuring their performance against human written questions. This thesis aims to expand the research of AQG algorithms by examining if it is possible to differentiate between questions generated by a computer and a human.

1.1 Purpose

This thesis aims to investigate the human ability to differentiate between pro- gramming questions generated by computers and humans. For the computer generated questions the AQG algorithm developed by Heilman [6] will be used.

1.2 Research Question

How accurate are humans at differentiating between programming questions generated by Heilman’s algorithm and programming questions written by a human?

1.3 Approach

In order to answer the research question a survey with two parts will be used. In

the first part the participants are given three questions and asked which one of

the questions they believe to be generated by a human. In the second part the

participants are given three questions and asked which one of the questions

they believe to be generated by a computer. They also have the alternative

(10)

CHAPTER 1. INTRODUCTION 3

to choose “None of the above” if they believe neither of the questions to be generated by a computer/human.

This format is meant to make it more challenging to find the correct alter-

native rather than using a simple question like “Is this question generated by a

human or computer?” where the only alternatives are yes or no.

(11)

Chapter 2 Background

This chapter presents some theoretical background related to AQG. The first section covers a brief introduction to the area of Natural Language Processing (NLP) and how it relates to AQG. The second section describes the AQG pro- cess in more detail. The third section brings up some related works in the area of AQG. The fourth section describes the AQG algorithm used in this thesis (Heilman’s algorithm) in detail.

2.1 Natural Language Processing

Natural Language Processing (NLP) is a subfield within both linguistics and artificial intelligence that has been around for a while. Some sources claim it dates back to the 1950’s [7] while some claim it began in the late 1940’s [8].

NLP has no single agreed-upon definition. However, Liddy [8] defines it as following:

“Natural Language Processing is a theoretically motivated range of com- putational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human- like language processing for a range of tasks or applications.”

According to Liddy [8] the field of NLP can be divided into two subfields, language processing and language generation. Language processing is about understanding natural language and providing a structured representation of it that a computer can comprehend. Language generation is about using the structured representation of the text in order to produce natural language. AQG combines of both these subfields since an algorithm that generates questions

4

(12)

CHAPTER 2. BACKGROUND 5

from a text needs to both understand the text and generate new text in the form of questions.

When it comes to language processing there exists many different levels of understanding, such as morphology, lexical, syntactic, semantic etc. The levels of understanding most related to AQG is syntactic and semantic un- derstanding. Syntactic understanding refers to understanding the grammatical structure of a language and is used by AQG algorithms in order to guide ques- tion generation. Semantic understanding refers to the meaning of a sentence and AQG algorithms incorporating this requires a deeper understanding of the input, beyond lexical and syntactic understanding [8] [3].

Whether AQG is a part of NLP [9] or AQG simply includes some priciples of NLP [10], it is clear that the fields of AQG and NLP are closely related.

2.2 Automatic Question Generation

The question generation process can be divided into three stages. The first part is the preprocessing stage where the input text is parsed and transformed into a more suitable format, for example by simplifying or classifying sentences.

The second stage is question construction, where the parsed input from the preprocessing stage is used to construct questions. The postprocessing stage makes sure the questions generated by the algorithm uppends a certain quality, for example by removing faulty questions or ranking questions based on their quality [3].

2.2.1 Preprocessing

Preprocessing can be divided into two types of preprocessing: standard pre- processing and QG-specific preprocessing. Standard preprocessing is pre- processing used in common NLP tasks, for example segmentation, sentence splitting, tokenization, POS tagging, and coreference resolution. QG-specific preprocessing is used to identify input that is more relevant for generating questions, such as sentence simplification, sentence classification and content selection [3].

2.2.2 Question Generation

This stage performs the main task of the algorithm, i.e generating questions.

The procedure used to generate questions can be divided into three categories:

templates, rules and statistical methods. Templates consist of a set structure

(13)

6 CHAPTER 2. BACKGROUND

with fixed text and placeholders that are substituted for values in order to gen- erate questions. Rules are usually used to annotate sentences with syntactic and/or semantic information that can then be used to specify a suitable ques- tion type and how to manipulate the input to generate questions. Approaches using statistical methods learn how to generate questions from training data [3].

2.2.3 Postprocessing

This stage makes sure the output questions uphold a certain level of quality.

Postprocessing is often divided into two processes: verbalisation and question ranking. Verbalization is used to improve the grammar and fluency of ques- tions or to provide variations of questions. Question ranking utilizes statistical models to rank questions and is used to prioritize high quality questions [3].

2.3 Related Works

Heilman [6] presented an algorithm for generating low-level factual questions.

Due to this it can be considered suitable for generating questions suited for be- ginner to intermediate level students [11]. The algorithm overgenerate ques- tions with the assumption that many of them would be unacceptable, then the questions are ranked based on quality and presented in descending order based on their rank.

Chali and Golestanirad [12] presented an improved version of Heilman’s algorithm which used community-based question answering systems such as Yahoo! to improve the ranking of the generated questions. The system they used is rule-based for the specific questions while also incorporating tem- plates for the basic questions. With this system they state that they managed to achieve a higher syntactic correctness and topic relevance than Heilman’s algorithm. They also hypothesize that further improvement could be made by exchanging their current rule-based system with learning techniques.

Earlier work in the field of AQG has mostly been focused around analyz- ing the quality of questions generated through specific algorithms and then improving upon or finding new algorithms. Not a lot of work has been done comparing questions generated through an algorithm to human written ques- tions.

However, some recent work had been conducted within the field of com-

paring human and computer generated questions. The study by Bonnevier

(14)

CHAPTER 2. BACKGROUND 7

and Damne [13] compared human written programming questions to program- ming questions generated through two different AQG algorithms, one of which was Heilman’s algorithm. This was done through a survey where the partici- pants were given two questions at a time and asked which question was gen- erated by a computer, as well as why they thought so. The end of the survey also contained two multi-choice questions where the participant was asked to pick the human generated questions among a group of 10 questions.

Bonnevier and Damne [13] arrived at the conclusion that people can differ between human and computer generated questions about 50 percent of the time and that the most distinguishing factors were obvious errors, sentence structure and formulation.

2.4 Heilman’s Algorithm

Heilman [6] presented an algorithm for generating short fact based questions from an input text. The questions generated are no longer than one sentence and the answers are often shorter. Since only short fact based questions are generated there is always one clear answer unlike the format of essay questions.

The approach used was to overgenerate questions, creating a large set of questions of varying quality, including very similar questions and faulty ques- tions. Then the questions were ranked based on quality and presented so that the questions of high quality were shown first.

Heilman’s algorithm can be divided into the basic stages of the question generation process: preprocessing, question generation and postprocessing.

What happens at each stage is presented in the following sections.

2.4.1 Preprocessing

The preprocessing stage of Heilman’s algorithm transforms the complex input sentences into simpler factual statements. This is done in two steps. First simplified factual statements are extracted from the input sentences, then the pronouns are replaced by their antecedents.

Since sentences can convey many pieces of important information we first need to extract simplified factual statements before forming our fact based questions. There are many ways of extracting the factual statements from a sentence. Below is one example of how it can look.

For example, from the sentence:

Mr X, the president of Y, obtained Z.

(15)

8 CHAPTER 2. BACKGROUND

We can extract:

Mr X obtained Z.

and Mr X is the president of Y.

When the simplified factual statements have been extracted sometimes they contain unresolved pronouns which almost always lead to vague questions.

For example if a factual statement is extracted outside of the original context it might lead to questions like:

When did he obtain Z?

Which is not a comprehensible question if it is not known that he means Mr X.

Hence we have to change the pronoun he back into what it was in its original context, which is Mr X, so that we can create the question:

When did Mr X obtain Z?

2.4.2 Question Generation

The question generation stage takes a factual statement as input and generates a set of possible questions. The question generation uses the words who, what, where, when, whose, and how many in order to transform the factual statement into a question. However it does not generate questions about word phrases such as What did Mr X do in 1991?.

Each statement can produce many questions. For example the statement:

Mr X met Mr Q

Can generate the questions:

Who met Mr Q?

and Who did Mr X meet?

and Did Mr X meet Mr Q?

The purpose of this stage is to generate as many questions as possible. The

postprocessing stage will then handle ranking the questions based on quality

and present the questions of highest quality.

(16)

CHAPTER 2. BACKGROUND 9

2.4.3 Postprocessing

The question generation stage generated a large number of questions, many of which are likely to be unacceptable. Therefore the purpose of the postpro- cessing stage is to rank these questions using a statistical model of question acceptability and then present the questions most likely to be acceptable first.

The statistical model uses least squares linear regression to model the qual-

ity of questions. The model forms values from 0 to 5 where a higher value in-

dicates a higher level of question acceptability. The values are determined by

combining the assessment of several features such as grammatics and vague-

ness. The total number of features used in the combined assessment is 179.

(17)

Chapter 3 Method

This section describes how the survey was formulated in order to answer the research question as well as how the survey was then conducted.

3.1 Dataset

The dataset used for both the human written questions and computer generated questions was an excerpt from the first chapter of the book Objects First With Java-A Practical Introduction Using BlueJ by Barnes and Kölling [14] which mostly focused on the basics of Java such as objects and classes.

3.2 Computer Generated Questions

Heilman’s algorithm was used to generate the computer generated questions for the survey. Using the ranking system provided by Heilman the highest ranking questions were chosen in order until all spots on the survey were filled.

All questions chosen had a rating above 1.98 (on Heilman’s scale between from 0 to 5 where higher is better). Duplicate questions and questions that were too similar to another question already chosen were discarded in the question choosing process.

3.3 Human Written Questions

The human written questions were generated by an experienced person within the field of introductory programming from the same dataset as the computer

10

(18)

CHAPTER 3. METHOD 11

generated questions. The person writing the questions had english as a sec- ondary language and therefore some small grammatical errors existed in the human generated questions.

3.4 The Survey

The survey was divided into three parts. The first part consisted of questions related to the participant in order to grasp the general demographic of the par- ticipants.

The second part consisted of groups of three questions where the partici- pant was asked to choose which question of the three they believed to be gen- erated by a computer. The participants also had the option to choose “None of the above” if they believed none of the questions to be generated by a computer.

The third part followed the format of the second part, but instead of choos- ing which question they believed to be generated by a computer they chose the question they believed to be written by a human.

The purpose of using groups of three questions and adding the alternative of “None of the above” rather than the two questions pick one format used by Bonnevier and Damne [13] was to decrease the influence of luck, as well as an attempt to further stimulate the analytical thinking of the participants in order to gain accurate results.

The survey was distributed to personal contacts of the author of this thesis.

The full survey can be found in appendix A.

3.5 Limitations

• The human written questions were written by a person who had english as a secondary language which raises the possibility of the questions including grammatical errors.

• The amount of participants were limited due to a limited personal net- work and time constraints which means the participants may not cor- rectly represent a larger population.

• There is a sample bias since the participants of the study all belong to the personal network of the author of this thesis.

• The fact that the questions were in english while most of participants

had swedish as their native language may have affected the results.

(19)

12 CHAPTER 3. METHOD

• Since the questions were about introductory programming the mean-

ing of the questions may not have been understood by the participants

with low programming experience, which could have influenced their

answers.

(20)

Chapter 4 Results

The results from the survey will be presented below. The result is divided in three parts corresponding to the three parts of the survey. A total of 53 people answered the survey.

4.1 Demographics

This section presents the demographics of the participants acquired form the first part of the survey.

Figure 4.1: gender distribution of the participants

13

(21)

14 CHAPTER 4. RESULTS

As displayed in figure 4.1, there were more men than women taking the survey and a few participants that did not associate with either gender.

Figure 4.2: age distribution of the participants

As displayed in figure 4.2, many different age groups participated in the

study with the smallest ones being 10-19 and 30-39. The largest age group

was 20-29 which consisted of about a quarter of the participants. Also no one

in the age group 0-9 participated in the study.

(22)

CHAPTER 4. RESULTS 15

Figure 4.3: programming experience distribution of the participants As displayed in figure 4.3, most of the participants had 0 years of program- ming experience. The groups with 1-3, 4-6 and 10+ years of programming experience were all fairly equal in size. The smallest group were the 7-9 years of programming experience with only 1 participant.

4.2 Computer Generated Questions

This section presents the participants answers to the questions of the second

part of the survey where the participant was asked to choose the computer gen-

erated question. The circle diagrams is divided into four sections correspond-

ing to the answers where the green option is the correct answer. The numbers

in the circle diagram indicate how many participants chose the option.

(23)

16 CHAPTER 4. RESULTS

Figure 4.4: result of question 1

Figure 4.5: result of question 2

(24)

CHAPTER 4. RESULTS 17

Figure 4.6: result of question 3

Figure 4.7: result of question 4

(25)

18 CHAPTER 4. RESULTS

Figure 4.8: result of question 5

4.3 Human Written Questions

This section presents the participants answers to the questions of the third part

of the survey where the participant was asked to choose the human written

question. The circle diagrams is divided into four sections corresponding to

the answers where the green option is the correct answer. The numbers in the

circle diagram indicate how many participants chose the option.

(26)

CHAPTER 4. RESULTS 19

Figure 4.9: result of question 6

Figure 4.10: result of question 7

(27)

20 CHAPTER 4. RESULTS

Figure 4.11: result of question 8

Figure 4.12: result of question 9

(28)

CHAPTER 4. RESULTS 21

Figure 4.13: result of question 10

4.4 Result Summary

The total number of correct answers was 236 out of 530. Out of those 236 correct answers 120 were in the second part of the survey and 116 were in the third part of the survey.

The results were also analyzed based on if the participant had program- ming experience or not. The result can be seen below:

Figure 4.14: result summary

As displayed in figure 14, the participants with programming experience guessed correctly more often than those without programming experience.

Due to the low number of participants dividing the results into smaller

(29)

22 CHAPTER 4. RESULTS

groups based on the demographics of the participants will give unreliable re-

sults.

(30)

Chapter 5 Discussion

This section will discuss the result with the aim to answer the research ques- tion: How accurate are humans at differentiating between programming ques- tions generated by Heilman’s algorithm and programming questions written by a human?.

5.1 Result Analysis

Since the total number of correct answers was 236 out of 530 that indicates that the average participant of this study was able to guess correctly 44.5 percent of the time. This is considerably higher than the 25 percent that would be expected if the answers were chosen at random, since each question has four alternatives. This result indicates that the participants were able to distinguish between the human and computer (Heilman’s algorithm) generated questions in this study to some degree.

Since the correct answers in percent was higher for participants with pro- gramming experience (54.5 percent) than for those without it (38.5 percent), having programming experience seems to have made it easier to distinguish between the human and computer generated programming questions in this study. This result was to be expected since programming experience would make it easier to understand the questions and therefore assist in the process of distinguishing between them.

Since the number of correct answers in the second and third part of the survey were close to equal (120 and 116) it seems the participants found it equally challenging to choose computer generated questions among human written ones as choosing human written questions among computer generated ones.

23

(31)

24 CHAPTER 5. DISCUSSION

5.2 Comparison with Related Work

The similar study by Bonnevier and Damne [13] arrived at the result that peo- ple could differ between human written and computer generated questions about 50 percent of the time. The percent was slightly higher when the partic- ipants were asked to pick one of two questions (60 percent) and slightly lower when asked to pick any number of questions among a group of ten questions (43.4 percent). The result when picking any number of questions among a group of questions is the process which most closely resembles the approach in this study and the result of 43.4 percent is also close to the result in this study, which was 44.5 percent. This further supports the accuracy of the re- sults in this study, however it is important to note that the study by Bonnevier and Damne [13] used two algorithms, only one of which was Heilman’s al- gorithm. They also stated that Heilman’s algorithm performed slightly worse than the other algorithm.

5.3 AQG in CS1 Education

The results of this study indicates that a well researched algorithm such as Heilman’s algorithm is still not reliable enough to be used in programming education by itself. Some of the questions created by the algorithm could be identified as a computer generated questions easily, such as the first question of the study where as much as 66 percent of the participants identified the ques- tion correctly. It also has some problems due to the fact that it overgenerates questions, such as the fact that several questions could be close to identical, only formulated slightly differently. It has some potential to be used in educa- tion if an educator can look through the questions and pick the ones that are of high enough quality to be used, that way the time needed to create exercises questions could be reduced drastically. However it still falls short of the per- formance needed to be used in automatic online quizzes and the like without human interference.

There is also the ethical perspective to consider. In education there needs to exist a certain measure of trust between educator and student. The stu- dents trust that the learning material provided by the educators is created for the specific purpose of boosting their knowledge and intuition in the subject.

Therefore the educators have a responsibility to put thought and consideration

into the learning material in order to provide the best possible learning expe-

rience for the students. If an algorithm is used to generate learning material

(32)

CHAPTER 5. DISCUSSION 25

of lacking quality without any additional interference from the educators the students may lose their trust in the education they receive.

5.4 Future Work

One of the largest limitations to this study was that the participants were few and sample biased since they all belonged to the authors connections. One possibility for future work would be to increase the sample size and include enough diversity to accurately represent a larger group of people. On top of this, changing the topic of the questions from programming to general ques- tions would be a way to remove the limitation that the questions require certain knowledge to be understood.

Another possible future study could be to use Heilman’s algorithm to cre-

ate learning material and use it in CS1 education in order to determine it’s

potential in practice.

(33)

Chapter 6 Conclusion

This participants of this study, consisting of both people with and without ex- perience in computer programming, were able to distinguish between human written questions and questions generated through Heilman’s algorithm about 44.5 percent of the time. The percent was higher for the participants with pro- gramming experience at 54.5 percent and lower for the participants without programming experience at 38.5 percent.

Whether the questions was "Which question was generated by a computer?"

or "Which question was generated by a human" seemed to make no significant difference.

Heilman’s algorithm can be considered to unreliable to use in education by itself. But if the process to generate learning material is managed by an educator using the algorithm as a tool, it could potentially reduce the time needed to create learning material.

26

(34)

Bibliography

[1] Benito Mendoza and Laura Zavala. “On the Use of Semantic-Based AIG to Automatically Generate Programming Exercises”. In: the 49th ACM technical symposium on computer science education (2019).

[2] Will Thalheimer. “The learning benefits of questions”. In: Work-Learning Research (2003).

[3] Ghader Kurdi et al. “A Systematic Review of Automatic Question Gen- eration for Educational Purposes”. In: International Journal of Artifi- cial Intelligence in Education (2019).

[4] Marilyn Walker, Owen Rambow, and Monica Rogati. “SPoT: A Train- able Sentence Planner”. In: Proceedings of NAACL (2001).

[5] Arthur Graesser et al. “AutoTutor: An Intelligent Tutoring System With Mixed-Initiative Dialogue”. In: IEEE Transactions on Education (2005).

[6] Michael Heilman. “Automatic Factual Question Generation from Text”.

In: Language Technologies Institute School of Computer Science Carnegie Mellon University (2011).

[7] Madeleine Bates. “Models of natural language understanding”. In: Pro- ceedings of the National Academy of Sciences (1995).

[8] Elizabeth Liddy. “Natural Language Processing”. In: Encyclopedia of Library and Information Science (2001).

[9] Sheetal Rakangor and Y.R. Ghodasara. “Literature Review of Auto- matic Question Generation Systems”. In: International Journal of Sci- entific and Research Publications (2015).

[10] Andreas Papasalouros and Maria Chatzigiannakou. “Semantic Web and Question Generation: an Overview of The State of The Art”. In: Inter- national Association for Development of the Information Society (2018).

27

(35)

28 BIBLIOGRAPHY

[11] Maria Chinkina and Detmar Meurers. “Question Generation for Lan- guage Learning: From ensuring texts are read to supporting learning”.

In: LEAD Graduate School and Research Network Department of Lin- guistics (2017).

[12] Yllias Chali and Sina Golestanirad. “Ranking Automatically Generated Questions Using Common Human Queries”. In: Proceedings of the 9th International Natural Language Generation conference (2016).

[13] Linnea Bonnevier and Sara Damne. “Human vs Computer Generated Questions”. In: KTH Royal Institute of Technology School of Electrical Engineering and Computer Science (2020).

[14] David J Barnes and Michael Kölling. Objects First With Java-A Prac-

tical Introduction Using BlueJ. Pearson Education, 2004.

(36)

Appendix A The survey

29

(37)

30 APPENDIX A. THE SURVEY

Figure A.1: part 1 of the survey

(38)

APPENDIX A. THE SURVEY 31

Figure A.2: part 2 of the survey

(39)

32 APPENDIX A. THE SURVEY

Figure A.3: part 3 of the survey

(40)

APPENDIX A. THE SURVEY 33

Figure A.4: end of the survey

(41)

www.kth.se

TRITA-EECS-EX-2020:684

Distinguishing between human and computer generated questions

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2020 ,

Distinguishing between human and computer generated questions

SEBASTIAN EMTELL

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

Bachelor in Computer Science Date: September 3, 2020 Supervisor: Ric Glassey Examiner: Pawel Herman

School of Electrical Engineering and Computer Science

Swedish title: Skilja på människa- och datorgenererade frågor

iii

Abstract

The results showed that people were able to distinguish between human

written programming questions and programming questions generated by Heil-

man’s algorithm about 44.5 percent of the time. However, this number was

slightly higher for the participants with programming experience (54.5 per-

cent) and slightly lower for the participants without programming experience

(38.5 percent). The results showed no significant difference whether the ques-

tion was Which question is generated by a computer? or Which question is

generated by a human?.

iv

Sammanfattning

Resultaten visade att människor kunde skilja mellan mänskligt skrivna

programmeringsfrågor och programmeringsfrågor genererade med Heilman’s

algorithm ungefär 44,5 procent av tiden. Detta antal var dock något högre för

deltagarna med programmeringserfarenhet (54,5 procent) och något lägre för

deltagarna utan programmeringserfarenhet (38,5 procent). Resultaten visade

ingen signifikant skillnad oavsätt om frågan var Which question is generated

by a computer? eller Which question is generated by a human?.

Contents

1 Introduction 1

1.1 Purpose . . . 2

1.2 Research Question . . . 2

1.3 Approach . . . 2

2 Background 4 2.1 Natural Language Processing . . . 4

2.2 Automatic Question Generation . . . 5

2.2.1 Preprocessing . . . 5

2.2.2 Question Generation . . . 5

2.2.3 Postprocessing . . . 6

2.3 Related Works . . . 6

2.4 Heilman’s Algorithm . . . 7

2.4.1 Preprocessing . . . 7

2.4.2 Question Generation . . . 8

2.4.3 Postprocessing . . . 9

3 Method 10 3.1 Dataset . . . 10

3.2 Computer Generated Questions . . . 10

3.3 Human Written Questions . . . 10

3.4 The Survey . . . 11

3.5 Limitations . . . 11

4 Results 13 4.1 Demographics . . . 13

4.2 Computer Generated Questions . . . 15

4.3 Human Written Questions . . . 18

4.4 Result Summary . . . 21

v

vi CONTENTS

5 Discussion 23

5.1 Result Analysis . . . 23

5.2 Comparison with Related Work . . . 24

5.3 AQG in CS1 Education . . . 24

5.4 Future Work . . . 25

6 Conclusion 26

Bibliography 27

A The survey 29

Chapter 1 Introduction

However, manually generating questions is a challenging task that requires a lot of training, expertise and resources [3].

Automatic Question Generation (AQG) methods were developed as a so- lution to the problem of generating a large number of high quality questions and have the potential to be used in several technologies such as intelligent tu-

1

2 CHAPTER 1. INTRODUCTION

1.1 Purpose

This thesis aims to investigate the human ability to differentiate between pro- gramming questions generated by computers and humans. For the computer generated questions the AQG algorithm developed by Heilman [6] will be used.

1.2 Research Question

How accurate are humans at differentiating between programming questions generated by Heilman’s algorithm and programming questions written by a human?

1.3 Approach

In order to answer the research question a survey with two parts will be used. In

the first part the participants are given three questions and asked which one of

the questions they believe to be generated by a human. In the second part the

participants are given three questions and asked which one of the questions

they believe to be generated by a computer. They also have the alternative

CHAPTER 1. INTRODUCTION 3