Human vs Computer Generated Questions

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2020,

Human vs Computer Generated Questions

- A Survey Study

LINNEA BONNEVIER

SARA DAMNE

(2)

Human vs Computer Generated Questions

-A Survey Study

LINNEA BONNEVIER AND SARA DAMNE

Bachelor’s Thesis in Computer Science Date: June 2020

Supervisor: Richard Glassey Examiner: Pawel Herman

Swedish title: M¨anniska- vs Datorskrivna Fr˚agor - En Enk¨atstudie School of Electrical Engineering and Computer Science

(3)

i

(4)

Abstract

This thesis investigates if humans can tell apart computer generated and human written questions. Limited work has previously been done in the area of comparing the results of different algorithms. This thesis used generated questions from two different algorithms. One recently developed and more advanced than the other but not as tested. The other algorithm was older but wildly researched.

A survey was conducted and shared across social media to gather large amounts of data of peoples perception of different questions. The survey was focused on testing common misconceptions, such as grammar and spelling, to see if it affected the choices taken by the participants. The survey also tested if questions could be distinguished in a more general setting with no directions or questions with obvious errors.

The results showed that people are able to tell the generated questions apart about 50 percent of the time. The results were consistent and did not vary between different questions or approaches. The most distinguishing factors were obvious errors such as misspelling and grammar errors, sentence structure and formulation.

(5)

Sammanfattning

Detta arbete undersöker om männskor kan se skillnad p˚a fr˚agor skrivna av en algoritm eller en människa. Begränsat arbete har tidigare gjorts inom omr˚adet att jämföra resultat fr˚an olika algoritmer. Rapporten använde genererade fr˚agor fr˚an tv˚a olika algoritmer. En nyligen utvecklad och mer avancerad än den andra men sparsamt testad. Den andra algoritmen var äldre men noga testad.

En undersökning genomfördes och delades p˚a olika sociala medier för att samla in stora mängder data om människors uppfattning om olika fr˚agor. Undersöknin- gen fokuserade p˚a att testa vanliga utmärkande drag, s˚asom grammatik och stavning, för att se om det p˚averkade deltagarnas val. Undersökningen tes- tade ocks˚a om fr˚agor kunde särskiljas i en mer allmän utsträckning utan givna uppgifts instruktioner eller fr˚agor med uppenbara fel.

Resultat visade att människor kan urskilja de dator genererade fr˚agorna cirka 50 procent av g˚angerna. Resultaten var konsekventa och varierade inte mellan olika fr˚agor eller olika tillvägag˚angssätt. De mest utmärkande faktorerna var uppenbara fel s˚a som felstavningar och grammatiska fel, meningsstruktur och formulering.

(6)

1 Introduction

How are you doing? What is the capital of Nigeria? Why are they asking about Nigeria? How many people have walked on the moon? When will these questions end? Questions are used everywhere around us. By others, ourselves and even computer programs.

The use of Automated Generated Questions (AGQ) dates back around 40 years[13]

and can be produced by feeding a text to an algorithm that then generates questions. AGQ can be used to generate questions for frequently asked questions pages on website, for educational purposes and more[10].

AGQ presents a set of challenges. One of the biggest being selecting the correct question word for each noun. Choosing the wrong word significantly impacts the quality of the questions and can in the worst case scenario make the question unusable, such as ”Who is the capital Nigeria?”. Another challenge is paraphrasing, where two sentences can have the same meaning but consist of different words which makes an algorithm interpret the sentence differently.

Many different algorithms that generate questions given an input text has been developed, still little extensive work has been done on looking into the results of these algorithms. To use AGQ systems in practice the quality of the questions generated needs to be assured. Most AGQ systems have been evaluated single handily but little research has been done comparing different algorithms with each other. Further more, not much research has been done comparing AGQ questions with human written ones.

This thesis aims to extend the research of comparing the results from different AGQ algorithms and see if they can be compared to human written questions.

1.1 Purpose

This thesis aims to investigate and challenge the perception of human vs computer generated questions. The focus will lie on if people can tell the different questions apart and if so finding patterns of what AGQ algorithms can improve.

1.1.1 Research question

• To what extent can people correctly differ between questions written by humans and questions automatically generated by computers? And what main differences can be identified?

1.2 Approach

In order to answer the research question a survey will be conducted where the participants will be asked if a series of questions are computer generated or

(9)

not. They will also be asked why they answered one way or another. From this the aim is to get a sense of what aspects or details people think are computer generated vs human written and to see if there are any clear patterns.

1.3 Thesis Outline

In the second chapter a background to AGQ is presented. In chapter three two AGQ algorithm are described in detail. The forth chapter the method used is described and presented. The fifth chapter presents the results. The sixth and final chapter contains the discussion and conclusion of the thesis.

(10)

2 Background

In the following section the theoretical background for the thesis is presented.

The first section covers Natural Language Processing (NLP) and overview of it’s areas of applications. One particular area of application is Automatic Generated Questions (AGQ) which is covered in the following section. The chapter ends with a section on related work in AGQ at the moment.

2.1 Natural Language Processing

The research and use of NLP began in the 1950’s and is an overlap between artificial intelligence and linguistics[15]. It can briefly be explained as a process that computerizes analysis of text. NLP is a crucial part of AGQ. Elizabeth D.

Liddy at Syracuse University defines NLP as follows.

”Natural Language Processing is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.” [12]

With naturally occurring texts Liddy means texts in any language, mode or genre, written or oral and texts that is not made for the analyze itself. Liddy states that the only requirement is that the text is in a human language with the purpose of communicating with one another. Lastly Liddy includes for a range of tasks or applications. This points to the fact that NLP is not an application used in itself. It is instead a part of many other areas of use, a couple of which we discuss further down. It should be mentioned that NLP is a broad research field and it does not exist one single, agreed upon definition. The definition above is just one among many[12].

NLP was originally referred to as Natural Language Understanding (NLU). Both NLP and NLU includes paraphrasing an input text, translating texts into different languages and answering questions about the content of the text. That understanding has been updated to processing is mostly due to that understanding also includes the ability to draw inference from a text, which is something NLP is not capable of even if it still remains as a general research goal[12].

NLP analyzes the given text based on many different aspects of a language. On a more low level of understanding NLP systems deal with lexical structures, syntactic meanings and morphology. NLP systems can also analyze a text on a more understanding-, high-level. The system then adds semantics, pragmatic aspects and discourse[12].

NLP is used in a series of different areas. Any application that uses some form of text could utilize NLP. Some of the most common applications that use NLP are:

• Information retrieval

(11)

• Information extraction

• Question answering

• Summarization

• Machine translations

• Dialogue systems [12]

2.2 Automated Generated Questions

One branch of NLP is text generation which includes automatically generating questions. There are different approaches in how to deal with this complex task.

The most common approach creates questions from one single sentence. These systems therefore starts with splitting the input text on every new sentence.

They then try to simplify the sentences by removing unnecessary parts or words before they transform the sentences into questions based on their own set of rules and patterns[13].

There are also systems that creates questions based on paragraphs instead of single sentences. These systems require the use of discourse cues that implies which sentences can be used together and from which text segments more con- ceptual questions can be constructed[13].

Other approaches are for example template based AGQ where questions are mapped to different templates that cover common subjected or systems that combine NLP and information retrieval in order to find possible answers to create questions. Which approach is the most suitable for a system depends on the purpose of the generated questions and on the type of data used as input[1].

There are a number of challenges when it comes to AGQ. For starters there are lexical challenges, especially when it comes to nouns. Identifying which nouns can be paired with which question word is of great importance. The wrong question word can make a question significantly worse or even make it completely unusable. For example if the text contains a name of a person the word who will be the appropriate question word but if the name is not a name of a person but of a place the question word should instead be where. This kind of knowledge requires either large databases that specifies what type of noun a word is or that the system makes use of machine learning teaching it which word goes with which. This problem becomes even more complex if the system includes many different languages[6].

Another lexical challenge with AGQ occurs with paraphrasing. Even though a system could identify a sentence such as John F. Kennedy was murdered. as a who-question it might not understand that The president of the United States

(12)

got murdered. has the same meaning and should also result in a who-question.

Other challenges include idioms, vagueness, word knowledge etc[6].

2.3 Related Work

Heilman[6] published an article in 2011 that presented an algorithm that generated factual questions. The system overgenerated questions and later ranked these according to a set of rules. This algorithm is described in detail in section 3.1.

Chali and Golestanirad presented in 2016 a proposed improvement of Heilman’s algorithm that generated questions for topics of interests. Chali and Golestani- rad mostly aimed to improve the ranking system in order to improve the quality of the generated questions. The system used templates to generate and rank general questions and implemented rules in order to generate and rank specific questions. They state that this work could probably be further developed by applying learning techniques to the templates using different corpuses[3].

Krishna and Iyyer presented an algorithm, called SQUASH, that turns a doc- ument of text into a hierarchy of questions and answers. The algorithm built upon earlier work and used comprehension datasets to train their algorithm[10].

This algorithm is described in detail in section 3.2.

Earlier work, regarding comparing generated questions, has mostly focused on analyzing the quality of generated questions and ranking/grading these according to different systems. Focus has primarily been on controlling the quality of a specific algorithm rather than comparing generated questions to human written ones.

Heilman[6] performed a study where teachers were given six articles to which they were asked to create three questions each. The participants could chose to either write a question on their own, choose a computer generated question or revise a generated question. The algorithm used to generate these questions are described in section 3.1. Almost 59 percent of the participants chose to use at least one generated question without editing it for all six articles. 94 percent chose to use at least one generated question for more than three of the articles.

On average 2.2 questions, of the asked three, were computer generated for each article. On average only 0.7 questions were revised generated questions. None of the participants chose to use this option for all articles, although 53 percent chose revised questions for more than three articles[6].

Further on Chali and Golestanirad[3] compared their algorithm to Heilman’s algorithm. A study with three participants was conducted were the participants were asked to grade 20 questions based on syntactic correctness and topic relevance. Half of the questions were generated using Heilman’s algorithm and half were generated using the writers own algorithm. Grades were given on a scale of 1 to 5. For syntactic correctness the Heilman algorithm scored 3.13 compared to the writers own system that scored 4.05. Based on topic relevance the scores

(13)

were 3.42 vs 4.06[3].

Krishna and Iyyer[10] let people determine if generated questions from their algorithm, SQUASH (described in section 3.2), were of proper quality when it came to grammar and pragmatics. The questions relevance and usefulness was also taken into consideration. Their study showed that 85.8 percent of the generated questions held good quality and that 78.1 percent were considered relevant[10].

In the next chapter more details about Heilman’s algorithm and the SQUASH algorithm will be presented. These are the two algorithms that this study has chosen to use for it’s survey. Heilman’s algorithm was chosen since it was one of the earlier question generating algorithms[6] and many algorithms since then have been based upon it[3]. The SQUASH algorithm was chosen since it was fairly new, published in July 2019[10].

3 AGQ Algorithms

The purpose of this section is to describe in detail the algorithms used in this study. First the Heilman algorithm is presented, followed by the SQUASH algorithm.

3.1 Algorithm 1 - Heilman’s algorithm

The first algorithm that was used in this study was an algorithm published in 2011 by Michael Heilman at Carnegie Mellon University. The purpose of the algorithm was to, given an input in form of an article of text, generate a list of ranked, fact based questions. The approach was to over generate questions from the given text and then rank these to present good quality questions first[6].

This algorithm creates questions on a fact based level. The expected answers are short, ranging from a few words to a sentence long. No longer, essay-like answers are expected and therefore no questions that corresponds with that type of answers was created[6].

The algorithm consists of three parts, plus a prework stage. Transforming sentences from the input text, create questions from these sentences and ranking the generated questions[6].

3.1.1 Prework

Before the first step the algorithm does some prework to the given text. Here it uses a couple of external sources. The first one is the Stanford Standard Parser[9]. The Stanford parser splits the text into sentences, tokenizes the sentences and then parses the text. This results in syntax trees where one tree represents one sentence. Sentences over 50 tokens (words) are not parsed

(14)

since sentences like that are considered too complicated to transform into questions[6].

In the prework the algorithm also uses Tregex tree searching language[11]. This tool identifies and labels relevant syntactic elements such as the subject of the sentence etc. The tool also identifies relations between words, i.e dominance and precedence which gives information such as who/what a pronoun refers to.

These children nodes can then be deleted and the syntax tree modified. This so that questions will be less vague[6].

Lastly in the prework the algorithm labels nouns more precisely. It uses a supersense tagger described by Ciaramita and Altun in the article Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger published in 2006[5] together with the lexical database WordNet[14] that labels common nouns by category such as person, location organization. I.e a name of a person will be labeled noun.person. This helps determine what question word to use for a question based on that sentence[6].

3.1.2 Part 1 - Transforming sentences

The first step in transforming sentences from the input text is to extract simpler factual statements from the given sentences. This means that unnecessary words, such as descriptive words are removed[6].

The second step in this part is to replace vague pronouns. This is done by using the labeling created by the Tregex tool described in 3.1.1. In this stage vague pronouns are replaced based on the node-to-node relationship found in the corresponding syntax tree. This prevents questions such as When was she born? [6].

3.1.3 Part 2 - Creating questions

This part starts by identifying answer phrases, hence the answer to the question is part of the result from this algorithm. The final output, after all parts, will be a list of both questions and their corresponding answer and the rank of the question. Answer phrases can be one of three different categories: noun phrases, prepositional phrases and subordinate clauses. Each of these categories are then paired with different kinds of question words[6]. I.e a noun phrase is paired with who or what and so on. The algorithm can generate question with the following question words: who, what, where, when, whose and how many as well as yes or no questions. One sentence can, and often does, result in many questions. As mentioned in the beginning of 2.3 the algorithm is designed to over generate questions so for each sentence it tries to create many different questions[6].

Part 2 also takes care of formatting such as adding a question mark, removing additional white spaces etc[6].

(15)

3.1.4 Part 3 - Ranking questions

The last part of the algorithm is to rank the generated questions so that the user will not have to go through a lot of questions to find one of descent quality. In order to do this the creator of the algorithm has developed an internal system which decides the quality of different features of the questions and then calculates a total score. Each feature is a value between 1-5 were 5 is the highest and 1 is the lowest[6].

There are 179 features categorized into nine categories:

• Length: includes features based on number of tokens in the question, the answer and the corresponding original phrase

• Question words: checks if the question have a valid question word, which is considered better than yes/no questions, and if so which question word the questions uses.

• Negation: assesses the question based on if words linked to negation is part of the question.

• N-Gram Language Model Features: includes features based on the likeli- hood of the relation between the length och the question, length of the answer and length of the original sentence

• Grammatical features: looks at the tokens that make up the questions and values it based on grammatical rules.

• Transformations: features based on what transformations was made during part 1.

• Vagueness: based on how many pronouns the question and answer has and whether these pronouns are specific names etc or vague substitutes.

• Pronoun replacement: a boolean value that is true if pronouns have been replaced in the creation of the question. In which case features regarding pronouns, like the ones in the vague-category, will be taken under consideration.

• Histograms: also related to the length of the question. Includes features that indicates if the length has overstepped different predecided thresh- olds.

All these features are than combined to calculate the final ranking score of the question. The output result is sorted based on this ranking score, with the highest ranked question first[6].

3.2 Algorithm 2 - SQUASH

The second algorithm used in this study was conducted by Kalpesh Krishna and Mohit Iyyer at University of Massachusetts Amherst in 2019. They call the algorithm SQUASH, Specificity-controlled QUestion AnSwer Hierarchies. The

(16)

algorithm aims to generate question-answer(QA) pairs from a given text. The algorithm provides the user with questions ranked from general overview questions to more specific, detailed questions. The algorithm also generates yes/no questions. The main purpose for questions generated from this system is for educational purposes and as frequently asked questions on i.e websites[10].

SQUASH generates questions with the help from templates taken from different datasets (see 3.2.1). The algorithm trains on these datasets and then generates similar questions from a selected answer span of the users input text[10].

SQUASH used QA datasets which divided the QA pairs into two categories, general and specific. These two categories were a fundamental part of the algorithm. All generated questions were labeled with how specific or general it was. The hierarchy in which the result was returned was also based on these two categories, with more specific questions at the top[10].

3.2.1 Training datasets

SQUASH used already existing datasets but modified them to suite the QA- hierarchy that the system aimed to generate. The first dataset that the algorithm utilized was SQuAD[16] which was designed for individual questions, meaning it did not compare the questions to each other. Other datasets used by SQUASH include QuAC[4] and CoQA[17] which contains relations between questions in a sequential order. To modify these datasets SQUASH added a number of rules that further labeled QA-pairs in a general to specific hierarchy[10].

For questions that could not be labeled using these rules a data-driven approach was used instead. The creators of SQUASH first labeled 1000 questions taken from QuAC manually, using the same labels as above, and then let a single-layer CNN binary classifier[8] label the rest of the questions in the used datasets[10].

3.2.2 The SQUASH pipeline

Given an input text the system divides it into paragraphs. For each paragraph it then generates questions through five steps: answer span selection, question generation conditioned, extracting answers for the generated questions, filtering out questions and structure the questions into the wanted hierarchy[10].

Answer Span Selection

The algorithm considers all sentences as possible answers for general and specific questions. For specific questions it also identified all entities and numerics as possible answers[10].

Conditional Question Generation

The question generation part of the system was trained on the datasets before the user’s text was handled. The training was done with a neural encoder-

(17)

decoder model. At the training stage the answer spans from the original texts was used as input. For every span there was a limit of 10 questions in order to reduce duplicates and unanswerable questions. At test time the algorithm used the answer span, its label and which paragraph it related to. It used beam search to over generate questions for each span. First it generated three question candidates that were mostly quite general or generic. To add diver- sity it thereafter generated ten more question candidates with a more random specificity-level and different focuses within the answer span[10].

Answering Generated Questions

After the questions were generated the algorithm checks that the question corresponds to the answer given in the original text. In the cases were the answers were found in another span from the paragraph the answer was altered to fit the generated question[10].

Question filtering

The system was programmed to over generate questions from each paragraph, resulting in the need to filter out bad questions afterwards. This is done in two different stages. First, simple heuristics was used to remove low-quality questions as well as duplicates and generic questions. In the second stage the remaining QA-pairs was filtered based on if the question was generated from irrelevant or repeated entities or numerics. Questions were also filtered out if the question was unanswerable or if the answer overlapped very little with the question. Lastly paragraphs to which a lot of questions were still remaining, the question with the highest quality for each answer span was selected[10].

Forming hierarchy

The QA-pairs were structured in an hierarchy compared to the rest of the questions from the same paragraph. A question labeled as specific was linked to the general -question it’s answer overlapped the most with. If no such question was found it was mapped to the closest general -question whose answer came before the specific-question’s answer in the original text. The structured result was then presented to the user in a tree-structure. It was shown in a bottom-top order with the specific-questions first, followed by their parent-question and so on[10].

(18)

4 Method

This section describes the method used to answer the research question. The focus is specifically on how questions were generated using different methods and how the survey was conducted. The purpose of the survey was to gather diverse opinions and make it possible to detect patterns on what people assumed were human and computer generated questions.

4.1 Datasets

The same dataset was used for generating the human and computer questions.

Two different texts on introductory programming were used. The first text was taken from Objects First With Java-A Practical Introduction Using BlueJ by David J. Barnes and Michael K¨olling[2] which focuses on basic Java, mostly objects and classes. The other text was a sample chapter from a programming book published by Pearson[7] with a focus towards hardware and assembly.

4.2 Computer generated questions

The Heilman[6] algorithm was used to generate questions. Using Heilmans built in rating system (see 3.1.4) we chose the questions rated above two and that were unique¹and had no clear errors.

The SQUASH algorithm[10] was used to generate questions. Questions requiring context or questions considered too vague were left out.

4.3 Human written questions

To generate the human written questions we used two different techniques. The first one being gathering a group of humans to write questions based on a specific text. The participants were given instructions to write simple fact based questions based on the text. The people participating in this process all had English as a second language. Therefor natural grammar and spelling errors were made in the human written questions that were not corrected. Total of 22 questions were generated and seven questions were presented in the survey.

The second text used included study questions at the end of the chapter. A total of 34 questions was selected from these, from which eight were presented in the survey.

1Unique as in shares no clear similarities to other questions

(19)

4.4 The Survey

A survey (see appendix A) was conducted and shared via social media. The survey was shared in various groups on Facebook, on our personal feeds, LinkedIn and Reddit. In total 367 people participated.

The survey was divided into three different parts. The first part consisted of demographic questions: age, gender and years of programming experience.

The second part was centered around comparing one human and one computer question generated from the same text. In this section common misconceptions such as spelling and grammar errors were tested to see if they could be factors that sways a person to chose either one of the options over the other. This part consisted of five questions. For example:

What does a programmer need to do to create, create, and test computer programs? vs

What happens once the keys has been found?

The participants could also voluntarily answer a why question after each one to share their thoughts on why a specific option was chosen.

In the last section we had five human written questions and five computer written questions in two queries, one for each algorithm. This was to see if specific questions stood out as either human or computer generated. This section also included a general question on what factors had made the participant choose their answers. The order of the questions in this part was randomized and therefor different for all the participants.

After completing the survey the participants had the possibility to view the correct answers in an external file. This was added after feedback from participants who wished to see the correct answers and compare with each other. The full survey was conducted in English, with the possibility to write longer answers in Swedish. The full survey can be found in appendix A.

The data generated by the survey has been used to make the graphs and dia- grams using Microsoft’s excel and Google’s sheets. Data was analysed in batches and divided into different interest groups, for example years of programming experience. The comments written by the participants were divided into different categories such as content, grammar, spelling, formulation, word choice and sentence structure by the authors. See section 5 to see data and results of survey.

(20)

4.5 Limitations

• The human generated questions were written by non-native English speakers and therefore may include language mistakes.

• Two algorithms were used to generate questions. Therefor conclusions drawn from this survey might not be applicable to AGQ algorithms in general.

• The amount of participants were limited and might therefor not represent a larger population.

• Since the survey consisted of questions about introductory programming participants without knowledge in that area might have been mislead by specific words or context.

• For every question in the survey the participant was given the choice to answer what they based their decision on. This was given as a open ended question in order to not guide the participants in any direction. These answers were later categorized, such as grammar errors, by the authors which may have lead to misjudgements our misinterpretations.

(21)

5 Results

In this section the results of the survey are presented with graphs and charts.

The first section presents the results of the second part of the survey, asking which questions were believed to be generated by a computer. The second part presents the results of which questions were thought to be written by a human.

The final section presents general comments from participants regarding their choices.

5.1 Part I - Demographics

A total of 367 people answered the survey.

Figure 1: Age Distribution of Participants

As seen in figure 1 the majority of the participants were in their 20’s.

Figure 2: Representation of previous programming experience

Seen in figure 2 about half of the participants had previous programming experience.

(22)

Figure 3: Distribution of years of programming experience

Figure 3 shows the number of years the participants with programming experience had. Almost 45 percent had less than one year of experience.

5.2 Part II - Which Question is Computer Generated

NOTE: In all the charts in this section red means human written question and blue means computer generated question.

In figure 4 (below) the count of the answers for question one through four in the survey is displayed.

The first pair of bars represents question 1. As seen almost 60 percent answered correctly. The most common comment on why the participants answered this way was due to formulation. I.e Not flowing as well as the other question.

Probably got some parameters but cant ”understand” that it can be said in other words. or ”Seems to be a question posed to text where the answer is in the same format. Too complicated a question. Too many clauses.”. Other common reasons were word choice and content. ”First question is more specific”, ”The word ”executed”, seemed cold”².

For question 2 (the next pair of bars in figure 4) almost 65 percent answered correctly. The question tested general formulation. Here many commented on the use of the word actually which many interpreted as a human word to add to a sentence since it does not add any information or extra meaning. I.e

”The word actually makes it sound more casual, human.”. 100 participants pointed out this word as the reason for their answer. Another common factor was formulation and content, such as ”Other response is more informal” or ”It sounds less natural”.

2Some comments have been translated from Swedish by the authors of this thesis.

(23)

Figure 4: Results of questions 1 through 4, blue represents the computer generated questions.

The result of question 3 is shown by the third pair of bars (as seen from the left) in figure 4. Question 3 had the purpose of testing how errors in formalities affected peoples opinion, in this case a question starting with a lower-case letter.

66.5 percent answered correctly, as seen in figure 4. 66 participants based their answer on this formality, most claiming it to be a human error. Comments like

”I assume it’s programmed to always begin with a capital letter.” were a common answer to Why? the participants answered the way they did. Many also listed content (51 participants) and formulation (66 participants) as their reason why.

I.e seems more awkwardly written, ”sounds like a google translation” or ”formal language”.

Question 4 (last pair of bars in figure 4) tested an obvious error where the word create appeared twice in the computer generated question. As seen in figure 4 63.2 percent answered correctly and 123 participants answered that their answer was due to the double occurrence of create. The second question also contains a grammar error which 50 participants pointed out as their reason why. If this was due to human error or a computer error varied. ”It should be

”have” , not ”has” , computer not likely to make grammar errors” as oppose to ”The question I believe to have been generated by a computer uses incorrect verb agreement (correct would be “keys have been” or “key has been”).”.

(24)

Figure 5: Results of question 5, blue represents the computer generated question.

Question 5³had three options, where one was computer generated and two were human written⁴. 43.4 percent answered correctly as seen in figure 5 above. The responses to Why? varied, mostly between spelling, content and formulation.

Some examples of comments are ”the last question has typos, so im assuming a human wrote the question very quickly, the second one is too specific”, ”just seems like a very simple and clear question” and ”Seems like a request with a simple answer required to provide a result”.

Average correct answers

Total 2.97

No programming experience 2.79 With programming

experience

3.14 Less than one year

programming

3.05 One to Two years

programming

3.11 Three or more years

programming

3.32

Table 1: Average correct answers for part II of the survey.

On average people with programming experience answered more questions correctly, as presented in table 1. In total participants answered 2.97 questions correct, which is almost 60 percent. Other notable results are that with more

3This question only got 256 responses

4Green and Red options in figure 5

(25)

years of programming experience the average amount of correct answers in- creases. The highest amount of correct answers is therefor found in the group of three or more years of experience.

5.3 Part III - Which Questions are Written by a Hu- man?

Algorithm 1

Figure 6: Distribution of answers for algorithm 1. Red are human written and blue computer generated.

The computer generated questions for the first part were generated with the Heilman algorithm. There was a large spread in which questions were believed to be written by a human (as seen in figure 6). The red questions in the graph were written by a human.

No directions were given to the participants of how many of the questions out of the ten questions were human written and on average a participant chose 4.97 questions as human written.

The one question that stood out were When is followed by a boolean condition?

which was wrongly chosen as a human question by 60 participants (second staple from the right in figure 6).

On average participants got 5.69 of the alternatives correct (see table 2), which is almost 57 percent correct. People without programming experience scored 5.81 correct answers compared to people with programming knowledge, that scored 5.57 correct answers. The over all highest scoring group were participants with

(26)

Total 5.69

experience

programming

6.19

Table 2: Average correct answers algorithm 1, part III.

more than three years of experience, marking a total of 6.19 correct answers out of ten.

Algorithm 2

Figure 7: Distribution of answers for algorithm 2. Red are human written and blue computer generated.

The computer generated questions for this query are from the SQUASH algorithm. Figure 7 displays that there is a smaller spread in which questions were believed to be written by a human compared to figure 6. The red questions were written by a human.

As earlier, no indication on how many questions were human written were given.

(27)

The average amount of guessed questions was slightly lower than for algorithm 1, with 4.72 answers per participant.

Total 5.09

experience

programming

5.33

Table 3: Average correct answers for algorithm 2, part III.

Table 3 shows how the amount of correct answers depends on years of programming experience. In total people had an average of 5.09 answers right, which is slightly over 50 percent correct. Participants with programming knowledge did score higher with an average of 5.49 correct. In comparison people with no prior knowledge scored 4.69 correct answers. The highest scoring group was people with less than one year of experience, with 5.59 correct answers.

5.4 General Comments

The last why question had a total of 200 answers. This question was aimed to get general comments on what had made the participants answered the way they did through out the survey. The most common factor mentioned, with 88 participants mentioning it, was grammar. Popular comments were ”mainly grammar that isn’t natural-sounding” and ”Poor grammar or nonsensical question makes me think it was computer generated.”. Out of the 88 participants, 24 specified that grammar mistakes were supposedly made by a computer and 19 that they were made by a human.

87 participants mentioned that sentence structure was a factor for them. Exam- ple of comments on this were ”A sentence built in the wrong order directly made me think it was computer generated” or ”Mainly sentence structure, i think the more formal questions were computer generated” or ”clunky, weird wording, sentence structure with an hindered flow. Something about some sentences seems ”off”.” were common. Some more in-depth comments such as ”Some of the sentences sound overly formal, and don’t really resemble how we talk (or at least they don’t resemble how native English speakers talk). A good example is:

”A CPU understands instructions that are written only in what language?” That sounds kind of stilted, and a more natural way to phrase it would be ”A CPU only understands instructions written in what language?” [...]” were common among the responses.

(28)

Word choice was another common factor according to 54 people. Some common comments included ”If the wording sounded more relaxed or casual than formal I thought it was a human.” or ”Grammar and word choice makes some sentences feel not quite right, while still technically making sense” or ”If it felt like ”spoken language” I thought it was a human”⁵.

Another common factor was spelling, with 39 participants mentioning this. Of these, 23 people claimed that spelling error was an human error. Common comments include ”[...] Words spelled wrong (particular) makes me think human.

[...]” or ”Spelling errors and when there is a simpler ways of writing the same thing to me indicates human written sentences.”.

In other cases length, complexity and substance of a question were mentioned as other contributing factors. Common comments would include ”Shorter sentences with a straight forward structure makes me think it’s a computer, longer with more complex word order feels human. Sentences that has the subject of the sentence wrong also feels like a computer.” or ”Jeopardy-style questions (as in the clues given to the contestants on Jeopardy) seemed computer generated” or

”Some of the sentences sound overly formal, and don’t really resemble how we talk (or at least they don’t resemble how native English speakers talk).”. Formal language was mentioned as a specifically contributing factor by many, believing this to be the case of computer generated questions.

A selected few answered that they based all answers on random guesses with comments like ”It’s more of a feeling. I assume that computer generated questions need to have simple answers” or ”I was going by gut feeling mostly.” or

”i was a bit unsure since i did not have knlwldege of the specific subject that the questions related to. Hard to see if it was human or computer, but some of the questions seemed weird.”.

Some people even changed their perception through out the survey, such as ”I think a computer writes as short and concise answers as possible. I now realize I changed my mindset mid survey. I first thought that the computer was the

”smarter” and has the most correct questions, but later I answered ”a human had corrected that question”. I thought the computer probably starts all questions with what or when”.⁶ or ”At first I thought the bad ones were human made to throw us off but it might’ve not been the case. Sentenced that were too verbose or asked weird questions were red flags”.

5Some comments have been translated to English by the authors.

6The comment has been translated by the authors.

(29)

6 Discussion

In this section the results will be discussed to be able to answer the research question To what extent can people correctly differ between questions written by humans and questions automatically generated by computers? and what main differences can be identified?. The discussion is divided into four parts, one per section of the result and one on future research suggested by the authors.

6.1 Part II

For the first four questions in part II the results were similar. Around 60 percent answered correctly in all cases. When comparing a computer generated question to a human written one the majority of the participants could in fact tell the difference, or guess correctly.

The number of correct answers were similar for question 1-4 in spite of the fact that some of these questions tested obvious errors. Hence the numbers alone can not really tell if a significant, simple error (such as grammar, misspelling etc) makes people able to tell humans and computer programs apart. Even so, the comments left by the participants of the survey shows that these errors in fact have been the main reason why people have answered the way they did.

Comparing the different errors in the questions shows that question four’s error contributed the most to peoples answers why. 123 participants, almost 34 percent, claimed that the double create was their reason why. Question two did not have an obvious error but one question did contain the informal word actually which 100 participants based their answer on, around 27 percent. In contrast, the error in question three (no upper-case letter in the beginning of the question) was only given as a reason by 66 people, 18 percent. Worth mentioning is that the same amount of people commented on the formulation as their reason why on this question. In the later two, question two and three, the error occurred in the human written alternative whereas in question four the error occurred in the computer generated alternative. That the error in the computer generated alternative was a reason for more participants could mean that people found it easier to detect errors not considered human errors. It could also just be that the double create was a more obvious errors than the following two.

The results from question five shows that participants found it more difficult to distinguish which question was computer generated when faced with a third options. 43.4 percent of the participants answered correctly, is significantly lower than in the first four questions. None of the alternatives for question five had any obvious, simple errors, just as question one. Question one still had a higher correctness rate, closer to question 2,3, and 4. Note also that the computer generated alternative is the same in question one and five which makes the large difference even more interesting. This means that participants that answered correctly on question one have answered incorrectly on question five. This could either be due to that the participants have changed their perception of computer

(30)

vs human during the taking of the survey or that the participants could not detect any specific patterns. Instead they based their answers on comparison between the different alternative resulting in different answers depending on what other alternatives were given. The comments on Why? for question five were similar to the other questions, with formulation and content being common reasons.

In part II participants with previous programming experience performed slightly better than the participants without any programming experience. Also the result show that the amount of correct answers increased with years of programming experience. The difference were still very slim and is not subject to any bigger conclusions. The better scores could very well just be due to the fact that these participants had a better understanding of the questions’ meaning and therefor was not distracted by the context.

6.2 Part III

The results from the first multiple choice question shows the deviation of which questions were believed to be written by a human. The amount of votes for each option, both for the human and computer written questions were quite similar, with only one question standing out. ”When is followed by a boolean condition?” received a significant lower number of votes. This can likely be due to the obvious error in the question, the wrongful usage of when instead of what.

Out of the top five voted options two were computer generated questions ”What are abstract classes?” and ”What is stored in a byte?”, they had 222 respectively 198 votes. The reason for them being among the top could be due to the fact that they didn’t have any obvious errors, neither spelling nor grammatical.

Their structure and word choices were relatively simple and straight forward.

An attribute that was stated by many in the why-questions to be typical of human written questions. The two options were also closely related to the human generated questions ”What is the assembly language?” or ”What is a boolean?” which both received about the same amount of votes.

The second algorithms’ results was more evenly distributed compared to the first algorithm. The questions ”What is the purpose of this chapter?” and ”What can computers do?” being among the five highest voted questions despite being computer generated. According to the why-questions people believed that questions that required more complex and nuanced answers were written by a human. Both of the above questions could therefor be categorised into this group and therefor received more votes.

The two questions with the least amount of votes ”What kind of datatype have been most common?” and ”What does software control?”, were written by a human and a computer respectively. The first question had an obvious grammar error, but both also had non specific and unclear answers. Both of these factors

(31)

were mainly factors connected to computer generated questions according to the results.

Since no direction was given on how many human written alternatives each question had the fact that the average amount of guessed is close to five on both is interesting. For algorithm two the average was slightly lower, at 4.72 which could be due to these questions having an over all higher standard.

As for programming experience part III showed no major difference to part II, although for the second algorithm programming knowledge was a bigger contributing factor in the amount of correct answers. For that algorithm it differed 0.8 correct answers between the groups. This could be due to the fact that not that many questions contained as many obvious errors as previous parts and more knowledge about the general subject would be required to understand the questions.

In part III the questions were the opposite of the ones in part II. Part III asks which questions were human written compared to part II where the questions were which alternatives that were computer generated. This difference was mostly used to not make the survey too repetitive and therefor make participants lose interest. Although this difference could of course also have an effect on peoples answers say if participants found it easier to detect human written questions or vice versa. Since this difference is not the main difference between part II and part III no such conclusions can be made from this study.

6.3 General comments

The general comments reflected well on the results and comments from part II and III. Grammar, formulation/sentence structure and obvious errors such as spelling or formality was repeated in the general comments. Therefor this does not result in any new discussions but does emphasize the arguments already made in section 6.1 and 6.2.

That people answered that their answers were based on guesses and intuitions shows, to some extent, that there is no certain way of distinguishing who/what wrote a specific question. A question with a certain feel to it but that makes sense and does not have any errors would in practice probably still be usable.

The last, general comment question in the survey did supply examples of factors in the question. Grammar, spelling and sentence structure were among these examples which could be a reason to why they were common answers. Since these factors were common answers in part II and III of the survey as well there is no reason to think that this was the only reason people mentioned these factors.

(32)

6.4 Comparison with Related Work

In the study made by Heilman[6] mentioned in section 2.3 showed that teachers, if presented by the option, choose to use generated questions as a helping tool when asked to write questions about a given text. This survey’s result goes against this since it shows that people can identify computer generated questions the majority of times. Although this does not necessary imply that the generated questions have a low quality and can’t be used for educational purposes.

In the general comments from this survey the formulation of the questions was a common factor to why a question was considered computer generated. Many participants commented on stiff formulation, something that does not necessar- ily make a question bad or unusable. In general, even if one could argue that if a question could be identified as a computer generated it does not mean that the question could not be of higher quality. Therefor it is difficult to compare the results from this survey with Heilman’s result.

As discussed in section 6.2 the results from part III of the survey showed that the participants found it slightly harder to distinguish between human written and computer generated questions for the SQUASH algorithm than for Heil- man’s algorithm. This shows the tendency that the newer algorithms generate questions of higher quality. The same tendency can be seen in section 2.3 where the algorithm created by Chali and Golestanirad scored slightly better than Heilman’s algorithm.

Krishna and Iyyer claimed that the question from the SQUASH algorithm in 85.8 percent of the cases held good quality[10]. You could argue that this goes against the results in this survey since the over all result was that people could detect if a question was computer generated. Although the result from this survey also shows that the SQUASH algorithm did do better than Heilman’s algorithm. This could be reason to argue that that result further supports the result from Krishna and Iyyers article.

6.5 Future Research

A new branch within this field could be to automate this kind of study by designing an algorithm to tell human written and computer generated questions apart. A system like this could serve both as a tool to test algorithms under development but also to test consisting algorithms to see how good they are and what needs improvement.

This survey used texts on introduction programming which half of the participants had no further knowledge in. Future work could therefor be to conduct a more general survey with topics that more people are familiar with, or that does not require as much context in order to understand the generated questions, to see if this would give a different result.

If doing a similar survey an interesting approach could be to include more questions similar to question five in this study, the question were the participants

(33)

were supposed to chose one computer written question among three alternatives instead of two. By increasing the number of alternatives participants are forced to make a more thoughtful decision, compared to simply comparing one question to another. Small errors or different formulations would then not be as critical since even though one alternative might be obvious the participants will still have to decide between the remaining two.

After publishing the survey we received feedback from participants regarding receiving the correct answers after completion. This was not intended from the beginning but was highly appreciated once done. A simpler way to do it could have been to create a quiz instead where the participants would get the answers directly.

(34)

7 Conclusion

On average this study showed that people could differ between human written and computer generated questions about 50 percent of the time. The most distinguishing factors were obvious errors, sentence structure and formulation.

There was no significant difference in peoples results when comparing a human written question to a computer generated one and when choosing human written questions freely among a set of questions. Neither was there a clear indication that a previous knowledge of programming improved participants results in this survey.

The survey showed tendencies that the newer algorithm of the two used was slightly harder to distinguish from human written questions. This illustrates that the research in this area is making progress.

Over all this study showed that the algorithms used can generate questions of high quality but that the algorithms need to improve sorting out errors and low quality questions in order to be of more use in practice.

(35)

References

[1] Andrea Andrenucci and Eriks Sneiders. “Automated question answering:

Review of the main approaches”. In: Third International Conference on Information Technology and Applications (ICITA’05). Vol. 1. IEEE. 2005, pp. 514–519.

[2] David J Barnes and Michael K¨olling. Objects First With Java-A Practical Introduction Using BlueJ. 2004.

[3] Yllias Chali and Sina Golestanirad. “Ranking automatically generated questions using common human queries”. In: Proceedings of the 9th Inter- national Natural Language Generation conference. 2016, pp. 217–221.

[4] Eunsol Choi et al. “Quac: Question answering in context”. In: arXiv preprint arXiv:1808.07036 (2018).

[5] Massimiliano Ciaramita and Yasemin Altun. “Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger”. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

2006, pp. 594–602.

[6] Michael Heilman. “Automatic factual question generation from text”. In:

Language Technologies Institute School of Computer Science Carnegie Mellon University 195 (2011).

[7] “Introduction to Computers and Programming”. In: (2008).

[8] Yoon Kim. “Convolutional neural networks for sentence classification”. In:

arXiv preprint arXiv:1408.5882 (2014).

[9] Dan Klein and Christopher D Manning. “Fast exact inference with a fac- tored model for natural language parsing”. In: Advances in neural information processing systems. 2003, pp. 3–10.

[10] Kalpesh Krishna and Mohit Iyyer. “Generating Question-Answer Hierar- chies”. In: arXiv preprint arXiv:1906.02622 (2019).

[11] Roger Levy and Galen Andrew. “Tregex and Tsurgeon: tools for query- ing and manipulating tree data structures.” In: LREC. Citeseer. 2006, pp. 2231–2234.

[12] Elizabeth D Liddy. “Natural language processing”. In: (2001).

[13] Karen Mazidi and Rodney Nielsen. “Linguistic considerations in automatic question generation”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pa- pers). 2014, pp. 321–326.

[14] George A Miller et al. “Introduction to WordNet: An on-line lexical database”.

In: International journal of lexicography 3.4 (1990), pp. 235–244.

[15] Prakash M Nadkarni, Lucila Ohno-Machado, and Wendy W Chapman.

“Natural language processing: an introduction”. In: Journal of the Amer- ican Medical Informatics Association 18.5 (2011), pp. 544–551.

(36)

[16] Pranav Rajpurkar et al. “Squad: 100,000+ questions for machine comprehension of text”. In: arXiv preprint arXiv:1606.05250 (2016).

[17] Siva Reddy, Danqi Chen, and Christopher D Manning. “Coqa: A conversa- tional question answering challenge”. In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 249–266.

List of Figures

1 Age Distribution of Participants . . . 14 2 Representation of previous programming experience . . . 14 3 Distribution of years of programming experience . . . 15 4 Results of questions 1 through 4, blue represents the computer

generated questions. . . 16 5 Results of question 5, blue represents the computer generated

question. . . 17 6 Distribution of answers for algorithm 1. Red are human written

and blue computer generated. . . 18 7 Distribution of answers for algorithm 2. Red are human written

and blue computer generated. . . 19

List of Tables

1 Average correct answers for part II of the survey. . . 17 2 Average correct answers algorithm 1, part III. . . 19 3 Average correct answers for algorithm 2, part III. . . 20

(37)

8 Appendix

8.1 Appendix A - Survey

(38)

(39)

(40)

(41)

(42)

(43)

TRITA-EECS-EX-2020:360

Human vs Computer Generated Questions

Human vs Computer Generated Questions

- A Survey Study

LINNEA BONNEVIER

SARA DAMNE

Human vs Computer Generated Questions

-A Survey Study

Abstract

Sammanfattning

Contents

1 Introduction

1.1 Purpose

1.2 Approach

1.3 Thesis Outline

2 Background

2.1 Natural Language Processing

2.2 Automated Generated Questions

2.3 Related Work

3 AGQ Algorithms

3.1 Algorithm 1 - Heilman’s algorithm

3.2 Algorithm 2 - SQUASH

4 Method

4.1 Datasets

4.2 Computer generated questions

4.3 Human written questions

4.4 The Survey

4.5 Limitations

5 Results

5.1 Part I - Demographics

5.2 Part II - Which Question is Computer Generated

5.3 Part III - Which Questions are Written by a Hu- man?

5.4 General Comments

6 Discussion

6.1 Part II

6.2 Part III

6.3 General comments

6.4 Comparison with Related Work

6.5 Future Research

7 Conclusion

References

List of Figures

List of Tables

8 Appendix

8.1 Appendix A - Survey