Spell checker for a Java application

(1)

Spell checker for a Java application

Stavningskontroll till en Java-applikation

Arvid Viktorsson Illya Kyrychenko

Faculty of Health, Science and Technology Computer Science

C-level thesis

(2)

(3)

Arvid Viktorsson

Illya Kyrychenko

c

(4)

(5)

Many text-editor users depend on spellcheckers to correct their typographical errors. The absence of a spellchecker can create a negative experience for the user. In today’s advanced technological environment spellchecking is an expected feature. 2Consiliate

Business Solutions owns a Java application with a text-editor which does not have a

spellchecker. This project aims to investigate and implement available techniques and algorithms for spellcheckers and automated word correction. During implementation, the techniques were tested for their performance and the best solutions were chosen for this project.

All the techniques were gathered from earlier written literature on the topic and implemented in Java using default Java libraries. Analysis of the results proves that it is possible to create a complete spellchecker combining available techniques and that the quality of a spellchecker largely depends on a well deﬁned dictionary.

(6)

(7)

like to thank Per-Erik Svensson, who has been our supervisor at 2consiliate, for all the help and tips we got during this project. Finally, we want to thank 2consiliate, as a whole, for allowing us to conduct our thesis under their wing.

(8)

Listings

3.1 Insert method for Trie . . . 22

3.2 Search method for Trie . . . 23

3.3 Java implementation of Damerau-Levenshtein distance algorithm . . . 26

D.1 Trie node class. . . 57

D.2 Soundex encoding class. . . 57

D.3 QWERTY Keyboard Map. . . 60

D.4 Levenshtein distance for Trie. . . 62

(13)

Chapter 1 Introduction

1.1 Project goal and motivation

The goal of the project is to develop a spellchecking module for an existing Java application called 2c8 Modelling Tool, shown in Figure 1.1. The Java application is a productivity modeling tool in which a user can model process ﬂows and events for certain activities. The models are represented with graphs, where each graph can have various nodes and edges. The nodes have a description feature where the user can enter custom text that describes what the node represents in the model. At this moment, the program does not have any spellchecking in the description tool. Both the owner of the tool and users have expressed a wish for a spellchecker. In addition, we thought that the spellchecking problem was interesting since nearly all most widely used software programs have some form of spellchecking and auto-correction and it is something that is often taken for granted by most users.

1.2 Expected results

(14)

2 1.2. EXPECTED RESULTS

(15)

errors, with the help of the algorithms and methods presented in Chapter 2. Mainly, we strive to implement the Damerau-Levenshtein distance algorithm as base for

the spellchecker and three supporting algorithms to rank the words generated by the Damerau-Levenshtein distance algorithm. These three algorithms are: Soundex,

Keyboard distance and probability. Above mentioned algorithms, and others, are

discussed in Section 2.2. Other algorithms will be considered if the time allows it. We will strive to mimic the results of most widely used spellchecking programs and services, namely Google’s spellchecking in Google Docs and SAOL. We intend to create a Swedish dictionary and an English dictionary for the spellchecker. We also plan to include the functionality where users would be able to add custom words to their own personalized dictionary.

1.3 Results

We have succeeded with the main objective of this project, a functioning spellchecker. We have created a Swedish and an English dictionary, but the English dictionary is incomplete at the time being. The dictionary is stored in a tree data structure called a Trie, presented in Section 2.2.1. To check if a word is correctly spelled or not a simple tree search is done on the Trie. This is described in greater detail in Section 3.2.2. For incorrectly spelled words a list of suggested words is extracted from the Trie by traversing the Trie and comparing dictionary words with the misspelled word using theDamerau- Levenshtein distance algorithm, presented in Section 2.2.2. The

Soundex, Keyboard distance and probability ended up being the only methods we have

used to sort the list with suggested words in the most relevant order. The results of the suggested words generated by the spellchecker are presented in Appendix C. The ﬁnal result inside 2c8 Modelling tool can be seen in Figure 1.2. The integration of the spellchecking module with the main program was done in collaboration with our supervisor from 2c8.

(16)

4 1.3. RESULTS

(17)

and will be implemented in future.

1.4 Overview

(18)

(19)

Chapter 2 Background

This chapter defines the main problems in spellchecking and introduces relevant terms. Section 2.1 introduces high-level and low-level problems. Section 2.2 discusses some of the solutions to the previously mentioned problems. Section 2.3 talks about the importance of having a user defined dictionary. Finally, Section 2.4 defines the objectives.

2.1 Spell-checking in software

Software text editors have been around for a long time. Ever since modern computers were invented, and brought to a wider audience, computer users have required the ability to insert text into computers to either program them, compose emails, make notes or write papers.

(20)

8 2.1. SPELL-CHECKING IN SOFTWARE

2.1.1 High-level problems

Article Techniques for automatically correcting words in text[1] deﬁnes three fun-damental problems in spell-checking: nonword error correction, isolated-word error correction and context-dependent word correction. Given two strings of characters, a and b,{a → b} denotes a as intended spelling and b as actual spelling. Nonword er-rors are spelling erer-rors in the form{taxes → taqes, dictionary → ditcionary, rebell →

rebll}. The nonword problem is about detecting that the spelling error has occurred.

Isolated-word error correction is a technique which, in addition to detecting a miss-spelled word, suggests correctly miss-spelled words. Lastly, context-dependent word cor-rection detects misspelled words in the form {taxes → takes, their → they’re, car →

bar}, where takes, they’re and bar are correctly spelled words but not in the context

in which they are used.

2.1.2 Low-level problems

According to the article A technique for computer detection and correction of spelling

errors, written by Damerau, F. J. in 1964[2], a wrong letter, a missing letter, an extra

letter or a single transposition between letters make up for the majority of the spelling errors. To correct these errors four operations are deﬁned: substitution, insertion, deletion, and transposition.

• Substitution changes one of the characters in a string of characters, keeping the

original length. For example, correcting a misspelled word{banana → babana} requires one edit. More speciﬁcally, it would take one substitution, b with n, to transform babana to banana

• Insertion and deletion, as the names imply, transform strings by either inserting

or deleting characters. String hel is one insertion edit away from help and one deletion edit away from he.

• Transposition ﬂips two nearby characters, eﬀectively changing two characters

in one operation, without adding or removing any characters. Steret becomes

(21)

These operations form a metric called edit-distance that was introduced by Wag-ner, R. A. and Fischer, M. J. in an article named The string-to-string correction

prob-lem,written in 1974[3]. Each of the edit operations are deﬁned as one edit-distance.

The minimum edit-distance is deﬁned as the minimum amount of operations required to transform a word a to a word b.

2.2 Related Work

Spell-checking is one of the oldest problems in computer science. An article named

The First Three Spelling Checkers[4] sheds some insight about the development of

the ﬁrst spell checkers in the late 50s at Massachusetts Institute of Technology. Many techniques and solutions have been developed over the years. This section goes over some of the better known ones, many of which are discussed by Kukich [1].

2.2.1 Techniques for nonword error detection

Nonword problems only require detection of whether a word is correctly spelled or not. To accomplish this some form of static dictionary, a set of possible words or collection of letter patterns, is required. The main techniques are divided into three groups: dictionary lookup, n-grams and tree-structures[1].

Dictionary lookup

(22)

10 2.2. RELATED WORK

N-Grams

N-grams are used for both nonword and context-dependent problems. Grams can either be a character or character strings. 1-gram(unigram) consists of a single gram, 2-gram(bigram) chains two grams together, trigram consists of three chained grams, and so on. For nonword problems, binary n-grams are used. In binary n-grams each gram is a binary array, a string of 0s and 1s. A binary bigram forms a two-dimensional array and a binary trigram a three-dimensional array. Rows and columns represent valid letter combinations in existing dictionary words.

To construct binary bigrams for a English dictionary, two-dimensional arrays would have to be of the size 26x26, accounting for all alphabetical letters. All bits of the array are set to 0 except at position i and j, where i and j represent letters in the alphabet. The two-dimensional array then implies that there is a word in the dictionary where the alphabet letter at i is followed by the alphabet letter at j. In other words, binary bigrams collect all possible two-letter combinations that exist in the dictionary, and trigrams collect all possible combinations of three letters an so on.

Trie (preﬁx tree)

Trie is a tree-type data structure that was introduced in the article File searching

using variable length keys written by René de la Briandais in 1959[6]. It was later

named Trie, which is derived from the word retrieval. It is sometimes referred to as a preﬁx tree in order to distinguish it from other tree data structures.

Trie, as described in article The Adaptive Spelling Error Checking Algorithm based

on Trie Tree [7], can be used for both nonword and isolated word problems. In the

(23)

2.2.2 Techniques for isolated-word error correction

Suggested spelling, when an incorrect word is given, is one of the most important features in modern spellchecking. To suggest relevant words is also one of the main sub-problems in isolated-word error correction. For example, in a typo {forest →

f orets} both forest and forgets are just one edit distance away from forets, where

both are valid and sensible suggestions. Several algorithms and techniques have been developed over the years to rank the suggestions, with similar edit distances, in the most relevant order possible.

Many techniques have been suggested to help solve this problem, everything from edit distance to advanced machine learning. This subsection discusses some of the most relevant techniques which ﬁt into the scope of this project.

Minimum edit-distance

As mentioned earlier, the minimum edit-distance is a metric which measures the amount of edits needed to transform a string of characters to some other string of characters. Algorithm which calculates this metric was introduced by Damerau in 1964. Two years later Levenshtein came out with a similar algorithm for the same metric[1]. Both of their algorithms have been combined into an algorithm known as Damerau-Levenshtein distance. This algorithm measures all four low level spelling errors: substitution, insertion, deletion and transposition. Using minimum

edit-distance alone usually yields too many irrelevant suggestions. For example, the

misspelled word forets has over 300 words that are either edit-distance one or two.

Minimum edit-distance has to be used in combination with other techniques in order

to ﬁlter and rank suggested words.

Soundex

Soundex, patented by Odell and Russell in 1918[8], is a way to group similarly

sound-ing letters and encode words based on Soundex groups. For example, words rope and

robe encode to R010 and rome encodes to R050. The former two words encode to

(24)

for error types where the user misspells words by writing them the way they sound or by confusing similar sounding letters. The Soundex encoding technique can be used to rank up suggested words where misspelled words and correctly spelled words have the same encoding value.

Keyboard related suggestions

Some spelling errors arise due to clumsiness of the typist. Most of these errors occur by pressing a wrong or neighboring key on the keyboard. For example, on a

QWERTY keyboard layout, key H has the following neighbors: Y,U,J,N,B,G and T. If misspelling {home → jome} occurs the two suggested correctly spelled words

might be home and dome. By analyzing the keyboard layout home would score higher than dome because j is a neighbor of h.

Bayes’Theorem

Bayes theorem [9] states that

P (A|B) = P (B|A)P (A)

P (B) (2.1)

where A and B are events and P (B) ̸= 0. P (A|B) denotes the probability of event

A occurring given that event B is true and P(A) is the probability that A occurs.

Bayes theorem can be applied for solving isolated-world problems [1] . In the context of isolated-world problems, A and B are words, P (A|B) is the probability that A is the correctly spelled word, given a misspelled word B. P (B|A) is the probability that word B could be typed as A. P (A) is the probability that word A appears in any valid text.

P (B) is irrelevant because it is the same for every A. The formula can be

re-written as

(25)

spelling. There are three main problems with constructing a good probability method model for spell-checking: P (B|A) needs to be modeled appropriately, relevant can-didate words have to be picked and P (A) has to come from a reliable source of data. The ﬁrst two problems are usually solved with edit-distances, while the third problem does not have any straightforward solutions.

2.2.3 Context-dependent word errors

Regardless of the development of isolated-word error correction, there will always re-main a residual class of errors that isolated-word detection techniques can not handle. It is the class of real-word errors when a correct real-word is exchanged with another. Some of these errors are the result of simple typos {f rom→ form, form → farm} or cognitive or phonetic lapses {there→ their, ingenious → ingenuous}. Some are grammatical or syntactic mistakes when the writer uses the wrong inﬂected form {walked → walks, was → were} or the wrong function word {her → his, of →

f or}. It can also be semantic anomalies {inf iveminuets, laveamessage}. In

addi-tion, some errors can occur due to insertions or deletions of whole words { the program

crashes due to some minor crashes errors, this will cost extra for the } or improper

spacing including both run-ons and splits {your self → yourself, myself → my self }. These type of errors need the information from the context for both

correction and detection. To handle context-dependent word errors there must exist a full-blown natural-language-processing(NLP) tool that has capabilities like: robust natural language parsing, semantic understanding, pragmatic modeling and discourse structure modeling.

According to Kukich [1], 40% of misspellings were real-word errors after analysing 925 handwritten student essays in 1987. Kukich created a 40,000-word corpus of typed textual conversations study that dramatized the need for context-dependent spelling correction. For example, the word book can both be a verb and noun, see below.

(26)

One way to handle real-word errors is to view them as violations of natural language processing constraints and use NLP tools to detect and correct them. Researchers of NLP identify at least ﬁve diﬀerent levels of processing constraints:

• A lexical level

– Nonword errors violates constraints on formation of valid words, therefor

they are classiﬁed as lexical errors.

• A syntactical level

– Errors when there is a lack of subject-verb number agreement would be

deﬁned as syntactic errors. These errors violate syntactic constraints.

• A semantic level

– Errors that not violate syntactic constraints but result in semantic

devi-ations , e.g. see you in ten minuets, which is a semantic error.

• A discourse structure level

– Spelling errors that do not follow the inherent coherence of a text. For

example enumeration violations should be considered as discourse struc-ture errors, e.g. Arvids car has four wheels: two in the front and one at

the back. • A pragmatic level

– Spelling errors that reﬂect deviations related to the discourse participants

goals and plans would be classiﬁed as pragmatic errors, e.g. Has Patrik

washed his carpet today? (where Car was intended)[1].

N-gram solution for context-dependent word errors

(27)

Have I met your friend? Have I met youre friend?

Assuming the N-gram contains the ﬁrst combination of words from above then the N-gram would not detect the second alternative as a valid combination of words and would suggest the ﬁrst one instead.

2.3 Personal dictionary

In every industry and company, there are words that often are used, recognized widely and have a deﬁnition but are not included in national dictionaries or other dictionaries. Therefore, it becomes a disturbing moment for the author when these words will always be perceived as misspelled by the program. An extension to the generated dictionary will be created with these speciﬁc industry words that are not included in the existing dictionary. These words will be added when an exact same spelling for the word has been used several times or manually by the user.

2.4 Objectives

The main objective of this project is to implement solutions for: nonword error detec-tion, isolated-word error correcdetec-tion, and implementations of context-dependent word error correction, for simpler grammatical errors in a module for a Java application.

For nonword problems, both hash set and Trie offer nearly identical performance in speed, as shown in a study by Xu and Wang[7]. To check if a word is correctly spelled hash set has to produce a hash address for a given word and the Trie has to visit n nodes, where n is the length of the word. The difference in execution time between these two operations is insignificant.

(28)

let-16 2.5. SUMMARY

ting a spell-checking algorithm jump over subtrees when the edit-distance gets too far from the entered word. On the downside, Trie data structures are very memory intensive. After implementing both the hash set solutions and the Trie, we want to evaluate how these two solutions compare to each other performance wise and decide whether the speed which the Trie oﬀers outweighs the memory cost.

Isolated-word problems also include the problem of ranking diﬀerent suggestions by deciding which ones are more relevant to a misspelled word. The three main solutions for this, not including edit-distance because edit-distance would need to be implemented either way, are: Soundex, keyboard layout and probabilistic error correction. It would be ideal to implement all of them in order to pick the best spelling suggestions but using all of them together might end up lowering the response time of a text editor.

2.5 Summary

(29)

(30)

18

Chapter 3 Implementation

This chapter describes which methods we have used from the previous chapter, the main steps we have gone trough to build our spellchecker and how we implemented the algorithms in Java. All the code for Java was written in Apache NetBeans(11.2) [11] IDE using standard Java libraries. We have chosen NetBeans because 2c8

Mod-elling Tool is written in NetBeans and is used by the company, we have not used any

NetBeans speciﬁc features, any IDE would suﬃce for this project. For storing and generating words we have picked two methods, the hash table dictionary lookup and the Trie. Both of them were implemented in parallel and then tested to determine which one is better or more suited for this project.

(31)

3.1 Generating a dictionary

A complete dictionary has to be attained for a reliable spellchecker. Unfortunately, there are not that many open source projects available for that and the open source projects that are available are not tailored speciﬁcally for spellcheckers.

We have decided to use Wiktionary [12] as a source for our dictionary. Wiktionary offers many different languages and is available as open source to anyone. The downside of Wiktionary is that the raw source code structure is unorganized. This is because many different people edit Wiktionary and not everyone follows the same conventions when inserting and editing words. Wiktionary is available to download as a raw XML dump of the entire content of Wiktionary. We have used XML

SAXParser [13] for Java to process the Wiktionary XML ﬁle. In Wiktionary each

word is given its own article. Using the SAXParser we have parsed all article names for desired languages to ﬁll the dictionaries. However, not all article names are valid words. There are categories such as slang, intentional misspellings, many unnecessary abbreviations etc. We had to tailor the parser so that it only extracted the words we wanted. This is where the unorganized nature of Wiktionary became problematic. In the XML code, each Wiktionary word has tags which describe and categorise the word. One Wiktionary editor may have used the following tags for slang words

{slang|en} and another editor could use {’slang’, ’en’}. Simply ﬁltering the word slang was not an option either because many valid words have slang words associated

with them. This resulted in a lot of manual and tedious checking of the dictionary to make sure undesired words did not get past the parsing ﬁlters.

In addition, while parsing Wikipedia articles for our probability method(discussed in Section 3.4) we have discovered that many words from Wikipedia were not in Wiktionary. We have also included many of those words in our dictionary.

3.2 Storing a dictionary

(32)

20 3.2. STORING A DICTIONARY

needs to have an access to a set of correct words, a dictionary. Several methods have been developed for storing dictionaries in computer memory, including, but not limited to, hash tables, n-grams and tree data structures. We have picked the hash table and the Trie tree methods for this project as they are the most researched and developed methods and ﬁt into the scope of this project. We will evaluate the perfor-mance of the hash table and the Trie for non-word error detection and isolated-word error correction based on execution speed in Java.

3.2.1 Hash table

A hash table is an array data structure where each element value is associated with a key. The key is calculated by some hash function, usually with the element value as one of the function parameters, which produces a hash code or a hash address. Depending on the hash function, hash code collisions may occur. A collision occurs when two diﬀerent element values hash to the same hash code.

Implementing dictionary lookup: Hash table

Java oﬀers an implementation of the hash table, a HashMap class. HashMap uses Java’s implementation of a mathematical set called Set and takes two arguments, a

key and a value. We used Java’s HashSet which is an instance of HashMap but only

requires a key for its argument.

A mathematical set is a collection that does not contain duplicate elements. For example, after insertion of values 1,1,2,3,4,4, a set S will have the following members: S ={1, 2, 3, 4}. Java’s HashSet follows the same principle as a set but also hashes its members with a hash code for a faster lookup. For two diﬀerent values,

v1 and v2, the HashSet will compute hash codes, c1 and c2, with a hash function h()

and store the values in a table. Collisions arise when v1 and v2 get the same hash

(33)

∅ R O

U N D

L L

Figure 3.1: Trie of the word ROLL and ROUND

3.2.2 Trie

A Trie is an ordered tree structure, which stores strings of characters. A node’s key is a character of some string which the tree stores. The depth of the node is the position of the character in the string with which that node is associated with. All descendants of a node for a string have a common prefix associated with that node while an empty string is associated with the root. A common prefix of characters can be created and other branches of the Trie can share the common prefix. A string can be formed when traversing down the Trie from a root node to the leaf node.

In Figure 3.1 ROLL and ROUND starts with the same two characters R and O. Therefore ROLL and ROUND share the preﬁx R− O.

Implementing a Trie

Each node features a hash set, with a key k and a value v, where k ∈ α (See table A.1 in Appendix A ), and represents a node value, and v is a pointer to a sub-tree. For example, in the tree in Figure 3.1 node O has k value O and pointers to children nodes L and U. Each node can have up to| α | unique characters as possible children and, therefore the same amount of possible pointers. In addition, each node has a

boolean value indicating whether the node is the last character of a valid word or

(34)

22 3.2. STORING A DICTIONARY

Listing 3.1: Insert method for Trie

1 p r i v a t e vo i d i n s e r t ( T r i e root , S t r i n g w o r d ) { 2 Tr i e n o d e = r o o t ; 3 for (int i = 0; i < w o r d . l e n g t h () ; i ++) { 4 ch a r l e t t e r = w o r d . c h a r A t ( i ) ; 5 6 if (! n o d e . c h i l d r e n . c o n t a i n s K e y ( l e t t e r ) ) { 7 no d e . c h i l d r e n . put ( letter , n e w N o d e () ) ; 8 } 9 no d e = n o d e . c h i l d r e n . get ( l e t t e r ) ; 10 } 11 no d e . i s W o r d = tr u e; 12 }

Listing 3.1 demonstrates how a node is inserted into the tree. The for loop starts from the root node, travels down the tree and creates children nodes if needed. As an example, consider the two words, roll and round from Figure 3.1, and an empty Trie. The insert method stars at the root node, which has value null, and a letter r. Since the tree is empty it does not have any children nodes. The if statement condition will evaluate to true and r will be inserted into the root’s children. The next line assigns r as the current node and since r is a newly created node it does not have children. Thus, the same process is repeated for letters o, l and l. When the end of the word is reached the for loop exits and sets isWord to true for the last node. When inserting the word round, insert starts at the root again. The if condition will be false for the ﬁrst letter r because r already exists as one of the children. Creation of the new node is skipped and current node is set to r and current letter set to o. Node r has a child o; no need for a new node. The current node is now o and the current letter is u. Node o does not have u as a child thus u is added to the children and a new sub-tree is created and nodes n and d are created.

(35)

until the end of the word is reached. The word is only valid if the last node’s isWord is set to true. Listing 3.2 shows the search method that is used for Trie lookup.

Listing 3.2: Search method for Trie

1 p r i v a t e B o o l e a n s e a r c h ( T r i e root , S t r i n g w o r d ) { 2 if ( r o o t == n u l l) { 3 r e t u r n f a l s e; 4 } 5 6 Tr i e n o d e = r o o t ; 7 8 for (int i = 0; i < w o r d . l e n g t h () ; i ++) { 9 no d e = n o d e . c h i l d r e n . get ( w o r d . c h a r A t ( i ) ) ; 10 if ( n o d e == n u l l) { 11 r e t u r n f a l s e; 12 } 13 } 14 r e t u r n n o d e . i s W o r d ; 15 }

3.2.3 Dictionary lookup performance comparison

As discussed previously, theoretically a hash table is faster than a Trie when it comes to dictionary lookup. For our hash table implementation we use Java’s hashSet and to look up if the word is in the set we use hashSet’s hash function. For more details see Java’s hashSet documentation [14]. In comparison, the Trie’s search method must travel down the Trie for each letter in the word, visiting each child node that corresponds to the letter. search returns true only if the end of the word is reached and the boolean value is true. The hash function for hashSet has constant time complexity but the Trie’s search is dependent on the length of the word. Figure 3.2 shows the performance diﬀerence between the two.

(36)

24 3.3. ISOLATED WORD CORRECTIONS

Figure 3.2: Dictionary lookup performance comparison between Trie and hashSet

3.3 Isolated word corrections

It is crucial to find out the correct spelling of the word that was misspelled. Several algorithms have been devised to provide spelling suggestions. Each of the algorithms fulfills different tasks and covers different types of spelling errors. First, a candidate list of suggested words must be generated. For that we use the Damerau-Levenshtein algorithm. The Damerau-Levenshtein distance is the differences between two strings,

x and y. The algorithm measures the minimum amount of character edits required

to transform x into y. As described in Section 2.2.2 the algorithm measures all of the four low level spelling errors required to make the edits.

3.3.1 Generating word suggestions: hash table

Design

In a hash table solution, a spellchecking algorithm must visit every word in the dictionary to calculate the edit distance. With a given misspelled word a, a set

(37)

algorithm will compare every word in D with a and calculate the minimum edit distance between them using the Damerau-Levenshtein distance algorithm. di can

be considered as a candidate for the correct spelling of a if the minimum edit distance between them is below a certain threshold.

First, any given word b is checked with the hash table lookup to determine if the word is misspelled or not by checking if b ∈ D. Then, if b is misspelled, the character count between di and b is compared. If the diﬀerence in the character

count exceeds the threshold then there is no need to run the Damerau-Levenshtein

distance algorithm. For example, if the ﬁrst word is four characters shorter than the

second word then the minimum edit distance is at least four as four insertions or deletions are required to just transform the words to the same length. If the two words are of similar length then they are compared with the Damerau-Levenshtein

distance algorithm to determine if di is a suitable candidate.

Damerau-Levenshtein algorithm

(38)

Listing 3.3: Java implementation of Damerau-Levenshtein distance algorithm

(39)

We will use a short step by step example to show how the algorithm calculates the edit-distance. Consider two words he and hi, where the edit distance between them is one since one substitution edit is required to transform he into hi. Figure 3.3 demonstrates, step by step, how the distance is calculated.

Figure 3.3: Step by step illustration of the Damerau-Levenshtein algorithm for words

he ans hi

Prior to the ﬁrst step a two dimensional array is built with ﬁrst row, i, and column, j, elements set to 0, 1, 2. Each step looks at three array elements related to it’s current position: the element to the left, the element above and the north-west element. The current position of the algorithm is marked with > <. Value one is added to the left and upper elements and cost is added to the north-west element.

Cost is either 0, if letters of the two words match, or 1, if the letters mismatch.

(40)

3.3.2 Generating word suggestions: trie

The Trie also uses the Damerau-Levenshtein distance algorithm. Unlike the hash map, the Trie does not need to go through every word in the dictionary. Instead, the Trie travels from node to node and remembers the current edit distance. If the edit distance passes the threshold an entire Trie subtree is skipped, because the edit distance can not decrease as a word gets larger, and another sub-tree is tested next. The edit distance is calculated in the same way as for the hash map. The biggest diﬀerence is that when the hash map compares a dictionary word to any given word it knows the whole dictionary word, while the Trie starts at the root and then goes to the ﬁrst letter, second letter, an so on. The Trie does not know the whole word it is currently checking. This means that the Damerau-Levenshtein distance algorithm needs to be applied recursively so that all of the node’s children are visited. Trie im-plementation of the Damerau-Levenshtein distance algorithm is shown in Appendix D.4, in Java method on the line 83.

3.3.3 Comparing the hash table and the trie

As can be seen in Figure 3.2, the hash table is a more eﬀective than the Trie when searching the dictionary to determine if a searched word is valid or not. However, when a word is misspelled and suggestions are to be given, Figure 3.4 shows that the Trie is a much faster method when searching for words within a certain edit-distance. We have tested a set of words where each tenth word is misspelled. For example in the 100 words group 10 words are misspelled.

This result is not unexpected. Consider the following example to clarify why the Trie is so much faster when it comes to searching for words suggestions with a certain edit distance. Figure 3.5 shows how three words, dependable, dependent and

depending are stored in the Trie, which share the same preﬁx depend. When the Trie

(41)

Figure 3.4: Word suggestions performance comparison with fail rate of 10% between Trie and hashSet

∅ D E P E N D

I N G E N T A B L E

Figure 3.5: Trie of the word DEPENDABLE, DEPENDENT and DEPENDING

even if the word is completely diﬀerent.

The results from this test clearly highlighted the eﬀectiveness of the Trie over the hash set. We have therefore chosen the Trie structure and tailored all further algorithms to it.

3.3.4 Improving the Trie

Performance of the Trie can be further improved by adding an integer value,

longest-Word, to each node. longestWord holds the length of the longest word that exists

(42)

30 3.4. RANKING WORDS longestWord value nine. When the algorithm searches for a word with 14 letters and

max edit-distance of three, meaning only words that are at least 11 characters long are of interest, it will be able to tell from the ﬁrst node d that no such word can be found in the sub-tree. This way the algorithm can skip some sub-trees entirely by only visiting the ﬁrst node. As an example we have searched for the word

ra-dioanstonomer and max edit distance three before implementing longestWord. The

algorithm visited 2332 words before implementing longestWord and 775 words after. However, short words showed no improvement in performance.

For dictionary lookup, hash set is a faster method. Combining the hash set and the Trie could be a way to improve the Trie at the cost of memory. Once the hash set signals that the word is misspelled the Tire is used to generate the word suggestions combining. This was never done, however, due to reasons mentioned in Section 4.2.2.

3.4 Ranking words

As mentioned in Section 2.2.2 many techniques have been suggested to help solve the word ranking problem, everything from edit distance to advanced machine learning. The following techniques are most suited to us and within our scope:

Damerau-Levenshtein distance algorithm, Soundex, keyboard distance and probability.

Depend-ing on various factors and results from the above mentioned algorithms the suggested words will be ranked diﬀerently.

Edit-distance

Edit-distance will be the ﬁrst step for the word suggestions. The max edit-distance for the word search is as follows:

max edit-dist =

{

1, word.len≤ 3

⌈word.len × R⌉, word.len > 3 (3.1)

(43)

R = 0.3 it could be raised higher to allow more errors.

After a word list with suggested words has been generated the algorithm could simply sort all the words based on the edit-distance, where a lower edit-distance corresponds to a better suggestion. But, as discussed in section 2.2.1, that is not enough for a robust spellchecker.

Soundex

The Soundex algorithm does its best to cover the phonetics errors. We have somewhat altered the classic Soundex algorithm for our program. In standard Soundex, the ﬁrst letter is left as it is and only the three following letters are encoded as digits. For example, for the long word nonpredictable the Soundex code is N561. The ﬁrs letter of the word is N, all vowels encode to zeros so they are ignored, the next n is in group 5, p is in group 1 and r is in group 6. The last digit 6 represents the letter r, and the rest of the word is ignored. Our program encodes entire word, generating extended Soundex code N51632314. We have decided to make that change to cover errors in the latter part of the word. For example, a similarly sounding misspelling

nonpredictaple would also encode as N1632314 but misspelling nonpredictarle gets

diﬀerent code, N1632364. Implementation of the Soundex encoder can be seen in Appendix D.2 with altered Soundex groups.

Keyboard distance

Keyboard distance covers typographical errors. Each QWERTY keyboard key has

neighboring keys that are associated with it. For example, key S has the following neighbors: A, Q, W, E, D, C, X and Z. We have created a small static database for each keyboard key using Java’s hashMap, can be seen in Appendix D.3. Each keyboard key is a hashMap key and all the neighboring keys are put in a list as a

hashMap value for the associated key. Consider misspelling {rest → reat}. After

(44)

32 3.4. RANKING WORDS

Probability

As mentioned in Section 2.2.2 about Baye’s Theorem, to use the probability method in a spellchecker it has to be modelled appropriately. The two main parameters in the formula 2.2 are P (A|B) and P (A). We have decided to model P (A|B) after the edit-distance, meaning the lower the edit-distance the higher the probability. For P (A) we have parsed Wikipedia articles and counted the number of times words occur in Wikipedia articles, the larger the appearance values the higher the probability. We have expanded the general probability formula for arg to includes other factors that take keyboard distance and Soundex into account. The next section describes the formula.

Combining the algorithms

We have devised a point system that combines the methods discussed in previous sections. The goal of the point system is to order a list of dictionary words so that the most probable words, related to the misspelled word, are higher on the list after sorting the list in ascending order based on the points. For two strings, a and b, where a is a misspelled word and b is a dictionary word, points for b are calculated in the following way:

4∗ editDistance − keyboard − Soundex = G (3.2) 2000000∗ G − wikiOccurrence (3.3)

editDistance is the edit-distance for a and b. keyboard can either be zero or one

and indicates whether mismatching characters at the same position for both strings are keyboard neighbors. Soundex can also be either zero or one, zero if a and b have diﬀerent Soundex codes and one if the code is the same. Finally, wikiOccurrence is the amount of times b has appeared in the Wikipedia.

(45)

equation 3.2. For example, consider a misspelled word taie and two dictionary words

take and tare. Equation 3.2 would evaluate take to three since 4∗ 1 − 1 − 0 = 3

and tale would evaluate to four eﬀectively putting them into two diﬀerent groups. Multiplying with 2000000 further separates the groups so that when wikiOccurrence subtraction is applied the groups are not changed, thus wikiOccurrence only moves words within the groups. Consider another word tale; like take, it would also evaluate to three. The word that occurs more often (has higher wikiOccurrence value) will get less points and a higher position on the list.

3.5 Summary

(46)

(47)

Chapter 4 Results and evaluation

This chapter presents the results, evaluates methods and algorithms implemented in the previous chapter and highlights some of the problems we have had during the implementation phase. Section 4.1 summarises the results of this project. Section 4.2 evaluates results. Section 4.3 highlights some of the problems and limitations we have encountered during the implementation phase. Finally, Section 4.4 summarises this chapter.

4.1 Results

The main goal set for the project was to create a spellchecker that provides relevant word suggestions for misspelled words. This has been accomplished. The dictio-nary has been created by parsing Wiktiodictio-nary and Wikipedia XML dumps. The spellchecker uses this dictionary as source for the correct words. The dictionary is saved in a Trie rather than the hash set because it proved to be more eﬃcient from our tests. Four main algorithms are used to compare and produce relevant word suggestions from the Trie; Soundex, Keyboard distance, probability and

Damerau-Levenshtein distance algorithm.

(48)

pro-36 4.2. EVALUATION

gram. The spellchecker can only detect a misspelled word if the dictionary does not contain it.

The spellchecker was successfully integrated with the main program as shown in Figure 1.2. Parts of the source code can be found in Appendix D and UML diagram in Appendix E.1. The SpellChecker class, which implements interface ISpellCheker, is the main communication point with 2c8 modeling tool. 2c8 modeling tool sends a string to spellCheck method which then returns a list with suggested words, if the word was misspelled, after running the string trough all the algorithms mentioned in previous chapters.

4.2 Evaluation

This section evaluates major parts of the project.

4.2.1 Dictionary

As mentioned in Section 3.1, a complete dictionary has to be attained for a reli-able spellchecker, and there are not many relireli-able open source resources availreli-able.

Wiktionary[12] has been used as a source for our dictionary.

(49)

the quality of the dictionary, e.g. some spelling suggestions have words that are from one of the, or similar, groups mentioned above. Considering the amount of words the resulting dictionaries hold it is very diﬃcult to assess the dictionary’s quality.

4.2.2 Hash vs Trie

We have successfully compared the hash table and the Trie methods. Section 3.2.3 compares dictionary lookup performance for the hashSet and the Trie. The result from Figure 3.2.3 shows that the hashSet is a faster method than the Trie for checking if the dictionary contains a speciﬁc word or not. When comparing the hashSet and the Trie for generated word suggestions in Section 3.3.3 the results are reversed dramatically as can been seen in Figure 3.2.

Section 3.3.4 mentions that it could be possible to combine a hashSet and a Trie by using a hashSet for error detection and then using a Trie for error correction. However, as seen in Figure 3.2 even for thousands of words the Trie dictionary lookup speed is under 20ms which is practically unnoticeable for the user. The ﬁnal implementation of the spellchecker does not utilizes hash set in anyway. The dictionary is stored in the Trie structure and the Damerau-Levenshtein distance

algorithm is applied to the Trie for generating word suggestions.

4.2.3 Nonword error detection

For nonword error detection a news article written by Nicholas Ringskog

Ferrada-Noli[15], that can be found in Appendix B, has been processed by the spellchecker.

The objective with this test was to evaluate the quality of the dictionary on an arbitrary prewritten Swedish article.

(50)

38 4.2. EVALUATION

In the provided example, out of 217 correctly spelled words three got marked as incorrectly spelled: kongeniala,konsertarrangör and språngbräda which can be found on lines 7, 12, and 13 respectively. For this speciﬁc example the false positive rate is 1.4%. For other, randomly chosen, articles the false positive rate is about the same and averages out to 2%.

4.2.4 Isolated-word error correction

To evaluate isolated-word correction we tested the spellchecker’s word suggestions for each of the low level problems mentioned in Section 2.1.2: substitution, insertion, deletion and transposition. Spelling errors can occur for diﬀerent reasons: random typographical errors, typographical errors due to typist’s clumsiness and phonetics errors. We will showcase each of the above mentioned type of misspellings for each of the low-level problem. For typographical errors and phonetic errors the Keyboard

distance algorithm and Soundex algorithm, respectively, are the main ranking factors

for word suggestions. For random typographical errors main ranking factors are either the Keyboard distance algorithm or the Soundex algorithm if the error was typographical or phonetic. Otherwise, the ranking for random typographical errors has to rely on edit-distance and probability alone.

Since the relevance of the suggested word is subjective we compared our sug-gestions for misspelled words with SAOL’s [16] spelling sugsug-gestions and Google’s spelling suggestions generated by Google Docs. Misspelled words were submitted out of context to Google Docs. We have divided each error into diﬀerent groups related to the above mentioned misspelling reasons. All the suggestion comparisons are presented in tables in Appendix C.

Substitution

In substitution error, one or more characters are substituted with another char-acter and requires another substitution to correct. {motorcykel → motorcycel,

(51)

clum-siness. {segelbåt→ selelbåt, segelbåt → selelbåg} is examples of random substitution corrections. See tables C.1, C.2 and C.3 .

Insertion

In insertion error, one or more characters are added to a correctly spelled word and re-quires a deletion to correct. Random insertion {kampanj→ katmpanj, kampanj →

katmpmanj}, accidental insertion due to clumsiness {lastbil → lagstbil, lastbil → lagstbqil} and a phonetics insertion {polaritet→ polanritet, polaritet → polanmritet}.

See tables C.4, C.5 and C.6.

Deletion

With deletion, one or more characters are missing from a correctly spelled word and requires an insertion to correct. Random deletion {f lygplan→ flygpln, flygplan →

f ygpln} and phonetics deletion {utveckling → utvekling, utveckling → utveklin}.

See tables C.7 and C.8.

Transposition

For transposition, two nearby characters in a correctly spelled word are swapped. To correct transposition another transposition is required on the same two characters. Random transposition { kutlur → kultur, kutlur → kultru}, accidental transpo-sition {motoredlen → motordelen, motoredlen → mtoordelen} and a phonetics transposition {polisbrikca → polisbricka, polisbrikca → polisbricka}. See tables C.9 , C.10, and C.11.

(52)

40 4.3. PROBLEMS

4.3 Problems

Below are highlighted some of the problems that have occurred during the project.

4.3.1 Parsing

Wiktionary

In Wiktionary some words do not have tags indicating their category. This has forced us to develop extensive ﬁlters to remove unwanted words.

Wikipedia

In Section 3.1 it is mentioned that not everyone who edit pages on Wiktionary and Wikipedia follows the same conventions and that raw source code structure can be unorganized. This was evident for some parts of the generated dictionary.

When Wikipedia was parsed for the probability metric we have added the words that appear in Wikipedia but not in Wiktionary to expand our dictionary. However, some words in Wikipedia articles were misspelled and those words were added to our dictionary because the program could not decide if the word is correctly spelled or not due to our assumption that Wikipedia content is correctly spelled. Therefore we have had to manually review the dictionary for removal of misspelled words. Most of these word were removed by counting the total word occurrence and not including words which appeared less than ten times throughout the entire Wikipedia.

4.3.2 Limitations

Inadequate language knowledge

(53)

Downside of Wiktionary

Regardless of whether it is well-known words used in serious texts or lesser-known slang words used on the internet, they can be added by users to Wiktionary. This creates a limitation because the content of Wiktionary is based on subjective opinions and not on objective ones such as SAOL.

Shortage of authentic errors

The texts that are being published by the university or some authority are always proofread. Therefore, it is hard to ﬁnd serious texts that include phonetic or ty-pographical errors. We have tried to create texts with misspellings to evaluate the program more extensively. However, during these tests we wrote misspelled words on purpose and those who will use the program will not misspell words on purpose. So a big limitation for this project was the shortage of the texts with authentic errors to test the program.

4.4 Summary

(54)

(55)

Chapter 5 Conclusion

5.1 Project Evaluation

The project has progressed well and the schedule that was set with our supervisor at the university has been followed. We have been in the premises of 2c8 throughout the period and they have provided the necessary technical equipment. When thoughts and problems have arisen, our 2c8 supervisor has always been helpful in the best possible way. Due to the restrictions of the COVID-19 pandemic, during the latter half of the project, the contact with our supervisor at the university was conducted only through digital means. This has not had any greater impact on the project.

5.2 Future Work

This section provides some suggestions for future work to improve the spellchecker and some missing features that we did not manage to implement within the time limit.

Dictionary

(56)

44 5.2. FUTURE WORK

base words. A deeper knowledge of a language in question is required but if done correctly the dictionary should be free of undesired words.

Combining algorithms and point based system

In equation 3.4 we have devised a point based on equation 2.2 which takes all the algorithms into account. This is by no means a perfect point system. It could possibly be adjusted to improve the suggestion rankings or scrapped for another point/ranking system entirely.

Context dependent word correction

We have not managed to get anything done for context dependent word correction. This is the hardest problem in the spellchecking problem and requires either a very deﬁned linguistic set or rules or a massive amount of example data. Our initial idea was to implement context dependent word correction for Swedish words de, dem, en and ett. When parsing Wikipedia articles we would store all the neighboring words that surround one of the above mentioned words thus creating some form of a context for these words.

Personal dictionary

The importance of creating a personal dictionary was presented in Section 2.3. As mentioned, every industry and company uses words which they have a deﬁnition for but not included in national dictionaries or other dictionaries. It would be a desired addition to the program if there existed a possibility to create your own personal dictionary. Expressing oneself through accepted industry words is common and the users should not be limited because the program marks these words as incorrect.

Soundex

Soundex was originaly designed for English language. We have made some

(57)

people who have more advanced education in the Swedish language would most likely improve Soundex for the spellchecker.

5.3 Concluding remarks

Spellchecking is an old problem in computer science, which boils down to string manipulation and string comparison. Many solutions exist, from very simple to very advanced. Big companies such as Facebook, Google and Microsoft have access to a lot of resources and data and most likely do not use the methods which we have implemented in this project but instead they use more advanced techniques like AI and machine learning. With the resources we have had and the scope of this project our resulting product is adequate, in our opinion of course.

During the development of the spellchecker it became clear just how important the quality of the dictionary is. Without a proper set of correctly spelled words it does not matter how good the spellchecking methods and algorithms are. From the beginning, we did not think we would have to spend so much time building an adequate dictionary. We have come to the conclusion that it is better if the dictionary is missing some words rather than contains unwanted words, especially with the addition of a personal dictionary where the user would be able to add both custom and false positive words to their custom dictionary.

(58)

(59)

References

[1] K. Kukich, “Techniques for automatically correcting words in text”, Acm

Com-puting Surveys (CSUR), vol. 24, no. 4, pp. 377–439, 1992.

[2] F. J. Damerau, “A technique for computer detection and correction of spelling errors”, Communications of the ACM, vol. 7, no. 3, pp. 171–176, 1964.

[3] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem”,

Journal of the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974.

[4] (Feb. 2020). The ﬁrst three spelling checkers. [Online; accessed 19. Feb. 2020], [Online]. Available: https://web.archive.org/web/20121022091418/http: //www.stanford.edu/~learnest/spelling.pdf.

[5] D. Knuth, “The art of programming, vol. 3 (sorting and searching)”,

Addison-Wesley Publishing Company, vol. 3, pp. 481–489, 1973.

[6] R. De La Briandais, “File searching using variable length keys”, Papers

pre-sented at the the March 3-5, 1959, western joint computer conference, pp. 295–

298, 1959.

[7] Y. Xu and J. Wang, “The Adaptive Spelling Error Checking Algorithm based on Trie Tree”, Atlantis Press, Jul. 2016, issn: 2352-5401.

[8] M. Odell and R. Russel, “U.S. Patent Numbers 1,261,167 (1918) and 1,435,663 (1922)”, Technical report,Patent Oﬃce, Washington, 1918.

[9] K. Vännman, Matematisk statistik. Lund: Studentlitteratur, 2002.

(60)

48 REFERENCES

[11] A. NetBeans. (Mar. 2020). Welcome to Apache NetBeans. [Online; accessed 5. May 2020], [Online]. Available: https://netbeans.apache.org.

[12] (Mar. 2020). Wiktionary. [Online; accessed 21. Apr. 2020], [Online]. Available: https://www.wiktionary.org.

[13] (Oct. 2018). SAXParser (Java Platform SE 7 ). [Online; accessed 21. Apr. 2020], [Online]. Available: https://docs.oracle.com/javase/7/docs/api/ javax/xml/parsers/SAXParser.html.

[14] (Oct. 2018). HashSet (Java Platform SE 7 ). [Online; accessed 21. Apr. 2020], [Online]. Available: https://docs.oracle.com/javase/7/docs/api/java/ util/HashSet.html.

[15] N. R. Ferrada-Noli. (Apr. 2020). Recension: Barnen djupt saknade när Pippi ﬁ-rar 75 år på Berwaldhallen. [Online; accessed 28. Apr. 2020], [Online]. Available: https://www.dn.se/kultur- noje/recension- barnen- djupt- saknade-nar-pippi-firar-75-ar-pa-berwaldhallen/.

(61)

Appendix A

Table of characters

’ - A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Å Ä Ö a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö

(62)

50

Appendix B

Review text

Som tur är ﬁnnsBerwaldhallen Play och Pippi Långstrump 75 år visas på detta vis

1

istället. Men för vem? Hela poängen med att göra en konsert som låter barn komma

2

till ett ﬁnt konserthus och höra en livs levande symfoniorkester är ju just det, den

3

fysiska upplevelsen, skriver Nicholas Ringskog Ferrada-Noli.

4

Böckerna står visserligen på egna ben, men frågan är om Astrid Lindgrens

berät-5

telser hade fått samma fäste i svensk kultur om det inte vore för ﬁlmerna och

tv-6

serierna och för de kongeniala sångerna som komponerades till dessa. Hade

verkli-7

gen Pippi Långstrump kunnat bli en sådan ikonisk ﬁgur om det inte vore för Jan

8

Johanssons Här kommer Pippi Långstrump, som går rakt in i en treårings hjärta

9

med sin berusande rytm och triumfatoriska melodi? Eller GeorgRiedelssånger, vars

10

bästa melodier är lika lysande självklara som Benny Anderssons eller Max Martins.

11

Om man som konsertarrangör vill få små barn att upptäcka musikens magiska

12

värld är det därför logiskt att välja just Pippi Långstrump som språngbräda. Pippi

13

Långstrump 75 år är visserligen inte den första satsningen Berwaldhallen gör för

14

barn, men var på planeringsstadiet ovanligt ambitiös: fyra sjungande skådespelare,

15

en regissör från Astrid Lindgrens Värld och ett helt nytt verk, Benjamin Staerns

16

Pippi lyfter hästen. Konserten sålde slut. Sedan kom coronaviruset till Sverige och

17

alla publika konserter ställdes in.

18

(63)

(64)

52

Appendix C

(65)

Table C.1: Word suggestions for substitution errors for fotbollsplan.

Edit-distance

1 2

fotbollsplam forbollsplam

Our SAOL Google Our SAOL Google

fotbollsplan fotbollsplan fotbollsplan fotbollsplan fotbollsplan fotbollsplan fotbollslag fotbollsplans fotbollslag fotbollsplans

fotbollsplans fotbollsspelaren fotbollsplans fotbollsspelaren fotbollshall fotbollsspelare fotbollshall fotbollsspelare fotbollspris fotbollsplanen fotbollspris fotbollsplanen fotbollsklub fotbollsspelarna fotbollsklub fotbollsspelarna

Table C.2: Word suggestions for substitution errors for word motorcykel.

Edit-distance

1 2

motorcycel motorkycel

motorcykel motorcykel motorcykel motorcykel motorcykel motorcykel motorcykeln motorcykeln motorcykeln motorcykeln

motorcykels motorcykels motortypen motorcykels motorcyklel motorcykelns motortyper motorskydden motorbyte motocrossen motorcykels motorskyddet motorcicles motorcyklar motorfel motorcykelns

Table C.3: Word suggestions for substitution errors for word segelbåt.

Edit-distance

1 2

selelbåt selelbåg

segelbåt segelbåt segelbåt segelbåt cirkelbåge selbåge segelbåts segelbåts segelbar segelbåt

(66)

54

Table C.4: Word suggestions for insertion errors for word kampanj.

Edit-distance

1 2

katmpanj katmpmanj

kampanj kampanj kampanj kampanj kampanj kampanj kampanja hatkampanj kampvana kardemummans

kampanjs champagne kampanjs matmammans skampanj kampanja kampanja kampsång kampanil kampanjs skampanj komplimang

skyttekompani kardemumman

Table C.5: Word suggestions for insertion errors for word lastbil.

Edit-distance

1 2

lagstbil lagstbqil

lastbil lastbil lastbil lastbil lastbil lastbil lagstil blixtbild lagstil blixtbild

lastbils lackstövel lastbils lastcykel lagspel lastbils lastbils ﬂygstil taxibil taxibil labil lyxbil lagspel

Table C.6: Word suggestions for insertion errors for polaritet.

Edit-distance

1 2

polanritet polanmritet

polaritet polaritet polaritet polaritet polarområdet polaritet polaritets polaritets polaritets polaritet

paritet popularitet olinjaritet polarområdets polarisen polariteter planarbetet koloniområdet bipolaritet polariteten bipolaritet polarområdes

(67)

Table C.7: Word suggestions for deletion errors for utveckling.

Edit-distance

1 2

utvekling utveklin

utveckling utveckling utveckling utveckling utveckling utveckling utvecklings utvecklings utveckla utvecklaren

utvecklling utväxling utvecklas utvecklings uveckling utvecklingen utvecklat utväxling utpekning utvikning utvecklar utvecklande

utvikning utsegling utvecklad utvikning

Table C.8: Word suggestions for deletion errors for ﬂygplan.

Edit-distance

1 2

ﬂygpln fygpln

flygplan flygplan flygplan flygplan flygplan flygplan flygeln flygeln fågeln vägplan

flygel flygplatsen flygeln vågplan flygplans flygplans tygeln byggplan

flygelns flygbladen bygeln fågeln flygen flygbilden flygeln

Table C.9: Word suggestions for transposition errors for kultur.

Edit-distance

1 2

kutlur kutlru

kultur kultur kultur kultur kultur kultur kullar kulturs kuttra kulturs

(68)

56

Table C.10: Word suggestions for transposition errors for word motordelen.

Edit-distance

1 2

motoredlen mtooredlen

motordelen motmedlen motordelen motordelen tåredalen motordelen motordelar motorleden morellen stormelden

motoraxeln motorhotellen motmedlen motorbåten mutsedeln motorleden motellen motorfelen naturmedlen morellen naturmedlen maktmedlen

Table C.11: Word suggestions for transposition errors for polisbricka.

Edit-distance

1 2

polisbrikca poilsbrikca

polisbricka polisbricka polisbricka polisbricka polisbricka polisbricka polisbrickan mosbricka polisbrickan

polisbrickas polisbrickas polisbrickans plastbricka polisbrickorna polisbrickans

(69)

Appendix D

Source Code

Listing D.1: Trie node class.

1 2 p a c k a g e com . m y c o m p a n y . s p e l l c h e c k e r ; 3 4 i m p o r t ja v a . u t i l . H a s h M a p ; 5 i m p o r t ja v a . u t i l . Map ; 6 7 p u b l i c c l a s s Tr i e { 8 S t r i n g w o r d ; 9 int l o n g e s t W o r d = I n t e g e r . M I N _ V A L U E ; 10 int p r o b a b i l i t y = 0;

11 Map < C h a r a c t e r , Trie > c h i l d r e n = new HashMap < >() ; 12 }

Listing D.2: Soundex encoding class.

1

2 p u b l i c c l a s s S o u n d e x 3 {

4 p u b l i c S t r i n g g e t G o d e ( S t r i n g w o r d )

(70)

(71)

(72)

60 w o r d A r r a y [ 0 ] ; 72 73 for (int i = 1; i < w o r d A r r a y . l e n g t h ; i ++) 74 if ( w o r d A r r a y [ i ] != w o r d A r r a y [ i - 1] && w o r d A r r a y [ i ] != ’ 1 ’) 75 o u t p u t += w o r d A r r a y [ i ]; 76 77 r e t u r n o u t p u t ; 78 } 79 }

Listing D.3: QWERTY Keyboard Map.

1

2 i m p o r t ja v a . u t i l . H a s h M a p ; 3

4 p u b l i c c l a s s K e y b o a r d M a p { 5

(73)

Spell checker for a Java application