Spell checker for a Java application
Stavningskontroll till en Java-applikationArvid Viktorsson Illya Kyrychenko
Faculty of Health, Science and Technology Computer Science
C-level thesis
Arvid Viktorsson
Illya Kyrychenko
c
Many text-editor users depend on spellcheckers to correct their typographical errors. The absence of a spellchecker can create a negative experience for the user. In today’s advanced technological environment spellchecking is an expected feature. 2Consiliate
Business Solutions owns a Java application with a text-editor which does not have a
spellchecker. This project aims to investigate and implement available techniques and algorithms for spellcheckers and automated word correction. During implementation, the techniques were tested for their performance and the best solutions were chosen for this project.
All the techniques were gathered from earlier written literature on the topic and implemented in Java using default Java libraries. Analysis of the results proves that it is possible to create a complete spellchecker combining available techniques and that the quality of a spellchecker largely depends on a well defined dictionary.
like to thank Per-Erik Svensson, who has been our supervisor at 2consiliate, for all the help and tips we got during this project. Finally, we want to thank 2consiliate, as a whole, for allowing us to conduct our thesis under their wing.
Contents
1 Introduction 1
1.1 Project goal and motivation . . . 1
1.2 Expected results . . . 1 1.3 Results . . . 3 1.4 Overview . . . 5 2 Background 7 2.1 Spell-checking in software . . . 7 2.1.1 High-level problems . . . 8 2.1.2 Low-level problems . . . 8 2.2 Related Work . . . 9
2.2.1 Techniques for nonword error detection . . . 9
2.2.2 Techniques for isolated-word error correction . . . 11
2.2.3 Context-dependent word errors . . . 13
3.3.1 Generating word suggestions: hash table . . . 24
3.3.2 Generating word suggestions: trie . . . 28
3.3.3 Comparing the hash table and the trie . . . 28
3.3.4 Improving the Trie . . . 29
3.4 Ranking words . . . 30
3.5 Summary . . . 33
4 Results and evaluation 35 4.1 Results . . . 35
4.2 Evaluation . . . 36
4.2.1 Dictionary . . . 36
4.2.2 Hash vs Trie . . . 37
4.2.3 Nonword error detection . . . 37
4.2.4 Isolated-word error correction . . . 38
4.3 Problems . . . 40 4.3.1 Parsing . . . 40 4.3.2 Limitations . . . 40 4.4 Summary . . . 41 5 Conclusion 43 5.1 Project Evaluation . . . 43 5.2 Future Work . . . 43 5.3 Concluding remarks . . . 45 A Table of characters 49 B Review text 50
C Word suggestion tables 52
D Source Code 57
E UML Diagram 67
1.1 Screenshot showcasing 2c8 Modelling Tool’s text editor without spellchecker 2 1.2 Screenshot showcasing 2c8 Modelling Tool’s text editor with spellchecker 4
3.1 Trie of the word ROLL and ROUND . . . . 21 3.2 Dictionary lookup performance comparison between Trie and hashSet 24 3.3 Step by step illustration of the Damerau-Levenshtein algorithm for
words he ans hi . . . . 27 3.4 Word suggestions performance comparison with fail rate of 10%
be-tween Trie and hashSet . . . 29 3.5 Trie of the word DEPENDABLE, DEPENDENT and DEPENDING 29
E.1 UML Diagram . . . 68
Listings
3.1 Insert method for Trie . . . 22
3.2 Search method for Trie . . . 23
3.3 Java implementation of Damerau-Levenshtein distance algorithm . . . 26
D.1 Trie node class. . . 57
D.2 Soundex encoding class. . . 57
D.3 QWERTY Keyboard Map. . . 60
D.4 Levenshtein distance for Trie. . . 62
Chapter 1
Introduction
1.1
Project goal and motivation
The goal of the project is to develop a spellchecking module for an existing Java application called 2c8 Modelling Tool, shown in Figure 1.1. The Java application is a productivity modeling tool in which a user can model process flows and events for certain activities. The models are represented with graphs, where each graph can have various nodes and edges. The nodes have a description feature where the user can enter custom text that describes what the node represents in the model. At this moment, the program does not have any spellchecking in the description tool. Both the owner of the tool and users have expressed a wish for a spellchecker. In addition, we thought that the spellchecking problem was interesting since nearly all most widely used software programs have some form of spellchecking and auto-correction and it is something that is often taken for granted by most users.
1.2
Expected results
2 1.2. EXPECTED RESULTS
errors, with the help of the algorithms and methods presented in Chapter 2. Mainly, we strive to implement the Damerau-Levenshtein distance algorithm as base for
the spellchecker and three supporting algorithms to rank the words generated by the Damerau-Levenshtein distance algorithm. These three algorithms are: Soundex,
Keyboard distance and probability. Above mentioned algorithms, and others, are
discussed in Section 2.2. Other algorithms will be considered if the time allows it. We will strive to mimic the results of most widely used spellchecking programs and services, namely Google’s spellchecking in Google Docs and SAOL. We intend to create a Swedish dictionary and an English dictionary for the spellchecker. We also plan to include the functionality where users would be able to add custom words to their own personalized dictionary.
1.3
Results
We have succeeded with the main objective of this project, a functioning spellchecker. We have created a Swedish and an English dictionary, but the English dictionary is incomplete at the time being. The dictionary is stored in a tree data structure called a Trie, presented in Section 2.2.1. To check if a word is correctly spelled or not a simple tree search is done on the Trie. This is described in greater detail in Section 3.2.2. For incorrectly spelled words a list of suggested words is extracted from the Trie by traversing the Trie and comparing dictionary words with the misspelled word using theDamerau- Levenshtein distance algorithm, presented in Section 2.2.2. The
Soundex, Keyboard distance and probability ended up being the only methods we have
used to sort the list with suggested words in the most relevant order. The results of the suggested words generated by the spellchecker are presented in Appendix C. The final result inside 2c8 Modelling tool can be seen in Figure 1.2. The integration of the spellchecking module with the main program was done in collaboration with our supervisor from 2c8.
4 1.3. RESULTS
and will be implemented in future.
1.4
Overview
Chapter 2
Background
This chapter defines the main problems in spellchecking and introduces relevant terms. Section 2.1 introduces high-level and low-level problems. Section 2.2 discusses some of the solutions to the previously mentioned problems. Section 2.3 talks about the importance of having a user defined dictionary. Finally, Section 2.4 defines the objectives.
2.1
Spell-checking in software
Software text editors have been around for a long time. Ever since modern computers were invented, and brought to a wider audience, computer users have required the ability to insert text into computers to either program them, compose emails, make notes or write papers.
8 2.1. SPELL-CHECKING IN SOFTWARE
2.1.1
High-level problems
Article Techniques for automatically correcting words in text[1] defines three fun-damental problems in spell-checking: nonword error correction, isolated-word error correction and context-dependent word correction. Given two strings of characters, a and b,{a → b} denotes a as intended spelling and b as actual spelling. Nonword er-rors are spelling erer-rors in the form{taxes → taqes, dictionary → ditcionary, rebell →
rebll}. The nonword problem is about detecting that the spelling error has occurred.
Isolated-word error correction is a technique which, in addition to detecting a miss-spelled word, suggests correctly miss-spelled words. Lastly, context-dependent word cor-rection detects misspelled words in the form {taxes → takes, their → they’re, car →
bar}, where takes, they’re and bar are correctly spelled words but not in the context
in which they are used.
2.1.2
Low-level problems
According to the article A technique for computer detection and correction of spelling
errors, written by Damerau, F. J. in 1964[2], a wrong letter, a missing letter, an extra
letter or a single transposition between letters make up for the majority of the spelling errors. To correct these errors four operations are defined: substitution, insertion, deletion, and transposition.
• Substitution changes one of the characters in a string of characters, keeping the
original length. For example, correcting a misspelled word{banana → babana} requires one edit. More specifically, it would take one substitution, b with n, to transform babana to banana
• Insertion and deletion, as the names imply, transform strings by either inserting
or deleting characters. String hel is one insertion edit away from help and one deletion edit away from he.
• Transposition flips two nearby characters, effectively changing two characters
in one operation, without adding or removing any characters. Steret becomes
These operations form a metric called edit-distance that was introduced by Wag-ner, R. A. and Fischer, M. J. in an article named The string-to-string correction
prob-lem,written in 1974[3]. Each of the edit operations are defined as one edit-distance.
The minimum edit-distance is defined as the minimum amount of operations required to transform a word a to a word b.
2.2
Related Work
Spell-checking is one of the oldest problems in computer science. An article named
The First Three Spelling Checkers[4] sheds some insight about the development of
the first spell checkers in the late 50s at Massachusetts Institute of Technology. Many techniques and solutions have been developed over the years. This section goes over some of the better known ones, many of which are discussed by Kukich [1].
2.2.1
Techniques for nonword error detection
Nonword problems only require detection of whether a word is correctly spelled or not. To accomplish this some form of static dictionary, a set of possible words or collection of letter patterns, is required. The main techniques are divided into three groups: dictionary lookup, n-grams and tree-structures[1].
Dictionary lookup
10 2.2. RELATED WORK
N-Grams
N-grams are used for both nonword and context-dependent problems. Grams can either be a character or character strings. 1-gram(unigram) consists of a single gram, 2-gram(bigram) chains two grams together, trigram consists of three chained grams, and so on. For nonword problems, binary n-grams are used. In binary n-grams each gram is a binary array, a string of 0s and 1s. A binary bigram forms a two-dimensional array and a binary trigram a three-dimensional array. Rows and columns represent valid letter combinations in existing dictionary words.
To construct binary bigrams for a English dictionary, two-dimensional arrays would have to be of the size 26x26, accounting for all alphabetical letters. All bits of the array are set to 0 except at position i and j, where i and j represent letters in the alphabet. The two-dimensional array then implies that there is a word in the dictionary where the alphabet letter at i is followed by the alphabet letter at j. In other words, binary bigrams collect all possible two-letter combinations that exist in the dictionary, and trigrams collect all possible combinations of three letters an so on.
Trie (prefix tree)
Trie is a tree-type data structure that was introduced in the article File searching
using variable length keys written by René de la Briandais in 1959[6]. It was later
named Trie, which is derived from the word retrieval. It is sometimes referred to as a prefix tree in order to distinguish it from other tree data structures.
Trie, as described in article The Adaptive Spelling Error Checking Algorithm based
on Trie Tree [7], can be used for both nonword and isolated word problems. In the
2.2.2
Techniques for isolated-word error correction
Suggested spelling, when an incorrect word is given, is one of the most important features in modern spellchecking. To suggest relevant words is also one of the main sub-problems in isolated-word error correction. For example, in a typo {forest →
f orets} both forest and forgets are just one edit distance away from forets, where
both are valid and sensible suggestions. Several algorithms and techniques have been developed over the years to rank the suggestions, with similar edit distances, in the most relevant order possible.
Many techniques have been suggested to help solve this problem, everything from edit distance to advanced machine learning. This subsection discusses some of the most relevant techniques which fit into the scope of this project.
Minimum edit-distance
As mentioned earlier, the minimum edit-distance is a metric which measures the amount of edits needed to transform a string of characters to some other string of characters. Algorithm which calculates this metric was introduced by Damerau in 1964. Two years later Levenshtein came out with a similar algorithm for the same metric[1]. Both of their algorithms have been combined into an algorithm known as Damerau-Levenshtein distance. This algorithm measures all four low level spelling errors: substitution, insertion, deletion and transposition. Using minimum
edit-distance alone usually yields too many irrelevant suggestions. For example, the
misspelled word forets has over 300 words that are either edit-distance one or two.
Minimum edit-distance has to be used in combination with other techniques in order
to filter and rank suggested words.
Soundex
Soundex, patented by Odell and Russell in 1918[8], is a way to group similarly
sound-ing letters and encode words based on Soundex groups. For example, words rope and
robe encode to R010 and rome encodes to R050. The former two words encode to
12 2.2. RELATED WORK
for error types where the user misspells words by writing them the way they sound or by confusing similar sounding letters. The Soundex encoding technique can be used to rank up suggested words where misspelled words and correctly spelled words have the same encoding value.
Keyboard related suggestions
Some spelling errors arise due to clumsiness of the typist. Most of these errors occur by pressing a wrong or neighboring key on the keyboard. For example, on a
QWERTY keyboard layout, key H has the following neighbors: Y,U,J,N,B,G and T. If misspelling {home → jome} occurs the two suggested correctly spelled words
might be home and dome. By analyzing the keyboard layout home would score higher than dome because j is a neighbor of h.
Bayes’Theorem
Bayes theorem [9] states that
P (A|B) = P (B|A)P (A)
P (B) (2.1)
where A and B are events and P (B) ̸= 0. P (A|B) denotes the probability of event
A occurring given that event B is true and P(A) is the probability that A occurs.
Bayes theorem can be applied for solving isolated-world problems [1] . In the context of isolated-world problems, A and B are words, P (A|B) is the probability that A is the correctly spelled word, given a misspelled word B. P (B|A) is the probability that word B could be typed as A. P (A) is the probability that word A appears in any valid text.
P (B) is irrelevant because it is the same for every A. The formula can be
re-written as
spelling. There are three main problems with constructing a good probability method model for spell-checking: P (B|A) needs to be modeled appropriately, relevant can-didate words have to be picked and P (A) has to come from a reliable source of data. The first two problems are usually solved with edit-distances, while the third problem does not have any straightforward solutions.
2.2.3
Context-dependent word errors
Regardless of the development of isolated-word error correction, there will always re-main a residual class of errors that isolated-word detection techniques can not handle. It is the class of real-word errors when a correct real-word is exchanged with another. Some of these errors are the result of simple typos {f rom→ form, form → farm} or cognitive or phonetic lapses {there→ their, ingenious → ingenuous}. Some are grammatical or syntactic mistakes when the writer uses the wrong inflected form {walked → walks, was → were} or the wrong function word {her → his, of →
f or}. It can also be semantic anomalies {inf iveminuets, laveamessage}. In
addi-tion, some errors can occur due to insertions or deletions of whole words { the program
crashes due to some minor crashes errors, this will cost extra for the } or improper
spacing including both run-ons and splits {your self → yourself, myself → my self }. These type of errors need the information from the context for both
correction and detection. To handle context-dependent word errors there must exist a full-blown natural-language-processing(NLP) tool that has capabilities like: robust natural language parsing, semantic understanding, pragmatic modeling and discourse structure modeling.
According to Kukich [1], 40% of misspellings were real-word errors after analysing 925 handwritten student essays in 1987. Kukich created a 40,000-word corpus of typed textual conversations study that dramatized the need for context-dependent spelling correction. For example, the word book can both be a verb and noun, see below.
14 2.2. RELATED WORK
One way to handle real-word errors is to view them as violations of natural language processing constraints and use NLP tools to detect and correct them. Researchers of NLP identify at least five different levels of processing constraints:
• A lexical level
– Nonword errors violates constraints on formation of valid words, therefor
they are classified as lexical errors.
• A syntactical level
– Errors when there is a lack of subject-verb number agreement would be
defined as syntactic errors. These errors violate syntactic constraints.
• A semantic level
– Errors that not violate syntactic constraints but result in semantic
devi-ations , e.g. see you in ten minuets, which is a semantic error.
• A discourse structure level
– Spelling errors that do not follow the inherent coherence of a text. For
example enumeration violations should be considered as discourse struc-ture errors, e.g. Arvids car has four wheels: two in the front and one at
the back. • A pragmatic level
– Spelling errors that reflect deviations related to the discourse participants
goals and plans would be classified as pragmatic errors, e.g. Has Patrik
washed his carpet today? (where Car was intended)[1].
N-gram solution for context-dependent word errors
Have I met your friend? Have I met youre friend?
Assuming the N-gram contains the first combination of words from above then the N-gram would not detect the second alternative as a valid combination of words and would suggest the first one instead.
2.3
Personal dictionary
In every industry and company, there are words that often are used, recognized widely and have a definition but are not included in national dictionaries or other dictionaries. Therefore, it becomes a disturbing moment for the author when these words will always be perceived as misspelled by the program. An extension to the generated dictionary will be created with these specific industry words that are not included in the existing dictionary. These words will be added when an exact same spelling for the word has been used several times or manually by the user.
2.4
Objectives
The main objective of this project is to implement solutions for: nonword error detec-tion, isolated-word error correcdetec-tion, and implementations of context-dependent word error correction, for simpler grammatical errors in a module for a Java application.
For nonword problems, both hash set and Trie offer nearly identical performance in speed, as shown in a study by Xu and Wang[7]. To check if a word is correctly spelled hash set has to produce a hash address for a given word and the Trie has to visit n nodes, where n is the length of the word. The difference in execution time between these two operations is insignificant.
let-16 2.5. SUMMARY
ting a spell-checking algorithm jump over subtrees when the edit-distance gets too far from the entered word. On the downside, Trie data structures are very memory intensive. After implementing both the hash set solutions and the Trie, we want to evaluate how these two solutions compare to each other performance wise and decide whether the speed which the Trie offers outweighs the memory cost.
Isolated-word problems also include the problem of ranking different suggestions by deciding which ones are more relevant to a misspelled word. The three main solutions for this, not including edit-distance because edit-distance would need to be implemented either way, are: Soundex, keyboard layout and probabilistic error correction. It would be ideal to implement all of them in order to pick the best spelling suggestions but using all of them together might end up lowering the response time of a text editor.
2.5
Summary
18
Chapter 3
Implementation
This chapter describes which methods we have used from the previous chapter, the main steps we have gone trough to build our spellchecker and how we implemented the algorithms in Java. All the code for Java was written in Apache NetBeans(11.2) [11] IDE using standard Java libraries. We have chosen NetBeans because 2c8
Mod-elling Tool is written in NetBeans and is used by the company, we have not used any
NetBeans specific features, any IDE would suffice for this project. For storing and generating words we have picked two methods, the hash table dictionary lookup and the Trie. Both of them were implemented in parallel and then tested to determine which one is better or more suited for this project.
3.1
Generating a dictionary
A complete dictionary has to be attained for a reliable spellchecker. Unfortunately, there are not that many open source projects available for that and the open source projects that are available are not tailored specifically for spellcheckers.
We have decided to use Wiktionary [12] as a source for our dictionary. Wiktionary offers many different languages and is available as open source to anyone. The downside of Wiktionary is that the raw source code structure is unorganized. This is because many different people edit Wiktionary and not everyone follows the same conventions when inserting and editing words. Wiktionary is available to download as a raw XML dump of the entire content of Wiktionary. We have used XML
SAXParser [13] for Java to process the Wiktionary XML file. In Wiktionary each
word is given its own article. Using the SAXParser we have parsed all article names for desired languages to fill the dictionaries. However, not all article names are valid words. There are categories such as slang, intentional misspellings, many unnecessary abbreviations etc. We had to tailor the parser so that it only extracted the words we wanted. This is where the unorganized nature of Wiktionary became problematic. In the XML code, each Wiktionary word has tags which describe and categorise the word. One Wiktionary editor may have used the following tags for slang words
{slang|en} and another editor could use {’slang’, ’en’}. Simply filtering the word slang was not an option either because many valid words have slang words associated
with them. This resulted in a lot of manual and tedious checking of the dictionary to make sure undesired words did not get past the parsing filters.
In addition, while parsing Wikipedia articles for our probability method(discussed in Section 3.4) we have discovered that many words from Wikipedia were not in Wiktionary. We have also included many of those words in our dictionary.
3.2
Storing a dictionary
20 3.2. STORING A DICTIONARY
needs to have an access to a set of correct words, a dictionary. Several methods have been developed for storing dictionaries in computer memory, including, but not limited to, hash tables, n-grams and tree data structures. We have picked the hash table and the Trie tree methods for this project as they are the most researched and developed methods and fit into the scope of this project. We will evaluate the perfor-mance of the hash table and the Trie for non-word error detection and isolated-word error correction based on execution speed in Java.
3.2.1
Hash table
A hash table is an array data structure where each element value is associated with a key. The key is calculated by some hash function, usually with the element value as one of the function parameters, which produces a hash code or a hash address. Depending on the hash function, hash code collisions may occur. A collision occurs when two different element values hash to the same hash code.
Implementing dictionary lookup: Hash table
Java offers an implementation of the hash table, a HashMap class. HashMap uses Java’s implementation of a mathematical set called Set and takes two arguments, a
key and a value. We used Java’s HashSet which is an instance of HashMap but only
requires a key for its argument.
A mathematical set is a collection that does not contain duplicate elements. For example, after insertion of values 1,1,2,3,4,4, a set S will have the following members: S ={1, 2, 3, 4}. Java’s HashSet follows the same principle as a set but also hashes its members with a hash code for a faster lookup. For two different values,
v1 and v2, the HashSet will compute hash codes, c1 and c2, with a hash function h()
and store the values in a table. Collisions arise when v1 and v2 get the same hash
∅ R O
U N D
L L
Figure 3.1: Trie of the word ROLL and ROUND
3.2.2
Trie
A Trie is an ordered tree structure, which stores strings of characters. A node’s key is a character of some string which the tree stores. The depth of the node is the position of the character in the string with which that node is associated with. All descendants of a node for a string have a common prefix associated with that node while an empty string is associated with the root. A common prefix of characters can be created and other branches of the Trie can share the common prefix. A string can be formed when traversing down the Trie from a root node to the leaf node.
In Figure 3.1 ROLL and ROUND starts with the same two characters R and O. Therefore ROLL and ROUND share the prefix R− O.
Implementing a Trie
Each node features a hash set, with a key k and a value v, where k ∈ α (See table A.1 in Appendix A ), and represents a node value, and v is a pointer to a sub-tree. For example, in the tree in Figure 3.1 node O has k value O and pointers to children nodes L and U. Each node can have up to| α | unique characters as possible children and, therefore the same amount of possible pointers. In addition, each node has a
boolean value indicating whether the node is the last character of a valid word or
22 3.2. STORING A DICTIONARY
Listing 3.1: Insert method for Trie
1 p r i v a t e vo i d i n s e r t ( T r i e root , S t r i n g w o r d ) { 2 Tr i e n o d e = r o o t ; 3 for (int i = 0; i < w o r d . l e n g t h () ; i ++) { 4 ch a r l e t t e r = w o r d . c h a r A t ( i ) ; 5 6 if (! n o d e . c h i l d r e n . c o n t a i n s K e y ( l e t t e r ) ) { 7 no d e . c h i l d r e n . put ( letter , n e w N o d e () ) ; 8 } 9 no d e = n o d e . c h i l d r e n . get ( l e t t e r ) ; 10 } 11 no d e . i s W o r d = tr u e; 12 }
Listing 3.1 demonstrates how a node is inserted into the tree. The for loop starts from the root node, travels down the tree and creates children nodes if needed. As an example, consider the two words, roll and round from Figure 3.1, and an empty Trie. The insert method stars at the root node, which has value null, and a letter r. Since the tree is empty it does not have any children nodes. The if statement condition will evaluate to true and r will be inserted into the root’s children. The next line assigns r as the current node and since r is a newly created node it does not have children. Thus, the same process is repeated for letters o, l and l. When the end of the word is reached the for loop exits and sets isWord to true for the last node. When inserting the word round, insert starts at the root again. The if condition will be false for the first letter r because r already exists as one of the children. Creation of the new node is skipped and current node is set to r and current letter set to o. Node r has a child o; no need for a new node. The current node is now o and the current letter is u. Node o does not have u as a child thus u is added to the children and a new sub-tree is created and nodes n and d are created.
until the end of the word is reached. The word is only valid if the last node’s isWord is set to true. Listing 3.2 shows the search method that is used for Trie lookup.
Listing 3.2: Search method for Trie
1 p r i v a t e B o o l e a n s e a r c h ( T r i e root , S t r i n g w o r d ) { 2 if ( r o o t == n u l l) { 3 r e t u r n f a l s e; 4 } 5 6 Tr i e n o d e = r o o t ; 7 8 for (int i = 0; i < w o r d . l e n g t h () ; i ++) { 9 no d e = n o d e . c h i l d r e n . get ( w o r d . c h a r A t ( i ) ) ; 10 if ( n o d e == n u l l) { 11 r e t u r n f a l s e; 12 } 13 } 14 r e t u r n n o d e . i s W o r d ; 15 }
3.2.3
Dictionary lookup performance comparison
As discussed previously, theoretically a hash table is faster than a Trie when it comes to dictionary lookup. For our hash table implementation we use Java’s hashSet and to look up if the word is in the set we use hashSet’s hash function. For more details see Java’s hashSet documentation [14]. In comparison, the Trie’s search method must travel down the Trie for each letter in the word, visiting each child node that corresponds to the letter. search returns true only if the end of the word is reached and the boolean value is true. The hash function for hashSet has constant time complexity but the Trie’s search is dependent on the length of the word. Figure 3.2 shows the performance difference between the two.
24 3.3. ISOLATED WORD CORRECTIONS
Figure 3.2: Dictionary lookup performance comparison between Trie and hashSet
3.3
Isolated word corrections
It is crucial to find out the correct spelling of the word that was misspelled. Several algorithms have been devised to provide spelling suggestions. Each of the algorithms fulfills different tasks and covers different types of spelling errors. First, a candidate list of suggested words must be generated. For that we use the Damerau-Levenshtein algorithm. The Damerau-Levenshtein distance is the differences between two strings,
x and y. The algorithm measures the minimum amount of character edits required
to transform x into y. As described in Section 2.2.2 the algorithm measures all of the four low level spelling errors required to make the edits.
3.3.1
Generating word suggestions: hash table
Design
In a hash table solution, a spellchecking algorithm must visit every word in the dictionary to calculate the edit distance. With a given misspelled word a, a set
algorithm will compare every word in D with a and calculate the minimum edit distance between them using the Damerau-Levenshtein distance algorithm. di can
be considered as a candidate for the correct spelling of a if the minimum edit distance between them is below a certain threshold.
First, any given word b is checked with the hash table lookup to determine if the word is misspelled or not by checking if b ∈ D. Then, if b is misspelled, the character count between di and b is compared. If the difference in the character
count exceeds the threshold then there is no need to run the Damerau-Levenshtein
distance algorithm. For example, if the first word is four characters shorter than the
second word then the minimum edit distance is at least four as four insertions or deletions are required to just transform the words to the same length. If the two words are of similar length then they are compared with the Damerau-Levenshtein
distance algorithm to determine if di is a suitable candidate.
Damerau-Levenshtein algorithm
26 3.3. ISOLATED WORD CORRECTIONS
Listing 3.3: Java implementation of Damerau-Levenshtein distance algorithm
We will use a short step by step example to show how the algorithm calculates the edit-distance. Consider two words he and hi, where the edit distance between them is one since one substitution edit is required to transform he into hi. Figure 3.3 demonstrates, step by step, how the distance is calculated.
Figure 3.3: Step by step illustration of the Damerau-Levenshtein algorithm for words
he ans hi
Prior to the first step a two dimensional array is built with first row, i, and column, j, elements set to 0, 1, 2. Each step looks at three array elements related to it’s current position: the element to the left, the element above and the north-west element. The current position of the algorithm is marked with > <. Value one is added to the left and upper elements and cost is added to the north-west element.
Cost is either 0, if letters of the two words match, or 1, if the letters mismatch.
28 3.3. ISOLATED WORD CORRECTIONS
3.3.2
Generating word suggestions: trie
The Trie also uses the Damerau-Levenshtein distance algorithm. Unlike the hash map, the Trie does not need to go through every word in the dictionary. Instead, the Trie travels from node to node and remembers the current edit distance. If the edit distance passes the threshold an entire Trie subtree is skipped, because the edit distance can not decrease as a word gets larger, and another sub-tree is tested next. The edit distance is calculated in the same way as for the hash map. The biggest difference is that when the hash map compares a dictionary word to any given word it knows the whole dictionary word, while the Trie starts at the root and then goes to the first letter, second letter, an so on. The Trie does not know the whole word it is currently checking. This means that the Damerau-Levenshtein distance algorithm needs to be applied recursively so that all of the node’s children are visited. Trie im-plementation of the Damerau-Levenshtein distance algorithm is shown in Appendix D.4, in Java method on the line 83.
3.3.3
Comparing the hash table and the trie
As can be seen in Figure 3.2, the hash table is a more effective than the Trie when searching the dictionary to determine if a searched word is valid or not. However, when a word is misspelled and suggestions are to be given, Figure 3.4 shows that the Trie is a much faster method when searching for words within a certain edit-distance. We have tested a set of words where each tenth word is misspelled. For example in the 100 words group 10 words are misspelled.
This result is not unexpected. Consider the following example to clarify why the Trie is so much faster when it comes to searching for words suggestions with a certain edit distance. Figure 3.5 shows how three words, dependable, dependent and
depending are stored in the Trie, which share the same prefix depend. When the Trie
Figure 3.4: Word suggestions performance comparison with fail rate of 10% between Trie and hashSet
∅ D E P E N D
I N G E N T A B L E
Figure 3.5: Trie of the word DEPENDABLE, DEPENDENT and DEPENDING
even if the word is completely different.
The results from this test clearly highlighted the effectiveness of the Trie over the hash set. We have therefore chosen the Trie structure and tailored all further algorithms to it.
3.3.4
Improving the Trie
Performance of the Trie can be further improved by adding an integer value,
longest-Word, to each node. longestWord holds the length of the longest word that exists
30 3.4. RANKING WORDS longestWord value nine. When the algorithm searches for a word with 14 letters and
max edit-distance of three, meaning only words that are at least 11 characters long are of interest, it will be able to tell from the first node d that no such word can be found in the sub-tree. This way the algorithm can skip some sub-trees entirely by only visiting the first node. As an example we have searched for the word
ra-dioanstonomer and max edit distance three before implementing longestWord. The
algorithm visited 2332 words before implementing longestWord and 775 words after. However, short words showed no improvement in performance.
For dictionary lookup, hash set is a faster method. Combining the hash set and the Trie could be a way to improve the Trie at the cost of memory. Once the hash set signals that the word is misspelled the Tire is used to generate the word suggestions combining. This was never done, however, due to reasons mentioned in Section 4.2.2.
3.4
Ranking words
As mentioned in Section 2.2.2 many techniques have been suggested to help solve the word ranking problem, everything from edit distance to advanced machine learning. The following techniques are most suited to us and within our scope:
Damerau-Levenshtein distance algorithm, Soundex, keyboard distance and probability.
Depend-ing on various factors and results from the above mentioned algorithms the suggested words will be ranked differently.
Edit-distance
Edit-distance will be the first step for the word suggestions. The max edit-distance for the word search is as follows:
max edit-dist =
{
1, word.len≤ 3
⌈word.len × R⌉, word.len > 3 (3.1)
R = 0.3 it could be raised higher to allow more errors.
After a word list with suggested words has been generated the algorithm could simply sort all the words based on the edit-distance, where a lower edit-distance corresponds to a better suggestion. But, as discussed in section 2.2.1, that is not enough for a robust spellchecker.
Soundex
The Soundex algorithm does its best to cover the phonetics errors. We have somewhat altered the classic Soundex algorithm for our program. In standard Soundex, the first letter is left as it is and only the three following letters are encoded as digits. For example, for the long word nonpredictable the Soundex code is N561. The firs letter of the word is N, all vowels encode to zeros so they are ignored, the next n is in group 5, p is in group 1 and r is in group 6. The last digit 6 represents the letter r, and the rest of the word is ignored. Our program encodes entire word, generating extended Soundex code N51632314. We have decided to make that change to cover errors in the latter part of the word. For example, a similarly sounding misspelling
nonpredictaple would also encode as N1632314 but misspelling nonpredictarle gets
different code, N1632364. Implementation of the Soundex encoder can be seen in Appendix D.2 with altered Soundex groups.
Keyboard distance
Keyboard distance covers typographical errors. Each QWERTY keyboard key has
neighboring keys that are associated with it. For example, key S has the following neighbors: A, Q, W, E, D, C, X and Z. We have created a small static database for each keyboard key using Java’s hashMap, can be seen in Appendix D.3. Each keyboard key is a hashMap key and all the neighboring keys are put in a list as a
hashMap value for the associated key. Consider misspelling {rest → reat}. After
32 3.4. RANKING WORDS
Probability
As mentioned in Section 2.2.2 about Baye’s Theorem, to use the probability method in a spellchecker it has to be modelled appropriately. The two main parameters in the formula 2.2 are P (A|B) and P (A). We have decided to model P (A|B) after the edit-distance, meaning the lower the edit-distance the higher the probability. For P (A) we have parsed Wikipedia articles and counted the number of times words occur in Wikipedia articles, the larger the appearance values the higher the probability. We have expanded the general probability formula for arg to includes other factors that take keyboard distance and Soundex into account. The next section describes the formula.
Combining the algorithms
We have devised a point system that combines the methods discussed in previous sections. The goal of the point system is to order a list of dictionary words so that the most probable words, related to the misspelled word, are higher on the list after sorting the list in ascending order based on the points. For two strings, a and b, where a is a misspelled word and b is a dictionary word, points for b are calculated in the following way:
4∗ editDistance − keyboard − Soundex = G (3.2) 2000000∗ G − wikiOccurrence (3.3)
editDistance is the edit-distance for a and b. keyboard can either be zero or one
and indicates whether mismatching characters at the same position for both strings are keyboard neighbors. Soundex can also be either zero or one, zero if a and b have different Soundex codes and one if the code is the same. Finally, wikiOccurrence is the amount of times b has appeared in the Wikipedia.
equation 3.2. For example, consider a misspelled word taie and two dictionary words
take and tare. Equation 3.2 would evaluate take to three since 4∗ 1 − 1 − 0 = 3
and tale would evaluate to four effectively putting them into two different groups. Multiplying with 2000000 further separates the groups so that when wikiOccurrence subtraction is applied the groups are not changed, thus wikiOccurrence only moves words within the groups. Consider another word tale; like take, it would also evaluate to three. The word that occurs more often (has higher wikiOccurrence value) will get less points and a higher position on the list.
3.5
Summary
Chapter 4
Results and evaluation
This chapter presents the results, evaluates methods and algorithms implemented in the previous chapter and highlights some of the problems we have had during the implementation phase. Section 4.1 summarises the results of this project. Section 4.2 evaluates results. Section 4.3 highlights some of the problems and limitations we have encountered during the implementation phase. Finally, Section 4.4 summarises this chapter.
4.1
Results
The main goal set for the project was to create a spellchecker that provides relevant word suggestions for misspelled words. This has been accomplished. The dictio-nary has been created by parsing Wiktiodictio-nary and Wikipedia XML dumps. The spellchecker uses this dictionary as source for the correct words. The dictionary is saved in a Trie rather than the hash set because it proved to be more efficient from our tests. Four main algorithms are used to compare and produce relevant word suggestions from the Trie; Soundex, Keyboard distance, probability and
Damerau-Levenshtein distance algorithm.
pro-36 4.2. EVALUATION
gram. The spellchecker can only detect a misspelled word if the dictionary does not contain it.
The spellchecker was successfully integrated with the main program as shown in Figure 1.2. Parts of the source code can be found in Appendix D and UML diagram in Appendix E.1. The SpellChecker class, which implements interface ISpellCheker, is the main communication point with 2c8 modeling tool. 2c8 modeling tool sends a string to spellCheck method which then returns a list with suggested words, if the word was misspelled, after running the string trough all the algorithms mentioned in previous chapters.
4.2
Evaluation
This section evaluates major parts of the project.
4.2.1
Dictionary
As mentioned in Section 3.1, a complete dictionary has to be attained for a reli-able spellchecker, and there are not many relireli-able open source resources availreli-able.
Wiktionary[12] has been used as a source for our dictionary.
the quality of the dictionary, e.g. some spelling suggestions have words that are from one of the, or similar, groups mentioned above. Considering the amount of words the resulting dictionaries hold it is very difficult to assess the dictionary’s quality.
4.2.2
Hash vs Trie
We have successfully compared the hash table and the Trie methods. Section 3.2.3 compares dictionary lookup performance for the hashSet and the Trie. The result from Figure 3.2.3 shows that the hashSet is a faster method than the Trie for checking if the dictionary contains a specific word or not. When comparing the hashSet and the Trie for generated word suggestions in Section 3.3.3 the results are reversed dramatically as can been seen in Figure 3.2.
Section 3.3.4 mentions that it could be possible to combine a hashSet and a Trie by using a hashSet for error detection and then using a Trie for error correction. However, as seen in Figure 3.2 even for thousands of words the Trie dictionary lookup speed is under 20ms which is practically unnoticeable for the user. The final implementation of the spellchecker does not utilizes hash set in anyway. The dictionary is stored in the Trie structure and the Damerau-Levenshtein distance
algorithm is applied to the Trie for generating word suggestions.
4.2.3
Nonword error detection
For nonword error detection a news article written by Nicholas Ringskog
Ferrada-Noli[15], that can be found in Appendix B, has been processed by the spellchecker.
The objective with this test was to evaluate the quality of the dictionary on an arbitrary prewritten Swedish article.
38 4.2. EVALUATION
In the provided example, out of 217 correctly spelled words three got marked as incorrectly spelled: kongeniala,konsertarrangör and språngbräda which can be found on lines 7, 12, and 13 respectively. For this specific example the false positive rate is 1.4%. For other, randomly chosen, articles the false positive rate is about the same and averages out to 2%.
4.2.4
Isolated-word error correction
To evaluate isolated-word correction we tested the spellchecker’s word suggestions for each of the low level problems mentioned in Section 2.1.2: substitution, insertion, deletion and transposition. Spelling errors can occur for different reasons: random typographical errors, typographical errors due to typist’s clumsiness and phonetics errors. We will showcase each of the above mentioned type of misspellings for each of the low-level problem. For typographical errors and phonetic errors the Keyboard
distance algorithm and Soundex algorithm, respectively, are the main ranking factors
for word suggestions. For random typographical errors main ranking factors are either the Keyboard distance algorithm or the Soundex algorithm if the error was typographical or phonetic. Otherwise, the ranking for random typographical errors has to rely on edit-distance and probability alone.
Since the relevance of the suggested word is subjective we compared our sug-gestions for misspelled words with SAOL’s [16] spelling sugsug-gestions and Google’s spelling suggestions generated by Google Docs. Misspelled words were submitted out of context to Google Docs. We have divided each error into different groups related to the above mentioned misspelling reasons. All the suggestion comparisons are presented in tables in Appendix C.
Substitution
In substitution error, one or more characters are substituted with another char-acter and requires another substitution to correct. {motorcykel → motorcycel,
clum-siness. {segelbåt→ selelbåt, segelbåt → selelbåg} is examples of random substitution corrections. See tables C.1, C.2 and C.3 .
Insertion
In insertion error, one or more characters are added to a correctly spelled word and re-quires a deletion to correct. Random insertion {kampanj→ katmpanj, kampanj →
katmpmanj}, accidental insertion due to clumsiness {lastbil → lagstbil, lastbil → lagstbqil} and a phonetics insertion {polaritet→ polanritet, polaritet → polanmritet}.
See tables C.4, C.5 and C.6.
Deletion
With deletion, one or more characters are missing from a correctly spelled word and requires an insertion to correct. Random deletion {f lygplan→ flygpln, flygplan →
f ygpln} and phonetics deletion {utveckling → utvekling, utveckling → utveklin}.
See tables C.7 and C.8.
Transposition
For transposition, two nearby characters in a correctly spelled word are swapped. To correct transposition another transposition is required on the same two characters. Random transposition { kutlur → kultur, kutlur → kultru}, accidental transpo-sition {motoredlen → motordelen, motoredlen → mtoordelen} and a phonetics transposition {polisbrikca → polisbricka, polisbrikca → polisbricka}. See tables C.9 , C.10, and C.11.
40 4.3. PROBLEMS
4.3
Problems
Below are highlighted some of the problems that have occurred during the project.
4.3.1
Parsing
Wiktionary
In Wiktionary some words do not have tags indicating their category. This has forced us to develop extensive filters to remove unwanted words.
Wikipedia
In Section 3.1 it is mentioned that not everyone who edit pages on Wiktionary and Wikipedia follows the same conventions and that raw source code structure can be unorganized. This was evident for some parts of the generated dictionary.
When Wikipedia was parsed for the probability metric we have added the words that appear in Wikipedia but not in Wiktionary to expand our dictionary. However, some words in Wikipedia articles were misspelled and those words were added to our dictionary because the program could not decide if the word is correctly spelled or not due to our assumption that Wikipedia content is correctly spelled. Therefore we have had to manually review the dictionary for removal of misspelled words. Most of these word were removed by counting the total word occurrence and not including words which appeared less than ten times throughout the entire Wikipedia.
4.3.2
Limitations
Inadequate language knowledge
Downside of Wiktionary
Regardless of whether it is well-known words used in serious texts or lesser-known slang words used on the internet, they can be added by users to Wiktionary. This creates a limitation because the content of Wiktionary is based on subjective opinions and not on objective ones such as SAOL.
Shortage of authentic errors
The texts that are being published by the university or some authority are always proofread. Therefore, it is hard to find serious texts that include phonetic or ty-pographical errors. We have tried to create texts with misspellings to evaluate the program more extensively. However, during these tests we wrote misspelled words on purpose and those who will use the program will not misspell words on purpose. So a big limitation for this project was the shortage of the texts with authentic errors to test the program.
4.4
Summary
Chapter 5
Conclusion
5.1
Project Evaluation
The project has progressed well and the schedule that was set with our supervisor at the university has been followed. We have been in the premises of 2c8 throughout the period and they have provided the necessary technical equipment. When thoughts and problems have arisen, our 2c8 supervisor has always been helpful in the best possible way. Due to the restrictions of the COVID-19 pandemic, during the latter half of the project, the contact with our supervisor at the university was conducted only through digital means. This has not had any greater impact on the project.
5.2
Future Work
This section provides some suggestions for future work to improve the spellchecker and some missing features that we did not manage to implement within the time limit.
Dictionary
44 5.2. FUTURE WORK
base words. A deeper knowledge of a language in question is required but if done correctly the dictionary should be free of undesired words.
Combining algorithms and point based system
In equation 3.4 we have devised a point based on equation 2.2 which takes all the algorithms into account. This is by no means a perfect point system. It could possibly be adjusted to improve the suggestion rankings or scrapped for another point/ranking system entirely.
Context dependent word correction
We have not managed to get anything done for context dependent word correction. This is the hardest problem in the spellchecking problem and requires either a very defined linguistic set or rules or a massive amount of example data. Our initial idea was to implement context dependent word correction for Swedish words de, dem, en and ett. When parsing Wikipedia articles we would store all the neighboring words that surround one of the above mentioned words thus creating some form of a context for these words.
Personal dictionary
The importance of creating a personal dictionary was presented in Section 2.3. As mentioned, every industry and company uses words which they have a definition for but not included in national dictionaries or other dictionaries. It would be a desired addition to the program if there existed a possibility to create your own personal dictionary. Expressing oneself through accepted industry words is common and the users should not be limited because the program marks these words as incorrect.
Soundex
Soundex was originaly designed for English language. We have made some
people who have more advanced education in the Swedish language would most likely improve Soundex for the spellchecker.
5.3
Concluding remarks
Spellchecking is an old problem in computer science, which boils down to string manipulation and string comparison. Many solutions exist, from very simple to very advanced. Big companies such as Facebook, Google and Microsoft have access to a lot of resources and data and most likely do not use the methods which we have implemented in this project but instead they use more advanced techniques like AI and machine learning. With the resources we have had and the scope of this project our resulting product is adequate, in our opinion of course.
During the development of the spellchecker it became clear just how important the quality of the dictionary is. Without a proper set of correctly spelled words it does not matter how good the spellchecking methods and algorithms are. From the beginning, we did not think we would have to spend so much time building an adequate dictionary. We have come to the conclusion that it is better if the dictionary is missing some words rather than contains unwanted words, especially with the addition of a personal dictionary where the user would be able to add both custom and false positive words to their custom dictionary.
References
[1] K. Kukich, “Techniques for automatically correcting words in text”, Acm
Com-puting Surveys (CSUR), vol. 24, no. 4, pp. 377–439, 1992.
[2] F. J. Damerau, “A technique for computer detection and correction of spelling errors”, Communications of the ACM, vol. 7, no. 3, pp. 171–176, 1964.
[3] R. A. Wagner and M. J. Fischer, “The string-to-string correction problem”,
Journal of the ACM (JACM), vol. 21, no. 1, pp. 168–173, 1974.
[4] (Feb. 2020). The first three spelling checkers. [Online; accessed 19. Feb. 2020], [Online]. Available: https://web.archive.org/web/20121022091418/http: //www.stanford.edu/~learnest/spelling.pdf.
[5] D. Knuth, “The art of programming, vol. 3 (sorting and searching)”,
Addison-Wesley Publishing Company, vol. 3, pp. 481–489, 1973.
[6] R. De La Briandais, “File searching using variable length keys”, Papers
pre-sented at the the March 3-5, 1959, western joint computer conference, pp. 295–
298, 1959.
[7] Y. Xu and J. Wang, “The Adaptive Spelling Error Checking Algorithm based on Trie Tree”, Atlantis Press, Jul. 2016, issn: 2352-5401.
[8] M. Odell and R. Russel, “U.S. Patent Numbers 1,261,167 (1918) and 1,435,663 (1922)”, Technical report,Patent Office, Washington, 1918.
[9] K. Vännman, Matematisk statistik. Lund: Studentlitteratur, 2002.
48 REFERENCES
[11] A. NetBeans. (Mar. 2020). Welcome to Apache NetBeans. [Online; accessed 5. May 2020], [Online]. Available: https://netbeans.apache.org.
[12] (Mar. 2020). Wiktionary. [Online; accessed 21. Apr. 2020], [Online]. Available: https://www.wiktionary.org.
[13] (Oct. 2018). SAXParser (Java Platform SE 7 ). [Online; accessed 21. Apr. 2020], [Online]. Available: https://docs.oracle.com/javase/7/docs/api/ javax/xml/parsers/SAXParser.html.
[14] (Oct. 2018). HashSet (Java Platform SE 7 ). [Online; accessed 21. Apr. 2020], [Online]. Available: https://docs.oracle.com/javase/7/docs/api/java/ util/HashSet.html.
[15] N. R. Ferrada-Noli. (Apr. 2020). Recension: Barnen djupt saknade när Pippi fi-rar 75 år på Berwaldhallen. [Online; accessed 28. Apr. 2020], [Online]. Available: https://www.dn.se/kultur- noje/recension- barnen- djupt- saknade-nar-pippi-firar-75-ar-pa-berwaldhallen/.
Appendix A
Table of characters
’ - A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Å Ä Ö a b c d e f g h i j k l m n o p q r s t u v w x y z å ä ö50
Appendix B
Review text
Som tur är finnsBerwaldhallen Play och Pippi Långstrump 75 år visas på detta vis
1
istället. Men för vem? Hela poängen med att göra en konsert som låter barn komma
2
till ett fint konserthus och höra en livs levande symfoniorkester är ju just det, den
3
fysiska upplevelsen, skriver Nicholas Ringskog Ferrada-Noli.
4
Böckerna står visserligen på egna ben, men frågan är om Astrid Lindgrens
berät-5
telser hade fått samma fäste i svensk kultur om det inte vore för filmerna och
tv-6
serierna och för de kongeniala sångerna som komponerades till dessa. Hade
verkli-7
gen Pippi Långstrump kunnat bli en sådan ikonisk figur om det inte vore för Jan
8
Johanssons Här kommer Pippi Långstrump, som går rakt in i en treårings hjärta
9
med sin berusande rytm och triumfatoriska melodi? Eller GeorgRiedelssånger, vars
10
bästa melodier är lika lysande självklara som Benny Anderssons eller Max Martins.
11
Om man som konsertarrangör vill få små barn att upptäcka musikens magiska
12
värld är det därför logiskt att välja just Pippi Långstrump som språngbräda. Pippi
13
Långstrump 75 år är visserligen inte den första satsningen Berwaldhallen gör för
14
barn, men var på planeringsstadiet ovanligt ambitiös: fyra sjungande skådespelare,
15
en regissör från Astrid Lindgrens Värld och ett helt nytt verk, Benjamin Staerns
16
Pippi lyfter hästen. Konserten sålde slut. Sedan kom coronaviruset till Sverige och
17
alla publika konserter ställdes in.
18
52
Appendix C
Table C.1: Word suggestions for substitution errors for fotbollsplan.
Edit-distance
1 2
fotbollsplam forbollsplam
Our SAOL Google Our SAOL Google
fotbollsplan fotbollsplan fotbollsplan fotbollsplan fotbollsplan fotbollsplan fotbollslag fotbollsplans fotbollslag fotbollsplans
fotbollsplans fotbollsspelaren fotbollsplans fotbollsspelaren fotbollshall fotbollsspelare fotbollshall fotbollsspelare fotbollspris fotbollsplanen fotbollspris fotbollsplanen fotbollsklub fotbollsspelarna fotbollsklub fotbollsspelarna
Table C.2: Word suggestions for substitution errors for word motorcykel.
Edit-distance
1 2
motorcycel motorkycel
Our SAOL Google Our SAOL Google
motorcykel motorcykel motorcykel motorcykel motorcykel motorcykel motorcykeln motorcykeln motorcykeln motorcykeln
motorcykels motorcykels motortypen motorcykels motorcyklel motorcykelns motortyper motorskydden motorbyte motocrossen motorcykels motorskyddet motorcicles motorcyklar motorfel motorcykelns
Table C.3: Word suggestions for substitution errors for word segelbåt.
Edit-distance
1 2
selelbåt selelbåg
Our SAOL Google Our SAOL Google
segelbåt segelbåt segelbåt segelbåt cirkelbåge selbåge segelbåts segelbåts segelbar segelbåt
54
Table C.4: Word suggestions for insertion errors for word kampanj.
Edit-distance
1 2
katmpanj katmpmanj
Our SAOL Google Our SAOL Google
kampanj kampanj kampanj kampanj kampanj kampanj kampanja hatkampanj kampvana kardemummans
kampanjs champagne kampanjs matmammans skampanj kampanja kampanja kampsång kampanil kampanjs skampanj komplimang
skyttekompani kardemumman
Table C.5: Word suggestions for insertion errors for word lastbil.
Edit-distance
1 2
lagstbil lagstbqil
Our SAOL Google Our SAOL Google
lastbil lastbil lastbil lastbil lastbil lastbil lagstil blixtbild lagstil blixtbild
lastbils lackstövel lastbils lastcykel lagspel lastbils lastbils flygstil taxibil taxibil labil lyxbil lagspel
Table C.6: Word suggestions for insertion errors for polaritet.
Edit-distance
1 2
polanritet polanmritet
Our SAOL Google Our SAOL Google
polaritet polaritet polaritet polaritet polarområdet polaritet polaritets polaritets polaritets polaritet
paritet popularitet olinjaritet polarområdets polarisen polariteter planarbetet koloniområdet bipolaritet polariteten bipolaritet polarområdes
Table C.7: Word suggestions for deletion errors for utveckling.
Edit-distance
1 2
utvekling utveklin
Our SAOL Google Our SAOL Google
utveckling utveckling utveckling utveckling utveckling utveckling utvecklings utvecklings utveckla utvecklaren
utvecklling utväxling utvecklas utvecklings uveckling utvecklingen utvecklat utväxling utpekning utvikning utvecklar utvecklande
utvikning utsegling utvecklad utvikning
Table C.8: Word suggestions for deletion errors for flygplan.
Edit-distance
1 2
flygpln fygpln
Our SAOL Google Our SAOL Google
flygplan flygplan flygplan flygplan flygplan flygplan flygeln flygeln fågeln vägplan
flygel flygplatsen flygeln vågplan flygplans flygplans tygeln byggplan
flygelns flygbladen bygeln fågeln flygen flygbilden flygeln
Table C.9: Word suggestions for transposition errors for kultur.
Edit-distance
1 2
kutlur kutlru
Our SAOL Google Our SAOL Google
kultur kultur kultur kultur kultur kultur kullar kulturs kuttra kulturs
56
Table C.10: Word suggestions for transposition errors for word motordelen.
Edit-distance
1 2
motoredlen mtooredlen
Our SAOL Google Our SAOL Google
motordelen motmedlen motordelen motordelen tåredalen motordelen motordelar motorleden morellen stormelden
motoraxeln motorhotellen motmedlen motorbåten mutsedeln motorleden motellen motorfelen naturmedlen morellen naturmedlen maktmedlen
Table C.11: Word suggestions for transposition errors for polisbricka.
Edit-distance
1 2
polisbrikca poilsbrikca
Our SAOL Google Our SAOL Google
polisbricka polisbricka polisbricka polisbricka polisbricka polisbricka polisbrickan mosbricka polisbrickan
polisbrickas polisbrickas polisbrickans plastbricka polisbrickorna polisbrickans
Appendix D
Source Code
Listing D.1: Trie node class.
1 2 p a c k a g e com . m y c o m p a n y . s p e l l c h e c k e r ; 3 4 i m p o r t ja v a . u t i l . H a s h M a p ; 5 i m p o r t ja v a . u t i l . Map ; 6 7 p u b l i c c l a s s Tr i e { 8 S t r i n g w o r d ; 9 int l o n g e s t W o r d = I n t e g e r . M I N _ V A L U E ; 10 int p r o b a b i l i t y = 0;
11 Map < C h a r a c t e r , Trie > c h i l d r e n = new HashMap < >() ; 12 }
Listing D.2: Soundex encoding class.
1
2 p u b l i c c l a s s S o u n d e x 3 {
4 p u b l i c S t r i n g g e t G o d e ( S t r i n g w o r d )
60 w o r d A r r a y [ 0 ] ; 72 73 for (int i = 1; i < w o r d A r r a y . l e n g t h ; i ++) 74 if ( w o r d A r r a y [ i ] != w o r d A r r a y [ i - 1] && w o r d A r r a y [ i ] != ’ 1 ’) 75 o u t p u t += w o r d A r r a y [ i ]; 76 77 r e t u r n o u t p u t ; 78 } 79 }
Listing D.3: QWERTY Keyboard Map.
1
2 i m p o r t ja v a . u t i l . H a s h M a p ; 3
4 p u b l i c c l a s s K e y b o a r d M a p { 5