Spell checker in CET Designer

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Bachelor thesis, 16 ECTS | Datateknik

2016 | LIU-IDA/LITH-EX-G--16/069--SE

Spell checker in CET Designer

Rasmus Hedin

Supervisor : Amir Aminifar Examiner : Zebo Peng

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

A common feature of text input tools is spell checking. It exists in search engines, email clients and of course in word processors like Microsoft Word. By having a spell checker when you are typing you can be more efficient than if you had to check the spelling with a separate proofing tool. Spell checking is a common request by the users of the room planning software CET Designer which is developed by Configura. In this thesisR Windows spell checking API is evaluated and compared to alternative spell checkers. A prototype of an integrated spell checker in CET Designer text tool is then implementedR with Windows spell checking API.

(4)

List of Figures

3.1 Communication between C++ dll, C# exe and CM . . . 9

3.2 Windows language bar. Current input language is English. The language can be switched by clicking it or by pressing Windows key+Space. . . 12

4.1 Words/second for test file 1 and 2 . . . 17

4.2 Overall harmonic mean for test file 1 and 2 . . . 18

4.3 Misspelled words marked with wavy underline . . . 19

(6)

List of Tables

3.1 Specs of computer used for evaluation . . . 7 4.1 Spell checkers performance and accuracy . . . 16 4.2 Spell checkers performance and accuracy . . . 17

(7)

1 Introduction

1.1 Motivation

CET Designer is a tool for room planning where the end user often adds information text and labels to the reports that they create. The text tool does not have a spell checker and this is a common request from the end users. A spell checker would make text handling easier and faster for end users. Today, almost every software where users input text has the ability to check the spelling. This will result in text that feels much more professional than if spelling errors were present. Being professional is important to make clients feel that they can trust in a company.

If a user of the current version of CET Designer wants to spell check a text a separate software needs to be used. When the text have been checked it can then be inserted into the CET Designer reports. This method takes some time and effort from the user. If the users feels that it is not worth the time and effort to check the spelling in a separate software, unchecked texts may be inserted into the reports.

If a spell checker is integrated into CET Designer the texts can easily be checked. This will lead to more efficient spell checking compared to using a separate spell checking program.

1.2 Aim

The goal of this thesis is to investigate if Windows built-in spell checker is good enough for CET Designer or if there exists another suitable spell checker. The most suitable alternative will then be used for the spell checker that is integrated in the CET Designer text tool.

1.3 Research questions

With this thesis work, we would also like to answer the following research questions: • Can Windows spell checking API be used to check spelling in CET Designer’s text tool? • What alternatives exist to check spelling of a text in CET Designer text tool?

(8)

1.4. Scope

1.4 Scope

For this thesis we will only implement a prototype of an integrated spell checker. Only the performance of Windows spell checking API and other suitable solutions will be evaluated and not the techniques used to check spelling.

1.5 Background

1.5.1 Configura

Configura is a company operating globally with its headquarter in Linköping, Sweden and commercial operation in Grand Rapids, Michigan, USA and Kuala Lumpur, Malaysia. The company, which is privately owned, was founded in 1990 and has over 110 employees world-wide. Parametric Graphical Configuration (PGC) software solutions is created by Configura for leading international industries. PGC is a development framework for implementation of fast, efficient and intuitive software for graphical configuration. The software is suitable for office furniture but can also be used for kitchen & bath, material handling and industrial ma-chinery. Software developed by Configura includes CET Designer , ConfiguraR (originalR

software platform) and InstantPlanner .R

1.5.2 CET Designer

CET Designer is a space-planning software that makes it easier to specify and sell productsR

in a variety of industries. The software is a complete solution that handles every step of the sales and order process quickly and with accuracy. In the 2D and 3D virtual environments one can simply drag and drop components and the software will calculate pricing to prevent the user from making calculation mistakes.

1.5.3 CM

The dissatisfaction with C++ and the need to restart and recompile the application after any source code change that leads to long work cycles motivated the development of CM (Con-figura Magic).

CM is a object-oriented programming language that supports extensible syntax and in-cremental development. Allocated memory for objects and values is automatically reclaimed by a garbage collector. The source files is compiled to machine code but it is just-in-time (JIT) compiled. This means that the compiler will only translate into machine code when it can not delay it anymore. Code can be changed and added during application runtime since the compilation is interleaved with the execution. This leads to shorter developing cycles with-out losing performance and the developer can get feedback from the code they are working on faster.

CET Designer is programmed in CM and therefore CM will be used for the integration of a spell checker. CM has good possibilities to integrate with dll’s and Windows API.

(9)

2 Theory

In this chapter we will introduce the concept of spell checkers and spell correctors. We will also describe some metrics that can be used to evaluate spell checkers.

2.1 Spelling error detection

Liang [7] and Peterson [11] define two types of spelling programs: Spell checkers and spell correctors. spell correctors will be explained in the next section. A spell checker will simply be given an input text and detect the words which are incorrectly spelt in some given language. A word is defined as a continuous string of characters that exists in the given language. A non-word on the other hand is defined as a word that is not found in a given word-list or dictionary or if the word form is incorrect. Liang [7] describes two main techniques that are used for spell checking: Dictionary lookup and n-gram analysis.

Spell checkers can have a combination of techniques to check the spelling of words. To an-alyze which techniques that perform best and is used by different spell checkers is out of the scope of this thesis. But it is very likely that spell checkers use a combination of techniques.

2.1.1 Dictionary lookup

With dictionary lookup the spell checker first divides the given text into words and possible words i.e. remove blank spaces, numbers, special characters and duplicate words. The next step is to strip out all misspelled words. If a word is not found in the dictionary it will be detected as misspelled. It is important that the dictionary is highly efficient according to Pollock [12] and suggests that it can be stored as hash codes or as patterns of bits distributed over a long string.

2.1.2 n-gram analysis

A n-gram is a subsequence of n consecutive letters of a word and n-grams of length 1, 2 and 3 can also be called unigram, bigram and trigram respectively. When using n-grams for spell checking one needs a table containing the probabilities of each n-gram being valid in the language. This table is pre-compiled from a sizable corpus where the frequency of each n-gram is counted. The frequency counts is then stored for each n-gram in some data

(10)

2.2. Spelling error correction

structure, e.g. array or vector, so they can be efficiently obtained and compared. This method is a probabilistic approach of spell checking.

When spell checking, an input word is decomposed into n-gram elements. These elements is then given a weight depending on the frequency of that element in the composed corpus. The weights is used to calculate a peculiarity index. A pre-defined threshold is used to deter-mine the peculiarity of a n-gram. n-grams below this threshold is considered as most peculiar and will be flagged as a potential errors. Words with potential errors may be considered as misspelled. This method has a probability to flag incorrectly since it is probabilistic.

2.2 Spelling error correction

A spell corrector will both detect spelling errors and try to give suggestions of the most likely correct words. When the misspelled words have been found the spell corrector can be used to find suggestions of correct spellings. To do this it will also use the dictionary and then provide a list of suggestions. The words in this list can also be ordered after some estimate of how likely it is that a suggestion is the correct word.

2.3 Evaluation metrics

2.3.1 Precision and Recall

The following definitions will be used for the evaluation metrics explained below. • True positive Tp: correct word reported as correct

• True negative Tn: misspelled word reported as misspelled

• False positive Fp: misspelled word reported as correct

• False negative Fn: correct word reported as misspelled

In the information retrieval domain the measurements precision and recall are used to measure how good an information retrieval system can retrieve relevant items requested by a user [17]. In the context of a spell checker a relevant item is a correct word when trying to de-tect correct words and a misspelled word when trying to dede-tect misspelled words. Precision is a measurement of how many of the retrieved items are relevant and recall is a measurement of how many of the relevant items were retrieved. Starlander et al. [16] defines precision and recall as the capacity a system have to correctly classify correct vs incorrect words.

Precision and recall can be calculated for both the ability to detect correct words and the ability to detect wrong words. By evaluating a spell checker’s capacity to detect both correct and incorrect words one gets a more refined view of the performance of a spell checker.

To measure a spell checker’s capacity to detect correct words the following measurements can be used where Pc and Rc is precision and recall respectively:

Pc= Tp

Tp + Fp and Rc= Tp Tp + Fn

To measure a spell checker’s capacity to detect misspelled words the following measure-ments can be used where Pi and Ri is precision and recall respectively:

Pi= Tn

Tn + Fn and Ri= Tn Tn + Fp

These metrics will have a value between 0 and 1 where an ideal spell checker have a value of 1. If the precision for incorrect words (Pi) is 1 it means that all the words that was reported

(11)

2.3. Evaluation metrics

as misspelled actually was misspelled. A recall for incorrect words (Ri) with value 1 means that all misspelled words in the text was found and reported. In other words spell checkers should have a precision and recall as close to 1 as possible to be classed as good.

Starlander et al. also measure the Predictive Accuracy (PA) of a spell checker which gives a more overall view of a spell checker’s capacity to handle all words accurately. However, Van Huyssteen et al. [5] explain that the problem with this measurement is that the difference of a good and a not so good spell checker is relatively small. So this makes this measurement difficult to use in an evaluation of a spell checker.

The last measurement that Starlander et al. use is harmonic mean between precision and recall for both the correct and incorrect capacity. This measurement is better according to van Huyssteen et al. because it will penalise spell checkers that for example selects all words as correct to get a high correct recall (Rc). An ideal spell checker have harmonic means of 1 just like the other measurements. The harmonic means for correct (fmc) and incorrect (fmi) is

defined as following:

f mc= ₁ 2 Rc+ Pc1

and f mi = ₁ 2 Ri+ Pi1

Van Huyssteen et al. found that the measurement for correct precision can be problematic because it will deviate depending on the percentage of spelling errors in the text. Therefore they calculated the adjusted error precision (Pia) with a normalised error percentage. The

normalisation is used to normalise the error percentage to a specific value. This is useful when comparing measurements from different input texts with different amount of errors. For example, if one input text has an error percentage of 4% and another input text has an error percentage of 6%. It would be easier to compare the tests if the error percentage was the same in both cases. Van Huyssteen et al. chose to set the normalisation percentage to 6% to get a realistic value but they think that some research needs to be performed to determine an optimal percentage. The same normalisation percentage will be used in this thesis. The measurement is defined as following:

Pia=

Tn(normalisation%_%errors )

Tn(normalisation%_%errors ) +Fn

The harmonic mean measurements can be combined into an overall harmonic mean (fmo)

but with Piainstead of Pi. It is defined as following:

f mo= ₁ 4

Rc+Pc1 + Ri1 +P1ia

2.3.2 Performance in terms of time

Another metric that can be used when measuring performance of a spell checker is how fast it is. A spell checker that checks the input text in less time than another spell checker has a better performance. The time, however, is not that important in the context of CET Designer since users only write short texts. But for the purpose of comparing spell checkers the time taken to check texts will be measured.

(12)

3 Method

In this chapter the method used will be explained. First the evaluation of the spell checkers is explained followed by the requirement specification of the prototype. How the prototype was implemented is also explained.

3.1 Pre-study

The purpose of this thesis was to integrate an already existing spell checker into CET De-signer. So, before anything could be implemented a study was made to find a suitable spell checker for the integration. Windows built-in spell checking API was suggested by Configura but other spell checkers were also investigated for comparison.

The metrics precision and recall explained in the theory chapter were used to evaluate the spell checkers we found after the exploration.

A test file with common English words, correct and misspelled, was created to be used as input to the spell checkers that were tested. A list of the 5000 most frequent words [19] was downloaded and used for the common English words. This list was made from the Corpus of Contemporary American English which contains 450 million words. Birbeck spelling error corpus [9] contains documents of misspellings and texts with misspellings in. From this cor-pus 959 misspelled words was extracted. These two lists were then used together as input to the spell checkers to test their performance. All collected words, misspelled and correct, were put in the same file one word per row. The text file was separated in two sections separated by the word CORRECT_SPELLING. Misspelled words was put in the first section and correct words was put in the second section after the word CORRECT_SPELLING.

The specifications of the computer that was used for evaluating the spell checkers can be seen in table 3.1. Only the time taken by the spell checkers depends on the specs of the

Processor Intel Core i7-6700k @ 4.00 GHz

RAM 16GB (4x4GB) DDR4 3000MHz

Operating System Windows 10 Enterprise 64-bit Table 3.1: Specs of computer used for evaluation

(13)

3.2. Requirement specification

computer. The time is not that important in the context of CET Designer’s text tools since the users don’t write long texts but it was chosen to be included in the evaluation.

The evaluation of the spell checkers was made with a test program for each spell checker which was built with existing sample code of the spell checkers. The overall harmonic mean measurement that was explained in the theory chapter was used to compare the spell check-ers. Algorithm 1 shows how the values true positive (Tp), true negative (Tn), false positive (Fp) and false negative (Fn) were determined. These values were then used to calculate the precision and recall measurements which finally could be used to calculate the overall harmonic mean. The test program was divided into two phases since the text file was divided into two sections. In the first phase, the spelling error phase, all words in the first section was checked and in the second phase all correct words was checked.

while read word from text file do

check word;

if word reported as misspelled then if in spelling error phase then

++Tn;

else

++Fn;

end else

if in spelling error phase then

++Fp; else ++Tp; end end end

Algorithm 1:Determine evaluation metrics for spell checker

To make it easier to plan and to know when the project is finished a requirement specifica-tion was created. Before anything could be written down as requirements some exploraspecifica-tion of what a spell checker should do was done. Much inspiration of what the spell checker should do and look like was taken from the proofing tools in Microsoft Word, LibreOffice and OpenOffice. Van Huyssteen et al. [5] defined some characteristics of a spell checker that also were used when writing the requirement specification. The spell checker should support multiple languages but only English will be used in the evaluation.

3.2 Requirement specification

3.2.1 Main requirements

• The spell checker should be integrated with CET Designer’s text tool

• Misspelled words in CET Designer’s text tool should be detected and indicated/dis-played

• If a word is detected as misspelled the user should get suggestions of the correct spelling • The user should be able to ignore words marked as misspelled

• The spell checker should support multiple languages, including:

(14)

3.3. Implementation – Dutch – English (UK) – English (US) – Finnish – French – German – Norwegian – Spanish – Swedish

3.2.2 Secondary requirements

• The spell checker should be made with "user friendliness" in mind • The user should be able to add words to the dictionary

3.3 Implementation

After the pre-study was finished and the spell checker to use was chosen the implementation part could start. With the results from the pre-study described in section 4.1 Windows spell checking API was chosen to be used for the implementation of an integrated spell checker.

To be able to use Windows spell checking API functions in CM both a C++ Dynamic-Link library (DLL) and a C# exe was implemented. CM can then start the C# exe to use the spell checking functions. Figure 3.1 shows the communication between the C++ DLL, C# exe and CM.

Figure 3.1: Communication between C++ dll, C# exe and CM

3.3.1 C++ DLL

The C++ DLL declares and export functions that can be used by the C# exe. Windows spell checking API is used in the C++ DLL to check the spelling of words. In order to make the functions in the C++ DLL available to the C# exe we need to export them. To export a func-tion, the keyword __declspec(dllexport) needs to be added to the left of the function declaration. This will tell the compiler that the function is exported from the DLL. The key-word extern "C" is also needed in front of the declaration so that the C# code can link to the functions. extern "C" can be added in front of every function or to a block of functions. Listing 3.1 shows an example of the export.

(15)

3.3. Implementation

GetSupportedLanguages

Retrieves languages that is currently supported. A string containing available lan-guages separated by semi-colon is created and then returned via a wchar_t* that is provided as an argument for the function. The string will contain only the tag for each language and not the whole name, e.g. en-US and sv-SE for American English and Swedish respectively.

CheckSpelling

This function will check the spelling of a given word in a specified language. The lan-guage is specified with a lanlan-guage tag, e.g. en-US and sv-SE for American English and Swedish respectively. When the word has been checked the corrective action, sug-gestions, replacement and indices will be returned as one wchar_t* for each. The wchar_t*s is taken as arguments by the function. Corrective action is the action that should be performed for the checked word and it can be either GET_SUGGESTIONS, REPLACEor NONE. Suggestions contain suggestions of the correct spelling if any. Re-placement contains a reRe-placement word if the spell checker is absolutely sure what a misspelled word should be replaced with. Indices contain the start index and length of each spelling error in the checked text.

AddWord

This function takes a word and a language tag as function arguments. The word will be added to the dictionary of the language specified by the language tag. This word will then be considered as correct and can also be used when the spell checker is looking for suggestions.

GetInputLanguageLcid

Returns the current keyboard input language.

IsAlpha

Returns true if a character is alphabetical in a specified language, i.e., it is not a special character or a digit.

3.3.2 C# exe

In order to use the functions that were declared in the C++ DLL we need to import them first. To import a function from a DLL in C# the attribute Dllimport, which takes the name of the DLL as argument, needs to be added to the declaration of the function. Functions with the same name as in the C++ DLL are declared in C# with the keywords static and extern. See listing 3.1 for example code.

A string in C++ is not the same as a string in C# so we need some way to transfer the words. In the C++ functions wchar_t* is used for both ways, from and to C#. To transfer a string from C# a string can be used. To receive a string a StringBuilder was used which represents a string of characters [10]. Some example code is privided in listing 3.2.

The C# exe is made as a wrapper for the DLL functions so that they can be called from CM. Therefore the imported functions were wrapped in methods with similar names which act as proxy methods. The strings received from CheckSpelling were stored in a SpellChecker class which also contained the proxy methods. Get methods were also implemented so it would be easy to get the strings stored in the SpellChecker class. The C# code is compiled to a .NET assembly that can be loaded in CM.

(16)

3.3. Implementation

// Export function from DLL

extern "C" __declspec(dllexport) void foo(); // Import function from DLL

[Dllimport("myDll.dll", CallingConvention = CallingConvention.Cdecl)]

public static extern void foo();

Listing 3.1: Export and import of DLL functions

// C++

extern "C" __declspec(dllexport) void foo(wchar_t_* str); // C# to C++

public static extern void foo(string str);

// C++ to C#

public static extern void foo(StringBuilder str);

Listing 3.2: Transfer string between C++ and C#

3.3.3 CM: SpellChecker and SpellingError

A SpellChecker class was also implemented in CM to wrap all spell checking functions from the C# exe. To be able to use functions from a .NET assembly it first needs to be loaded into CM which can be done with the NetObj class. NetObj has functionality to load an .NET assembly and then creates an instance of a .NET class with loadAssembly and createInstance respectively.

SpellingError

_class

Represents a misspelled word. Contains the word, index in text, length and corrective action.

checkSpelling

Takes a string containing the text to be checked as argument. This text is separated into words and then each word is checked one by one with checkSpellingWord. Returns an array of SpellingError with one entry for each misspelled word.

checkSpellingWord

Takes a string containing a word to be checked as argument. If the word is not in the ignore list it will be checked with the checkSpelling function from the .NET spell checker.

currentLanguageTag

Returns the current language tag e.g. "en-US". The current language tag represents the language in the language and region settings in CET Designer control panel.

(17)

3.3. Implementation

currentInputLanguageTag

Returns the current input language tag e.g. "en-US". The current input language tag represents the input language that is set in the language bar. See figure 3.2.

isSupported

Takes a string containing a language tag and returns true if it is supported by the spell checker.

Figure 3.2: Windows language bar. Current input language is English. The language can be switched by clicking it or by pressing Windows key+Space.

3.3.4 CM: FormattedTextArea

FormattedTextAreais a multi-line input control which can be used by the user to write text. Methods used for checking the spelling were added to this class and are explained below.

selectMisspelledWord

Selects the word at the marker if it is misspelled.

checkSpellingLeft

Checks the spelling of the word to the left of the marker.

checkSpellingRight

Checks the spelling of the word to the right of the marker.

checkSpellingBoth

Checks the spelling of both the word to the left and to the right of the marker.

splitWord

Splits the word at the marker and inserts space.

skipPunctuationsLeft

Skips punctuation marks to the left of the marker.

skipPunctuationsRight

(18)

3.3. Implementation

checkSpelling

Checks the spelling of the word at the marker. This method is called when the space bar is pressed. If the marker is at the end of a word the last typed word will be checked. If in the middle of a word it will split it and the check spelling of the two new words that are formed. If the characters directly to the left and right of the marker are space or empty nothing will be checked.

replaceWord

Replaces misspelled word at the marker with a replacement word provided by the spell checker. Use this method when the corrective action is REPLACE.

markWordAsMisspelled

Marks the word at the marker as misspelled. Will remove the word from the TextParagraphElement and insert a MisspelledTextParagraphElement in-stead. When this element is drawn a red wavy underline will be drawn underneath the misspelled word.

markedAsMisspelled

Returns true if the word at the marker is marked as misspelled.

unmarkWordAsMisspelled

Unmarks a word marked as misspelled.

unmarkAllWords

Unmarks all misspelled words in the text.

findAndUnmark

Takes a string as argument. It will go through the whole text and unmark all mis-spelled words matching the argument.

checkSpellingAll

Checks the spelling of the whole text.

showDropDown

Shows a drop down menu with suggestions of correct spelling and the options "Ignore word" and "Add to dictionary".

3.3.5 Marking misspelled words

The dialog window (DrawTextEditDialog) that is used to edit text contains a FormattedTextAreawhich is a multi-line input control where the text is inputted. The text is divided into Paragraphs which contains ParagraphElements. ParagraphElement is an abstract class that is inherited by TextParagraphElement which is an ele-ment containing text. It is also inherited by e.g SetBoldParagraphElement and

(19)

3.3. Implementation

SetColorTextParagraphElement which can be used to set the text to bold or change the color of the text.

The SetBoldParagraphElement and SetColorTextParagraphElement were used in the beginning to mark a word as misspelled just to see if the spell checker was working as expected. However, this was not a good solution since the users may want to change the format of the text with these elements. It is not implemented so that users can change format at the moment but it is a requested feature. So instead of changing the format of the text for each misspelled word one could draw a red wavy line below the misspelled word. This marker is used by many proofing tools, e.g. Microsoft Word, to show that a word is misspelled. FPSSpellChecker created by Gullet [3] also uses wavy underlines to mark misspelled words. The wavy underlines is drawn with multiple lines alternating between going up and down and always forward until the whole word is marked (see algorithm 2). This solution was used in the implementation of the spell checker for this thesis.

To achieve the drawing of red wavy underlines the class TextParagraphElement was extended by a new class MisspelledTextParagraphElement. This will make sure that the text can be drawn without changing any other code. When drawing this element the text will be drawn just like the TextParagraphElement but the red wavy underline is also drawn. A string with the language tag of the language that the text was misspelled in is stored in MisspelledTextParagraphElement so that correct suggestions can be shown to the user.

Input:x, y, width currentX Ð x

while currentX <= x + width do

moveTo(currentX, y)

if currentX + 3 <= x + width then

drawLineTo(currentX + 3, y + 3)

end

if currentX + 6 <= x + width then

drawLineTo(currentX + 6, y)

end

currentX Ð currentX+6

end

(20)

4 Results

This chapter will present the results of the pre-study and implementation.

4.1 Pre-study

4.1.1 Found spell checkers

The spell checkers that was found and considered relevant will be presented and explained below.

Windows spell checking API

Windows 8 and later versions offer a spell checking API that is designed for C/C++ developers. The API is made available via a set of COM-like interfaces and supports many different languages. It has the ability to check spelling, give suggestions, detect repeated words, and ignore and add words to the user dictionary. The possibility exists to create your own spell checker which can be used with the spell checking API and it is free to use [18].

Hunspell

Hunspell is a spell checker that is used in e.g. LibreOffice, OpenOffice, Mozilla Fire-fox, Mozilla Thunderbird and Google Chrome. The C++ library is licensed under the GPL/LGPL/MPL tri-license. It is based on MySpell so MySpell dictionaries can be used for the spell checking. It is free to use [4].

Spellex Spell Check Engine

Spellex Windows DLL Spell Check Engine is a tool from Spellex that can be used by developers to integrate spell checking functionality into Windows applications. It includes dictionaries for American English and UK English. Brazilian (Portuguese), Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese (Iberian),

(21)

4.1. Pre-study

Spanish, and Swedish dictionaries can be installed. The cost is $1670 for one devel-oper and $499 for each additional develdevel-oper. This includes English US and English UK dictionaries. Additional dictionaries can be added for $899 per dictionary [15].

Sentry Spelling Checker Engine

Sentry Spelling Engine is a DLL for Windows that can be called from Windows appli-cations to add spell checking functionality. It includes American, British and Canadian English dictionaries but user dictionaries can also be used. Other supported languages are Danish, Dutch, Finnish, French, German, Norwegian, Spanish and Swedish and each dictionary can be obtained for a cost of $300. License cost is $700 for one devel-oper and $6300 for any number of develdevel-opers [14].

Intellexer Spellchecker SDK

It provides spell checking functionality to Windows applications. It comes with a common DLL interface and interfaces for C++ and .NET. It supports common English language and Russian. The license cost is $500 [6].

4.1.2 Evaluation of spell checkers

After some exploration of available spell checkers three solutions were chosen for evaluation. These three were Windows spell checking API, as suggested by Configura, Hunspell and Sentry. Windows spell checker is free to use and supports the required languages. Sentry also supports the required languages. It is not free to use but a free trial could be obtained for the evaluation. Spellex and Intellexer were not included because a trial version could not be obtained.

Evaluation with about 6000 words

Windows Hunspell Sentry

Time (s) 0.733 0.12 0.063 Time w. suggest (s) 4.69 2162 0.172 Pc 0.997 0.997 0.977 Rc 1.000 0.986 0.999 Pi 1.000 0.933 0.992 Ri 0.985 0.983 0.875 Pia 1.000 0.965 0.996 fmc 0.999 0.992 0.987 fmi 0.993 0.957 0.930 fmo 0.996 0.946 0.954

Table 4.1: Spell checkers performance and accuracy

Table 4.1 shows the performance of the spell checkers when testing with about 6000 words. Sentry performed best in terms of time taken to check the input text, both with and without suggestions. Hunspell performed very poorly in the suggestion test with the time of 2162 seconds. The other values presented in the table should be 1 for an ideal spell checker. A fmoscore of 1 means that all words were reported

correctly, i.e. all misspelled words were reported as misspelled and all correct words were reported as correct. The spell checker with the highest overall harmonic mean (fmo) is considered to have the best accuracy. In this test all three spell checkers got

(22)

4.1. Pre-study

a score very close to 1 which is very good. However, Windows spell checking API performed best with a score of 0.996.

Evaluation with about 30000 words

Windows Hunspell Sentry

Time (s) 3.484 0.48 0.266 Pc 1.000 0.999 0.995 Rc 0.989 0.984 0.886 Pi 0.748 0.666 0.201 Ri 0.985 0.983 0.875 Pia 0.856 0.799 0.335 fmc 0.992 0.992 0.937 fmi 0.882 0.882 0.484 fmo 0.953 0.934 0.639

Table 4.2: Spell checkers performance and accuracy

Table 4.2 shows the performance and accuracy of the spell checkers when testing with about 30000 words. This test was only done without suggestions and Sentry performed best in terms of time in this test also. Comparing to the first test the over-all harmonic mean (fmo) decreased for all three spell checkers where Sentry’s score

decreased the most. Windows and Hunspell performed good in this test also and Windows got the highest score of 0.953.

Figure 4.1 shows the performance of the spell checkers in terms of words per second when testing without suggestions. The values are derived from the time in table 4.1 and 4.2. The spell checker with the highest value is considered to be the one with best performance. We can see that the spell checkers performed better in the second test when testing with about 30000 words and that Sentry was the fastest spell checker.

Figure 4.2 shows the overall harmonic mean (fmo) of the spell checkers. We can see that in

the first test the spell checkers performed similarly. In the second test the spell checkers did not perform as good as in the first test and Sentry’s accuracy decreased a lot. What Sentry’s drop could depend on will be discussed in section 5.1.

(23)

4.2. Implementation

Figure 4.2: Overall harmonic mean for test file 1 and 2

4.2 Implementation

The project resulted in a prototype of a spell checker integrated in CET Designer’s text tool. Figure 4.3 shows an edit dialog that is used to edit texts. The edit dialog already existed in CET Designer so the spell checking functionality was added to it. The additions that were made to the edit dialog were a button and a check box. The button is used to check the whole text for spelling errors and the check box is used to enable/disable the auto-correct function. If the check box is checked the spell checker will replace words automatically but only if Windows spell checking API suggest that you should. Auto-correction can happen when for example one character is missing or when two characters needs to be switched.

When a word is detected as misspelled by the spell checker it will be marked with a wavy underline. A red wavy underline was chosen because it is commonly used in other word processors and spell checkers to mark misspelled words. All three of the word processors MS Word, LibreOffice and OpenOffice use a red wavy underline to show that a word is mis-spelled.

The check spelling button was decided to only check and mark misspelled words in the text. When pressing a similar button in other words processors a dialog is commonly dis-played. In this dialog the misspelled words will be displayed one by one and the user can choose to replace words with one of the suggested words. The dialog can also be used to ignore or add words to the dictionary. It was decided that there was not enough time to implement this functionality but that it could be considered as future work.

Another function that was implemented was check as you type. When the space bar is pressed the last typed word will be spell checked and marked or auto-corrected if it is de-tected as misspelled. This function is also available in the word processors MS Word, Libre-Office and OpenLibre-Office.

If a word is marked as misspelled the user can right-click with the mouse on it to open a drop down menu as shown in figure 4.4. The drop down menu will display suggestions if any was found else no suggestions will be displayed. In the drop down menu the user can also choose to ignore the misspelled word for the rest of the session or add it to the dictionary. If one of the later options is chosen the whole text will be searched for the misspelled word and unmark it as misspelled. When CET is closed the list of ignored words will be lost so the next time it is started none of the previously ignored words will be ignored. The add word function is on the other hand more permanent since it will add the word to the user’s dictionary that is stored on the computer.

(24)

4.2. Implementation

Figure 4.3: Misspelled words marked with wavy underline

(25)

5 Discussion

In this chapter we will discuss the results and the method used. Some societal aspects will also be discussed.

5.1 Results

In the pre-study we evaluated three spell checkers, Windows spell checking API, Hunspell and Sentry Spelling Checker Engine.

When testing with the first test file all three spell checkers performed very well in terms of overall harmonic mean. Windows spell checker performed best but the other two were not far behind. In the second test the overall harmonic means decreased for all three spell checkers, especially the performance of Sentry. Windows was the spell checker that performed best in this test as well in terms of overall harmonic mean of precision and recall. A reason why the results were so high for this test text may be because very common correct words were used. If common words are used it is a high chance that a spell checker’s dictionary will contain those words and the spell checker will therefore report the words correctly. With the dictionary lookup this means that some correct words will be reported as misspelled but misspelled words will not be missed since words not found in the dictionary will be reported as misspelled. n-gram analysis will not be affected in the same way since it does not use a dictionary.

The second input text contained a lot of names of persons and locations and this could be the reason for Sentry’s decreased accuracy in the second test. To test this theory, an additional small test was performed where names were sent to the spell checkers as input. Windows and Hunspell reported most of the names as correct but Sentry reported them as misspelled. It seems that this is the reason for Sentry’s decreased accuracy in the second test. The best guess to why Sentry does not recognise names is that the corpus used when compiling the dictionary or the n-gram probability table did not contain names.

When testing the speed of the spell checkers, Sentry was the one that performed best. By running the tests multiple times and then taking an average of the time taken the result could have been more reliable since the time can vary between runs. This was not done for this thesis but if the test were to be done again an average time would have been a better and more reliable choice.

(26)

5.2. Method

In Figure 4.1 we can see that the words per second increased with the larger test files. The reason is that the needed intitialization before the words can be checked takes some time. This time will be less significant if more words are checked at a time.

The time metric varied a lot between the spell checkers. Windows spell checker performed similarly with and without suggestions but it took a little longer time with the suggestions test. Hunspell was much faster than Windows spell checker without suggestions but with suggestions it took very long time so therefore it was not a good alternative to integrate into CET. This horrendously bad time to get suggestions may be because of the dictionary that was used was not optimized for lookups. Sentry performed best in terms of time taken to finish both with and without suggestions. At first, the time taken for Windows spell checker was around 50 seconds to check 6000 words. After some optimization of the code the time could be decrease to under a second without suggestions. Originally, it was very slow since for every word a ISpellCheckerFactory as well as a ISpellChecker were created. By creating only one of each of these objects the spell checking went much faster.

One factor that made Windows Spell Checking API very attractive was the easy installa-tion of languages for the spell checking. When a language pack is installed in Windows it will come with a spell checker and dictionary. With the other spell checkers one would need to provide dictionaries to the users with CET Designer.

5.2 Method

In the pre-study only one test file with commonly used and misspelled words were used at first. According to van Huyssteen et al. [5] the test text needs to contain at least 30000 words to get a stable result. For the evaluation in this thesis only about 6000 words were used at first. 959 misspelled words was found and that is enough to get an error rate of about 3.2% if 30000 words were to be tested. Van Huyssteen et al. had an error rate of about 3% in their test texts so this is enough misspelled words. To get a more reliable result more words were collected but there was not enough time to confirm that the words were correct since the implementation would take a long time. The list of the 5000 most frequent words was free to download and use. From the same site [19] a list of 20000 and 60000 could be obtained but they cost $60 and $90 respectively and were therefore not an option.

Some tests with sample text data from American National Corpus (ANC) [1] were made. When using this sample Sentry [14] detected a lot of correct words as misspelled. Many of these words were names and years which had a format that Sentry did not consider to be correct. Sentry’s function to spell check a text only returns if it finds a misspelling and will not report when a word is detected as correct. This is an issue when evaluating since all words needs to be reported to get correct precision and recall. Therefore the Sentry evaluation program needed to be changed to make it report every word. This could have affected the time taken to finish the test but Sentry was still the fastest spell checker.

We saw that both Windows spell checker and Hunspell had both precision and recall close to 1 and that means that they cover many of the words in the English language. However, as said before these values might differ with other test files so if more words is checked it would give a more accurate result. Windows spell checker had better precision and recall but that does not mean that Hunspell is much worse since the values could change if Hunspell used a different dictionary or if different words were used.

The prototype of a spell checker implemented in this thesis supports multiple languages but for the evaluation only American English was tested. To completely test the performance of the spell checkers one could have tested different languages to see if the results would differ. This would give a more complete evaluation and then maybe another spell checker than Windows could have been chosen for the implementation.

(27)

5.3. The work in a wider context

5.3 The work in a wider context

A study was made by Galletta et al. [2] where undergraduate and graduate students were asked to edit and correct a business letter using Microsoft Word 2003. One half of the test subjects had the spell checking feature turned on and the other half had it turned off. The purpose of the study was to investigate how spell checkers affect the subjects’ ability to cor-rect spelling errors. The subjects were categorized in subjects with high verbal ability and low verbal ability. The authors expected that the high verbal subjects as well as the subjects using the spell checker feature would perform better than low verbal subjects who did not use the spell checker feature. They also expected that the low verbal subjects would not know when to ignore the incorrect advice from the spell checker.

The conclusions that were drawn by Galletta et al. were that for obvious errors correctly reported by the spell checker the subjects performed better with the spell checker than with-out the spell checker. But for false positives and false negatives the performance was worse for both high and low verbal subjects. They saw that with spell checking enabled the subjects left behind nearly twice as much errors and this demonstrates that a spell checker can give a false sense of security.

Andrea Lunsford and Karen Lunsford [8] made a study where they tried to replicate a study made twenty-two years ago by Andrea and her research partner Robert Connors. The purpose was to chart changes that have been made in student writing errors over the course of twenty years. Andrea and Robert Connors assembled a list of the most common errors in student papers in 1988. A similar list was assembled in 2008 by Andrea and Karen to be compared to the previous list. What they found was that spelling errors and the use of the wrong words were more common in the later study. This should be used with caution according to the authors but it shows that the errors that students make have changed over the last years. If this is because of the use of proofing tools one can not be sure but it seems that it has something to do with it.

A spell checker is a great tool that will make users write correct texts. However, spell checkers are not perfect and can always be improved so one should be aware of that spell checkers also makes mistakes. As said by Ian McNeilly, director of the National Association for the Teaching of English "...if people are blindly writing things and expecting automated programs to address all of their inaccurate spellings, that’s a concern - because they won’t." [13].

(28)

6 Conclusion

In this chapter we will answer the research questions and see what requirements have been satisfied. Some future work will also be mentioned.

6.1 Research questions

6.1.1 Can Windows spell checking API be used to check spelling in CET

Designer’s text tool?

A prototype of a spell checker integrated into CET Designer’s text tool was implemented with Windows spell checker API.

6.1.2 What alternatives exist to check spelling of a text?

The alternatives that were found and that seemed to be suitable for CET Designer integration were Hunspell, Spellex Spell Check Engine, Sentry Spelling Checker Engine and Intellexer Spellchecker SDK.

6.1.3 Which alternative gives the best performance and accuracy?

In terms of classifying words as correct and incorrect all three spell checkers performed very well with an overall harmonic mean close to 1 which is the ideal value for a spell checker. The winner however was Windows spell checking API with a value of 0.996 in the first test and 0.953 in the second test. The values is shown in table 4.1 and table 4.2. This means that almost all correct words were reported as correct and almost all misspelled words were reported as misspelled by the spell checker.

In terms of time taken to spell check a text we saw that Sentry was the one with the highest performance. Hunspell had a similar time performance without suggestions but with suggestions it had no chance competing with Windows and Sentry.

(29)

6.2. Requirement analysis

6.2 Requirement analysis

All main requirements have been satisfied as well as the secondary requirement to be able to add misspelled words to the user dictionary. The secondary requirement that the spell checker should be made with "user friendliness" in mind was not satisfied.

6.3 Future work

6.3.1 Spelling correction dialog

In many proofing tools like the ones in Microsoft word, LibreOffice, and OpenOffice there exists a dialog that will be displayed after the spell checker has been run. This dialog will display one spelling error at a time with some possible corrective actions. Common actions are to ignore word, add word to dictionary and change the word to one chosen from a list of suggestions. This could be a good addition to the spell checker in CET Designer since it could make it easier for the user to go through all spelling errors and correct them.

6.3.2 Optimization

The implemented prototype of a spell checker is not very fast at the moment but it func-tions as a proofing tool. One thing that could make it faster is by optimizing the calls to the C++ DLL. The test program for evaluating Windows spell checking API only created one ISpellCheckerFactorywhich is ideal. But every time a call to the CheckSpelling func-tion in the C++ DLL is made the ISpellCheckerFactory is created and then closed before the result is returned. This is not very effective since only one word is checked at a time and this will create a lot of ISpellCheckerFactory. The ISpellCheckerFactory is also used to create a ISpellChecker each time the function is called. A solution could be to send the whole text in one call but this will not work at the moment since CheckSpelling will return false if one or more words is misspelled. To make this work one would need to return all spelling errors at once. This could for example be returned with a string formatted in some cleaver way so that it is easy to extract each individual spelling error. The index of the error in the text, length and corrective action needs to be returned for every spelling error. The suggestion and replacement words could be returned in this call or maybe with another call since suggestions are only needed when a user requests them. Requesting suggestions on a later call by sending the misspelled word can give different suggestions since the spell checker is aware of the context in the first call.

A thing that needs to be considered is that the text in CET Designer is divided into para-graphs. This will make the spelling error indices point to the wrong position in the text if the whole text is sent to the spell checker since each paragraph has its own indices.

The C# layer only acts as a wrapper for the C++ DLL. The interface for the DLL is a C interface and this makes it possible to import the DLL directly in CM. It could be worth to remove this layer to optimize but then some editing of the CM code needs to be done and it may not be worth it.

One last thing that needs to be considered is that if the whole text is checked the Win-dows spell checking API can detect repeated words. This will give the corrective action CORRECTIVE_ACTION_DELETEand it is currently sent back to the caller as NONE.

6.3.3 Spell checker settings

At the moment the only controls for the spell checker are the check spelling button and auto-correct check box that are located in the edit dialog. If more settings need to be added it will just fill the dialog window and make it more difficult to use. Therefore all settings could be

(30)

6.3. Future work

moved to the control panel and only have a settings button to open the control panel dialog. The autocorrect setting will be moved there and maybe a setting to choose which language the spell checker should use could be added there. These settings would preferably go under the language and region settings.

6.3.4 Resource lookup

At the moment the labels of the "Check spelling" button and check box are in English. Since CET Designer supports multiple languages the button and check box labels needs to be changed depending on the current language setting in the control panel. In CM this can easily be done by adding entries in a resource file ending in .rs and then using $<resource name> to make a resource lookup for each label to get the correct one for the current lan-guage. "Ignore word", "Add to dictionary" as well as "no suggestions" in the drop down menu also needs to be looked up. After the spelling has been checked a dialog window is displayed with a message that the spell checker has finished. This message is also only in English at the moment.

(31)

Bibliography

[1] American National Corpus.URL: http://www.anc.org/.

[2] Dennis F Galletta, Alexandra Durcikova, Andrea Everard, and Brian M Jones. “Does spell-checking software need a warning label?” In: Communications of the ACM 48.7 (2005), pp. 82–86.

[3] Matt Gullet.URL: http://www.codeproject.com/Articles/879/A- Spell-Checking-Engine.

[4] Hunspell.URL: https://hunspell.github.io/.

[5] Gerhard B van Huyssteen, ER Eiselen, and MJ Puttkammer. “Re-evaluating evaluation metrics for spelling checker evaluations”. In: Proceedings of First Workshop on Interna-tional Proofing Tools and Language Technologies. 2004, pp. 91–99.

[6] Intellexer Spellchecker SDK.URL: http://spellchecker.intellexer.com/. [7] Hsuan Lorraine Liang. “Spell checkers and correctors : a unified treatment.” In: (2009). [8] Andrea A Lunsford and Karen J Lunsford. “" Mistakes are a fact of life": A national

comparative study”. In: College Composition and Communication (2008), pp. 781–806. [9] Roger Mitton. Birbeck spelling error corpus. 1985. URL: http : / / ota . ox . ac . uk /

headers/0643.xml(visited on 04/14/2016).

[10] Passing strings from C# to C++ dll and back — minimal example. URL: http : / / stackoverflow . com / questions / 20752001 / passing strings from c -sharp-to-c-dll-and-back-minimal-example(visited on 05/26/2016).

[11] James L Peterson. “Computer programs for detecting and correcting spelling errors”. In: Communications of the ACM 23.12 (1980), pp. 676–687.

[12] J.J. POLLOCK. “SPELLING ERROR DETECTION AND CORRECTION BY COM-PUTER: SOME NOTES AND A BIBLIOGRAPHY”. In: Journal of Documentation 38.4 (1982), pp. 282–291.DOI: 10.1108/eb026733.

[13] Poor spelling of ’auto-correct generation’ revealed, BBC News.URL: http://www.bbc. com/news/education-18158665.

[14] Sentry Spelling Checker Engine.URL: https://www.wintertree- software.com/ dev/ssce/windowssdk.html.

[15] Spellex Windows DLL Spell Check Engine. URL: http : / / www . spellex . com /

(32)

Bibliography

[16] Marianne Starlander and Andrei Popescu-Belis. “Corpus-based Evaluation of a French Spelling and Grammar Checker.” In: LREC. 2002.

[17] Kai Ming Ting. “Encyclopedia of Machine Learning”. In: ed. by Claude Sammut and Geoffrey I. Webb. Boston, MA: Springer US, 2010. Chap. Precision and Recall, pp. 781– 781.

[18] Windows Spell Checking API. URL: https : / / msdn . microsoft . com / en - us / library/windows/desktop/hh869852(v=vs.85).aspx.

[19] Word frequency data Corpus of Contemporary American English. URL: http : / / www . wordfrequency.info/(visited on 04/14/2016).

Spell checker in CET Designer

Linköping University | Department of Computer Science

Bachelor thesis, 16 ECTS | Datateknik

2016 | LIU-IDA/LITH-EX-G--16/069--SE

Spell checker in CET Designer

Rasmus Hedin

Upphovsrätt

Copyright

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Scope

1.5

Background

1.5.1 Configura

1.5.2 CET Designer

1.5.3 CM

2

Theory

2.1

Spelling error detection

2.1.1 Dictionary lookup

2.1.2 n-gram analysis

2.2

Spelling error correction

2.3

Evaluation metrics

2.3.1 Precision and Recall

2.3.2 Performance in terms of time

3

Method

3.1

Pre-study

3.2

Requirement specification

3.2.1 Main requirements

3.2.2 Secondary requirements

3.3

Implementation

3.3.1 C++ DLL

GetSupportedLanguages

CheckSpelling

AddWord

GetInputLanguageLcid

IsAlpha

3.3.2 C# exe

3.3.3 CM: SpellChecker and SpellingError

SpellingError

class

checkSpelling

checkSpellingWord

currentLanguageTag

currentInputLanguageTag

isSupported

3.3.4 CM: FormattedTextArea

selectMisspelledWord

checkSpellingLeft

checkSpellingRight

checkSpellingBoth

splitWord

skipPunctuationsLeft

skipPunctuationsRight

checkSpelling

replaceWord

markWordAsMisspelled

markedAsMisspelled

unmarkWordAsMisspelled

unmarkAllWords

findAndUnmark

checkSpellingAll

showDropDown

_class