Automatic Induction of Word Classes in Swedish Sign Language

(1)

Automatic Induction of Word Classes in Swedish Sign Language

Johan Sjons

Department of Linguistics Thesis (15 ECTS credits)

Degree of Master of Arts in Computational Linguistics (1 year, 60 ECTS credits) Spring 2013

Supervisors: Kristina Nilsson Björkenstam, Robert Östling and Anna-Lena Nilsson Examiner: Henrik Liljegren

(2)

Abstract

Identifying word classes is an important part of describing a language. Research about sign languages often lack distinctions crucial for identifying word classes, e.g. the difference between sign and gesture.

Additionally, sign languages typically lack written form, something that often constrains quantitative research on sign language to the use of glosses translated to the spoken language in the area. In this thesis, such glosses have been extracted from The Swedish Sign Language Corpus. The glosses were mapped to utterances based on Swedish translations in the corpus, and these utterances served as input data to a word space model, producing a co-occurence matrix. This matrix was clustered with the K-means algorithm. The extracted utterances were also clustered with the Brown algorithm. By using V-measure, the clusters were compared to a gold standard annotated manually with word classes. The Brown algorithm performs significantly better in inducing word classes than a random baseline. This work shows that utilizing unsupervised learning is a feasible approach for doing research on word classes in Swedish Sign Language. However, future studies of this kind should employ a deeper linguistic analysis of the language as a part of choosing the algorithms.

Sammanfattning

Att identifiera ordklasser är ett viktigt steg i det att beskriva ett språk. Inom den lingvistiska forskningen om teckenspråk saknas många av de distinktioner som måste göras för att identifiera just ordklasser, som exempelvis vad som skiljer tecken från gester. Lagt till detta saknar teckenspråk standardiserade skriftspråk, något som gör att man i kvantitativa undersökningar av teckenspråk ofta får använda sig av glosser översatta från det i området talade språket. I denna studie har data av den typen, transkriber- ade glosser, extraherats från Svensk Teckenspråkskorpus och mappats till yttranden, baserat på svenska översättningar i korpusen. K-means-algoritmen och Brown-algoritmen, där den förstnämnda fick som indata en samförekomstmatris populerad med de extraherade glosserna, och den sistnämnda de extraherade yttrandena, användes för att klustra datan. Evaluaringsmåttet V-Measure användes för att jämföra klustrena mot en guldstandard annoterad för hand. Resultaten visar att Brown-algoritmen fungerar bäst, i det att den producerar kluster som slår en slumpmässig ordklassinducering. Framtida studier skulle tjäna på att använda mer av den lingvistiska kunskap som faktiskt finns om ordklasser i svenskt tecken- språk, i val av algoritm för induktion av ordklasser.

Keywords

Swedish Sign Language, Word Class Induction, Part-of-Speech, Brown algorithm, Word Space Models, Clustering, K-means algorithm, Computational Linguistics

(3)

1 Introduction

Word classes or parts-of-speech are the different categories into which words are divided, depending on their morphological, syntactic or semantic features. Defining word classes in an undescribed language is a difficult task – though an important one – since word classes are very useful, for instance in research, e.g. when trying to find certain constructions and in education, e.g. giving the learner a better understanding of the language. Having data of natural language annotated with word class is also necessary for almost any higher level processing of natural language, e.g. syntactic parsing, voice recognition and speech synthesizers.

Depending on the language, the different features that should to be taken into consideration for defining word classes will differ. For instance, when trying to find word classes for an isolating language, the syntax probably constitutes as a more convenient feature to investigate than the morphology, while in synthetic languages morphology will be crucial. However, to define word classes manually might demand a lot of knowledge about the language in question, or will at any rate be time consuming. Therefore an automatic method for defining word classes can be applied as a first step or as a complement to a more manual approach.

Defining word classes in an undescribed language automatically is often referred to as word class induction, which is done with some unsupervised learning algorithm, that is, automatically finding structure in unlabeled data. Preferably, this data is in text format.

In this thesis, where the language in focus is Swedish Sign Language (SSL), an algorithm that takes into account the syntactic features is preferable, since this language have relatively little inflection, but where the word order is more important.

Previous research on word classes in SSL is not very extensive, reflecting the fact that research about this language started in the 1970’s. This thesis will therefore be explorative and tentative.

2 Background

This section briefly reviews some of the issues in research on word classes in sign languages. Addition- ally, the word classes that have been identified in SSL and what characterize them is outlined. Finally, some of the established methods used for automatically identifying word classes in unlabeled data are given.

2.1 Word Classes in Sign Languages

One of the first things to do when describing a language is often to identify the word classes of that language (Schwager and Zeshan, 2008). Schachter (2007, p. 1) states that all languages have word classes, although these can vary in kind and number. However, research about word classes is something that is under-represented in the sign language literature, due to the fact that this field carries problems of theoretical and methodological nature (Schwager and Zeshan, 2008). For instance, it is not all that simple to list all the signs of a sign language in a dictionary, partly because this would require a clear distinction between signs and gestures (Johnston and Schembri, 1999).

Furthermore, sign languages typically lack writing systems, due to the problems of representing signs orthographically (Hopkins, 2008; Johnston and Schembri, 1999). Schwager and Zeshan (2008, p. 510) address this issue: “There is no satisfactory way of recording the dynamic, three-dimensional properties of sign language utterances on paper”. This is due to the fact that describing a sign language through a writing system would demand the system to in some way represent handshape, orientation, movement and location (Hopkins, 2008).

One way of representing sign language orthographically has therefore become to translate each sign into a gloss of the spoken language (Hopkins, 2008). Since this is widely used in the academic world

(5)

(Hopkins, 2008), data in text format of spontaneous communication – something that Schwager and Zeshan (2008) point out is important to have for studying word classes – is typically given in the format of transcribed glosses. An example of a resource that contains such data is The Swedish Sign Language Corpus(Mesch et al., 2012). This corpus will be presented further in Section 3.1.

The core of the problem of word class research for sign languages, however, is according to Schwager and Zeshan (2008) that no general criteria for defining word classes in sign languages have been suggested. This might be connected to another fact, namely that studies done on identifying word classes in sign languages “are very few and far between” (Schwager and Zeshan, 2008, p. 514).

2.2 Word Classes in Swedish Sign Language

Within the fields of word classes in SSL, some distinctions have been made (SOU, 2006a, pp. 27-33).

However, this field could benefit from being investigated further, for example through a corpus study (SOU, 2006a, p. 18).

In SOU (2006a, p. 27), it is said that the semantic and in particular the syntactic features determine the word classes for the signs. The authors also present a table for the defined word classes in SSL, with examples (SOU, 2006a, p. 27), see Table 1.¹

Table 1: Table adapted from SOU (2006a, p. 27). My translations.

NOUNS concrete BOY, HORSE, CARROT, BALL, HOME

abstract SUGGESTION, IDEA, FANTASY, SOUL

VERBS intransitive WALK, YAWN, WAKE UP, GROW, GO-HOME, PREGNANT

transitive FETCH, CATCH-SIGHT-OF, BUY, LOVE auxiliaries WILL, PERF(perfect tense marker)

ADJECTIVES GOOD, BAD, OLD, RED, SWEET

ADVERBS NOW, NEVER, WHEN, HERE, THERE, WHERE

NUMERALS ONE, TWENTY, FIRST

PRONOUNS possessive INDEX-c(’I’, ’we’,’one’), INDEX-x (’you’,’he’,’she’,’it’) interrogative WHO, WHAT

CONJUNCTIONS PLUS(’and’), CAUSE (’because’)

PREPOSITIONS IN, FROM

In SSL, it is not always possible to determine whether a sign in isolation is a noun or a verb, since some nouns and verbs are performed in similar ways. To be able to determine the word class in these cases, the context is crucial, although mouthing – which for nouns often is a loan from Swedish – can be a useful non-manual cue for making this distinction between the two (SOU, 2006a, p. 27). Additionally, for some signs, called sign pairs, the nouns have a repeated movement, whereas the verbs have a more protracted movement (SOU, 2006a, p. 28).

Nouns also typically lack inflection, although some signs inflect for plural, either through a repeated movement (SOU, 2006a, p. 29) or with a horizontal movement (Wallin, 1994).

Verbs, on the other hand, have a rather rich morphology. For instance reduplication can imply that an action is repeated, ongoing or protracted (Bergman, 1983; Börstell, 2011).

Some signs are in spite of their meaning not adjectives, since they cannot be used attributively. Their function is instead restricted to predication, e.g. WOMAN PREGNANT², and are therefore defined as verbs (SOU, 2006a, p. 33). These verbs form a small category (Bergman, 1983).

1Table adapted from SOU (2006a, p. 27)). All the translations in this thesis are my own.

2This example is taken directly from SOU (2006a, p. 33)

(6)

Furthermore, what seems to be a universal in sign languages, and does apply to SSL, is that verbs lack inflection for tense (SOU, 2006a, p. 31). However, some signs such as the perfect marker “PERF”

have temporal features (SOU, 2006a, p. 31). This is one of several genuine signs, i.e. not borrowed from Swedish, and it is used either as a perfect tense marker, or as for example ”have been (to a place)“.

It has been claimed that personal pronouns based on the conversational roles of referents do not exist in SSL as in many spoken languages (for example English) (SOU, 2006a, p. 33). Instead, pronominal pointings are directed towards the location of the person or entity referred to (Nilsson, 2010).

2.3 Induction of Word Classes

In previous research within the field of identifying word classes for spoken languages, automatic and semi-automatic methods have been employed as a facilitation or replacement for doing this manually, with little or no knowledge about the languages represented in the data available. In such studies, supervised or unsupervised learning have been employed. For instance, Brill and Marcus (1992) performed a pilot experiment which succeeded in automatically tagging corpora (some with as little as about 8000 words) of unknown data, and with only a little help from an informant that corrected the data as a part of the machine learning process.

Automatically finding word classes with unsupervised learning is – as mentioned – often referred to as induction (see Christodoulopoulos et al. (2010) for an extensive summary and evaluation of different machine learning algorithms for induction).

Finding distributional similarity of words can also be employed as a first step in automatic word class induction (Kiss, 1973, p. 29; Schütze, 1993).

2.3.1 Supervised and Unsupervised learning

In unsupervised learning, the aim is to to find structure in an unlabeled data-set (Alpaydin, 2004, p.

11). This can be contrasted with supervised learning, where an algorithm learns from labeled examples (N.B. that labeled examples can be used as a part of the evaluation phase in unsupervised learning), e.g. a corpus annotated with word classes (Alpaydin, 2004, p. 9). As no previous data of this kind is available, supervised learning is not suitable for the task in this thesis, but unsupervised learning is.

2.3.2 Word Space Models

Word space models are based upon the hypothesis that words with similar meanings tend to occur in the same contexts, which often is referred to as the distributional hypothesis (Harris, 1954). For instance, if it is found in a corpus that cat and dog more often occur in the same contexts than cat and likely, then cat and dog have higher semantic similarity than cat and likely, per definition. To produce information that reveals how close words are to each other (measured in e.g. euclidean or cosine distance, see section 2.3.3), word space models use statistics for generating high-dimensional context vectors (Sahlgren, 2005).

One way of representing the data in a word space model is to populate a matrix, where the unique words in the corpus correspond to the rows in the matrix and the unique left and right context words correspond to the columns in the matrix. For instance, the data available could be:

"I eat meat. You eat vegetables."

(7)

To populate the matrix, an iteration over the corpus would be needed, so that each time a word appeared in a context, a one would be added to the matrix in the corresponding cell. For example, the first word is ”I“, which does not have a left context, but its right context is ”eat“. Therefore, a one is added to the cell corresponding to ”I“ and ”eat, right“ in the matrix. Like that, the matrix would be populated as follows:

Table 2: Example matrix.

eat, right I, left meat, right you, left vegetables, right eat, left

I 1 0 0 0 0 0

eat 0 1 1 0 1 1

meat 0 0 0 1 0 0

you 1 0 0 0 0 0

vegetables 0 0 0 1 0 0

Each word now has its own context vector. In this example, the vectors of the words ”meat“ and

”vegetables” are identical, that is to say, are semantically similar according to the model (or rather occur in the same contexts). The way of representing a word space model in this way is called a “co-occurence matrix” (Sahlgren, 2006, p. 31).

In this example the matrix is very small and it is easy to see which words are the most similar (one does not even need the matrix to see which words are appearing in the same contexts). With more data however, which is also needed to be able to say something more generic, e.g. in the case of induction, the matrix easily becomes too large to be analyzed manually. The number of dimensions in each words vector is the total number of contexts in the data (which often is hundreds of thousands of instances (Jurafsky and Martin, 2009, p. 803)). Since this is unmanageable for human analysis, a word space model might be useful as a first step in induction, e.g. with the produced vectors utilized as input data to a unsupervised learning algorithm, to find coherence in the matrix.

However, even if a larger amount of data is available, the matrix will still suffer from being sparse, i.e. most of the entries are zeros (Sahlgren, 2005), which is also the case for this example matrix. This is due to the simple fact that most words in natural language do not occur in more than a few contexts.

Different solutions for this have been suggested (see Sahlgren (2005, p. 3) for a brief summary).

Sahlgren (2005, p. 3) also points out that word space models allow for defining “semantic similarity in mathematical terms”, and argues that they do not require any previous linguistic knowledge. However, it can be worth noting that the task of tokenization (segmenting the words) is needed. For instance it is not always as simple as to split the words on whitespace, as in cases like rain forest. This should for most applications probably be defined as one token (Jurafsky and Martin, 2009).

2.3.3 Euclidean and Cosine Distance

There are different ways of measuring the distance between the vectors in a word space model. Two of these are euclidean and cosine distance, where the first is the “ordinary” distance, i.e. a line between the end points of the vectors. So, for calculating the euclidean distance between two vectors, one takes the square root of the sum of all the squared differences of the corresponding values in the vectors (Deza and Deza, 2009, p. 94), as in Equation 1 below (adapted from Deza and Deza (2009, p. 94)).

dE = s n

∑

i=1

(pi− q_i)² (1)

One problem with utilizing the euclidean distance for measuring the distance between vectors, though, is that this measure does not compensate for the length of the vectors. For instance, if two words, for instance cat and dog, often appear in the same contexts, but cat is twice as frequent dog, their vector lengths will differ. Thus, if there also is the word elephant, which seldom appears in the same contexts

(8)

as cat or dog, but is as frequent as cat, then the euclidean distance between the vectors of cat and dog might be larger than the distance between cat and elephant. To cope with this, one can instead measure the cosine distance of the angles of the vectors. In Equation 2, the cosine similarity for (or cosine distance beteween) two vectors is described, i.e. the dot product of vector x and vector y, divided by the magnitude of vector x times the magnitude of vector y (Deza and Deza, 2009, p. 308). The magnitude is the square root of the product of the squared instances in the vectors.

cos(→−x, −→y) = x· y

||x|| ||y|| (2)

In this work, the measure is used as 1 minus the distance. Then a low value corresponds to a smaller distance (see Tables in Section 4.2).

2.3.4 Clustering and theK-means Algorithm

One method in unsupervised learning is called cluster analysis, which is to automatically find coherent clusters in an unlabeled data-set. The most widely used type of clustering algorithm is known as the K-means algorithm (Kanungo et al., 2002; Tajunisha, 2011), which is easy to understand (Hamerly and Elkan, 2002), as well as effective and simple (Wang et al., 2012). Furthermore, the K-means algorithm has been used in word class induction, with context vectors as input (Finch and Chater, 1992;

Headden III et al., 2008).

The K-means algorithm is an iterative algorithm that works as follows. Decide K (number of clusters).

The algorithm initializes that many points, called cluster centroids, in R^d, where d is the number of dimensions in the data. The initialization can be done in several ways, e.g. randomly or with some refinement (see for example Fayyad et al. (1998) or Bradley and Fayyad (1998)). Then the algorithm has two steps. In the first step, it goes through the vectors (i.e. not the cluster centroids) and assigns each vector to the cluster centroid that is closest to it (measured using for instance cosine distance, see Section 2.3.3). In the second step, the algorithm calculates the mean of all the vectors assigned to one cluster, and moves the cluster centroids to these means. These two steps are repeated until no or only a little movement (defined by some preset value) takes place (Kanungo et al., 2002).

How well K-means performs is highly dependant on the initialization of the cluster centroids, and because of this, the algorithm is often run several times with different initializations (Alpaydin, 2004, p.

147). The output consists of some preset number of clusters.

2.3.5 The Brown Algorithm

Another example of cluster analysis is the Brown algorithm, which is an agglomerative algorithm. This means that it works bottom-up, i.e. it lets each word in the corpus start out as an own cluster, and then merges pairs of clusters repeatedly upwards in the hierarchy. Those clusters, which according to a class- based bigram model cause the smallest possible decrease in the likelihood of the corpus, are chosen to be merged (see below for description of the algorithm). The input is unlabeled data, thus plain text (often millions of words) contrasted to the example of K-means, which have vectors as input. The output of the Brown algorithm can be the both of the following (Liang, 2005):

1. Words sectioned into different clusters, for instance the following set of word clusters: (apple, pear), (boat, car) and (walk, run).

2. A hierarchical word clustering. See Figure 1.

As can be seen in Figure 1, each of the leaf nodes is a word, and each of these leaves has a corresponding binary number (see Table 3). The most uninteresting cluster here is of course the highest one, which contains all the words in the corpus (cluster a). The left second highest cluster b corresponds to cluster d (apple, pear) and e (boat, car) – which all are nouns – but is separated from c (walk, run) – which contains verbs, and so on.

Table 3 shows that for instance 1 corresponds to (apple, pear) and (boat, car), 10 only to (boat, car) and 0 only to (walk, run), and so on.

(9)

a

c

run walk

1 0

b

e car boat

1 0

d

pear apple

1 0

Figure 1: Example of hierarchical clustering with the Brown algorithm.

Table 3: Example of how every word in an output from the Brown algorithm is represented by a binary number.

WORD BINARY NUMBER

apple 111

pear 110

boat 101

car 100

walk 01

run 00

Equation 3 describes the probability for a corpus p given a clustering, which the Brown algorithm tries to maximize (Brown et al., 1992).³

p(w1, w2...wT) =

n

∏

i=1

e(wi|C(wi))q(C(wi)|C(wi−1)) (3) The Brown clustering algorithm works as follows: each word in the corpus is assigned to its own cluster.

Then the probability for each pair of words is calculated, i.e. how likely it is that those words belong to the same cluster. This calculation is based on two parameters, namely the emission probability and the transition probability (e and p in Equation 3). The former is the probability of a word wi being in the cluster and the latter the probability that word w follows the previous word wi−1. Thus, the pair with the highest probability is merged into a new cluster. This is repeated up the hierarchy.

One advantage of the Brown algorithm, despite being relatively old and “relatively simple”, is that it works “as well or better than” newer systems for word class induction (Christodoulopoulos et al., 2010, p. 576).

2.3.6 Cluster Evaluation and Evaluation of Word Class Induction Systems

One downside with cluster analysis is that there is no established way for optimally evaluating the outcome (Färber et al., 2010). However, there are mainly two different paradigms for doing this, which are called internal and external evaluation (Halkidi et al., 2002).

In internal evaluation, the clusters produced are evaluated based on features of the actual data (Halkidi et al., 2002), and there are basically two parameters affecting the score for this: inter- and intra similarity. The first is the level of similarity between the different clusters, whereas the second is the level of similarity within each cluster. Thus, a high internal evaluation score is obtained with a low inter simi-

3Equation 3 is derived from Christodoulopoulos et al. (2010, p. 575).

(10)

larity and a high intra similarity (Pantel and Lin, 2002). Internal evaluation will perhaps therefore still not say very much about the actual clusters, but is used for comparing different algorithms.

In external evaluation, on the other hand, a gold standard, for instance a corpus annotated with word classes, is used. In short, the results are compared to true values. This is viewed as the best way of evaluating clustering (Färber et al., 2010), although the quality of the evaluation is directly related to the quality of the gold standard.

Even so, evaluating systems of word class induction with external evaluation is not always a process without issues, since there seemingly is no established method for mapping the output clusters found by the algorithm to whatever true labels that are available (Christodoulopoulos et al., 2010). Additionally, the algorithm might find more or fewer classes than are in the gold standard (Christodoulopoulos et al., 2010).

2.3.7 V-Measure

After evaluating several different measures of word class induction evaluations, Christodoulopoulos et al. (2010) recommend a certain measure, called V-Measure. This was introduced by Rosenberg and Hirschberg (2007), and is an external entropy-based measure. Entropy is a measure of the disorder (the higher the disorder, the higher the entropy) within a system (Jurafsky and Martin, 2009, p. 148).

V-Measure (see Equation 10)⁴ compares to F-measure (Christodoulopoulos et al., 2010; Rosenberg and Hirschberg, 2007), in that F-measure is the dynamic mean of precision and recall, where precision is the number of relevant retrieves divided by the total number of retrieves, whereas recall is the relevant hits divided by the total number of relevant instances in the data (Jurafsky and Martin, 2009, p. 489).

Similarly, V-Measure constitutes the dynamic mean of two values, namely homogeneity and completeness (Rosenberg and Hirschberg, 2007). A high homogeneity score generally decreases the completeness score, and the other way around (Rosenberg and Hirschberg, 2007).

For the criteria of homogeneity to be completely fulfilled, zero entropy is demanded, i.e. each cluster is only allowed to contain members from one category (see Equations 4, 5 and 6). The completeness criteria (see Equations 7, 8 and 9) measures the number of members that are assigned to the same cluster (Rosenberg and Hirschberg, 2007).

In the equations below, N is the set of data points, C is the set of classes and K the set of clusters.

Thus, for instance Equation 5 is the negative sum of looping through the set of clusters K (starting, of course, at k = 1), looping through the set of classes C (c = 1), divided the ack by the number of data points times the logarithm of ack divided by the sum of each i in C times ack.

Homogeneity (precision analogue):

h= 1 − H(C|K) H(C)

(4) where

H(C|K) = −

|K|

∑

k=1

|C|

∑

c=1

a_ck N

log





 a_ck

|C|

∑

c=1

a_ck







(5)

H(C) = −

|C|

∑

c=1

1 n

|K|

∑

k=1

ack

! log 1

n

|K|

∑

k=1

ack

!

(6)

4All the Equations in Section 2.3.7 are adapted from Rosenberg and Hirschberg (2007, pp. 411-412).

(11)

Completeness (recall analogue):

c= 1 − H(K|C) H(K)

(7) where

H(K|C) = −

|C|

∑

c=1

|K|

∑

k=1

a_ck N

log





 a_ck

|K|

∑

k=1

a_ck







(8)

H(C) = −

|K|

∑

k=1

1 n

|C|

∑

c=1

ack

! log 1

n

|C|

∑

c=1

ack

!

(9)

V-Measure (F-Measure analogue):

VB= (1 + B)hc Bh+ c

(10)

The output is the matrix A = ai j, which Rosenberg and Hirschberg (2007, p. 411) describe as ”the number of data points that are members of class ciand elements of cluster kj“.

In V-Measure, if B equals 1, then completeness is weighted more strongly than homogeneity, and the other way around: if B is less than 1, then homogeneity is weighted more strongly than completeness (Rosenberg and Hirschberg, 2007). The authors also point out that all of these measures are independent of all the other parameters, i.e. the number of classes and the number of clusters, as well as the size of the data set and the cluster algorithm utilized (Rosenberg and Hirschberg, 2007). For a more thorough specification of V-Measure, see Rosenberg and Hirschberg (2007).

2.4 Purpose

The aim of this work is not to define or re-define word classes in SSL. Rather, it aims to serve as a first step towards understanding how well established computational linguistic methods of word class induction perform with transcribed glosses of spontaneous SSL as input data, and what could be done to improve the outcome. To match the scope of the work, the algorithms utilized are delimited to the K-means algorithm and the Brown algorithm. To this end, the following questions are addressed:

1. Is utilizing unsupervised learning for finding word classes in SSL a feasible approach?

2. Which, if any, of the K-means algorithm and the Brown algorithm will perform well in identifying word classes for SSL?

3. Is it possible from the results of this work to say anything about the number of word classes in SSL?

(12)

3 Data and Method

The method of this thesis mainly consists of five steps. The first was to extract data from The Swedish Sign Language Corpus (Mesch et al., 2012), the second generating a co-occurence matrix with these utterances, the third to cluster this matrix with the K-means algorithm, and the fourth to cluster the extracted utterances with the Brown algorithm. The fifth step was to use V-Measure to evaluate the clusters produced by the two algorithms. Below, a brief description of the corpus is presented, then each of these five steps is explained.

3.1 The Swedish Sign Language Corpus

The Swedish Sign Language Corpus(Mesch et al., 2012) is a corpus containing films (n=47) in which pairs of deaf people sign with each other in SSL. What the signers (n=42) say – each sign in these films – has been transcribed in the program ELAN⁵ by other signers, who also have marked up the start and end points for each sign. Each sign is thus represented by a gloss (token frequency=29686), which in its meaning is unique, i.e. represents only one sign (type frequency=5572). Additionally, some of the glosses have been tagged with extra information, as for instance if the sign was reduplicated (BURN@r) or performed with the non-dominant hand INDEX@nh (Wallin et al., 2009).⁶

Furthermore, interpreters have translated what the signers say into Swedish. These translations are in the form of sentences, rather than word-slots corresponding to the glosses (see Wallin et al. (2009)), since this would be futile given that the syntax in SSL to some extent differs from the syntax in Swedish (SOU, 2006a). The translations have therefore been made based on what the interpreters have felt is reasonable with respect to SSL, although this sometimes had to be compromized (Anna-Lena Nilsson, p.c). For instance, an utterance in SSL can end with a repetition of a sign, which in Swedish would appear earlier or not at all in a Swedish utterance (Anna-Lena Nilsson, p.c).

More importantly, the interpreters did in most cases translate the films before the glossing had been done (Anna-Lena Nilsson, p.c), which for instance means that a translation sometimes ends before a gloss starts, or the other way around.

As can be seen Figure 2, the glosses and the translations are on different tiers in ELAN.

5See http://tla.mpi.nl/tools/tla-tools/elan/ for more information about the software.

6In this thesis, words written in uppercase with THIS font represent a gloss. In the corpus, these very much resemble Swedish.

In this work, however, I have translated them into English.

(13)

Figure 2: An example from one of the transcribed and translated films in ELAN.

The output files from ELAN are in a XML-based format, with information about who says what, the start- and end point for each transcription, and, of course, the actual transcriptions.

3.2 Extracting Data

In order to extract the relevant data from the corpus, I wrote a program that most importantly does two things:

1. Processing the glosses.

2. Mapping the glosses to utterances.

3.2.1 Processing the Glosses

As explained in Section 3.1, some of the transcribed glosses contain extra information. Depending on the type, this extra information was sometimes beneficial and sometimes unbeneficial for the task at hand. The goal was to remove all the tags that could be removed without changing the meaning of the gloss.

In Table 4, the different tags that have been used in addition to the glosses are represented. The first two columns are taken from Wallin et al. (2009, p. 1), and the third is a description of how the gloss has been processed in the program. For example, GLOSS@xxx (meaning that the transcriber was not completely sure whether this was the sign performed) and GLOSS@h (meaning that a sign is held while the signer continues to perform one or more signs with the other hand).

(14)

Table 4: Explanation to tags and how these were processed. The two first columns are adapted from Wallin et al. (2009, p. 1).

TYPE MEANING PROCESSED

GLOSS@b finger-spelled extracted with @b included

GLOSS@p polysyntethic sign extracted with @p included

GLOSS@g gesture or gesture like sign extracted with @g included

GLOSS@& interrupted sign not extracted at all

GLOSS@n name or name sign extracted with @n included

GLOSS@zzz suggestion to name of gloss extracted without @zzz by transcriber

GLOSS@xxx transcriber not sure whether extracted without @xxx this was the sign performed

GLOSS<>GLOSS2@nh overlapping signs both added (in order)

GLOSS@r reduplicated sign extracted with @r included

GLOSS@nh performed with non-dominant hand extracted without @nh

GLOSS@h held sign extracted only once (even if held

while performing other signs) without @h

3.2.2 Mapping Glosses to Utterances

As explained in Section 3.1, the start or end of a translation did not always correspond exactly to the glosses on these two different layers in ELAN. However, since the corpus lacks information about utterance boundaries in SSL, these translations were the only cues available for mapping the glosses to utterances.

In order to do this, the program decided which of the translations the glosses should be mapped to as follows: it was considered to be part of an utterance if the middle of a gloss was within the time interval of a translation, if a larger part of a gloss was within the interval of a translation than outside that translation or inside another translation, or if a gloss was closer to one translation than to another translation, but not further away than 200 milliseconds. A large number of the glosses that could be mapped to a translation still only consisted of one gloss, which is unusable for the word space model (see section 2.3.2). These were used for the Brown algorithm, however. See Table 5 for an example from the output of the implemented program.⁷

Table 5: Example of an utterance extracted from ELAN.

PRO>present USUALLY VIEW TV@b PROGRAM USUALLY INSIDE COMPUTER OR TV@b ZAP USUALLY PRO>present

“Do you usually watch sign language shows on TV or online?”

This resulted in that roughly 93% of the glosses could be used (token frequency = 27351, type/token ratio = 0.141), and these correspond to 3484 utterances (≈ 7.85 glosses per utterance). These utterances can be said to be the output-data retrieved from the corpus and the input-data to a co-occurence matrix.

For the Brown algorithm, all the glosses (even those consisting of only one word) were used (token frequency = 28046, type/token ratio = 0.137).

7A part of this this utterance can be seen in Figure 2. The translation in Table 5 is a translation of that Swedish translation.

(15)

3.3 Gold Standard

Although there is no established consensus as to which word classes exist in SSL, a gold standard of the 110 most common glosses (which correspond to a minimum frequency of 40) in the corpus was developed based on the definitions made by SOU (2006a, p. 27).⁸. Out of this, some glosses were removed, due to ambiguity.

3.4 The Co-occurence Matrix

I implemented another program, to populate a co-occurence matrix (see Section 2.3.2), where the contexts were one gloss to the left and one gloss to the right of each gloss in the utterances extracted from the corpus. The context never stretched over two utterances. A function was implemented so that gloss- contexts with a frequency under a certain value could be chosen not to be counted when populating the matrix. This will henceforth be referred to as the context frequency stop-list.

Also, a function which lets the user check for the closest neighbors in the vector space was implemented.⁹

3.5 Clustering with theK-means Algorithm

I also implemented a third program, to cluster the matrix with the K-means algorithm. Here I utilized packages from Scikit-learn.¹⁰Different parameters were changed in order to see how the result changed, as for instance the number of K and how often a gloss appeared in the matrix (henceforth referred to as the vector-sum stop-list). Additionally, experiments were carried out, which differed in consideration to the context-frequency stop-list.

Furthermore, since the initialization of the cluster centroids affects the outcome extensively (see Sec- tion 2.3.4), a parameter available in the open-source program for refining the initialization was utilized.¹¹ This parameter interdicts the cluster centroids from appearing too close to each other in the initialization.

3.6 Clustering with the Brown Algorithm

The Brown algorithm – as described in section 2.3.5 – does not require more than unlabeled text. There- fore the matrix was not used for this step, but rather just the extracted utterances. This algorithm was also run with an open-source program.¹² The Brown algorithm was run with a frequency stop-list set to 40 (of the glosses in the extracted utterances), corresponding to the 110 most frequent glosses, since these are the ones in the gold standard.

3.7 Evaluation

In order to evaluate the clusters produced by the two algorithms, another small program was implemented, also using the package from Scikit-learn.¹³ This program was used to calculate the V-Measure for the different algorithms.

8The gold standard was also kindly hand corrected by Lars Wallin.

9http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

10http://scikit-learn.org/stable/index.html

11http://scikit-learn.org/stable/modules/clustering.html#k-means

12https://github.com/percyliang/brown-cluster

13http://scikit-learn.org/0.13/modules/generated/sklearn.metrics.v_measure_score.html

(16)

4 Results

For the different steps, different evaluation methods were used. Below, the results are presented as follows: a short presentation of the data distribution, a manual evaluation of the outcome of the word space model, the evaluation of the K-means clustering, and lastly the evaluation of the Brown clustering.

4.1 Frequencies of Glosses in Swedish Sign Language Table 6 shows a sample of the glosses and their frequencies and ranks.

Table 6: The frequencies and the rank for a small sample of the glosses.

GLOSS FREQUENCY RANK

INDEX-1 2089 1

INDEX 1067 2

PU@g 841 3

... ... ...

MUCH 141 21

... ... ...

PERF 136 23

GOOD 136 24

... ... ...

CHILD 80 55

WRITE-KEYBOARD 80 56

WORK 79 57

INSIDE 78 58

... ... ...

WHICH 13 336

EASY 13 337

TREE 13 338

OTHER 13 340

THRIVE 13 341

... ... ...

ARTICLE 1 3670

... ... ...

BURN@r 1 3674

... ... ...

REASON 1 3691

... ... ...

PRAYER 1 3848

(17)

It can be worth noting that the extracted data follows Zipf’s law, which states that the most frequent word in a corpus of natural language will appear twice as many times as the second most frequent word, thrice as many times as the third most frequent word, and so on (Zipf, 1949). See Figure 3.

Figure 3: Frequency distribution for the extracted utterances. The x-axis corresponds to the rank of the type and the y-axis to the token frequencies. R²=0,974.

4.2 The Word Space Model

The reason for using the K-means cluster analysis was that the matrix is too large to be analyzed manually. However, some manual evaluation was done. A sample of a hand-full glosses and their nearest equals were extracted. The context frequency stop-list was set to 40.

Below, two frequent and two infrequent glosses and their nearest neighbors are presented. As can be seen for the two usual ones (MOTHER and HAVE, see Tables 7 and 8) their nearest neighbors seem good, whereas the neighbors of the two unusual ones (COFFEE and OPEN, see Tables 9 and 10) seem less accurate. So, in the case of the glosses COFFEE and OPEN, and other glosses that only appear a few times in the utterances, their context vector happens to be similar to for instance BUT, THEN@b etc.

(18)

Table 7: Cosine distances for the glosses closest to the gloss MOTHER. Lower values are better. Gloss- frequency = 0.00347.

GLOSS 1-COS V

PARENTS 0.2370

FOOT@b 0.2715

OLDEST 0.2715

FORMER 0.2715

FATHER 0.2788

SISTER 0.2972

GRANDMOTHER^GRANDFATHER 0.3132

BROTHER 0.3255

MOTHER^FATHER 0.3315

Table 8: Cosine distances for the glosses closest to the gloss HAVE. Lower values are better. Gloss- frequency = 0.00559.

GLOSS 1-COS V

MEET 0.3036

CAN 0.3312

WORK 0.3345

PRACTICE@r 0.3417

PERF 0.3429

WANT 0.3559

SIT 0.3649

FEEL 0.3727

Table 9: Cosine distances for the glosses closest to the gloss COFFEE. Lower values are better. Gloss- frequency = 0.00022.

GLOSS 1-COS V

BUT 0.2821

THEN@b 0.2917

BECAUSE 0.2922

SOME@b 0.2929

ONE^TWO^THREE 0.2929

EX@b 0.2929

TALK@r 0.2929

(19)

Table 10: Cosine distances for the glosses closest to the gloss OPEN. Lower values are better. Gloss- frequency = 0.00022.

GLOSS 1-COS V

SOME@b 0.3876

PREPARE 0.4226

TWELVE-O’CLOCK 0.4227

ENTITY(J)@p-OVERLAP 0.4227

SHIRT 0.4227

4.3 TheK-means Algorithm

In Table 11, a typical example of the outcome from the K-means algorithm is shown, regardless of what K (number of clusters), the context frequency stop-list or the vector-sum stop-list were set to. In this case the clustered matrix was populated with the context frequency stop-list set to 40, the vector-sum stop-list also set to 40 and the number of K to 8. The clusters with only one gloss (all except cluster 0) did actually contain only those glosses shown, whereas the cluster with several glosses (cluster 0) shows a sample from that cluster. Thus, almost all of the glosses appeared in the same cluster.

Table 11: An example of clusters produced with the K-means clustering algorithm.

CLUSTER GLOSSES

0 FILM, WIFE, CONTACT, MOST, FUNNY, MAN, ...

1 INDEX-1

2 MAJA@b@n

3 GIVE

4 YOUNG

5 GROUND

6 OLDER

7 PERCEIVE@r

8 SAY

Furthermore, all compilations with the K-means produced poorly according to V-Measure, which with K set to 9 and the context-frequency stop-list set to 40 gave a value of 0.0933.

Also, the results did not improve when using the function for refining the cluster centroid initialization (see section 3.5).

(20)

4.4 The Brown Algorithm

The glosses in the gold standard were clustered with the Brown algorithm randomly and evaluated with V-measure. This is viewed as a baseline. To generate a representative random classification for the different glosses, the average of 100 000 compilations was taken, as well as the standard deviation of these.

An evaluation with V-Measure (see Section 2.3.7) was run with the initial¹⁴number of clusters set to different values (as the gold standard as true labels). See Table 12.

Table 12: Results from the Brown algorithm. Value is between 0 and 1, higher is better.

NUMBER OF CLUSTERS VALUE BASELINE (STANDARD DEVIATION)

110 0.575 0.433 [0.0151]

55 0.509 0.432 [0.0178]

20 0.379 0.298 [0.0219]

16 0.379 0.266 [0.0217]

12 0.369 0.225 [0.0227]

11 0.297 0.214 [0.0226]

10 0.283 0.202 [0.0232]

9 0.277 0.187 [0.0212]

8 0.347 0.172 [0.0236]

7 0.310 0.156 [0.0236]

4 0.166 0.097 [0.0233]

3 0.176 0.072 [0.0218]

All the clusters produced by the Brown algorithm are significantly better than the random baseline, i.e.

lie outside the confidence the interval on a level of 95% (baseline +/- (1,96×the standard deviation)).

However, since SOU (2006b) presents eight word classes for SSL, perhaps the most interesting are the results from clusters 7, 8, 9, 10 and 11, given that there might be more or fewer word classes in SSL than eight. Here, a two-tailed non-parametric test, the Whitney Mann U Test (Mann and Whitney, 1947, see Equation 11) was performed.

U= N₁N₂+N₁(N1+ 1)

2 − R₁ (11)

The null hypothesis was that population one, containing the clusters 7, 8, 9, 10 and 11, does not differ significantly from population two, i.e their baselines. The alternative hypothesis is that they differ.

The p-value from the Mann Whitney U test is 0.01208, in other words, the two groups differ significantly; the null-hypothesis can be rejected.

14As was explained in Section 2.3.5, the initial number of clusters can be set to some number, which cannot be higher than the total number of words (or glosses as in this case).

(21)

Table 13 shows a selection of instances in the clusters produced with the Brown algorithm, when the number of clusters was set to 8. One cluster only contained (two) glosses that was removed from the gold standard due to ambiguity (see Section 3.3). Table 13 should not be seen as being representative in aspect of what the clusters contain, since these glosses are just a selection. It should rather be seen as an example of output from the algorithm (see Appendix for the whole outcome from this clustering).

Table 13: A selection from a clustering produced by the Brown algorithm, with 8 clusters and the glosses of a frequency of 40 or greater.

CLUSTER GLOSS FREQUENCY

00 MANY 52

00 PRO-MANY-1 88

00 PRO-1 2089

0100 SAME 102

0100 IMPLY 115

0100 (to) SIGN 119

0101 HUG 57

0101 WILL 58

0101 BEFORE 63

0110 HOW 155

0110 CAUSE 104

0110 THEN 157

0111 OTHER 54

0111 ALL 60

0111 OBJ-PRON-1 60

110 FATHER 54

110 CHILD 80

110 MOTHER 95

111 MUCH 141

111 OR 77

111 TO 100

(22)

Figure 4 shows the change in V-Measure for the different number of clusters produced with the Brown alorhtm. These results should be interpreted carefully, first because V-Measure weights homogeneity more strongly than completeness when the number of clusters is larger than the number of word classes (true labels in the gold standard), and second because the number of word classes in the gold standard are set to eight. This will be discussed further in Section 5.3.3.

Figure 4: Change in V-Measure for the different number of clusters with produced the Brown algorithm.

5 Discussion

In this section a discussion about what has been done in the different sections of this work is given.

5.1 Discussion of Background

In previous research of identifying word classes in unlabeled data-sets, the case have often been that an informant has been available for correcting the outcome by hand as a step in the machine learning process, i.e. semi-supervised learning.

Others have employed methods of unsupervised learning algorithms for inducing word classes, and evaluated these against some true labels. This was also the case for this work. Even so, the validity of these true labels can be disputed, since there is no distinct consensus to which word classes exist in SSL.

Furthermore, there are many different methods for word class induction that have not been discussed in this thesis, although some of them would maybe have produced better results (see for instance Christodoulopoulos et al. (2010) for a summary).

5.2 Discussion of Data and Method

Bob Mercer’s famous statement ”There is no data like more data“ (Bertin-Mahieux et al., 2011) does apply here, since the amount of data was relatively small. Having more data would maybe not be enough,

(23)

however. Earlier studies, such as Brill and Marcus (1992), have succeeded in getting better results with even less data. Then again, they employed semi-supervised learning of the kind mentioned above.

Nonetheless, the content of the extracted data should be considered. As mentioned, the program implemented for extracting the data consisted of processing glosses with a tag (GLOSS@) and of mapping glosses to utterances (see Table 4).

5.2.1 Discussion of Processing the Glosses The different tags are:

GLOSS@b, GLOSS@p, GLOSS@g, GLOSS@&, GLOSS@n, GLOSS@zzz, GLOSS@xxx, GLOSS<>GLOSS2@nh, GLOSS@r, GLOSS@nh and GLOSS@h.

Below will follow a short explanation to how it was decided how each of these was processed.

The glosses of the type GLOSS@b, meaning the sign was finger-spelled, was added with the tag included, which was based on the assumption that fingerspelling does not change the meaning of the sign. This might need to be studied by someone with more knowledge on this subject.

How to process glosses of the type GLOSS@p, meaning the sign is polysyntethic, was difficult, since polysyntethic signs are complex, resembling whole sentences (Wallin, 1994). Glosses with this tag was extracted with the tag included. This was perhaps not optimal for the algorithms used. It is reasonable to assume that since being complex, the polysyntethic signs are also rare in terms of type token frequency.

As both the word space model and the Brown algorithm are based on the distributional hypothesis, splitting these glosses could have improved the results. However, this is also a matter that would have needed a better analysis.

The glosses of the type GLOSS@g, meaning the gloss represents a gesture rather than a sign, should maybe not have been included in the extraction of the corpus data, since gestures seemingly is a difficult matter. Hence, if it is not clear how using gestures as input to algorithms that utilize distributional semantics affects the results, then both ways should have been tested. That is to say, maybe the word space model – and by extension the K-means clustering – or the Brown clustering, would have performed better had these glosses not been extracted at all.

The interrupted signs – represented with GLOSS@& – was decided not to be included in the extraction, since whatever sign following the interrupted sign probably was the actual sign the signer had in mind.

The glosses tagged with @n, meaning the sign was a name, was never considered not to be extracted, since this would remove information essential for the algorithms. The data would have instances of utterances where the sign for a place or a person would be omitted.

Glosses tagged as GLOSS@zzz, meaning the transcriber was not sure that this was the sign performed, were extracted without the tag @zzz. This tag was used in order for those who work on the corpus to return to the gloss later and check if the chosen gloss was the correct one, which is outside the frames for this work. The glosses GLOSS@xxx, meaning the transcriber was uncertain of which sign was performed, were extracted without their tags. This decision was based on the assumption that a native signer of a language in general has a good idea of what is being said. It would be difficult to evaluate this decision without manual analysis of at least a sample of the glosses with this tag. Maybe these glosses only added noise to the data, or maybe they enhanced the quality of the data.

Since GLOSS<>GLOSS2@nh meant that two signs overlapped, but that one was still produced before the other, these glosses were split into two, and extracted chronologically. The glosses tagged with @r were not processed, but rather extracted with the tag, since reduplication can change the meaning of the sign (see Section 2.2). That the sign was produced with the non-dominant hand, tagged as GLOSS@nh, could maybe have been of use, although this probably would demand some other algorithm than those utilized in this work, since the non-dominant hand (as mentioned in Section 3.2.1) sometimes is held.

Adding these glosses would give data where for instance every third of the transcribed glosses would be the same sign. Therefore it was extracted without its tag, and only once. The same applies to GLOSS@h.

(24)

To summarize, it can be said that the meaning of all the tags maybe should have been analyzed further before deciding on how to process them. However, since the purpose of using unsupervised learning is to circumvent manual analysis, it is an overall tricky question to what extent such an analysis should be employed.

It is also worth noting that some of the other transcribed signs could benefit from being revised, which has also been noted by Lars Wallin (p. c.). For instance, some glosses in the corpus consist of two unified signs. If there is no movement between the signs, these glosses have been glossed as for example UN^USUAL or SAID^PRO. Results produced with algorithms using distributional semantics would probably benefit from the former of these two being one token, and the latter being two tokens.

5.2.2 Discussion of Mapping the Glosses

As for mapping the glosses to utterances, the method employed was perhaps not optimal. Still, there was hardly any way of knowing how long an utterance in the corpus is, other than to use the Swedish translations. This is to say, if not being fluent in SSL, or if not using automatic extraction. Since the validity of the matrix from the word space model maybe would have been higher had the utterance boundaries been decided upon cues in SSL, as for instance head position and gaze direction (Bergman, 1995), or eye blink (Baker, 1976), a study that takes this into consideration would suit as a complement or replacement in the step of mapping glosses to utterances. Fenlon (2010) presents results from such a study, suggesting that “boundaries can be identified in a reliable way” (Fenlon, 2010, p. 4). However, the this was out of scope for this thesis.

Another solution could have been not use the utterances at all, but instead extract the glosses in the order they appeared chronologically.

5.3 Discussion of Results

This section gives a short summary of the results of the study. It also addresses a number of problems in this work, and proposes some solutions to these problems.

5.3.1 Discussion of the Word Space Model

The word space model tended to become either too noisy, i.e. a lot of glosses appeared only once in a lot of different contexts (if the context frequency stop-list was set to e.g. 0), or too sparse, i.e. a lot of glosses appeared many times in the same contexts (if the context frequency was set to e.g. 40). Using a larger context window, e.g. two words to the left and right of each focus could maybe have given better results. Another alternative could have been to use some other model, such as LSA.

5.3.2 Discussion of theK-means Algorithm

The K-means algorithm performed poorly on the task of clustering the matrices produced with the word space model into word classes. Glosses that according to the distinctions made by SOU (2006a) clearly should be classified as belonging to different word classes, such as for example SPORT, ALWAYS and KNOWappeared in the same cluster, independent of what any of the parameters were set to. Evaluating with V-Measure confirmed the poor results. One reason for this is maybe that the data is small, which makes it hard for the algorithm to find coherence, but it can also be that no reduction of the dimension- ality in the matrix was employed.

Additionally, semi-supervised learning should perhaps have been employed, i.e. the K-means algorithm could have been given some true values, as for instance that MOTHER is a noun and HAVE is a verb, seeing as these glosses seemed to have good neighbors, and then performed the clustering.

5.3.3 Discussion of the Brown algorithm

The results of the Brown algorithm are far more promising than those of the K-means algorithm. First of all, the clusters themselves comprise glosses that to some extent seem coherent in terms of word classes.

Also, the algorithm significantly outperforms the random baseline.

(25)

The result of the Mann Whitney U Tests needs to be interpreted with caution, since one of the values might increase the overall value for the whole group, but the result of the hypothesis test at least indicates that the Brown algorithm can be worth using in future studies.

However, for all of the different numbers of clusters except for 55 and 110, the value of V-Measure is below 0.5, which compared to the results presented by Christodoulopoulos et al. (2010, p. 581) is quite weak. However, Christodoulopoulos et al. (2010) used almost twice as much data as was used in this study. Nonetheless, the results indicate that the Brown algorithm works for the kind of data used in this study.

As for the matter of the change depending on the number of clusters, it seems that V-Measure gets better from 4 clusters and all the way up to 8 clusters, and then goes down at 9 clusters, before steadily going up again (see Figure 4). This might be explained by how V-Measure works (see Section 2.3.7). If the homogeneity criteria is 1, as in the case of 110 clusters, meaning that each gloss initially starts out in its own cluster (zero entropy), then the V-Measure will be somewhat skewed, even though the completeness is weighted more strongly (Christodoulopoulos et al., 2010). Vlachos et al. (2009) presents an improved version of V-Measure, called V-beta measure, in which this has been adjusted for. Further- more, it is not very easy to say something generic about the number of word classes from the results presented in this study, since the gold standard includes exactly 8 word classes.

5.3.4 Discussion of Evaluation

Evaluating clustering is as mentioned not an easy task. However, using established methods such as V-Measure for comparing to a gold standard gives a hint of how good the results are.

Something that could have been done, is using internal evaluation for comparing the performance of K-means to Brown, but given the poor results from the former, this seemed futile.

A parenthesis is that a more manual evaluation can be employed. Since the clusters produced with the Brown algorithm are hierarchical, and since there are indeed not many glosses to consider and compare to the gold standard, manual evaluation of these should be no large task. Perhaps the errors in the clusters would reveal some pattern.

5.4 Future Work

As has been discussed, most of the steps in this work would have gained from some kind of refinement.

Above all, though, a deeper analysis of all the relevant features in SSL in addition to choosing the induction algorithm, would probably improve the outcome. Furthermore, since V-Measure favors a high initial number of clusters, the V-beta measure – constructed to compensate for this issue – should be used instead.

6 Conclusions

Utilizing unsupervised learning for finding word classes in SSL seems to be a feasible approach, since the Brown algorithm performs significantly better in inducing word classes in SSL than a random baseline. The Brown algorithm also outperforms the K-means algorithm, although this comparison is somewhat skewed, as the input to the former was glossed spontaneous communication transcribed into glosses, and the input to the latter a co-occurence matrix populated with these glosses and their immediate left and right neighbors. The results presented in the study do not in particular shed new light on the number of word classes in SSL.

(26)

Bibliography

E. Alpaydin. Introduction to machine learning. The MIT Press, 2004.

C. Baker. Eye-Openers in ASL. In R. Underhill, editor, Proceedings: Sixth California Linguistics Association Conference, pages 37–38. San Diego, CA: Campanile Press, 1976.

B. Bergman. Verbs and adjectives: morphological processes in Swedish Sign Language. Language in sign: An international perspective on sign language, pages 3–9, 1983.

B. Bergman. Gränsmarkörer. Kompendium i teckenspråksgrammatik. Institutionen för Lingvistik, Stockholm, 1995.

T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In ISMIR 2011:

Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida, pages 591–596. University of Miami, 2011.

C. Börstell. Revisiting Reduplication: Toward a description of reduplication in predicative signs in Swedish Sign Language. PhD thesis, Stockholm, 2011.

P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering. In Proceedings of the fifteenth international conference on machine learning, volume 66. San Francisco, CA, USA, 1998.

E. Brill and M. Marcus. Tagging an unfamiliar text with minimal human supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language, pages 10–16, 1992.

P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479, 1992.

C. Christodoulopoulos, S. Goldwater, and M. Steedman. Two Decades of Unsupervised POS induction:

How far have we come? In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 575–584. Association for Computational Linguistics, 2010.

M. M. Deza and E. Deza. Encyclopedia of distances. Springer, 2009.

I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek.

On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD, 2010.

U. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinement clustering algorithms. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 194–198, 1998.

J. Fenlon. Seeing sentence boundaries: the production and perception of visual markers signalling boundaries in signed languages. PhD thesis, UCL (University College London), 2010.

S. Finch and N. Chater. Bootstrapping syntactic categories using statistical methods. Background and Experiments in Machine Learning of Natural Language, 229:235, 1992.

M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster validity methods: part 1. ACM Sigmod Record, 31(2):40–45, 2002.

G. Hamerly and C. Elkan. Alternatives to the k-means algorithm that find better clusterings. In Pro- ceedings of the eleventh international conference on Information and knowledge management, pages 600–607. ACM, 2002.

Z. S. Harris. Distributional structure. Word, 1954.

(27)

W. P. Headden III, D. McClosky, and E. Charniak. Evaluating unsupervised part-of-speech tagging for grammar induction. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 329–336. Association for Computational Linguistics, 2008.

J. Hopkins. Choosing how to write sign language: a sociolinguistic perspective. International Journal of the Sociology of Language, 192(1):75–89, 2008.

T. Johnston and A. Schembri. On defining lexeme in a signed language. Sign language & linguistics, 2 (2):115–185, 1999.

D. Jurafsky and J. Martin. Speech and Language Processing. Pearson Education Inc., 2nd edition, 2009.

T. Kanungo, Mount, D.M., N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. An Efficient Clustering Algorithm: Analysis and Implementation. Pattern Analysis and Machine Intellegence, 24:881–892, 2002.

G. Kiss. Grammatical word classes: A learning process and its simulation. Psychology of Learning and Motivation, 7:1–41, 1973.

P. Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Tech- nology, 2005.

H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 18(1):50–60, 1947.

J. Mesch, L. Wallin, A.-L. Nilsson, and B. Bergman. Svensk teckenspråkskorpus. Datamängd. Projektet Korpus för det svenska teckenspråket 2009-2011 (version 1), Avdelningen för teckenspråk, Institu- tionen för lingvistik, Stockholms universitet, 2012.

A.-L. Nilsson. Studies in Swedish Sign Language : Reference, Real Space Blending, and Interpretation.

PhD thesis, Stockholm University, Department of Linguistics, 2010.

P. Pantel and D. Lin. Document clustering with committees. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 199–206.

ACM, 2002.

A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), volume 410, page 420, 2007.

M. Sahlgren. An Introduction to Random Indexing. In Proceedings of Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering. Copenhagen, 2005.

M. Sahlgren. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm, 2006.

P. Schachter. Language Typology and Syntactic Description: Volume 1, Clause Structure. Cambridge University Press, 2007.

H. Schütze. Part-of-speech induction from scratch. In Proceedings of the 31st annual meeting on As- sociation for Computational Linguistics, pages 251–258. Association for Computational Linguistics, 1993.

W. Schwager and U. Zeshan. Word classes in sign languages Criteria and classifications. Studies in Language, 32(3):509–545, 2008.

(28)

SOU. Utredningen Översyn av teckenspråkets ställning. Number 29. Socialdepartementet, Stockholm, 2006a.

SOU. Teckenspråk och teckenspråkiga. Översyn av teckenspråkets ställning. Number 54. Socialdeparte- mentet, Stockholm, 2006b.

S. Tajunisha. An efficient method to improve the clustering performance for high dimensional data by principal component analysis and modified K-means. International Journal of Database Management Systems (IJDMS), 3, 2011.

A. Vlachos, A. Korhonen, and Z. Ghahramani. Unsupervised and constrained Dirichlet process mixture models for verb clustering. In Proceedings of the workshop on geometrical models of natural language semantics, pages 74–82. Association for Computational Linguistics, 2009.

L. Wallin. Polysyntetiska tecken i svenska teckenspråket. PhD thesis, Stockholm University, Department of Linguistics, 1994.

L. Wallin, J. Mesch, and A.-L. Nilsson. Transkriptionskonventioner för teckenspråkstexter. 2009.

J. Wang, J. Wang, Q. Ke, G. Zeng, and S. Li. Fast approximate k-means via cluster closures. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3037–3044. IEEE, 2012.

G. K. Zipf. Human Behaviour and the Principle of Least-Effort. 1949.

Automatic Induction of Word Classes in Swedish Sign Language