Dependency Parsing for Chinese Social Media Text

(1)

Dependency Parsing for Chinese Social Media Text

Xiao Cheng

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits

June 5, 2019

(2)

Abstract

In this thesis, we investigate dependency parsing for Chinese social media

text. In order to investigate our research questions, we conduct several

parsing experiments on standard Chinese and social media data both with

UUParser and UDPipe, to assess parser performance on social media data,

and to improve this performance over baselines. The main contributions of

this study are as follows: first, the annotation of data sets for Chinese social

media data with or without manual annotation; second, the dependency

parsing of social media data can be improved by providing more relevant

training data. In addition, difference between domains of training data and

test data can hurt performance for both parsers.

(3)

Preface 4

1 Introduction 5

1.1 Purpose . . . . 5

1.2 Outline . . . . 6

2 Background 7 2.1 The Application of Social Media Text in NLP . . . . 7

2.2 Dependency Parsing . . . . 8

2.2.1 Data-driven approaches for dependency parsing . . . . 8

2.2.2 Transition-based Parsing . . . . 8

2.2.3 Parsing with Neural Networks . . . . 9

2.2.4 Universal Dependencies . . . 10

2.2.5 Chinese Parsing . . . 10

2.3 Semi-supervised Learning . . . 11

2.3.1 Self-training . . . 11

2.3.2 Co-training . . . 11

3 Data and tools 13 3.1 Standard Chinese . . . 13

3.2 Annotations for Weibo . . . 13

3.2.1 Weibo Training Data Sets . . . 15

3.3 Tools . . . 16

3.3.1 UUParser . . . 16

3.3.2 UUPipe . . . 16

4 Experiments 18 4.1 Parsing Models . . . 18

4.2 Baseline Models . . . 20

4.3 Self-training and Co-training . . . 20

4.3.1 Self-training . . . 20

4.3.2 Co-training . . . 21

4.3.3 Comparing Self-taining with Co-training . . . 22

4.4 Merged Training Data . . . 22

4.4.1 Concatenation . . . 23

4.4.2 Treebank Embeddings . . . 24

4.5 Discussion . . . 24

5 Conclusion and Future Work 26

(4)

Preface

I would like to thank my supervisor Joakim Nivre for his continuing support,

advice and encouragement. I also would like to thank my family, friends and

all my teachers in the Department of Linguistics and Philology at Uppsala

University.

(5)

1 Introduction

Dependency parsing is one of the core tasks in natural language processing (NLP). The essential task is to analyze the syntactic structure of a sentence or the relations between words. Microblogging services have attracted many users to share their experiences and opinions on various topics on the platform. Sina Weibo

¹

(the Chinese counterpart of Twitter) is one of China’s most popular mi- croblogging platforms. In the last few decades, NLP techniques for analyzing the Chinese language have been well-studied in regular text, but current techniques lack the ability to deal with the syntactic analysis of Weibo posts. This allows us to get some interesting observations on the differences between the standard Chinese text and the social media text. First, social media web text is different from the regular text. Most of the tokenizers, taggers and parsers trained on standard Chinese fail on Weibo text data. Weibo posts are much shorter than any other web text with the limit of characters for each post, which is usually full of noise, and as a result, deriving syntactic structures can be difficult for human annotators and automatic parsing alike. Second, there are limited treebanks for social media data, especially for Chinese microblogs (Sina Weibo). Researchers have focused on building resources and tools for English during the past years, but there are fewer resources in other languages in general.

1.1 Purpose

The main goal of this study is to investigate dependency parsing on Chinese social media data. In this thesis, the following research challenges will be addressed:

• Collecting and annotating Chinese social media data. Doing linguistic analysis and error analysis of that data so as to mark the differences between standard Chinese data and social media data.

• Analyzing the performance of parsers on both standard Chinese data and Chinese social media data, and comparing two neural network-based parsers: UUParser and UDPipe.

• Investigating the influence of different experimental setups for automatic annotation of in-domain data. Comparing results using both self-training and co-training on Chinese social media data.

We annotate the Weibo data based on the Universal Dependencies guidelines version 2.3 Nivre et al. (2018). This allows us to compare our data set to standard Chinese treebanks. We also conduct several experiments using UUparser and UDPipe with and without gold or predicted POS tags in order to study the impact of social media data on dependency parsing.

1

http://en.wikipedia.org/wiki/Sina_Weibo

(6)

1.2 Outline

The rest of the thesis is structured as follows:

• Chapter 2 provides an overview of related work. We introduce the applica- tion of social media data in natural language processing (NLP). We also introduce the concepts of data-driven dependency parsing, and describe how recent advances in this field compare to traditional methods, especially transition-based parsing and dependency parsing with neural networks.

We also give an overview of semi-supervised learning in self-training and co-training. Besides, we give brief introductions to dependency parsing for social media data and for Chinese.

• Chapter 3 describes the data sets and tools used for training and evaluation.

We introduce how we choose the data and our process of annotating Chinese social media data. We also introduce the two tools which we used in this thesis, UUparser and UDPipe.

• Chapter 4 describes our parsing experiments on both Standard Chinese and Chinese social media text, which are conducted to attempt to improve the accuracy of the parsing results. We explain the setup of our experiment for UUparser and UDPipe separately, and introduce our different parsing models. Besides, we explain how we choose the system that produces the best results. We discuss the experimental results related to various experiments on different training and evaluation sets.

• Chapter 5 summarizes the main findings of our thesis as well as suggestions

for future studies.

(7)

2 Background

This chapter first studies the application of social media data in natural language processing. Then, we introduces the concepts that are most relevant to this study, i.e., dependency parsing, mainly the data-driven approaches, including transition-based parsing and dependency parsing with neural networks. Besides, we will have an overview of the studies on self-training and co-training methods.

2.1 The Application of Social Media Text in NLP

In natural language processing research, social media sources receive scholars’

widespread attention. Social media text contains lots of non-standard language, including emojis, lexical variants, as well as the usage of hashtags, URLs and mentions on social media platforms, i.e., Twitter, Facebook and Sina Weibo (Chi- nese microblog service). Many researchers consider these non-standard languages as “noise”. Yin et al. (2015) claim that social media text is full of “noisy” text.

In order to investigate which social media text is noisier, Baldwin et al. (2013) evaluate the characteristics of social media text from different sources.

Social media data is widely used in NLP. Firstly, sentiment analysis is a type of natural language processing application for tracking the mood of the public about a particular product or topic (Nasukawa and Yi, 2003). Twitter is one of the largest social media platform, there is an the abundance of studies focusing on building corpora of Twitter text, and improving the techniques and algorithms used for sentiment analysis (Go et al., 2009; Pak and Paroubek, 2010).

Secondly, for the dependency parsing task, Pak and Paroubek (2010) propose a new dependency parser for English tweets. To our knowledge, Kong et al. (2014) and Y. Liu et al. (2018) develop tweebanks for English twitter. The further study of Blodgett et al. (2018) focuses on the dependency parsing on African-American and mainstream American English.

In this thesis, we study dependency parsing for Chinese social media text,

comparing with the standard Chinese. For the Chinese language, dependency

parsing for analyzing Chinese social media data, Sina Weibo, has not been well-

studied in the past years. Most of the research on Weibo is related to sentiment

analysis (Fu et al., 2014; Xue et al., 2014; Yuan et al., 2013). Due to the lack of

research on lexical and syntactic features of Weibo, machine translation suffers

from the mismatch problem caused by microblog text. Ling et al. (2013) present

a framework to crawl parallel data from microblogs. Wang et al. (2014) presents

a Chinese Weibo treebank for dealing with the dependency parsing problem

on Weibo text. Considering the ellipsis phenomenon in spoken Chinese, Ren

et al. (2018) introduces a practical scheme for ellipsis annotation, and build an

ellipsis-aware Chinese dependency treebank for Weibo text.

(8)

2.2 Dependency Parsing

2.2.1 Data-driven approaches for dependency parsing

Dependency parsing consists in predicting head-dependent relationships in a sentence, which can be difficult due to the statistical sparsity of these word- to-word relations. Kübler et al. (2009) divide dependency parsing approaches into two different categories: data-driven dependency parsing and grammar- based dependency parsing. Grammar-based approaches rely on a defined formal grammar, whereas data-driven approaches use machine learning from linguistic data. All parsing systems used in this thesis use data-driven methods. There are two main elements of data-driven dependency parsing: learning, in which a model is learned from training data, and parsing, in which the learned model is used in the parsing phase to produce dependency trees of new, unseen sentences.

2.2.2 Transition-based Parsing

In recent years, data-driven approaches to dependency parsing can be broadly categorized into graph-based and transition-based parsers (Kübler et al., 2009).

The model of graph-based parsing (R. McDonald, 2006) learns a scoring function for dependency graphs for a given sentence, such that the correct tree is scored above all other trees. The model of transition-based parsing (Nivre, 2008) learns to score transitions and greedily take the highest scoring transition from all the parser states until we have a complete dependency graph. It treats parsing as a sequence of actions that produce a parse tree.

A transition-based dependency parsing system uses a model parameterized over transitions of an abstract machine between configurations for deriving dependency graphs. Transition-based dependency parsing can be performed by starting at the initial configuration and taking the optimal transition from each configuration by using a guide until reaching a terminal configuration, then returning the dependency tree associated with that configuration. In the transition-based parsing process, the input words are put into an input buffer and built structures are organized and stored by a stack which contains only the root node at the beginning of the process. Transitions in such a system add arcs to the dependency graph and modify the stack and the buffer. When the process terminates, the set of dependency arcs contains the complete structure of the dependency tree. Many transition systems are inspired by shift-reduce parsing, we take the arc-eager system (Nivre, 2003) as example, for which transitions are:

• Shift: removes the first item of the buffer, and pushes it onto the top of the stack.

• Reduce: removes the top item from the stack.

• Left-Arc: adds an arc from the first item of the buffer and removes the top item from the stack.

• Right-Arc: adds an arc from the top item on the stack to the first item of

the buffer, removes the front of the buffer, pushes it onto the top of the

stack.

(9)

Goldberg and Nivre (2012) provide an improved oracle for the arc-eager transition system. However, this oracle does not allow for reordering of the words, and is unable to produce non-projective trees. The UUParser system (Lhoneux, Stymne, et al., 2017) we used in this thesis is a transition-based parser. The authors extend the arc-hybrid system for transition-based parsing with a swap transition that enables words reordering and construction of non-projective trees.

2.2.3 Parsing with Neural Networks

The early attempts at using simple recurrent network (SRN) to describe phrases include Elman (1991), who used recurrent neural networks (RNN) to create complex representations of sentences from a simple grammar and to analyze the linguistic structures of the resulting representations. Neural network language models turn out to have many advantages over the traditional language models.

For example, neural networks can handle much longer histories, and they can generalize over contexts of similar words. In recent years, enormous parsing success has been achieved by the use of neural net-based language models. Neu- ral networks have had an important influence on natural language processing, including syntactic parsing. Henderson (2004) was the first to attempt to use neural networks for parsing Penn Treebank data, using a neural network archi- tecture called Simple Synchrony Networks (SSNs) to predict parse decisions in a constituency parser.

There are some recent uses of deep learning on dependency parsing (Stenetorp, 2013; Weiss et al., 2015). Transition-based parsers have been shown to provide a good balance between efficiency and accuracy, but it suffers from search errors due to the inability to revise incorrect decisions. The motivation for using neural networks is to reduce the limitation of transition-based parsing described in D. Chen and C. Manning (2014), in which the authors explore the potential advantages of using a neural network classifier for a greedy transition-based dependency parser. In their study, a feed-forward neural network with a hidden layer is used to learn a dense and compact feature representations, but it is not competitive with state-of-the-art dependency parsers that are trained for structured search.

One of the most related studies with our thesis was carried out by Lhoneux, Stymne, et al. (2017), in which the authors developed a parsing system without using part-of-speech tags, morphological features and lemmas. The Uppsala system mainly consists of two components, a model for joint sentence and word segmentation, and a greedy transition-based parser based on Kiperwasser and Goldberg (2016) that makes the parsing decisions given the raw words of a sentence. The UUparser system was further updated in Lhoneux, Stymne, et al.

(2017) and Smith et al. (2018) to allow non-projective trees and use of external embeddings and some tuning of default hyperparameters.

We also use UDPipe (Straka et al., 2016) in this thesis. It is a pipeline to

perform sentence segmentation, tokenization, POS tagging, lemmatization and

dependency parsing. The parser Parsito (Straka et al., 2015) is inspired by

D. Chen and C. Manning (2014), which is a transition-based, non-projective

dependency parser, using a simple neural network with a hidden layer and locally

(10)

normalized scores. UDPipe was further updated in Straka and Straková (2017) to achieve low running times and moderately sized models.

2.2.4 Universal Dependencies

There are different annotation formats for dependencies. In the last years, the most common way of developing a treebank for a language is to choose an annotation scheme that was most appropriate for capturing the picked language’s grammatical features. However, this brings the challenge that annotation schemes vary enormously over different languages. Therefore, there are some attempts to develop a homogeneous annotation scheme that can be used to annotate multiple languages (R. McDonald et al., 2013). The Universal Dependencies project

¹

aims to build cross-linguistically consistent treebank from several existing annotation frameworks, the Stanford dependencies (De Marneffe and C. D. Manning, 2008;

Marneffe et al., 2014), the universal Google dependency scheme (R. McDonald et al., 2013), the Google universal POS tags (Petrov et al., 2011), etc. The UD project provides guidelines and categories of building a UD treebank for different languages, or converting other treebanks in other annotated formats into UD treebanks. Universal dependency relations are cross-linguistically applicable, and can be used to facilitate multilingual studies as well as dependency tasks. The updated version UD v2.3 consists of more than 100 treebanks across 70 languages.

The tag set is formatted according to the CoNLL-U format, which is based on CoNLL-X format by R. McDonald and Nivre (2007). In this thesis, we mainly use the Chinese UD treebank (GSD) in our experiments. Figure 2.2 shows an example of a sentence taken from the that UD treebank.

Figure 2.1: Example of CoNLL-U format, taken from the Chinese UD Treebank.

2.2.5 Chinese Parsing

Research on Chinese Treebank (Bikel and Chiang, 2000);(Chiang and Bikel, 2002) are the first ones for Chinese parsing. Levy and C. Manning (2003) use CTB for parsing. Li et al. (2011) and Ma et al. (2012) propose joint learning for Chinese POS tagging and dependency parsing. In recent years, the CoNLL shared task focuses on multilingual dependency parsing, which also included Chinese (Che et al., 2010; Nivre et al., 2007; Y. Zhang and Clark, 2008). However, there are few

1

https://universaldependencies.org

(11)

resources on other social media languages, e.g. social media Chinese, although dependency parsing on standard Chinese newswire data (like People’s Daily) is proposed by T. Liu et al. (2006).

2.3 Semi-supervised Learning

In parsing, we attempt to analyze the syntactic structure from a string of words.

Much of the challenge of this lies in extracting the appropriate features from texts.

There are many supervised techniques to train the parser, based on labeled data (Charniak and Johnson, 2005; Henderson, 2004). Unlike supervised techniques,

semi-supervised techniques allow us to use both labeled and unlabeled data to train models. The specific type of semi-supervised techniques includes self- training (Charniak, 1997) and co-training (Blum and Mitchell, 1998; Steedman et al., 2003). In this paper, we investigate the influence of using self-training and co-training methods on dependency parsing of Chinese social media data.

2.3.1 Self-training

Self-training refers to the incorporating unlabeled data into a new model, which the existing model first labels unlabeled data. The newly labeled data is then treated as truth and combined with the actual labeled data to train a new model. To our knowledge, the first reported uses of self-training for parsing are proposed by Charniak (1997). The author trained a self-trained model from the combination of the newly parsed text with Wall Street Journal training data (Marcus et al., 1993). However, the performance of self-trained model did not beat the original model. For a further attempt, McClosky et al. (2006) present an effective method to verify the impact of self-training for parsing tasks, although we find the oppositions in Charniak (1997). Here raises a question on when and why self-training is helpful, and McClosky et al. (2008) test four hypotheses for better understanding of self-training parsing. The authors conducted several experiments on analysis of non-generative re-ranker features and generative re-ranker features as the extension of McClosky et al. (2006). They conclude that the advantages of self-training appear most influenced by seeing known words in new combinations.

To summarize, many studies have verified self-training can improve the parsing accuracy by using specific parameter settings, while others are failed. Our goal of this study is to see if the self-trained model works on standard Chinese and Chinese social media text. Therefore, we believe that studying the impact of self- training on dependency parsing in detail is essential for investigating syntactic parsing of Chinese social media text.

2.3.2 Co-training

Blum and Mitchell (1998) firstly defined co-training under the hypothesis that two

learning algorithms are trained separately on each set, and then each algorithm’s

predictions on the new unlabeled text are used to enlarge the training set of the

other. A further explanation is described in Nigam and Ghani (2000), which

answer the question on when co-training methods outperform other methods

(12)

using unlabeled data. The co-training method in this paper also makes the improvements of lower classification error rates, but we cannot assert that co- training will always have high performance in other applications, thus we can only say that the co-training succeeds in part.

Co-training helps us to solve problems in many areas. For example, firstly, co- training methods are widely used in classification tasks. Kiritchenko and Matwin (2011) suggests that the performance of co-training depends on the learning algorithm itself, when applying co-training on the email domain. The research of Denis et al. (2003) shows that co-training algorithm has a positive effect on classifiers from positive and unlabeled data. For Chinese classification task, (Wan, 2009) proposes a co-training approach to improving the classification accuracy of polarity identification of Chinese product reviews. Secondly, domain adaptation is of importance in many areas of applied machine learning, especially in natural language processing, where different genres often use very different vocabulary to describe similar concepts. M. Chen et al. (2011) optimizes a co-training algorithm in order to achieve the best performance on domain adaptation task. Besides, studies in combining labeled and unlabeled data for NLP tasks were done in the area of unsupervised part of speech tagging, as described in Cutting et al.

(1992). Unlike unsupervised method, Sarkar (2001) introduces a new co-training approach that concatenates unlabeled data with a small set of labeled data to train a statistical parser. Also, it should be noted that co-training has its own limitations. Pierce and Cardie (2001) points out that the optimized methods help to learn behavior of co-training on NLP tasks, especially from a large number of training data.

According to the previous investigations, we use self-training and co-training

methods to close the gap between standard Chinese and Chinese social media

text in dependency parsing.

(13)

3 Data and tools

In this thesis, we hypothesize that adding social media data to train on would benefit our parsing models. In order to verify our hypothesis of assessing parser accuracy on Chinese social media data, we first collected a Weibo dataset of 2548 sentences from two publicly available Weibo corpus, µtopia dataset (Ling et al., 2013) and Leiden Weibo Corpus(LWC)

¹

on which to train and test the parsers.

We also train and test parsers on standard Chinese Treebanks for comparison.

We conducted different experiments on our data sets using UUParser (Lhoneux, Stymne, et al., 2017) and UDPipe (Straka et al., 2016). We also used UDPipe for the preprocessing of the Weibo data. In this chapter, the annotated Weibo data set is produced which we later split into training, development and test sets for the dependency parsers. We will describe our data and tools in depth.

3.1 Standard Chinese

The standard Chinese data sets come from the v2.3 version of the Universal Dependency Chinese GSD (GSD) treebank (Nivre et al., 2018), as the training, development, and test sets have already been split. We firstly present a brief description of Chinese UD treebank used in this thesis.

The Chinese UD GSD Treebank has been part of Universal Dependencies since the UD v1.3 version release in 2016. This treebank is annotated and converted by Google and includes sentences collected from Wikipedia articles. Part-of-speech tags and dependency relations in the Chinese UD GSD treebank are annotated manually by native speakers, whereas lemmas and features are automatically predicted by a program. Since the latest version has been released from 2018, we use the v2.3 version of the treebank using gold sentence and word segmentation in this thesis. We use the original size of Chinese data sets of Universal Dependency treebank for our experiments in this thesis. The size of the UD Chinese treebank can be found in Table 3.1.

Training Set Development Set Test Set

Standard Chinese 3997 500 500

Weibo 2198 250 100

Table 3.1: Size of data sets (number of sentences).

3.2 Annotations for Weibo

The goal of the task in this thesis is to evaluate parser accuracy on social media data, so we needed to create a gold standard annotation of the collected data.

1

http://lwc.daanvanesch.nl/index.php

(14)

Before preprocessing of Weibo data, we split the data into 3 data sets. The size of the Weibo data can be found in Table 3.1. The training and development data sets are automatically segmented and tagged using UDPipe, and parsed using either UDPipe or UUParser, while the test data set was annotated manually by two native speakers of Chinese with linguistic backgrounds and previous experience of working on the Universal Dependencies projects, who were also familiar with the Universal Dependencies guidelines. In order to create a Chinese social media data set, we need to collect the raw text, which refers to Weibo posts.

Firstly, as mentioned above, the sentences were taken from µtopia datasets (Ling et al., 2013) and Leiden Weibo Corpus(LWC).

²

Leiden Weibo Corpus (LWC) consists of more than 5 million messages posted on Weibo in January 2012, and µtopia corpus is a parallel corpus that contains parallel sentence pairs mined from Microblogs. We chose the Chinese to English data sets which include 2003 sentences from µtopia, and picked about 500 sentences from LWC corpus randomly.

Secondly, we need to filter out the noisy text from Weibo posts in order to better evaluate the parsing accuracy. Parsing noisy social media data is interesting, but also brings big challenges. One of the reasons is that microblogs are normally short and are full of noise, e.g. hashtags, information sources, URLs, etc. (Cao et al., 2015), which is different from standard Chinese or other newswire text.

For example, Weibo posts contain “#nickname#”, “来自iPhone客户端” (From iPhone), “http://t.cn/zT6za7g” and so on. Although the noisy texts are parts of the Weibo posts, it is unclear how to annotate these noisy texts. Taking this into account, it is better to remove noisy texts before parsing Weibo data. Therefore, We manually filter out those noise text from the Weibo posts. Table 3.2 shows the example data pre-processing rules.

Rules Raw Text Processed Text

Hashtags #祈福雅安#保佑保佑

URLs 亲爱的老爸，

你终于出现了。

http://t.cn/zT6za7g

亲爱的老爸，

你终于出现了。

Nicknames 今天真是个好日子，

NBA巨星@DwyaneWade 今天真是个好日子，

NBA巨星 Information sources 我开车，

你到后座去小睡一下。

（来自iPhone客户端）

我开车，

你到后座去小睡一下。

Sharing news 只比你努力一点的人，

已经甩你太远//@小百科: 只比你努力一点的人，

已经甩你太远 Table 3.2: Removing Noisy Text

Thirdly, we focused on annotating cleaned Weibo data. This task brings many challenges during the annotation procedure. The annotations needed producing in the CoNLL-U format. To facilitate the annotation process, we first preprocess the Weibo data using the UDPipe pipeline v2.3, including a Chinese word

2

http://lwc.daanvanesch.nl/index.php

(15)

segmenter, tokenizer and part-of-speech tagger. UDpipe is a “trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files”.

³

We use UDPipe for segmentation and pos tagging on the training and development data sets without checking manually. In addition, we need to create a test data set as the gold standard. For the test data set, the annotation process has two stages: in the first stage, we rely on the draft version of the Weibo data, which is produced by UDPpie; in the second stage, we revised the mis-segmented word tokens, POS tags and dependency relations manually. In order to ensure consistency of annotation with UD guidelines and with the Chinese UD treebank, we discuss the difficult cases during the annotation process in order to come to an agreement.

3.2.1 Weibo Training Data Sets

The starting point of the experiments is to build a training set for Weibo data.

We created two version of Weibo training data sets using UUParser and UDPipe to parse our 2548 sentences of raw Weibo data. The first one is segmented, tagged, and parsed by UDPipe automatically. The second one is segmented and tagged by UDPipe, but parsed by UUParser. The Weibo training data is in CoNLL-U format which we introduced in previous chapter. Examples for the parsed data can be found in Figure 3.1 and Figure 3.2.

Figure 3.1: Example for UDPipe-parsed Weibo Data

As is shown by Figure 3.1 and Figure 3.2, each line represents one word/token, and comprises ten columns which are used to specify in the following order: a unique id (integer for words), word form, lemma, universal part-of-speech tag, language-specific part-of-speech tag, morphological features, head, dependency relation, any other annotations.

Comparing the outputs of two Weibo data sets, we can observe the difference in head number and dependencies from many lines, but the most obvious one is the dependency of the last punctuation (index:15) should be punct but in the output of UDPipe-parsed Weibo data is not. Therefore, the UUParser-parsed Weibo data is better than the UDPipe-parsed data in some cases. However, we

3

http://ufal.mff.cuni.cz/udpipe

(16)

Figure 3.2: Example for UUParser-parsed Weibo Data

cannot say that which one is best, since we did not evaluate the quality of Weibo training data sets.

3.3 Tools

In order to make a comparison of different dependency parsers, we use UUParser and UDPipe to perform the task in this thesis. We briefly introduce two parsing systems in the following part.

3.3.1 UUParser

UUParser

⁴

(Lhoneux, Shao, et al., 2017) is based on the transition-based parser

⁵

of Kiperwasser and Goldberg (2016) which relies on a BiLSTM to learn features of words. The UUParser system was adapted to Universal Dependencies. It uses the arc-hybrid transition system (Kuhlmann et al., 2011), extended with a SWAP transition and a static-dynamic oracle, as described in Lhoneux, Stymne, et al. (2017). The latest vesion of UUParser is available since 2018, extended with cross-treebank functionality and use of external embeddings (Stymne et al., 2018). This parser allows for the construction of non-projective dependency trees (Nivre, 2009).

3.3.2 UUPipe

UDpipe

⁶

(Straka et al., 2016) is a neural pipeline containing a tokenizer, part-of- speech tagger, lemmatizer and dependency parser for Universal Dependencies treebanks in CoNLL-U format. UDPipe is easily retrainable on the given anno- tated data in CoNLL-U format and in some cases also with raw text. It is also a baseline system used both in the CoNLL 2017 and CoNLL 2018 shared tasks.

The improved version UDPipe 1.2 of original UDPipe is described in Straka and Straková (2017) for processing CoNLL-U version V2.0 files.

4

https://github.com/UppsalaNLP/uuparser

5

https://github.com/elikip/bist-parser

6

https://ufal.mff.cuni.cz/udpipe/

(17)

UDPipe is designed to easily process raw text to CoNLL-U files, which are tagged and parsed dependency trees. UDPipe employs a GRU network during the inference of segmentation and tokenization. The tokenizer is based on a bidirectional LSTM artificial neural network (Graves and Schmidhuber, 2005).

UDPipe uses characters features to predict the POS and lemma tags, which is a supervised, rich feature averaged perceptron, as described in Collins (2002).

Both the tagger and the lemmatizer do not use any hyperparameters which require tuning. For the dependency parser, UDPipe uses Parsito of Straka et al.

(2015), which is a transition-based neural dependency parser with one hidden layer predicts the transitions. The parser also makes use of the information from lemmas, POS tags, features and dependency relationships through a group of embeddings precomputed by word2vec using the training data. This parser allows the construction of both projective and non-projective dependency trees. For UDPipe 1.2 system, Straka and Straková (2017) merged UD treebanks of the same language, and used the whole training data and searched for the further updated hyperparameters using the development data.

In this thesis, we use UDPipe 1.2 with pretrained models for UD v2.3 treebanks.

We use UUParser and UDPipe to perform pre-processing of Weibo data and

dependency parsing. Systems were scored by computing the unlabeled attachment

score (UAS) and labeled attachment score (LAS). Ulabeled attachment score

(UAS) assesses whether the output has the correct head. Labeled attachment

score (LAS) assesses whether the output has the correct head and dependency

relations. A further explanation is described in Nivre et al. (2007).

(18)

4 Experiments

In this chapter, we conduct a series of experiments to explore how we can improve the parsing accuracy on Chinese social media data, including the conversion of Chinese social media data to CONLL-U format and the design of different parsing models. In order to find a domain-adaptation technique to close the gap between standard Chinese and our Weibo data set, we carried out several experiments, testing several different methods, such as self-training, co-training, and the use of a single treebank.

4.1 Parsing Models

Given our knowledge of the features of Weibo data, Weibo is different in ex- pressions from standard Chinese. We hypothesize that the parsing accuracy for standard Chinese worse when we train the parsers on the Weibo data, while the accuracy of Weibo data increases when we add social media data into the training set. In order to verify the hypothesis, we conducted a series of experiments within several in-domain scenarios. As a result of that, we set up 12 different parsing models to conduct experiments for chosen data sets with both parsers.

Firstly, we trained the baselines on UUParser and UDPipe individually, which means training the models on standard Chinese. We use the original standard Chinese treebank with gold POS tags to see results of parsing with gold and predicted POS tags both with UUParser and UDPipe, and results without POS tags with UUParser.

Secondly, to make a comparison with the results of baseline models, we conducted self-training and co-training experiments on two versions of Weibo training data sets. The first one is segmented, tagged and parsed by UDPipe.

The second one is tokenized and tagged with UDPipe, but parsed with UUParser.

We trained parsers on these two Weibo training data sets with UUParser and UDPipe separately. We call this self-training when a parser is trained on its own output and co-training when it is trained on the output of the other parser.

Thirdly, we did various combinations for standard Chinese and Weibo data.

The first method is concatenation, which is adding Weibo data to the standard Chinese Treebank. To get self-trained and co-trained models, we trained parsers on the combination of standard Chinese and Weibo data on UDPipe. The combined training sets includes two different combinations of standard Chinese and two different Weibo training data sets. To make a comparison, we also conducted several experiments on the same in-domain data sets with UUParser.

In this way, we can compare results using both self-training and co-training. We report results of all these experiments, including with gold and predicted POS tags for UUParser and UDPipe, and results without POS tags for UUParser.

In addition, we compare these results to using treebank embeddings for training

parsers. Because Stymne et al. (2018) mentioned that this often gave better

(19)

accuracy than simply combining the training sets. For this method, instead of training the parser on a single training set consisting of standard Chinese and Weibo data merged together, we trained a model on two separate training sets (one for standard Chinese and one for Weibo) and used a treebank embedding to record which set each sentence comes from. When we apply the model to a test set, the same treebank embedding will be used to distinguish the two domains.

The explanation for the parsing models can be found in Table 4.1.

Abbreviation Full Name

SC Single Standard Chinese Treebank ST Self-training (Weibo training data) CT Co-training (Weibo training data)

SC+ST Standard Chinese concatenated with Weibo training data SC+CT Standard Chinese concatenated with Weibo training data TE(ST) Treebank Embeddings

TE(CT) Treebank Embeddings

Table 4.1: Names of the parsing models

• SC: This model is trained on the standard Chinese treebank with gold POS tags. We use this model for testing both on standard Chinese and Weibo data with UUParser and UDpipe.

• ST: This model is trained on Weibo training data parsed by UDPipe when UDPipe is the final parser, or Weibo training data parsed by UUParser (but segmented and tagged by UDPipe) when UUParser is the final parser.

We use this model for testing both on standard Chinese and Weibo data with UUParser and UDpipe.

• CT: This model is trained on Weibo training data parsed by UDPipe when UUParser is the final parser, or Weibo training data parsed by UUParser (but segmented and tagged by UDPipe) when UDPipe is the final parser.

We use this model for testing both on standard Chinese and Weibo data with both UUParser and UDpipe.

• SC+ST: This model parses standard Chinese and Weibo data and is trained on the combination of standard Chinese and Weibo data. We use this model for both UUParser and UDpipe in a self-training scenario.

• SC+CT: This model parses standard Chinese and Weibo data and is trained on the combination of standard Chinese and Weibo data. We use this model for both UUParser and UDpipe in a co-training scenario.

• TE(ST): This model is trained on two separate training sets (one for standard Chinese and one for Weibo data). We use this model for UUParser in a self-training scenario.

• TE(CT): This model is trained on two separate training sets (one for

standard Chinese and one for Weibo data). We use this model for UUParser

in a co-training scenario.

(20)

4.2 Baseline Models

The first step is to build a baseline by training a model on standard Chinese training data, the Chinese UD treebank. We present our experimental results of both UUParser and UDPipe, testing the parsing model (SC) described in the previous section on the Chinese UD treebank with and without POS tags, as well as on the Weibo test set for comparison. The results of these experiments are listed in Table 4.2.

As can be seen in Table 4.2, for the results on standard Chinese with gold tags, both parsers achieve an UAS score above 83, but UUParser has higher scores than UDPipe. With predicted tags, UUParser still beats UDpipe with an UAS of more than 4 points. Although the results with no POS tags for UUParser are worse than the results of using POS tags, the UAS and LAS scores still are higher than UDPipe.

There is a big drop on Weibo test data for both UUParser and UDPipe. The UAS of Weibo with gold tags has decreased to 68.04 for UDPipe and 69.47 for UUParser. Bigger drops for the results of Weibo test data with predicted tags, smaller for gold tags. UUParser still has a better performance than UDpipe, but the differences between both parsers are small. Besides, we observed that the UAS and LAS scores of UUparser are lower than UDPipe when we did not use POS tags for UUParser during training.

Test Set POS Tags UDpipe UUparser

Standard Chinese

Gold UAS 83.13 83.56

LAS 79.98 80.13 Predicted UAS 75.05 79.06 LAS 69.25 73.92

Without UAS None 78.86

LAS None 73.77

Weibo

Gold UAS 68.04 69.47

LAS 64.99 65.61 Predicted UAS 63.88 64.25 LAS 60.99 62.36

Without UAS None 62.30

LAS None 57.12

Table 4.2: UAS and LAS Scores for Baseline Models

4.3 Self-training and Co-training

In order to improve the accuracy of Weibo test data, we tested the self-trained and co-trained models trained on single treebanks. All the experimental results are listed in Table 4.3.

4.3.1 Self-training

To investigate the effectiveness of Weibo training data, we did self-training on

only Weibo training data. In self-training, the data sets consist of two data

(21)

Test Set POS Tags UDpipe UUparser

ST CT ST CT

Standard Chinese

Gold UAS 74.23 69.51 70.79 75.14 LAS 68.30 64.40 66.84 69.77 Predicted UAS 56.33 53.76 62.75 67.97 LAS 50.46 47.76 57.31 60.48 Without UAS None None 62.31 65.91 LAS None None 59.28 60.92

Weibo

Gold UAS 81.02 78.53 83.47 87.82 LAS 75.83 72.50 77.06 82.87 Predicted UAS 78.50 74.39 80.06 84.82 LAS 72.17 69.40 73.12 78.07 Without UAS None None 75.68 78.35 LAS None None 70.88 71.93 Table 4.3: UAS and LAS Scores for Self-training and Co-training Models. For each test

set, the best result is marked with bold.

training sets, one for labeled data and another one for unlabeled data. First we self-train UDpipe on the raw Weibo data. Or to put it another way, we train a model on the labeled data and use that model to label the unlabeled data. From the combination of our original labeled data (standard Chinese treebank) and the newly labeled data (Weibo data), we train a second model – our self-trained model. We also did self-training with UUParser with the same experimental set ups. As is shown in Table 4.2 and Table 4.3, compared with baselines, results on standard Chinese with gold POS tags, UDPipe is better than UUParser with an UAS of 74.23 and an LAS of 68.30. On the contrary, UUParser outperforms UDPipe by more than 5 points of UAS and 4 points of LAS, when testing on standard Chinese with predicted tags. The results on UDPipe has the lowest UAS and LAS scores: 56.33 and 50.46 respectively, when testing with predicted tags. We did not see positive results on UUParser when we testing without tags, but the UAS and LAS scores are still higher than UDPipe.

Compared with the baseline, results on Weibo test data show substantial gains on both UAS and LAS scores, which did result in improvement of more than 10 points with either gold tags or predicted tags on both parsers. Although results on UUParser without tags did not beat results on UDPipe with predicted tags, in general, UUParser has a better performance on Weibo test data.

4.3.2 Co-training

In order to capitalize on the respective advantages of both UUParser and UDPipe, the co-trained parsers are tested with standard Chinese and Weibo data. In the co-training experiments, we first train UDPipe on Weibo training data parsed by UUParser (but segmented and tagged by UDPipe), then we train UUParser on the Weibo training data parsed by UDPipe.

Seen from Table 4.3, comparing the proposed co-training approach with

baseline methods, there has been a great decline in results on standard Chinese

(22)

with gold or predicted POS tags, but UUParser still outperforms UDPipe. The UAS and LAS scores on standard Chinese with predicted tags for UDPipe are much lower than what we have got in the previous experiments (UAS and LAS decreased to 53.76 and 47.76 respectively). Results on standard Chinese without tags with UUParser are still better than all the results on UDPipe.

Our results show that Weibo has a big improvement in the co-training method.

While UDPipe did not beat UUParser in general, the LAS and UAS for Weibo test data saw at least a substantial 10 points increase on both parsers. The best parser achieved 87.82 of UAS and 82.87 of LAS on Weibo data with gold tags, which is significantly higher than the baselines. Therefore, the co-training method outperforms all baselines over all results we got on Weibo test data.

4.3.3 Comparing Self-taining with Co-training

So far, UUParser outperforms UDPipe both in self-training and co-training (ex- cept self-training when evaluating with gold tags on standard Chinese). UUParser always has a better performance than UDPipe when we tested on Weibo data, although UDpipe is better than UUParser when no POS tags are used during training when we tested on Weibo test data with UUParser. It is also interesting to see that UUParser, when trained only Weibo data, achieves higher accuracy on the Weibo test set than on the standard test set. The possible reason for this observation is that the combination of the training data from different domains may lead to lower results than if we only have data from one domain.

Comparing the results of self-training and co-training methods, the UAS and LAS scores of standard Chinese drops among all the baseline parsers compared, while Weibo data benefits from these parsing models. In general, we can see that self-training outperforms co-training for UDPipe. By contrast, for UUParser co-training is better than self-training. One of the possible reasons is that the dependency relations of Weibo training data parsed by UDPipe have a strong connection with the POS tags tagged by itself, so we can see better results on this training data set for both parsers. Seen from the results on UDPipe and UUParser, we can see that results seem to indicate that the best setup is co-training for UUParser, where we tagged and parsed the Weibo training set with UDPipe. Therefore, the self-training and co-training have their own advantages in different cases.

4.4 Merged Training Data

In order to see if the merged training data will have a positive impact on the dependency parsing task, we further use each of the Weibo training sets merged with the standard Chinese training data, since we already tested the self-trained and co-trained results in the previous section.

There are two methods for merging standard Chinese and Weibo training data.

The first one is concatenation, which we tested on the Chinese UD training data

with the supplement of automatically annotated in-domain data. This method is

making combinations by adding annotated Weibo training data into standard

Chinese training data. We tested on concatenations with both UUParser and

UDPipe. In some cases, especially for small treebanks, concatenation helps a lot

(23)

over a single treebank, whereas it actually hurts for Weibo data in our experiments.

As a further step, we presented the second method which is called treebank embbeding, but we only did this for UUParser with treebank embeddings, we trained a model on two separate training sets (one for standard Chinese and one for Weibo) and used a treebank embedding to parse the test data sets. Table 4.4 shows the results on the test sets of each merged training sets.

Test Set POS Tags UDpipe UUparser

SC+ST SC+CT SC+ST SC+CT TS(ST) TE(CT)

Standard Chinese

Gold UAS 82.66 82.34 83.77 84.39 79.56 80.04

LAS 79.50 78.85 80.95 81.86 76.06 74.93 Predicted UAS 73.93 73.68 78.48 78.03 75.48 75.03 LAS 67.94 67.92 72.85 72.74 71.05 70.25

Without UAS None None 78.52 78.03 78.32 77.14

LAS None None 73.55 72.24 73.22 72.04

Weibo

Gold UAS 72.07 65.01 68.84 75.59 65.04 71.48

LAS 67.08 50.64 60.18 67.79 60.48 68.89 Predicted UAS 66.60 60.82 62.92 69.30 60.22 65.30

LAS 60.32 54.49 56.74 63.35 55.04 60.74

Without UAS None None 56.82 62.94 62.77 66.13

LAS None None 53.12 58.09 57.82 62.32

Table 4.4: UAS and LAS Scores for Merged Training Data Models. For each test set, the best result is marked with bold.

4.4.1 Concatenation

Firstly, compared with results in Table 4.2, the UAS and LAS scores of standard Chinese test data decreased, because the baseline has higher UAS and LAS scores. As we can see from the table, on average the concatenations did not beat the baseline for UDPipe with the worst results for SC+CT. Compared to ST and CT for UDPipe, when testing on the standard Chinese data with gold or predicted POS tags, the concatenated models have better performance than ST and CT with higher UAS and LAS scores, although SC+ST and SC+CT did not beat the baseline model (SC) by around 1 UAS and LAS points.

On the other hand, we observed some positive improvements in SC+ST with UUParser, where the accuracy of standard Chinese with gold POS tags for SC+CT improved compared to the baseline model (SC). The SC+CT model beats the baseline for UUparser with an UAS of 84.39 and an LAS of 81.86.

However, the UAS and LAS scores of standard Chinese for SC+ST with predicted tags or without tags outperforms the scores for SC+CT. It is important that we combine results from UDPipe with UUParser, the concatenation SC+ST is more suitable for parsing standard Chinese. To sum up, the concatenations did not improve the accuracy of dependency parsing on standard Chinese in most cases by comparing with baseline models on both parsers, but there is only a marginal difference between baselines and concatenations.

Secondly, on average, the Weibo test data gains on both UAS and LAS scores

when using SC+ST with UDPipe, which resulted in an improvement of 4.0-16.0

UAS and 1.0-4.0 LAS points by comparing with baseline models, while having a

drop on SC+CT with either gold or predicted tags. For UUParser, we see that

parsing with SC+CT results in a considerable 5 points increase in UAS over

parsing with SC+ST, with LAS also higher than SC+ST. When testing Weibo

(24)

test data without tags, the UAS and LAS scores decreased, but there are small differences. Although we see a sudden decrease comparing with results of ST and CT, we still could probably say that the supplement of Weibo training data has a positive effect on dependency parsing accuracy.

It can be seen from the table that concatenation methods are more suitable for standard Chinese for both parsers in self-training. For Weibo, self-training still works better with UDPipe, while co-training is better for UUParser. Although we got some negative results in our experiments, it is nearly always better to combine treebanks in some way than to use only a single treebank in reality.

4.4.2 Treebank Embeddings

Instead of training the parser on a single training set consisting of standard Chinese and Weibo data merged together, we also evaluated the performance of multi-treebank models. Table 4.4 shows the results on the test sets of each training set. We trained this model for UUParser only, since UDPipe is not able to use treebank embeddings. We find the results of the multi-treebank model (TE+CT) to be quite encouraging on the Weibo test set with gold POS tags, given that this model was able to achieve an UAS of 71.48 and a LAS of 68.89.

We also have a positive gain on the Weibo test set without any POS tags, as well as with predicted tags. However, we did not see the same improvement on TE+ST, the results for this model are worse than baselines. Comparing with the concatenation method, we are slightly surprised to see that the multi-treebank model is worse than just training on the combined data when we use POS tags on standard Chinese and Weibo data in some cases. Besides, we find that the UAS and LAS of Weibo on SC+CT are higher than TE(ST) and TE(CT) with gold or predicted tags, but we also see some positive results from TE(CT).

4.5 Discussion

The experiments in this thesis have clarified many details about the differences between standard Chinese and Chinese social media data (Weibo) for dependency parsing. After discussing the experimental results, we can conclude that we get several improvements by using different methods, especially on Weibo data.

Similar to the experiments which we have conducted with UDPipe, we have also trained different parsing models for both standard Chinese and Weibo data with UUparser. Overall, UUParser outperforms UDPipe with higher parsing accuracy.

It seems to be the case that both UDPipe and UUParser learn better from the

Weibo training data, no matter whether we are using gold or the predicted POS

tags. On the other hand, when we look at the results of using self-training and

co-training on only Weibo training data, we see some indication that the model

is adapted to Weibo, and all of these models perform much better with UUParser

than with UDPipe. The results of the single treebanks (including self-training,

co-training and standard Chinese) have higher UAS and LAS scores than results

of concatenations, when testing on the same type of data (train and parse on

standard Chinese or train and parse on Weibo data). It is also interesting to

see that our best performing model for the Weibo test data is trained only on

Weibo data, although we get improvement with the concatenated models. We

(25)

also can make a conclude that self-training is better for UDPipe, but co-training is better for UUParser in general.

By comparison, it is clear that the supplement of social media data did not

improve on UAS and LAS scores for the standard Chinese test set. However,

even when the Chinese UD treebank is a bad fit, the concatenation and treebank

embedding can still improve the parsing accuracy over using only the single

model for the Weibo with the best fit. In fact, when the data set is tiny, as for

Weibo, the effect of adding a small data set to a large one is minor, but we do

have the positive gains in our experiments in this thesis.

(26)

5 Conclusion and Future Work

In this thesis, we mainly explored dependency parsing for Chinese social media data, and tried to answer the research questions listed in chapter 1 by conducting a series of parsing experiments both with UUParser and UDPipe. The main contributions of our work are as follows.

Firstly, in chapter 3, we annotated a data set of Chinese social media data, Weibo, according to the Universal Dependencies annotation guidelines. Compared to a standard Chinese treebank, we found that Weibo data is full of noisy text.

In addition, each sentence is a much shorter than the sentences in our standard Chinese data, which we hypothesized would negatively affect parsing quality for standard Chinese, but positively affect Weibo parsing accuracy. We also hypothesized that ambiguity caused by mis-match cases, such as segmentation, POS tags, and dependency relations, would be causes of errors on both data sets.

Secondly, we conducted several parsing experiments to confirm our hypotheses, showing that the models of concatenation and treebank-embedding methods did indeed have a better performance on social media data. On the other hand, not all of these models failed on standard Chinese treebank. We discussed the results of both dependency parsers, which also confirmed the previous hypothesis that Weibo data benefits from new training data (including Weibo data), no matter whether we are using the gold or predicted POS tags during parsing. We also investigated the impact of automatic and manual annotations of such domain data by leveraging the advantages of both UUParser and UDPipe, showing that self-training is better for UDPipe, but co-training is better for UUParser. Both self-training and co-training have an improvement over the baseline on substantial Weibo test data. This is the main finding of this thesis that these methods give substantial gains for social media data in scenarios with the combination of two data sets, even though the automatically annotated data should be revised manually.

Although both parsers perform well on social media data, Weibo text brings challenges due to lexical and syntactic features which differ from standard Chinese.

In our experiments, we show that parsing accuracy increased on Weibo data;

however, the Weibo training data is pre-processed by UDpipe automatically,

so many mis-matched cases have not been corrected. As our future work, it is

essential to collect a large amount of social media data and annotate the data

more systematically in order to achieve better results, since we have realized the

importance of the domain of the training data. Besides, we consider using other

state-of-art approaches on dependency parsing for Chinese social media text.

(27)

Bibliography

Baldwin, Timothy, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang (2013). “How noisy social media text, how diffrnt social media sources?” In:

Proceedings of the Sixth International Joint Conference on Natural Language Processing , pp. 356–364.

Bikel, Daniel M and David Chiang (2000). “Two statistical parsing models applied to the Chinese Treebank”. In: Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 12 . Association for Computational Linguistics, pp. 1–6.

Blodgett, Su Lin, Johnny Wei, and Brendan O’Connor (2018). “Twitter Uni- versal Dependency Parsing for African-American and Mainstream American English”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Vol. 1, pp. 1415–1425.

Blum, Avrim and Tom Mitchell (1998). “Combining labeled and unlabeled data with co-training”. In: Proceedings of the eleventh annual conference on Computational learning theory . ACM, pp. 92–100.

Cao, Yuhui, Zhao Chen, Ruifeng Xu, Tao Chen, and Lin Gui (2015). “A Joint Model for Chinese Microblog Sentiment Analysis”. In: Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing . Beijing, China:

Association for Computational Linguistics, July 2015, pp. 61–67.

Charniak, Eugene (1997). “Statistical Parsing with a Context-free Grammar and Word Statistics”. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence . AAAI’97/IAAI’97. Providence, Rhode Island: AAAI Press, pp. 598–603.

Charniak, Eugene and Mark Johnson (2005). “Coarse-to-fine n-best parsing and MaxEnt discriminative reranking”. In: Proceedings of the 43rd annual meeting on association for computational linguistics . Association for Computational Linguistics, pp. 173–180.

Che, Wanxiang, Zhenghua Li, and Ting Liu (2010). “Ltp: A chinese language technology platform”. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations . Association for Computational

Linguistics, pp. 13–16.

Chen, Danqi and Christopher Manning (2014). “A Fast and Accurate Dependency Parser using Neural Networks”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Doha, Qatar:

Association for Computational Linguistics, Oct. 2014, pp. 740–750.

Chen, Minmin, Kilian Q Weinberger, and John Blitzer (2011). “Co-training for

domain adaptation”. In: Advances in neural information processing systems,

pp. 2456–2464.

(28)

Chiang, David and Daniel M Bikel (2002). “Recovering latent information in tree- banks”. In: Proceedings of the 19th international conference on Computational linguistics-Volume 1 . Association for Computational Linguistics, pp. 1–7.

Collins, Michael (2002). “Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms”. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing- Volume 10 . Association for Computational Linguistics, pp. 1–8.

Cutting, Doug, Julian Kupiec, Jan Pedersen, and Penelope Sibun (1992). “A practical part-of-speech tagger”. In: Third Conference on Applied Natural Language Processing .

De Marneffe, Marie-Catherine and Christopher D Manning (2008). Stanford typed dependencies manual . Tech. rep. Technical report, Stanford University.

Denis, Francois, Anne Laurent, Rémi Gilleron, and Marc Tommasi (2003). “Text classification and co-training from positive and unlabeled examples”. In: Pro- ceedings of the ICML 2003 workshop: the continuum from labeled to unlabeled data , pp. 80–87.

Elman, Jeffrey L (1991). “Distributed representations, simple recurrent networks, and grammatical structure”. Machine learning 7.2-3, pp. 195–225.

Fu, Chen, Bai Xue, and Zhan Shaobin (2014). “A study on recursive neural network based sentiment classification of Sina Weibo”. In: 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) . IEEE, pp. 681–685.

Go, Alec, Richa Bhayani, and Lei Huang (2009). “Twitter sentiment classification using distant supervision”. CS224N Project Report, Stanford 1.12.

Goldberg, Yoav and Joakim Nivre (2012). “A dynamic oracle for arc-eager dependency parsing”. Proceedings of COLING 2012, pp. 959–976.

Graves, Alex and Jürgen Schmidhuber (2005). “Framewise phoneme classification with bidirectional LSTM and other neural network architectures”. Neural Networks 18.5-6, pp. 602–610.

Henderson, James (2004). “Discriminative Training of a Neural Network Sta- tistical Parser”. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume . Barcelona, Spain, July

2004, pp. 95–102.

Kiperwasser, Eliyahu and Yoav Goldberg (2016). “Simple and accurate depen- dency parsing using bidirectional LSTM feature representations”. Transactions of the Association for Computational Linguistics 4, pp. 313–327.

Kiritchenko, Svetlana and Stan Matwin (2011). “Email classification with co- training”. In: Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research . IBM Corp., pp. 301–312.

Kong, Lingpeng, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A Smith (2014). “A dependency parser for tweets”. In: Pro- ceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1001–1012.

Kübler, Sandra, Ryan McDonald, and Joakim Nivre (2009). “Dependency pars- ing”. Synthesis Lectures on Human Language Technologies 1.1.

Kuhlmann, Marco, Carlos Gómez-Rodrıguez, and Giorgio Satta (2011). “Dy- namic programming algorithms for transition-based dependency parsers”. In:

Proceedings of the 49th Annual Meeting of the Association for Computational

(29)

Linguistics: Human Language Technologies-Volume 1 . Association for Compu- tational Linguistics, pp. 673–682.

Levy, Roger and Christopher Manning (2003). “Is it harder to parse Chinese, or the Chinese Treebank?” In: Proceedings of the 41st Annual Meeting on Associ- ation for Computational Linguistics-Volume 1 . Association for Computational Linguistics, pp. 439–446.

Lhoneux, Miryam de, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre (2017). “From Raw Text to Universal Dependencies-Look, No Tags!” In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pp. 207–

Lhoneux, Miryam de, Sara Stymne, and Joakim Nivre (2017). “Arc-hybrid non- 217.

projective dependency parsing with a static-dynamic oracle”. In: Proceedings of the 15th International Conference on Parsing Technologies , pp. 99–104.

Li, Zhenghua, Min Zhang, Wanxiang Che, Ting Liu, Wenliang Chen, and Haizhou Li (2011). “Joint models for Chinese POS tagging and dependency parsing”.

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, pp. 1180–1191.

Ling, Wang, Guang Xiang, Chris Dyer, Alan Black, and Isabel Trancoso (2013).

“Microblogs as parallel corpora”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Vol. 1, pp. 176–186.

Liu, Ting, Jinshan Ma, and Sheng Li (2006). “Building a Dependency Treebank for Improving Chinese Parser.” Journal of Chinese Language and Computing 16.4, pp. 207–224.

Liu, Yijia, Yi Zhu, Wanxiang Che, Bing Qin, Nathan Schneider, and Noah A.

Smith (2018). “Parsing Tweets into Universal Dependencies”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) . New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 965–975. doi: 10.18653/v1/N18-1088.

Ma, Ji, Tong Xiao, Jingbo Zhu, and Feiliang Ren (2012). “Easy-first Chinese POS tagging and dependency parsing”. Proceedings of COLING 2012, pp. 1731–

1746.

Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz (1993).

“Building a Large Annotated Corpus of English: The Penn Treebank”. Com- putational Linguistics 19.2, pp. 313–330.

Marneffe, Marie-Catherine de, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning (2014). “Universal Stanford dependencies: A cross-linguistic typology”. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) . Reykjavik, Iceland: European Language Resources Association (ELRA), May 2014, pp. 4585–4592.

McClosky, David, Eugene Charniak, and Mark Johnson (2006). “Effective self- training for parsing”. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics . Association for Computational Linguistics, pp. 152–

159.

(30)

McClosky, David, Eugene Charniak, and Mark Johnson (2008). “When is self- training effective for parsing?” In: Proceedings of the 22nd International Confer- ence on Computational Linguistics-Volume 1 . Association for Computational Linguistics, pp. 561–568.

McDonald, Ryan (2006). “Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing”. AAI3225503. PhD thesis. Philadelphia, PA, USA.

isbn: 978-0-542-79978-5.

McDonald, Ryan and Joakim Nivre (2007). “Characterizing the errors of data- driven dependency parsing models”. In: Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) .

McDonald, Ryan, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, et al. (2013). “Universal dependency annotation for multilingual parsing”. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) . Vol. 2, pp. 92–97.

Nasukawa, Tetsuya and Jeonghee Yi (2003). “Sentiment analysis: Capturing favorability using natural language processing”. In: Proceedings of the 2nd international conference on Knowledge capture . ACM, pp. 70–77.

Nigam, Kamal and Rayid Ghani (2000). “Analyzing the effectiveness and appli- cability of co-training”. In: Cikm. Vol. 5, p. 3.

Nivre, Joakim (2003). “An Efficient Algorithm for Projective Dependency Pars- ing”. In: Proceedings of the Eighth International Workshop on Parsing Tech- nologies (IWPT) . Nancy, France, Apr. 2003, pp. 149–160.

Nivre, Joakim (2008). “Algorithms for Deterministic Incremental Dependency Parsing”. Computational Linguistics 34.4, pp. 513–553.

Nivre, Joakim (2009). “Non-projective dependency parsing in expected linear time”. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 . Association for Computational Linguistics, pp. 351–359.

Nivre, Joakim, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Se- bastian Riedel, and Deniz Yuret (2007). “The CoNLL 2007 Shared Task on Dependency Parsing”. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007 . Prague, Czech Republic: Association for Computational Linguistics, June 2007, pp. 915–932.

Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan T McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. (2018). “Universal dependencies v1: A multi- lingual treebank collection”.

Pak, Alexander and Patrick Paroubek (2010). “Twitter as a corpus for sentiment analysis and opinion mining.” In: LREc. Vol. 10. 2010, pp. 1320–1326.

Petrov, Slav, Dipanjan Das, and Ryan McDonald (2011). “A universal part-of- speech tagset”. arXiv preprint arXiv:1104.2086.

Pierce, David and Claire Cardie (2001). “Limitations of co-training for natural

language learning from large datasets”. In: Proceedings of the 2001 Conference

on Empirical Methods in Natural Language Processing .

Dependency Parsing for Chinese Social Media Text