Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

(1)

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data

processing

UNIVERSITY OF GOTHENBURG

Richard Johansson

(2)

today's lecture

I as you've seen, processing large corpora can take time!

I for instance, building the frequency tables in the word sketch assignment

I in this lecture, we'll think of how we can process large volumes of data by parallelizing our programs

I some basic ideas, some techniques, and pointers to software

I we'll just dip our toes, but there will be pointers for further reading

(3)

overview

basics of parallel programming

parallel programming in Python

architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

(4)

speeding up by parallelizing

I can we buy a machine that runs our code 10 times faster?

I I have a 2 GHz CPU: can I get a 20 GHz CPU instead?

I it's probably easier to buy 10 machines, or a machine with 10 CPUs, and then try to make the program parallel

(5)

Moore's law

I Moore's lawwas formulated by Gordon Moore at Intel in the early 1970s

I overall processing power for computers doubles every 2 years

I until about 2000, this used to mean that processors got faster

I Moore's law still holds, but its eect nowadays is increased parallelization

I increased number of CPUs in computers

I and each CPUs can run more than one process at a time

(6)

Moore's law (Wikipedia)

(7)

computer clusters

I computations may be distributed over large collections of machines: clusters

I for instance, Sweden has the SNIC infrastructure that connects clusters at dierent universitites

(8)

parallelizing an algorithm

I making an algorithm work in a parallel fashion may involve signicant changes

I a parallel algorithm is ecient if

T parallel ≈ T sequential number of processors

I for instance, if I can compute a frequency table in a corpus 10 times as fast by using 10 machines

I often, this isn't exactly the case because there is some

administrative overhead when we parallelize

(9)

processing embarassingly parallel tasks

I an embarrassingly parallel(or trivially parallel) job

I can besplit into separate pieceswith little or no eort,

I pieces be processedindependently: when processing one piece, we don't need to care about what other pieces contain

I and where it'seasy to collectthe result in the end

I how can I process such a task if I have 10 machines (or CPUs)?

I split the data into 10 pieces (of roughly equal size)

I assign a piece to each machine

I run the 10 machines in parallel

I concatenate the 10 results

(10)

processing embarassingly parallel tasks

I an embarrassingly parallel(or trivially parallel) job

I can besplit into separate pieceswith little or no eort,

I pieces be processedindependently: when processing one piece, we don't need to care about what other pieces contain

I and where it'seasy to collectthe result in the end

I how can I process such a task if I have 10 machines (or CPUs)?

I split the data into 10 pieces (of roughly equal size)

I assign a piece to each machine

I run the 10 machines in parallel

I concatenate the 10 results

(11)

what kinds of tasks are embarrassingly parallel?

I lowercasing the 1000 text les in a directory?

I building a frequency table of the words in a corpus?

I PoS tagging? parsing?

I machine learning tasks:

I Naive Bayes training?

I perceptron training?

(12)

embarrassing parallelization on Unix-like systems

I on Unix-like systems (e.g. Mac or Linux), commands such as split can be handy

I for instance, we split a le bigfile into 10 parts split bigfile -n 10 smallfile_

I then we will get smallfile_aa, smallfile_ab, etc

I if we have more than one CPU on the machine, we can start multiple processes at once:

python3 do_something.py smallfile_aa &

python3 do_something.py smallfile_ab &

...

I if we have many machines, we may need to copy les; on a computer cluster with many machines, the le system is usually shared between the machines

(13)

when parallelization is not trivial

I typically, algorithms that work in an incremental fashion are hard to parallelize

I when the result in the current step depends on what has happened before

I a good example is the perceptron learning algorithm

I what we do in this step depends on all the errors we made before

I parallelized versions of the perceptron (and related algorithms such as SVM, LR) usemini-batches rather than single instances

(14)

overview

(15)

simple parallelization in Python

I in programming, we distinguish between two types of parallel activities:

I threadsare parallel activities that share memory (variables, data structures, etc)

I processesrun with separate memory, so they need to communicate over the network or through les

I in Python, forvarious technical reasons, using threads is less ecient in general than using separate processes

I but threading can be useful for many other purposes, for instance to process events in a server application

I if you're interested, take a look at the threading library

(16)

the multiprocessing library

I the multiprocessing library (included in Python's standard library) contains some functions for managing processes:

I creating a process

I waiting for a process to end, or stop it violently

I communicating between processes

I synchronization: making sure that processes don't mess up for each other

I managing a group of slave processes: the Pool

(17)

simple multiprocessing example

import time

import multiprocessing as mp import random

def do_something(job_nbr):

while True:

print('Process {0} says hello!'.format(job_nbr)) time.sleep(random.random())

if __name__ == '__main__':

nbr_workers = 5

for i in range(nbr_workers):

worker = mp.Process(target=do_something,

(18)

masterslave architecture and the Pool

I in many cases, we have a masterprocess (the main program) that creates tasks for a number of slave processes that work in parallel to do the hard work

I with the Pool class from the multiprocessing library, we can simplify the management of slaves:

I the master process submits tasks to the Pool, which distributes the tasks to the slaves

I the slaves process the tasks in parallel

I the master collects the results

(19)

Pool example

import multiprocessing as mp

### THIS PART IS EXECUTED IN THE SLAVE PROCESS ###

def compute_square(number):

return number*number

### THIS PART IS EXECUTED IN THE MASTER PROCESS ###

square_list = []

def add_square(square):

square_list.append(square) if __name__ == '__main__':

pool = mp.Pool(processes=4) # or mp.cpu_count() for i in range(10):

# submit a job

pool.apply_async(compute_square, args=[i], callback=add_square)

(20)

word counting example: not parallelized

I now we'll do something more useful: computing frequencies

I we'll start from this non-parallelized example:

from collections import Counter filename = ... something ...

freqs = Counter()

with open(filename) as f:

for l in f:

freqs.update(l.split()) print(freqs.most_common(5))

I now, let's divide this into master and slave

(21)

parallelized word counting example: slave part

def compute_frequencies(lines):

# make a frequency table for these lines freqs = Counter()

for l in lines:

freqs.update(l.split())

# send the frequency table back to the master return freqs

(22)

parallelized word count example: master part (1)

if __name__ == '__main__':

filename = ... something ...

pool = mp.Pool(processes=mp.cpu_count()) with open(filename) as f:

chunk = read_chunk(f, 100000) while chunk:

# submit a job

pool.apply_async(compute_frequencies, args=[chunk],

callback=merge) chunk = read_chunk(f, 100000) pool.close() # tell the pool we're done pool.join() # wait for all jobs to finish print(total_result.most_common(5))

(23)

parallelized word count example: master part (2)

# this is the callback function, called every time we

# get a partial frequency table from a slave total_result = Counter()

def merge(partial_result):

total_result.update(partial_result)

# helper function to read a number of lines that should

# be sent to a slave

def read_chunk(f, chunk_size):

chunk = []

for line in f:

chunk.append(line)

if len(chunk) == chunk_size:

break

(24)

word counting example: how much improvement?

1 2 3 4 5 6 7 8 9 10

number of processes 0

5 10 15 20

seconds

(25)

why not half the time with twice the number of processes?

I splitting:

I reading the le, dividing into chunks

I communication overhead:

I processes don't share memory, so they need to send and receive data

I inputs (the chunks) and outputs (the partial tables) are pickled and unpickled

I this becomes even more critical if the processes run on separate machines, because then the data is sent over a network

I assembling the end result:

I for instance, merging the partial tables

I process administration:

(26)

overview

(27)

architectures for large-scale processing

I on a single machine, multiprocessing solutions such as Python's Pool can be useful, although a bit low-level

I we'll have a look at frameworks that can help us program for larger systems that may be distributed on many machines

(28)

connections to functional programming

I some architectures for large-scale processing borrow a few concepts fromfunctional programming

I FP has the following characteristics:

I data structures areimmutable(not modiable): instead of modifying, they are transformed into new structures

I many standard operations on data structures (transforming, collecting, ltering, etc) are implemented ashigher-order functions: functions that take other functions as input

I (in Python, list comprehension plays much of the same role)

I uses small on-the-y functions a lot: lambda in Python

I FP is attractive for this purpose because it separates the what from the how

I we want to transform a list, but we don't want to worry about how its parts are distributed to dierent machines or in which order the parts are processed

(29)

a higher-order function in Python: map

I the function map applies a function to all elements in a collection

def add1(x):

return x + 1

print(list(map(add1, [1, 2, 3, 4, 5])))

# prints [2, 3, 4, 5, 6]

print(list(map(lambda x: x + 1, [1, 2, 3, 4, 5])))

# prints [2, 3, 4, 5, 6]

print(list(map(len, ['a', 'few', 'strings'])))

(30)

another higher-order function: reduce

I the function reduce applies a function to accumulate the elements in a collection

I typical example: summing or multiplying all elements

I reduce lives in the functools library in Python def add(x, y):

return x + y

print(reduce(add, [1, 2, 3, 4, 5]))

# prints 15

print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5]))

# prints 15

(31)

contrived example using map and reduce

I sum the lengths of some words:

words = ['a', 'few', 'strings']

print(reduce(lambda x, y: x + y, map(len, words)))

# prints 11

(32)

less contrived example

I the parallelized word counting program we wrote before can be thought of as mapping and reducing:

I map: for each chunk, compute a partial frequency table

I reduce: combine all partial tables into a complete table

(33)

MapReduce

I MapReduce [Dean and Ghemawat, 2004] is an architecture developed by Google that models large-scale computation tasks in terms of mapping and reducing

I see alsothis paperfor a popular-scientic introduction

I the user denes the map and reduce tasks to be carried out

I MapReduce was designed to take care of many of the complexities in distributed processing:

I large les can be distributed across several machines

I to minimize network trac, tasks are carried out locally as much as possible: a machine handles the piece it stores

I sometimes computers break down, so the system may need to reprocess tasks that have disappeared

(34)

Hadoop

I Hadoop is an open-source implementation of an architecture similar to Google's ideas

I https://hadoop.apache.org/

I its central parts are

I processing part: Hadoop MapReduce

I le system: HDFS (Hadoop Distributed File System)

I . . . but it also has many other components

(35)

Spark

I Spark [Zaharia et al., 2012] is a more recent framework that addresses some of the drawbacks of Hadoop

I most importantly, it tries to keep data in memory, rather than in les, which can lead to signicant speedups for some tasks

I Spark can be installed not only on a cluster but also on a single machine (standalone mode)

I see http://spark.apache.org/

I

(36)

word counting example in Spark

I the Spark engine is implemented in the Scala language

I a fairly new functional programming language that runs on the Java virtual machine

I however, we can write Spark programs not only in Scala or Java but also other languages including Python

I here's a Python example from the Spark web page:

(37)

intuition of the word counting program

(38)

Spark's fundamental data structure: the RDD

I Spark works by processing RDDs: Resilient Distributed Datasets

I Resilient: it recomputes data in case of loss

I Distributed: may be spread out over dierent machines

I conceptually, an RDD is similar to a Python list (or more precisely, a generator)

I word counting example:

1. RDD with lines 2. RDD with tokens

3. RDD with (token, 1) pairs 4. RDD with (token, count) pairs

(39)

transformations of RDDs

I Spark includes many transformations of RDDs

I many of the transformations are well-known higher-order functions in FP

I not just map and reduce!

I check the overview here:

http://spark.apache.org/docs/latest/

programming-guide.html

I see a complete list of transformations here:

http://spark.apache.org/docs/latest/api/python/

pyspark.html#pyspark.RDD

I let's walk through the steps in the word counting program

(40)

step 1: reading a text le as lines

I spark.textFile reads a text le and returns an RDD containing the lines

text_file = spark.textFile(NAME_OF_FILE)

(41)

step 2: flatMap; splitting the lines into tokens

I flatMap is a transformation that applies some function to all elements in an RDD

I . . . and then attens the result: removes lists inside the RDD

I we use this to convert the lines into a new RDD with tokens step1 = text_file.flatMap(lambda line: line.split())

(42)

step 3: map

I map is a transformation that applies some function to all elements in an RDD

I this is simpler than flatMap: no attening involved

I in our case, we make a new RDD consisting of wordcount pairs (but all counts are 1 so far)

step2 = step1.map(lambda word: (word, 1))

(43)

step 4: reduceByKey

I reduceByKey is similar to reduce that we explained before, but operates on keyvalue pairs

I and the aggregation operation is applied to the values, separely for each key

I in our case, we sum all the 1s for each word separately step3 = step2.reduceByKey(lambda a, b: a+b)

(44)

the nal VG assignment: using Spark

I do a few small word counting exercises using Spark

I we have installed Spark on the lab machines

I . . . or you may install it on your own

I we don't have a real cluster: you'll have to make believe!

(45)

references I

I Dean, J. and Ghemawat, S. (2004).MapReduce: Simplied data processing on large clusters.In OSDI'04.

I Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012).Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.In NSDI.

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing