• No results found

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

N/A
N/A
Protected

Academic year: 2022

Share "Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data

processing

UNIVERSITY OF GOTHENBURG

Richard Johansson

(2)

today's lecture

I as you've seen, processing large corpora can take time!

I for instance, building the frequency tables in the word sketch assignment

I in this lecture, we'll think of how we can process large volumes of data by parallelizing our programs

I some basic ideas, some techniques, and pointers to software

I we'll just dip our toes, but there will be pointers for further reading

(3)

overview

basics of parallel programming

parallel programming in Python

architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

(4)

speeding up by parallelizing

I can we buy a machine that runs our code 10 times faster?

I I have a 2 GHz CPU: can I get a 20 GHz CPU instead?

I it's probably easier to buy 10 machines, or a machine with 10 CPUs, and then try to make the program parallel

(5)

Moore's law

I Moore's lawwas formulated by Gordon Moore at Intel in the early 1970s

I overall processing power for computers doubles every 2 years

I until about 2000, this used to mean that processors got faster

I Moore's law still holds, but its eect nowadays is increased parallelization

I increased number of CPUs in computers

I and each CPUs can run more than one process at a time

(6)

Moore's law (Wikipedia)

(7)

computer clusters

I computations may be distributed over large collections of machines: clusters

I for instance, Sweden has the SNIC infrastructure that connects clusters at dierent universitites

(8)

parallelizing an algorithm

I making an algorithm work in a parallel fashion may involve signicant changes

I a parallel algorithm is ecient if

T parallel ≈ T sequential number of processors

I for instance, if I can compute a frequency table in a corpus 10 times as fast by using 10 machines

I often, this isn't exactly the case because there is some

administrative overhead when we parallelize

(9)

processing embarassingly parallel tasks

I an embarrassingly parallel(or trivially parallel) job

I can besplit into separate pieceswith little or no eort,

I pieces be processedindependently: when processing one piece, we don't need to care about what other pieces contain

I and where it'seasy to collectthe result in the end

I how can I process such a task if I have 10 machines (or CPUs)?

I split the data into 10 pieces (of roughly equal size)

I assign a piece to each machine

I run the 10 machines in parallel

I concatenate the 10 results

(10)

processing embarassingly parallel tasks

I an embarrassingly parallel(or trivially parallel) job

I can besplit into separate pieceswith little or no eort,

I pieces be processedindependently: when processing one piece, we don't need to care about what other pieces contain

I and where it'seasy to collectthe result in the end

I how can I process such a task if I have 10 machines (or CPUs)?

I split the data into 10 pieces (of roughly equal size)

I assign a piece to each machine

I run the 10 machines in parallel

I concatenate the 10 results

(11)

what kinds of tasks are embarrassingly parallel?

I lowercasing the 1000 text les in a directory?

I building a frequency table of the words in a corpus?

I PoS tagging? parsing?

I machine learning tasks:

I Naive Bayes training?

I perceptron training?

(12)

embarrassing parallelization on Unix-like systems

I on Unix-like systems (e.g. Mac or Linux), commands such as split can be handy

I for instance, we split a le bigfile into 10 parts split bigfile -n 10 smallfile_

I then we will get smallfile_aa, smallfile_ab, etc

I if we have more than one CPU on the machine, we can start multiple processes at once:

python3 do_something.py smallfile_aa &

python3 do_something.py smallfile_ab &

...

I if we have many machines, we may need to copy les; on a computer cluster with many machines, the le system is usually shared between the machines

(13)

when parallelization is not trivial

I typically, algorithms that work in an incremental fashion are hard to parallelize

I when the result in the current step depends on what has happened before

I a good example is the perceptron learning algorithm

I what we do in this step depends on all the errors we made before

I parallelized versions of the perceptron (and related algorithms such as SVM, LR) usemini-batches rather than single instances

(14)

overview

basics of parallel programming

parallel programming in Python

architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

(15)

simple parallelization in Python

I in programming, we distinguish between two types of parallel activities:

I threadsare parallel activities that share memory (variables, data structures, etc)

I processesrun with separate memory, so they need to communicate over the network or through les

I in Python, forvarious technical reasons, using threads is less ecient in general than using separate processes

I but threading can be useful for many other purposes, for instance to process events in a server application

I if you're interested, take a look at the threading library

(16)

the multiprocessing library

I the multiprocessing library (included in Python's standard library) contains some functions for managing processes:

I creating a process

I waiting for a process to end, or stop it violently

I communicating between processes

I synchronization: making sure that processes don't mess up for each other

I managing a group of slave processes: the Pool

(17)

simple multiprocessing example

import time

import multiprocessing as mp import random

def do_something(job_nbr):

while True:

print('Process {0} says hello!'.format(job_nbr)) time.sleep(random.random())

if __name__ == '__main__':

nbr_workers = 5

for i in range(nbr_workers):

worker = mp.Process(target=do_something,

(18)

masterslave architecture and the Pool

I in many cases, we have a masterprocess (the main program) that creates tasks for a number of slave processes that work in parallel to do the hard work

I with the Pool class from the multiprocessing library, we can simplify the management of slaves:

I the master process submits tasks to the Pool, which distributes the tasks to the slaves

I the slaves process the tasks in parallel

I the master collects the results

(19)

Pool example

import multiprocessing as mp

### THIS PART IS EXECUTED IN THE SLAVE PROCESS ###

def compute_square(number):

return number*number

### THIS PART IS EXECUTED IN THE MASTER PROCESS ###

square_list = []

def add_square(square):

square_list.append(square) if __name__ == '__main__':

pool = mp.Pool(processes=4) # or mp.cpu_count() for i in range(10):

# submit a job

pool.apply_async(compute_square, args=[i], callback=add_square)

(20)

word counting example: not parallelized

I now we'll do something more useful: computing frequencies

I we'll start from this non-parallelized example:

from collections import Counter filename = ... something ...

freqs = Counter()

with open(filename) as f:

for l in f:

freqs.update(l.split()) print(freqs.most_common(5))

I now, let's divide this into master and slave

(21)

parallelized word counting example: slave part

def compute_frequencies(lines):

# make a frequency table for these lines freqs = Counter()

for l in lines:

freqs.update(l.split())

# send the frequency table back to the master return freqs

(22)

parallelized word count example: master part (1)

if __name__ == '__main__':

filename = ... something ...

pool = mp.Pool(processes=mp.cpu_count()) with open(filename) as f:

chunk = read_chunk(f, 100000) while chunk:

# submit a job

pool.apply_async(compute_frequencies, args=[chunk],

callback=merge) chunk = read_chunk(f, 100000) pool.close() # tell the pool we're done pool.join() # wait for all jobs to finish print(total_result.most_common(5))

(23)

parallelized word count example: master part (2)

# this is the callback function, called every time we

# get a partial frequency table from a slave total_result = Counter()

def merge(partial_result):

total_result.update(partial_result)

# helper function to read a number of lines that should

# be sent to a slave

def read_chunk(f, chunk_size):

chunk = []

for line in f:

chunk.append(line)

if len(chunk) == chunk_size:

break

(24)

word counting example: how much improvement?

1 2 3 4 5 6 7 8 9 10

number of processes 0

5 10 15 20

seconds

(25)

why not half the time with twice the number of processes?

I splitting:

I reading the le, dividing into chunks

I communication overhead:

I processes don't share memory, so they need to send and receive data

I inputs (the chunks) and outputs (the partial tables) are pickled and unpickled

I this becomes even more critical if the processes run on separate machines, because then the data is sent over a network

I assembling the end result:

I for instance, merging the partial tables

I process administration:

(26)

overview

basics of parallel programming

parallel programming in Python

architectures for large-scale parallel processing: MapReduce, Hadoop, Spark

(27)

architectures for large-scale processing

I on a single machine, multiprocessing solutions such as Python's Pool can be useful, although a bit low-level

I we'll have a look at frameworks that can help us program for larger systems that may be distributed on many machines

(28)

connections to functional programming

I some architectures for large-scale processing borrow a few concepts fromfunctional programming

I FP has the following characteristics:

I data structures areimmutable(not modiable): instead of modifying, they are transformed into new structures

I many standard operations on data structures (transforming, collecting, ltering, etc) are implemented ashigher-order functions: functions that take other functions as input

I (in Python, list comprehension plays much of the same role)

I uses small on-the-y functions a lot: lambda in Python

I FP is attractive for this purpose because it separates the what from the how

I we want to transform a list, but we don't want to worry about how its parts are distributed to dierent machines or in which order the parts are processed

(29)

a higher-order function in Python: map

I the function map applies a function to all elements in a collection

def add1(x):

return x + 1

print(list(map(add1, [1, 2, 3, 4, 5])))

# prints [2, 3, 4, 5, 6]

print(list(map(lambda x: x + 1, [1, 2, 3, 4, 5])))

# prints [2, 3, 4, 5, 6]

print(list(map(len, ['a', 'few', 'strings'])))

(30)

another higher-order function: reduce

I the function reduce applies a function to accumulate the elements in a collection

I typical example: summing or multiplying all elements

I reduce lives in the functools library in Python def add(x, y):

return x + y

print(reduce(add, [1, 2, 3, 4, 5]))

# prints 15

print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5]))

# prints 15

(31)

contrived example using map and reduce

I sum the lengths of some words:

words = ['a', 'few', 'strings']

print(reduce(lambda x, y: x + y, map(len, words)))

# prints 11

(32)

less contrived example

I the parallelized word counting program we wrote before can be thought of as mapping and reducing:

I map: for each chunk, compute a partial frequency table

I reduce: combine all partial tables into a complete table

(33)

MapReduce

I MapReduce [Dean and Ghemawat, 2004] is an architecture developed by Google that models large-scale computation tasks in terms of mapping and reducing

I see alsothis paperfor a popular-scientic introduction

I the user denes the map and reduce tasks to be carried out

I MapReduce was designed to take care of many of the complexities in distributed processing:

I large les can be distributed across several machines

I to minimize network trac, tasks are carried out locally as much as possible: a machine handles the piece it stores

I sometimes computers break down, so the system may need to reprocess tasks that have disappeared

(34)

Hadoop

I Hadoop is an open-source implementation of an architecture similar to Google's ideas

I https://hadoop.apache.org/

I its central parts are

I processing part: Hadoop MapReduce

I le system: HDFS (Hadoop Distributed File System)

I . . . but it also has many other components

(35)

Spark

I Spark [Zaharia et al., 2012] is a more recent framework that addresses some of the drawbacks of Hadoop

I most importantly, it tries to keep data in memory, rather than in les, which can lead to signicant speedups for some tasks

I Spark can be installed not only on a cluster but also on a single machine (standalone mode)

I see http://spark.apache.org/

I

(36)

word counting example in Spark

I the Spark engine is implemented in the Scala language

I a fairly new functional programming language that runs on the Java virtual machine

I however, we can write Spark programs not only in Scala or Java but also other languages including Python

I here's a Python example from the Spark web page:

(37)

intuition of the word counting program

(38)

Spark's fundamental data structure: the RDD

I Spark works by processing RDDs: Resilient Distributed Datasets

I Resilient: it recomputes data in case of loss

I Distributed: may be spread out over dierent machines

I conceptually, an RDD is similar to a Python list (or more precisely, a generator)

I word counting example:

1. RDD with lines 2. RDD with tokens

3. RDD with (token, 1) pairs 4. RDD with (token, count) pairs

(39)

transformations of RDDs

I Spark includes many transformations of RDDs

I many of the transformations are well-known higher-order functions in FP

I not just map and reduce!

I check the overview here:

http://spark.apache.org/docs/latest/

programming-guide.html

I see a complete list of transformations here:

http://spark.apache.org/docs/latest/api/python/

pyspark.html#pyspark.RDD

I let's walk through the steps in the word counting program

(40)

step 1: reading a text le as lines

I spark.textFile reads a text le and returns an RDD containing the lines

text_file = spark.textFile(NAME_OF_FILE)

(41)

step 2: flatMap; splitting the lines into tokens

I flatMap is a transformation that applies some function to all elements in an RDD

I . . . and then attens the result: removes lists inside the RDD

I we use this to convert the lines into a new RDD with tokens step1 = text_file.flatMap(lambda line: line.split())

(42)

step 3: map

I map is a transformation that applies some function to all elements in an RDD

I this is simpler than flatMap: no attening involved

I in our case, we make a new RDD consisting of wordcount pairs (but all counts are 1 so far)

step2 = step1.map(lambda word: (word, 1))

(43)

step 4: reduceByKey

I reduceByKey is similar to reduce that we explained before, but operates on keyvalue pairs

I and the aggregation operation is applied to the values, separely for each key

I in our case, we sum all the 1s for each word separately step3 = step2.reduceByKey(lambda a, b: a+b)

(44)

the nal VG assignment: using Spark

I do a few small word counting exercises using Spark

I we have installed Spark on the lab machines

I . . . or you may install it on your own

I we don't have a real cluster: you'll have to make believe!

(45)

references I

I Dean, J. and Ghemawat, S. (2004).MapReduce: Simplied data processing on large clusters.In OSDI'04.

I Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012).Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.In NSDI.

References

Related documents

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Det har inte varit möjligt att skapa en tydlig överblick över hur FoI-verksamheten på Energimyndigheten bidrar till målet, det vill säga hur målen påverkar resursprioriteringar

DIN representerar Tyskland i ISO och CEN, och har en permanent plats i ISO:s råd. Det ger dem en bra position för att påverka strategiska frågor inom den internationella

Den här utvecklingen, att både Kina och Indien satsar för att öka antalet kliniska pröv- ningar kan potentiellt sett bidra till att minska antalet kliniska prövningar i Sverige.. Men