Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data
processing
UNIVERSITY OF GOTHENBURG
Richard Johansson
today's lecture
I as you've seen, processing large corpora can take time!
I for instance, building the frequency tables in the word sketch assignment
I in this lecture, we'll think of how we can process large volumes of data by parallelizing our programs
I some basic ideas, some techniques, and pointers to software
I we'll just dip our toes, but there will be pointers for further reading
overview
basics of parallel programming
parallel programming in Python
architectures for large-scale parallel processing: MapReduce, Hadoop, Spark
speeding up by parallelizing
I can we buy a machine that runs our code 10 times faster?
I I have a 2 GHz CPU: can I get a 20 GHz CPU instead?
I it's probably easier to buy 10 machines, or a machine with 10 CPUs, and then try to make the program parallel
Moore's law
I Moore's lawwas formulated by Gordon Moore at Intel in the early 1970s
I overall processing power for computers doubles every 2 years
I until about 2000, this used to mean that processors got faster
I Moore's law still holds, but its eect nowadays is increased parallelization
I increased number of CPUs in computers
I and each CPUs can run more than one process at a time
Moore's law (Wikipedia)
computer clusters
I computations may be distributed over large collections of machines: clusters
I for instance, Sweden has the SNIC infrastructure that connects clusters at dierent universitites
parallelizing an algorithm
I making an algorithm work in a parallel fashion may involve signicant changes
I a parallel algorithm is ecient if
T parallel ≈ T sequential number of processors
I for instance, if I can compute a frequency table in a corpus 10 times as fast by using 10 machines
I often, this isn't exactly the case because there is some
administrative overhead when we parallelize
processing embarassingly parallel tasks
I an embarrassingly parallel(or trivially parallel) job
I can besplit into separate pieceswith little or no eort,
I pieces be processedindependently: when processing one piece, we don't need to care about what other pieces contain
I and where it'seasy to collectthe result in the end
I how can I process such a task if I have 10 machines (or CPUs)?
I split the data into 10 pieces (of roughly equal size)
I assign a piece to each machine
I run the 10 machines in parallel
I concatenate the 10 results
processing embarassingly parallel tasks
I an embarrassingly parallel(or trivially parallel) job
I can besplit into separate pieceswith little or no eort,
I pieces be processedindependently: when processing one piece, we don't need to care about what other pieces contain
I and where it'seasy to collectthe result in the end
I how can I process such a task if I have 10 machines (or CPUs)?
I split the data into 10 pieces (of roughly equal size)
I assign a piece to each machine
I run the 10 machines in parallel
I concatenate the 10 results
what kinds of tasks are embarrassingly parallel?
I lowercasing the 1000 text les in a directory?
I building a frequency table of the words in a corpus?
I PoS tagging? parsing?
I machine learning tasks:
I Naive Bayes training?
I perceptron training?
embarrassing parallelization on Unix-like systems
I on Unix-like systems (e.g. Mac or Linux), commands such as split can be handy
I for instance, we split a le bigfile into 10 parts split bigfile -n 10 smallfile_
I then we will get smallfile_aa, smallfile_ab, etc
I if we have more than one CPU on the machine, we can start multiple processes at once:
python3 do_something.py smallfile_aa &
python3 do_something.py smallfile_ab &
...
I if we have many machines, we may need to copy les; on a computer cluster with many machines, the le system is usually shared between the machines
when parallelization is not trivial
I typically, algorithms that work in an incremental fashion are hard to parallelize
I when the result in the current step depends on what has happened before
I a good example is the perceptron learning algorithm
I what we do in this step depends on all the errors we made before
I parallelized versions of the perceptron (and related algorithms such as SVM, LR) usemini-batches rather than single instances
overview
basics of parallel programming
parallel programming in Python
architectures for large-scale parallel processing: MapReduce, Hadoop, Spark
simple parallelization in Python
I in programming, we distinguish between two types of parallel activities:
I threadsare parallel activities that share memory (variables, data structures, etc)
I processesrun with separate memory, so they need to communicate over the network or through les
I in Python, forvarious technical reasons, using threads is less ecient in general than using separate processes
I but threading can be useful for many other purposes, for instance to process events in a server application
I if you're interested, take a look at the threading library
the multiprocessing library
I the multiprocessing library (included in Python's standard library) contains some functions for managing processes:
I creating a process
I waiting for a process to end, or stop it violently
I communicating between processes
I synchronization: making sure that processes don't mess up for each other
I managing a group of slave processes: the Pool
simple multiprocessing example
import time
import multiprocessing as mp import random
def do_something(job_nbr):
while True:
print('Process {0} says hello!'.format(job_nbr)) time.sleep(random.random())
if __name__ == '__main__':
nbr_workers = 5
for i in range(nbr_workers):
worker = mp.Process(target=do_something,
masterslave architecture and the Pool
I in many cases, we have a masterprocess (the main program) that creates tasks for a number of slave processes that work in parallel to do the hard work
I with the Pool class from the multiprocessing library, we can simplify the management of slaves:
I the master process submits tasks to the Pool, which distributes the tasks to the slaves
I the slaves process the tasks in parallel
I the master collects the results
Pool example
import multiprocessing as mp
### THIS PART IS EXECUTED IN THE SLAVE PROCESS ###
def compute_square(number):
return number*number
### THIS PART IS EXECUTED IN THE MASTER PROCESS ###
square_list = []
def add_square(square):
square_list.append(square) if __name__ == '__main__':
pool = mp.Pool(processes=4) # or mp.cpu_count() for i in range(10):
# submit a job
pool.apply_async(compute_square, args=[i], callback=add_square)
word counting example: not parallelized
I now we'll do something more useful: computing frequencies
I we'll start from this non-parallelized example:
from collections import Counter filename = ... something ...
freqs = Counter()
with open(filename) as f:
for l in f:
freqs.update(l.split()) print(freqs.most_common(5))
I now, let's divide this into master and slave
parallelized word counting example: slave part
def compute_frequencies(lines):
# make a frequency table for these lines freqs = Counter()
for l in lines:
freqs.update(l.split())
# send the frequency table back to the master return freqs
parallelized word count example: master part (1)
if __name__ == '__main__':
filename = ... something ...
pool = mp.Pool(processes=mp.cpu_count()) with open(filename) as f:
chunk = read_chunk(f, 100000) while chunk:
# submit a job
pool.apply_async(compute_frequencies, args=[chunk],
callback=merge) chunk = read_chunk(f, 100000) pool.close() # tell the pool we're done pool.join() # wait for all jobs to finish print(total_result.most_common(5))
parallelized word count example: master part (2)
# this is the callback function, called every time we
# get a partial frequency table from a slave total_result = Counter()
def merge(partial_result):
total_result.update(partial_result)
# helper function to read a number of lines that should
# be sent to a slave
def read_chunk(f, chunk_size):
chunk = []
for line in f:
chunk.append(line)
if len(chunk) == chunk_size:
break
word counting example: how much improvement?
1 2 3 4 5 6 7 8 9 10
number of processes 0
5 10 15 20
seconds
why not half the time with twice the number of processes?
I splitting:
I reading the le, dividing into chunks
I communication overhead:
I processes don't share memory, so they need to send and receive data
I inputs (the chunks) and outputs (the partial tables) are pickled and unpickled
I this becomes even more critical if the processes run on separate machines, because then the data is sent over a network
I assembling the end result:
I for instance, merging the partial tables
I process administration:
overview
basics of parallel programming
parallel programming in Python
architectures for large-scale parallel processing: MapReduce, Hadoop, Spark
architectures for large-scale processing
I on a single machine, multiprocessing solutions such as Python's Pool can be useful, although a bit low-level
I we'll have a look at frameworks that can help us program for larger systems that may be distributed on many machines
connections to functional programming
I some architectures for large-scale processing borrow a few concepts fromfunctional programming
I FP has the following characteristics:
I data structures areimmutable(not modiable): instead of modifying, they are transformed into new structures
I many standard operations on data structures (transforming, collecting, ltering, etc) are implemented ashigher-order functions: functions that take other functions as input
I (in Python, list comprehension plays much of the same role)
I uses small on-the-y functions a lot: lambda in Python
I FP is attractive for this purpose because it separates the what from the how
I we want to transform a list, but we don't want to worry about how its parts are distributed to dierent machines or in which order the parts are processed
a higher-order function in Python: map
I the function map applies a function to all elements in a collection
def add1(x):
return x + 1
print(list(map(add1, [1, 2, 3, 4, 5])))
# prints [2, 3, 4, 5, 6]
print(list(map(lambda x: x + 1, [1, 2, 3, 4, 5])))
# prints [2, 3, 4, 5, 6]
print(list(map(len, ['a', 'few', 'strings'])))
another higher-order function: reduce
I the function reduce applies a function to accumulate the elements in a collection
I typical example: summing or multiplying all elements
I reduce lives in the functools library in Python def add(x, y):
return x + y
print(reduce(add, [1, 2, 3, 4, 5]))
# prints 15
print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5]))
# prints 15
contrived example using map and reduce
I sum the lengths of some words:
words = ['a', 'few', 'strings']
print(reduce(lambda x, y: x + y, map(len, words)))
# prints 11
less contrived example
I the parallelized word counting program we wrote before can be thought of as mapping and reducing:
I map: for each chunk, compute a partial frequency table
I reduce: combine all partial tables into a complete table
MapReduce
I MapReduce [Dean and Ghemawat, 2004] is an architecture developed by Google that models large-scale computation tasks in terms of mapping and reducing
I see alsothis paperfor a popular-scientic introduction
I the user denes the map and reduce tasks to be carried out
I MapReduce was designed to take care of many of the complexities in distributed processing:
I large les can be distributed across several machines
I to minimize network trac, tasks are carried out locally as much as possible: a machine handles the piece it stores
I sometimes computers break down, so the system may need to reprocess tasks that have disappeared
Hadoop
I Hadoop is an open-source implementation of an architecture similar to Google's ideas
I https://hadoop.apache.org/
I its central parts are
I processing part: Hadoop MapReduce
I le system: HDFS (Hadoop Distributed File System)
I . . . but it also has many other components
Spark
I Spark [Zaharia et al., 2012] is a more recent framework that addresses some of the drawbacks of Hadoop
I most importantly, it tries to keep data in memory, rather than in les, which can lead to signicant speedups for some tasks
I Spark can be installed not only on a cluster but also on a single machine (standalone mode)
I see http://spark.apache.org/
I
word counting example in Spark
I the Spark engine is implemented in the Scala language
I a fairly new functional programming language that runs on the Java virtual machine
I however, we can write Spark programs not only in Scala or Java but also other languages including Python
I here's a Python example from the Spark web page:
intuition of the word counting program
Spark's fundamental data structure: the RDD
I Spark works by processing RDDs: Resilient Distributed Datasets
I Resilient: it recomputes data in case of loss
I Distributed: may be spread out over dierent machines
I conceptually, an RDD is similar to a Python list (or more precisely, a generator)
I word counting example:
1. RDD with lines 2. RDD with tokens
3. RDD with (token, 1) pairs 4. RDD with (token, count) pairs
transformations of RDDs
I Spark includes many transformations of RDDs
I many of the transformations are well-known higher-order functions in FP
I not just map and reduce!
I check the overview here:
http://spark.apache.org/docs/latest/
programming-guide.html
I see a complete list of transformations here:
http://spark.apache.org/docs/latest/api/python/
pyspark.html#pyspark.RDD
I let's walk through the steps in the word counting program
step 1: reading a text le as lines
I spark.textFile reads a text le and returns an RDD containing the lines
text_file = spark.textFile(NAME_OF_FILE)
step 2: flatMap; splitting the lines into tokens
I flatMap is a transformation that applies some function to all elements in an RDD
I . . . and then attens the result: removes lists inside the RDD
I we use this to convert the lines into a new RDD with tokens step1 = text_file.flatMap(lambda line: line.split())
step 3: map
I map is a transformation that applies some function to all elements in an RDD
I this is simpler than flatMap: no attening involved
I in our case, we make a new RDD consisting of wordcount pairs (but all counts are 1 so far)
step2 = step1.map(lambda word: (word, 1))
step 4: reduceByKey
I reduceByKey is similar to reduce that we explained before, but operates on keyvalue pairs
I and the aggregation operation is applied to the values, separely for each key
I in our case, we sum all the 1s for each word separately step3 = step2.reduceByKey(lambda a, b: a+b)
the nal VG assignment: using Spark
I do a few small word counting exercises using Spark
I we have installed Spark on the lab machines
I . . . or you may install it on your own
I we don't have a real cluster: you'll have to make believe!
references I
I Dean, J. and Ghemawat, S. (2004).MapReduce: Simplied data processing on large clusters.In OSDI'04.
I Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012).Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.In NSDI.