Learning Hadoop 2
Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2
Garry Turkington Gabriele Modena
BIRMINGHAM - MUMBAI
Learning Hadoop 2
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: February 2015
Production reference: 1060215
Published by Packt Publishing Ltd.
Livery Place 35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-551-8
www.packtpub.com
Credits
Authors
Garry Turkington Gabriele Modena
Reviewers Atdhe Buja Amit Gurdasani Jakob Homan James Lampton Davide Setti
Valerie Parham-Thompson
Commissioning Editor Edward Gordon
Acquisition Editor Joanne Fitzpatrick
Content Development Editor Vaibhav Pawar
Technical Editors Indrajit A. Das Menza Mathew
Copy Editors Roshni Banerjee Sarang Chari Pranjali Chury
Project Coordinator Kranti Berde
Proofreaders Simran Bhogal Martin Diver Lawrence A. Herman Paul Hindle
Indexer
Hemangini Bari
Graphics Abhinash Sahu
Production Coordinator Nitesh Thakur
Cover Work
Nitesh Thakur
About the Authors
Garry Turkington has over 15 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems.
In his current role as the CTO at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and the USA.
He has BSc and PhD degrees in Computer Science from Queens University Belfast in Northern Ireland, and a Master's degree in Engineering in Systems Engineering from Stevens Institute of Technology in the USA. He is the author of Hadoop Beginners Guide, published by Packt Publishing in 2013, and is a committer on the Apache Samza project.
I would like to thank my wife Lea and mother Sarah for their
support and patience through the writing of another book and my
daughter Maya for frequently cheering me up and asking me hard
questions. I would also like to thank Gabriele for being such an
amazing co-author on this project.
Gabriele Modena is a data scientist at Improve Digital. In his current position, he uses Hadoop to manage, process, and analyze behavioral and machine-generated data. Gabriele enjoys using statistical and computational methods to look for
patterns in large amounts of data. Prior to his current job in ad tech he held a number of positions in Academia and Industry where he did research in machine learning and artificial intelligence.
He holds a BSc degree in Computer Science from the University of Trento, Italy and a Research MSc degree in Artificial Intelligence: Learning Systems, from the University of Amsterdam in the Netherlands.
First and foremost, I want to thank Laura for her support, constant encouragement and endless patience putting up with far too many
"can't do, I'm working on the Hadoop book". She is my rock and I dedicate this book to her.
A special thank you goes to Amit, Atdhe, Davide, Jakob, James and Valerie, whose invaluable feedback and commentary made this work possible.
Finally, I'd like to thank my co-author, Garry, for bringing me on
board with this project; it has been a pleasure working together.
About the Reviewers
Atdhe Buja is a certified ethical hacker, DBA (MCITP, OCA11g), and
developer with good management skills. He is a DBA at the Agency for Information Society / Ministry of Public Administration, where he also manages some projects of e-governance and has more than 10 years' experience working on SQL Server.
Atdhe is a regular columnist for UBT News. Currently, he holds an MSc degree in computer science and engineering and has a bachelor's degree in management and information. He specializes in and is certified in many technologies, such as SQL Server (all versions), Oracle 11g, CEH, Windows Server, MS Project, SCOM 2012 R2, BizTalk, and integration business processes.
He was the reviewer of the book, Microsoft SQL Server 2012 with Hadoop, published by Packt Publishing. His capabilities go beyond the aforementioned knowledge!
I thank Donika and my family for all the encouragement and support.
Amit Gurdasani is a software engineer at Amazon. He architects distributed
systems to process product catalogue data. Prior to building high-throughput
systems at Amazon, he was working on the entire software stack, both as a
systems-level developer at Ericsson and IBM as well as an application developer
at Manhattan Associates. He maintains a strong interest in bulk data processing,
data streaming, and service-oriented software architectures.
Jakob Homan has been involved with big data and the Apache Hadoop ecosystem for more than 5 years. He is a Hadoop committer as well as a committer for the Apache Giraph, Spark, Kafka, and Tajo projects, and is a PMC member. He has worked in bringing all these systems to scale at Yahoo! and LinkedIn.
James Lampton is a seasoned practitioner of all things data (big or small) with 10 years of hands-on experience in building and using large-scale data storage and processing platforms. He is a believer in holistic approaches to solving problems using the right tool for the right job. His favorite tools include Python, Java, Hadoop, Pig, Storm, and SQL (which sometimes I like and sometimes I don't). He has recently completed his PhD from the University of Maryland with the release of Pig Squeal:
a mechanism for running Pig scripts on Storm.
I would like to thank my spouse, Andrea, and my son, Henry, for giving me time to read work-related things at home. I would also like to thank Garry, Gabriele, and the folks at Packt Publishing for the opportunity to review this manuscript and for their patience and understanding, as my free time was consumed when writing my dissertation.
Davide Setti , after graduating in physics from the University of Trento, joined the SoNet research unit at the Fondazione Bruno Kessler in Trento, where he applied large-scale data analysis techniques to understand people's behaviors in social networks and large collaborative projects such as Wikipedia.
In 2010, Davide moved to Fondazione, where he led the development of data analytic tools to support research on civic media, citizen journalism, and digital media.
In 2013, Davide became the CTO of SpazioDati, where he leads the development of tools to perform semantic analysis of massive amounts of data in the business information sector.
When not solving hard problems, Davide enjoys taking care of his family vineyard
and playing with his two children.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com . Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at service@packtpub.com for more details.
At www.PacktPub.com , you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
TM