Download and install Download virtual machine Import virtual machine in Virtualbox

Similar documents

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Word count example Abdalrahman Alsaedi

Introduction to MapReduce and Hadoop

Hadoop WordCount Explained! IT332 Distributed Systems

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop: Understanding the Big Data Processing Method

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

CS54100: Database Systems

map/reduce connected components

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Data Science in the Wild

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Configuration and First Examples

Big Data Management and NoSQL Databases

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Tutorial- Counting Words in File(s) using MapReduce

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Chase Wu New Jersey Ins0tute of Technology

Distributed Filesystems

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Hadoop MultiNode Cluster Setup

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Internals of Hadoop Application Framework and Distributed File System

The Hadoop Eco System Shanghai Data Science Meetup

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop IST 734 SS CHUNG

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Large scale processing using Hadoop. Ján Vaňo

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

MR-(Mapreduce Programming Language)

Word Count Code using MR2 Classes and API

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

INTRODUCTION TO HADOOP

HPCHadoop: MapReduce on Cray X-series

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

BIG DATA APPLICATIONS

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop 2.6 Configuration and More Examples

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Getting to know Apache Hadoop

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

L1: Introduction to Hadoop

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

By Hrudaya nath K Cloud Computing

Installation and Configuration Documentation

Introduction To Hadoop

Apache Hadoop new way for the company to store and analyze big data

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Data Science Analytics & Research Centre

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Hadoop Ecosystem B Y R A H I M A.

Hadoop Distributed File System (HDFS)

TP1: Getting Started with Hadoop

Hadoop (pseudo-distributed) installation and configuration

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

HDFS. Hadoop Distributed File System

Big Data Too Big To Ignore

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Introduction to Apache Pig Indexing and Search

研發專案原始程式碼安裝及操作手冊. Version 0.1

5 HDFS - Hadoop Distributed System

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

How to install Apache Hadoop in Ubuntu (Multi node setup)

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Hadoop, Hive & Spark Tutorial

Distributed Systems + Middleware Hadoop

Set JAVA PATH in Linux Environment. Edit.bashrc and add below 2 lines $vi.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-oracle/

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

HSearch Installation

IDS 561 Big data analytics Assignment 1

Transcription:

Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual machine in Virtualbox File Import Appliance

Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig

Introduction What is Hadoop and where did it come from?

Big Data

Big Data Sources Every day 2.5 quintillion bytes or 2.5 exabytes (10 18 ) are generated, that number is estimated to double every 40 month Astronomy Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000; 200 GB (10 9 ) per night Large Synoptic Survey Telescope (LSST) (~2020); estimated ~20 TB (10 12 ) per night Business Twitter: 12 terabytes (10 12 ) of Tweets every day Walmart: 2.5 petabytes (10 15 ) of data every hour from its customer transactions

Infoworld Special Report: Big Data Deep Dive Big Data - Big Business IDC predicts big data technology and services will grow worldwide from $3.2 billion in 2010 to $16.9 billion in 2015. This represents a compound annual growth rate of 40 percent about seven times that of the overall information and communications technology market.

The Economist, Deciding Factor. Big Data in the News The Economist Intelligence Unit study showed that nine out of 10 surveyed business leaders believe data is now the fourth factor of production, as fundamental to business as land, labor and capital.

What is Big Data? When the data itself becomes part of the problem Three (to five) dimensions: Volume Variety Velocity (Veracity) (Value)

The Solution

Short History Created by Doug Cutting, named after his son s toy elephant 2002 - Nutch, search engine, scalability problems 2004 - Google papers on GFS and MapReduce 2006 - Yahoo hires Doug to improve Hadoop 2008 - Hadoop becomes Apache Top Level Project 2013 - Hadoop 2.0

Hadoop Today Moving from Internet companies Yahoo Google to business and science applications Customer relationship management Bioinformatics Astrophysics

Hadoop Definition Framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model Open-source software, maintained by The Apache Software Foundation http://hadoop.apache.org/

Apache Hadoop Projects Ambari Avro Cassandra Chukwa HBase Hive Mahout Pig Spark Tez ZooKeeper Apache Hadoop Related Projects HDFS Yarn (Vers.2) MapReduce Hadoop Common Apache Hadoop Subprojects

Hadoop Cluster Commodity hardware Individual disk space on each node Hadoop framework handles: Data "backups" (through replication) Hardware failure Parallelization of code (through MapReduce paradigm)

HDFS Write-once, read-many Each file is stored as sequence of samesized blocks (default size 64MB) Blocks are replicated across different nodes Highly reliable: Redundant data storage Heartbeat messages to detect connectivity problems Automatic failover

From: http://nosql.mypopescu.com/post/15561851616/hadoop-distributed-file-system-hdfs-a-cartoon-is-worth

Pipelining From: http://nosql.mypopescu.com/post/15561851616/hadoop-distributed-file-system-hdfs-a-cartoon-is-worth

MapReduce Programing model designed for batch processing of large volumes of data in parallel by dividing the work into a set of independent tasks Not limited to Hadoop

MapReduce WordCount Example

Problems suited for MapReduce Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output

Hadoop 2.0

Yarn Yet Another Resource Scheduler "Operating System" for Hadoop Responsible for cluster management improves resource utilization Allocates resources to different applications Supports multiple processing modes; not just MapReduce Security

Hadoop 2.0 Components

Hadoop Components Name Node Stores metadata (filenames, replications factors ) Checks data node availability (Heartbeat) Hadoop 1: One Hadoop 2: Mulitple Data Node Stores data Replicates blocks Computation

Hadoop Components Resource Manager (RM) Scheduler that allocates available cluster resources amongst the competing applications Node Manager (NM) Takes direction from the RM. Manages resources on single node. Application Master An instance of a framework-specific library, that runs a specific YARN job and is responsible for negotiating resources from the RM. Working with the NM to execute and monitor containers (allocated resources on node).

What makes Hadoop different?

Traditional Clusters vs. Hadoop Traditional Computationally intensive Moves data to code Shared file storage Hadoop Data intensive Moves code to data Individual disks

MPI vs. MapReduce MPI Requires the user to implement data splitting, data management, parallelization, synchronization and fault-tolerance in their applications Hadoop/MapReduce Framework handles data splitting etc.

Databases vs. HDFS Databases Schema created before storage Data cleaned before storage HDFS Data loaded without schema Schema created during data processing

Setup

Setup Three options: Standalone (single Java process) Pseudo-Distributed (separate Java processes) Fully-Distributed Prerequisites (on virtual machine): Ubuntu server 14.04 with SSH Ubuntu desktop (for monitoring) Oracle (Sun) Java 1.7.0_80-b15 http://www.webupd8.org/2012/01/install-oracle-java-jdk-7-in-ubuntu-via.html

Setup Files hosts (ip address).bashrc (Java dir, home dir) hadoop configuration files in /usr/local/hadoop/conf hadoop-env.sh (Java dir) core-site.xml (default file system) hdfs-site.xml (replication factor) mapred-site.xml (mapreduce framework) yarn-site.xml (nodemanager aux-services) master/slave (ips for multi-node cluster)

Login User: rmacc2015 Password: rmacc2015 Start terminal (Ctrl+Alt+F1) Login as hduser User: hduser Password: hduser Change directory to hadoop with cd $HADOOP_PREFIX or cd /usr/local/hadoop

Format Namenode All existing data will be deleted!!! Delete tmp directory $ rm -R /tmp/hadoop-hduser/dfs/data Format Namenode $ bin/hdfs namenode -format

Starting Hadoop Daemons All: $ sbin/start-dfs.sh $ sbin/start-yarn.sh Log output in logs directory

Check Daemons jps 1367 NameNode 8695 DataNode 2600 SecondaryNameNode 4786 ResourceManager 5019 NodeManager 4558 Jps

Web Interfaces http://localhost:8088 Cluster status and jobs http://localhost:50070 HDFS

HDFS Hadoop Distributed File System

Ubuntu (Linux) Hadoop bin/hadoop dfs -copyfromlocal mis6110/files/file.txt mis6110/kdd/file.txt bin/hadoop dfs -copytolocal mis6110/movies/file.txt mis6110/files/file.txt local file system /usr/local/hadoop/... mis6110 (HDFS) /user/hduser/... mis6110 Files Scripts KDD Movies Create Directory mkdir bin/hadoop dfs -mkdir Show content of directory ls bin/hadoop dfs -ls Create Directory sh mkdir fs -mkdir Show content of directory sh ls fs -ls

HDFS Shell bin/hdfs dfsadmin help all admin commands bin/hdfs dfs help all commands Most commands similar to unix dfs copyfromlocal dfs -ls Shell commands: http://hadoop.apache.org/docs/current/hadoop-projectdist/hadoop-common/filesystemshell.html

Importing Data KDD Example: Legcare sales data from KDD Cup 2000 We wish to thank Blue Martini Software for contributing the KDD Cup 2000 data Cleaned for easier/faster use Copy file KDDCupCleaned.txt and KDDrmacc.txt to hdfs

Copying File to HDFS Create new directory: $ bin/hdfs dfs -mkdir /rmacc Copy files: $ bin/hdfs dfs -copyfromlocal../rmacc/beowulf.txt /rmacc/ $ bin/hdfs dfs -copyfromlocal../rmacc/kdd* /rmacc/ Check: bin/hdfs dfs -ls -R /rmacc or localhost:50070 Utilities Browse the filesystem

MapReduce WordCount Example From: http://www.rabidgremlin.com/data20/#(3)

Java Code for WordCount Example 1. package org.myorg; 2. 3. import java.io.ioexception; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.tostring(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasmoretokens()) { 22. word.set(tokenizer.nexttoken()); 23. output.collect(word, one); 24. }25. } 26. } 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

Java Code for WordCount Example 29. public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasnext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37. 38. public static void main(string[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setjobname("wordcount"); 41. 42. conf.setoutputkeyclass(text.class); 43. conf.setoutputvalueclass(intwritable.class); 44. 45. conf.setmapperclass(map.class); 46. conf.setcombinerclass(reduce.class); 47. conf.setreducerclass(reduce.class); 48. 49. conf.setinputformat(textinputformat.class); 50. conf.setoutputformat(textoutputformat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. }

There Must be an Easier Way

Pig (Latin)

Pig High-level data processing language (Pig Latin) Resides on user machine, not cluster Pig Latin compiled into efficient MapReduce jobs

Starting Job History Daemon Pig relies on job history daemon (Error: 0.0.0.0:10020 retrying to connect to server) $ sbin/mr-jobhistory-daemon.sh start historyserver

Check Daemons jps 1367 NameNode 8695 DataNode 2600 SecondaryNameNode 4786 ResourceManager 5019 NodeManager 7409 JobHistoryServer 4558 Jps

Test Pig Installation (Hadoop has to be running) $ cd $PIG_HOME $ bin/pig -help

How to Run Pig Grunt interactive shell Two modes: Local, standalone (pig x local) Hadoop, distributed (pig x mapreduce or just pig) Scripts (.pig) Embedded in Java or Python PigPen, Eclipse plugin

Pig Statements in Grunt LOAD Transform data DUMP or STORE Example: grunt> A = load student' using PigStorage() AS (name:chararray, age:int, gpa:float); grunt> B = foreach A generate name; grunt> dump B A is called a relation or outer bag

Pig LOAD Function Pigs eat anything LOAD 'data' [USING function] [AS schema]; USING AS PigStorage structured text file (default) TextLoader unstructured UTF-8 data Other and User Defined Functions (Field1[:type], Field2:[type],... FieldX[type]) Bytearray default type

Pig Example: Loading Data > bin/pig > a = LOAD /rmacc/beowulf.txt ; > All Statements end with semicolon!!!

Pig Debugging Statements Debug Operator DUMP DESCRIBE EXPLAIN ILLUSTRATE Full list Description Display results Display schema of relation Display execution plan Display step-by-step execution http://pig.apache.org/docs/r0.15.0/test.html#diagnostic-ops Describe and illustrate only work if schema is provided

Pig Example: KDD Data Key Date Time Unit Price Order LineID Qty Order Status Tax Amount Weekday Hour City State Customer ID 1 2/27/2000 08\:06\:35 15 1 15 Shipped 1.3 16.3 Sunday 8 Westport CT 62 2 3/30/2000 10\:00\:18 9 1 9 Shipped 0 9 Thursday 10 Westport CT 62 3 1/28/2000 14\:43\:34 12 1 12 Shipped 1.02 13.02 Friday 14 San Francisco CA 96 4 1/29/2000 10\:22\:37 12 1 12 Shipped 0.87 12.87 Saturday 10 Novato CA 132 5 2/1/2000 08\:44\:48 6.5 1 6.5 Shipped 0.55 7.05 Tuesday 8 Cupertino CA 168 6 2/29/2000 10\:31\:42 15 1 15 Shipped 1.24 16.24 Tuesday 10 San Ramon CA 184 7 2/29/2000 10\:31\:42 14 1 14 Shipped 1.16 15.16 Tuesday 10 San Ramon CA 184 8 2/29/2000 10\:31\:42 6.5 2 6.5 Shipped 1.07 14.07 Tuesday 10 San Ramon CA 184 9 3/8/2000 16\:48\:47 11 3 11 Shipped 2.72 35.72 Wednesday 16 San Ramon CA 184 10 1/30/2000 14\:13\:57 10 1 10 Shipped 0 10 Sunday 14 Scarsdale NY 224 11 1/30/2000 14\:13\:57 13.5 1 13.5 Shipped 0 13.5 Sunday 14 Scarsdale NY 224 12 2/26/2000 03\:42\:17 12.7 1 12.7 Shipped 0 12.7 Saturday 3 Novato CA 236 13 3/30/2000 11\:51\:44 6.5 1 4.88 Shipped 0 4.88 Thursday 11 Novato CA 236 14 3/30/2000 11\:51\:44 6.5 1 4.88 Shipped 0 4.88 Thursday 11 Novato CA 236 15 3/30/2000 11\:51\:44 7 1 7 Shipped 0 7 Thursday 11 Novato CA 236 16 3/30/2000 11\:51\:44 12 1 12 Shipped 0 12 Thursday 11 Novato CA 236

Pig Example: Loading Data > a = LOAD /rmacc/kddrmacc, AS (key, date, time, qty:float); Columns can be accessed by $0 for second column (first contains key) or name (key, date, time.)

Pig Data Types Simple: int, long, float, double, chararray, bytearray, boolean Complex: tuple - a set of fields (10, 5, alpha) bag - a collection of tuples {(10,5,alpha) (8,2,beta)} map - a set of key value pairs [key#value]

Pig Example: Reduce # of Fields > a = LOAD /rmacc/kddcupcleaned.txt AS (key:int,date,time,up,ol,qty,os,tax,amount:float, wd,hour,city,state,ci); > b = FOREACH a GENERATE key, date, time, amount; > STORE b INTO /rmacc/kddcupshort.txt USING PigStorage( * ) ; (always NEW directory)

Pig Example:Group per Date > a = LOAD /rmacc/kddrmacc.txt AS (key,date,time,qty:float); > groupday = GROUP a BY date; > illustrate groupday > group is new key for each bag (day) > bag contains tuples (with data)

Pig Example: Sum per Date > a = LOAD /rmacc/kddrmacc.txt AS (key,date,time,qty:float); > groupday = GROUP a BY date; > sumday = FOREACH groupday GENERATE group, SUM(a.qty); > STORE sumday INTO /rmacc/sumday ;

Evaluation Functions (Case Sensitive) Function AVG CONCAT COUNT COUNT_STAR DIFF IsEmpty MAX MIN SIZE SUM TOKENIZE List with examples Description Calculates average Concatenates two expressions of identical type Counts the number of elements in a bag Like count by includes NULL values in count Compares two fields in a tuple Checks if a bag or map is empty Calculates maximum Calculates minimum Computes the number of elements (characters) Calculates sum Splits a string and outputs a bag of words http://pig.apache.org/docs/r0.15.0/func.html#eval-functions

Other Functions Math Functions String Functions Datetime Functions Tuple, Bag, Map Functions User Defined Functions

Pig Example: Filter Purchases> $100 > a = LOAD /rmacc/sumday AS (date,sum:float); > bigpur = FILTER a BY sum>1000; > DUMP bigpur;

Relational Operators Operators LOAD GROUP FOREACH FILTER Full list with examples Description Loads data from the file system Groups the data in one or more relations Generates data transformations based on columns of data Selects tuples from a relation based on some condition http://pig.apache.org/docs/r0.15.0/basic.html#relational+op erators

Other Operators Arithmetic Operators Boolean Operators Cast Operators Comparison Operators Type Construction Operators Dereference Operators Disambiguate Operator Flatten Operator Null Operators Sign Operators

Pig WordCount Code > b = LOAD /rmacc/beowulf.txt AS beo; > beowords = FOREACH b GENERATE flatten(tokenize(beo)) as bw; > wg = GROUP beowords BY bw; > wc = FOREACH wg GENERATE group, COUNT(beowords) as bc; > sumord =ORDER wc BY bc; > STORE sumord INTO /rmacc/beowulfwc

Hadoop Summary Accessible runs on commodity hardware Robust handles hardware failures Scalable by adding more nodes Simple allows users to quickly write efficient parallel code Pig - easy to learn

Conclusions Individual components easy to setup integration more complicated Resources Apache Hadoop (download and docu) http://hadoop.apache.org Online Searches Books for overview, not technical details Evolving project constantly changing, documentation can t keep up with development

Questions?