Introduction to MapReduce, Hadoop, & Spark Jonathan Carroll-Nellenback Center for Integrated Research Computing
Big Data Outline Analytics Map Reduce Programming Model Hadoop Ecosystem HDFS, Pig, Hive, Oosize, Sqoop, HBase, Yarn Spark
Big Data Volume Too big for typical database software tools to capture, store, manage, and analyze (TB, PB, EB) Velocity Real time streaming data Variety Structured & Unstructured (documents, graphs, videos, audio etc...) Veracity How trustworthy is the data Value How useful is the data
Big Data Analytics Descriptive analytics What happened Predictive analytics What will happen Prescriptive analytics What to do going forward Social media analytics Mine social media data to discover sentiment, behavior patterns, individual preferences Entity analytics Analyze common types of events, people, things, transactions, and relationships Cognitive computing IBM Watson (simulation of human thought processes natural language processing, pattern recognition, data mining)
Map Reduce Map Reduce Programming Model (Google) map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3]
Word Count Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Count occurrences of words in document K1 = line number V1 = text Map returns a list of words and the number 1 K2 = word V2 = 1 Reduce adds up values for each key K3 = word V3 = number of occurrences K1 V1 1 It was the best of times, 2 It was the worst of times, 3 it was the age of wisdom, 4 it was the age of foolishness, 5 it was the epoch of belief, 6 it was the epoch of incredulity, 7 it was the season of Light, 8 it was the season of Darkness, K2 V2 K3 V3 It 1 It 10 was 1 was 12 the 1 the 100 best 1 best 10 of 1 of 50 times 1 times 15 It 1 age 4 was 1 wisdom 3
Image credit: Broomfield Technology Consultants
WordCount.ipynb
Generate Index Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] List all lines that a word appears on K1 = line number V1 = text Map returns a list of words and the line number K2 = word V2 = line # Reduce concatenates values for each key K3 = word V3 = list of line # s K1 V1 1 It was the best of times, 2 It was the worst of times, 3 it was the age of wisdom, 4 it was the age of foolishness, 5 it was the epoch of belief, 6 it was the epoch of incredulity, 7 it was the season of Light, 8 it was the season of Darkness, K2 V2 K3 V3 It 1 It 1,2,3,... was 1 was 1,2,3,... the 1 the 1,2,3,... best 1 best 1 of 1 of 1,2,3,... times 1 times 1,2 It 2 age 3,4, was 2 wisdom 3
Index.ipynb
Join Tables Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Join two tables using a key K1 = tableid V1 = row of data Map returns a list of words and the line number K2 = value to join on V2 = row and tableid Reduce does an outer/inner product on each set of rows from different tableids K3 = none V3 = outer product from each set of rows first 1 George 2 John 3 Thomas K1 + = last 1 Washington 2 Adams 3 Jefferson V1 first 1 George first 2 John first 3 Thomas last 1 Washington last 2 Adams last 3 Jefferson
Join Tables Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Join two tables using a key K1 = tableid V1 = row of data Map returns a list of words and the line number K2 = value to join on V2 = row and tableid Reduce does an outer/inner product on each set of rows from different tableids K3 = none V3 = outer product from each set of rows K1 V1 first 1 George first 2 John first 3 Thomas last 1 Washington last 2 Adams last 3 Jefferson K2 V2 1 first George 2 first John 3 first Thomas 1 last Washington 2 last Adams 3 last Jefferson K3 V3 1 George Washington 2 John Adams 3 Thomas Jefferson
Join.ipynb
Join Tables Collecting data for each join key within a single partition allows for any type of join (left, right, inner, outer) Tables don t have to have same number of columns K1 V1 first 1 George 1789 first 2 John 1797 first 3 Thomas 1801 last 1 Washington last 2 Adams last 3 Jefferson K2 V2 1 first George 1789 2 first John 1797 3 first Thomas 1801 1 last Washington 2 last Adams 3 last Jefferson K3 V3 1 George Washington 1789 2 John Adams 1797 3 Thomas Jefferson 1801
Map Reduce Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Group by key Shuffle Initial key value pairs are partitioned across multiple nodes Mapping can occur locally Grouping by key involves data shuffling prior to reduce If reduction operator is associative and commutative, then no need to group data for each key K2 on the same partition. Reduction can happen within each partition and then across partitions.
Hadoop Ecosystem Hadoop Framework Hadoop MapReduce HDFS (Hadoop Distributed File System) Fault tolerant Designed for Batch processing Large datasets Simple coherency model: file content cannot be updated just appended Hadoop YARN Resource management and job scheduler Additional packages Pig Procedural language (pig latin) to express map and reduce processes Hive Provides SQL interface on top of MapReduce HBase Column-oriented key-value store on top of HDFS Sqoop Allows for efficiently moving data from relational database to HDFS
Apache Spark Runs in memory Main Components Spark Core MapReduce Spark SQL Database Spark Streaming Real-time analysis of streaming data Spark MLlib Distributed machine learning library Spark GraphX Distributed graph processing framework Procedural style interface similar to PigLatin Libraries available for Python, R, Scala, Java Supports lazy evaluation Map operations (transformations) only occur when needed by a reduction Reductions (actions) are scheduled to run within a MapReduce cluster. Can choose to cache intermediate results Comes with standalone native Spark cluster, or can interface with YARN Can connect to HDFS, GPFS, Cassandra, Amazon S3, etc...
Jupyter Web application that allows for the creation and sharing of documents that contain live code, equations, visualizations, and explanatory text. Supports Julia, Python, and R Jupyter + Python + Spark = Easy interactive big-data analytics
Spark+Jupyter on Bluehive Install X2go following instructions for your OS under Getting Started Guide https://info.circ.rochester.edu Getting Started Connecting using <OS> Connecting using a Graphical User Interface
Spark+Jupyter on Bluehive Open a terminal on Head node (under Applications : System Tools ) Copy over directory containing sample files. cp -r /public/jcarrol5/csc461. cd CSC461 Run the command init_jupyter.sh which will 1.) create a self-signed key for the jupyter web app 2.) prompt you for a password to secure your jupyter notebook server 3.) update your jupyter notebook configuration file
Spark+Jupyter on Bluehive Run the command interactive -p debug -t 60 --exclusive This will request an interactive session on a compute node in the debug partition for 60 minutes. Run the script start-spark-jupyter This will start a stand-alone spark cluster within your job s allocation and launch your jupyter notebook initialized with the spark context (sc) and spark SQL context (sqlcontext) Note the url of the spark master (http://bhc0001:8080) and the Jupyter Notebook (https://bhc0001:8888) Open firefox web browser (under Applications : Internet) Connect to the Jupyter Notebook (and the Spark master)
Spark Data structures Spark Core: RDD Resilient Distributed Data Set RDD Each element is just a value (iestring) MapRDD Each element is a key-value pair - though the value can be anything. RowRDD Each element in the RDD is a Row object with some schema Spark SQL: DataFrame Column oriented distributed data set (Hive). Supports SQL queries as well as procedural operations (join, select,.)
Wegmans Data The Wegmans.ipynb notebook contains example code that loads data from text files into an RDD of strings maps that onto a RDD of Rows Transforms the RowRDDs into DataFrames Performs SQL queries as well as procedural queries Performs Market Basket Analysis to find sets of frequent items
Spark modules Core RDD SQL DataFrames Streaming Supports continuous inflow of data mllib Machine Learning Library Statistics Classification/Regression Collaborative Filtering ALS Clustering (ie k-means) Dimensionality Reduction (SVD / PCA) Frequent pattern mining (FP-growth) GraphX Supports operations on graphs (maptripletes, groupedges, pagerank, connectedcomponents, )
Homework Modify the word count example to count letters instead of words in the taleoftwocities.txt file Modify the Wegmans.ipynb Jupyter notebook to calculate the top 20 grossing products sold by Wegmans. Calculate the average number of products in each transaction. Repeat the basket analysis but using item category names instead of item names. Determine what items are frequently bought with CARBONATED SODA POP