Assignment 4: Pig, Hive and Spark
|
|
|
- Arline Hensley
- 10 years ago
- Views:
Transcription
1 Assignment 4: Pig, Hive and Spark Jean-Pierre Lozi March 15, 2015 Provided files An archive that contains all files you will need for this assignment can be found at the following URL: Download it and extract it (using tar -xvzf assignment4.tar.gz, for instance). 1 Pig Apache Pig makes it possible to write complex queries over big datasets using a language named Pig Latin. The queries are transparently compiled into MapReduce jobs and run with Hadoop. While doing this exercise, 1 you are advised to read Chapter 11 ( Pig ) from the book Hadoop: The Definitive Guide. The book is available in SFU s online library ( You will need more information than what is contained in that single chapter, though, don t hesitate to look for other resources online. In particular, you are invited to check the official documentation ( 12.0/). For questions in which a sample output is provided, make sure your output matches it exactly. You will find two files named movies_en.json and artists_en.json in assignment4.tar.gz. They contain a small movie database in the JSON format. Here is an example record from movies_en.json (both files actually have one record per line, newlines are added here for readability): { "id": "movie:14", "title": "Se7en", "year": 1995, "genre": "Crime", "summary": " Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his modus operandi.", "country": "USA", "director": { "id": "artist:31", "last_name": "Fincher", "first_name": "David", "year_of_birth": "1962" }, "actors": [ 1 The first questions of this exercise are based on Philippe Rigaux Travaux Pratiques in his course on Distributed Computing. 1
2 } { "id": "artist:18", "role": "Doe" }, { "id": "artist:22", "role": "Somerset" }, { "id": "artist:32", "role": "Mills" } ] And here is an example record from artists_en.json: { "id": "artist:18", "last_name": "Spacey", "first_name": "Kevin", "year_of_birth": "1959" } As you can see, movies_en.json contains the full names and years of birth of movie directors, but only the identifiers of actors. The full names and years of birth of all artists, as well as their identifier, are listed in artists_en.json. Question 1 Connect to the Hadoop cluster and copy the two files movies_en.json and artists_en.json to the HDFS. Run Grunt, Pig s interactive shell, using: $ pig We want to load the two files into two relations named movies and artists, respectively. How can we load a JSON file into a relation? Start with the easier case: artists_en.json. We want a relation with four fields: id, firstname, lastname, and yearofbirth. All fields will be of type chararray. Use DESCRIBE and DUMP to ensure your relation was created properly. When it is, load movies_en.json into a relation with the following fields (all fields except director and actors are of type chararray): id, title, year, genre, summary, country, director: (id, lastname, firstname, yearofbirth), actors: {(id, role)} Question 2 We want to get rid of the descriptions of the movies to simplify the output of the next questions. How can we modify the movies relation to remove descriptions? 2
3 Question 3 Create a relation named mus_year that groups the titles of American movies by year. DUMP it to ensure it s correct. Your results should look like the following: (1988,{(Rain Man),(Die Hard)}) (1990,{(The Godfather: Part III),(Die Hard 2),(The Silence of the Lambs),(King of New York)}) (1992,{(Unforgiven),(Bad Lieutenant),(Reservoir Dogs)}) (1994,{(Pulp Fiction)}) Question 4 Create a relation named mus_director that groups the titles of American movies by director. Your results should look like this: ((artist:1,coppola,sofia,1971),{(marie Antoinette)}) ((artist:3,hitchcock,alfred,1899),{(psycho),(north by Northwest),(Rear Window),(Marnie),(The B irds),(vertigo)}) ((artist:4,scott,ridley,1937),{(alien),(gladiator),(blade Runner)}) ((artist:6,cameron,james,1954),{(the Terminator),(Titanic)}) Question 5 Create a relation named mus_actors that contains (movieid, actorid, role) tuples. Each movie will appear in as many tuples as there are actors listed for that movie. Again, we only want to take American movies into account. Your results should look like the following: (movie:1,artist:15,john Ferguson) (movie:1,artist:16,madeleine Elster) (movie:2,artist:5,ripley) (movie:3,artist:109,rose DeWitt Bukater) Question 6 Create a relation named mus_actors_full that associates the identifier of each American movie to the full description of each of its actors. Your results should look like the following: (movie:67,artist:2,marie Antoinette,Dunst,Kirsten,) (movie:2,artist:5,ripley,weaver,sigourney,1949) (movie:5,artist:11,sean Archer/Castor Troy,Travolta,John,1954) (movie:17,artist:11,vincent Vega,Travolta,John,1954) Question 7 Create a relation named mus_full that associates the full description of each American movie with the full description of all of its actors. First, do it using JOIN, which will work but will repeat all information about a movie for each of its actors: (movie:1, {(movie:1,artist:16,madeleine Elster,Novak,Kim,1925, movie:1,vertigo,1958,drama,usa,(artist:3,hitchcock,alfred,1899)), 3
4 (movie:1,artist:15,john Ferguson,Stewart,James,1908, movie:1,vertigo,1958,drama,usa,(artist:3,hitchcock,alfred,1899))}) Do it again using COGROUP. You should obtain cleaner results: (movie:1, {(movie:1,vertigo,1958,drama,usa,(artist:3,hitchcock,alfred,1899))}, {(movie:1,artist:15,john Ferguson,Stewart,James,1908), (movie:1,artist:16,madeleine Elster,Novak,Kim,1925)}) For both versions, clean up the results even more by avoiding to needlessly repeat the movie identifier. Question 8 Create a relation named mus_artists_as_actors_directors that lists, for each artist, the list of American movies which he directed and in which he played. The full name of the artist as well as their year of birth should be made available. You should get, for instance: (artist:20,eastwood,clint,1930,artist:20, {(movie:8,unforgiven,artist:20,william Munny), (movie:63,million Dollar Baby,artist:20,Frankie Dunn), (movie:26,absolute Power,artist:20,Luther Whitney)}, {(movie:26,absolute Power,artist:20), (movie:63,million Dollar Baby,artist:20), (movie:8,unforgiven,artist:20)}) Question 9 Filter UDF We now want to list the title and director of all movies for which their director s first name and last name start with the same letter. To do so, you will need to write a Filter User-Defined Function (Filter UDF). Create a project in Eclipse with one class for your Filter UDF. What JAR(s) will you have to add? Which version? Where do you get them? What will you put in your build.xml file to automatically create and upload a JAR to the hadoop cluster? How do you make it so that Pig finds your JAR? How do you refer to your Filter UDF to use it? You should get the following output: (Jaws,(artist:45,Spielberg,Steven,1946)) (The Lost World: Jurassic Park,(artist:45,Spielberg,Steven,1946)) (Sound and Fury,(artist:138,Chabrol,Claude,1930)) 4
5 Question 10 Eval UDF We now want to list the title and the imdb.com URL of the three movies found in the previous question. To do so, you will write an Eval User-Defined Function (Eval UDF) that queries IMDB s search function for a movie. For the movie The Matrix Reloaded, for instance, it will access the following webpage: Warning: make sure that you add a call to Thread.sleep(500) after you access this URL to ensure that you don t query the IMDB servers too often, which could result in rcg-hadoop s IP getting blacklisted! IMDB movie identifiers start with tt, followed by 7 digits. You can look for the first occurrence of such a string in the page (using a regular expression, for instance), and consider that it is the movie identifier of the first result. You can then construct the URL of the movie by using this identifier. For instance, for The Matrix Reloaded, the corresponding IMDB URL will be: Your objective will be to obtain the following output: (Jaws, (The Lost World: Jurassic Park, (Sound and Fury, Question 11 Load UDF The file artists_en.txt from the assignment4.tar.gz archive contains the same information as the artists_en.json file, in a different format: id=artist:1 last_name=coppola first_name=sofia year_of_birth= id=artist:2 last_name=dunst first_name=kirsten year_of_birth=null -- id=artist:3 last_name=hitchcock first_name=alfred year_of_birth= Write a Load UDF that loads the file into a new relation. Make sure that this relation contains the same data as artists (you can just DUMP the two relations and compare the results manually, no need to write a program that will compare the relations). 5
6 Question 12 Store UDF We would like to export our tables in the XML format. Could Pig easily produce a well-formed XML file? What is the issue? (You are not asked to write a Store UDF, just answer the question ;-)) Question 13 Aggregate/Algebraic UDF We now want to find the description of the movie with the longest description. To this end, write an Algebraic UDF that finds the longest chararray from a bag of chararrays. Since we got rid of the movie descriptions in Question 2, you may need to reload the movies relation. The objective is to obtain the following result: (The Birds, A wealthy San Francisco socialite pursues a potential boyfriend to a small Northern California town that slowly takes a turn for the bizarre when birds of all kinds suddenly begin to attack people there in increasing numbers and with increasing viciousness.) Bonus points if you use an Accumulator! 2 Hive Like Pig, Apache Hive makes it possible to write complex queries over big datasets, but it has the advantage of using HiveQL, a language that is very easy to learn for people who are used to work with databases since it is very similar to SQL. While doing this exercise, you are advised to read Chapter 12 ( Hive ) from the book Hadoop: The Definitive Guide. The book is available in SFU s online library ( Again, you are encouraged to look up resources online and to use your Google-fu to figure out how to solve questions. Question 1 Similarly to what we did in the previous exercise, we want to create two tables, named artists and movies, that contain the data from files artists_en.json and movies_en.json, respectively. Unfortunately, Hive doesn t support JSON natively. Using the following Serializer/Deserializer (SerDe), create two tables with similar structures as the Pig relations from the first exercise that you will populate with the data from the JSON files: Hint: you can keep the column names from the JSON files if it makes things easier. Question 2 Take questions 3 to 11 from Exercise 1, and solve them again, using Hive instead of Pig this time. You may have to adapt things a bit, but always try to make it so that your output is as close as possible as what you got with Pig. If you don t manage to solve some of the questions with Hive, try to explain why. Good luck! 3 Spark In this exercise, 2 we will experiment with Apache Spark. According to Wikipedia: 2 This exercise is based on code written by Justin Fuston. 6
7 Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop s two-stage disk-based MapReduce paradigm, Spark s in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster s memory and query it repeatedly, Spark is well suited to machine learning algorithms. Have a look at the official documentation ( and examples ( Again, you re encouraged to use Google! We will use the WDC tick data from You will find the 1.1GB WDC_tickbidask.txt file in /cs/bigdata/datasets. You will also find a 4.6MB file named WDC_tickbidask_short.txt that only contains data for the first few days; working with this file is much faster. Both files contain comma-separated values: 09/28/2009,09:10:37,35.6,35.29,35.75,150 09/28/2009,09:30:06,35.48,35.34,35.64, /28/2009,09:30:06,35.49,35.37,35.61,100 09/28/2009,09:30:10,35.4,35.4,35.57,100 09/28/2009,09:30:10,35.48,35.43,35.53,200 According to The order of fields in the tick files is: Date,Time,Price,Bid,Ask,Size. Bid and Ask values are omitted in regular tick files and are aggregated from multiple exchanges and ECNs. For each trade, current best bid/ask values are recorded together with the transaction price and volume. Trade records are not aggregated and all transactions are included. In assignment4.tar.gz, you will find a file named stocktick.py that contains the declaration of a Python class named StockTick that will be used to represent a record: Contents of stocktick.py: class StockTick: def init (self, text_line=none, date="", time="", price=0.0, bid=0.0, ask=0.0, units=0): if text_line!= None: tokens = text_line.split(",") self.date = tokens[0] self.time = tokens[1] try: self.price = float(tokens[2]) self.bid = float(tokens[3]) self.ask = float(tokens[4]) self.units = int(tokens[5]) except: self.price = 0.0 self.bid = 0.0 self.ask = 0.0 self.units = 0 else: self.date = date self.time = time 7
8 self.price = price self.bid = bid self.ask = ask self.units = units Question You will also find a file named stock.py that you are asked to complete. The objective is to compute various statistics on the tick file. Read all comments in the file carefully: Contents of stock.py: import sys from stocktick import StockTick from pyspark import SparkContext def maxvaluesreduce(a, b): ### TODO: Return a StockTick object with the maximum value between a and b for each one of its ### four fields (price, bid, ask, units) def minvaluesreduce(a, b): ### TODO: Return a StockTick object with the minimum value between a and b for each one of its ### four fields (price, bid, ask, units) def generatespreadsdailykeys(tick): ### TODO: Write Me (see below) def generatespreadshourlykeys(tick): ### TODO: Write Me (see below) def spreadssumreduce(a, b): ### TODO: Write Me (see below) if name == " main ": """ Usage: stock """ sc = SparkContext(appName="StockTick") # rawtickdata is a Resilient Distributed Dataset (RDD) rawtickdata = sc.textfile("/cs/bigdata/datasets/wdc_tickbidask.txt") tickdata = ### TODO: use map to convert each line into a StockTick object goodticks = ### TODO: use filter to only keep records for which all fields are > 0 ### TODO: store goodticks in the in-memory cache numticks = ### TODO: count the number of lines in this RDD sumvalues = ### TODO: use goodticks.reduce(lambda a,b: StockTick(???)) to sum the price, bid, ### ask and unit fields maxvaluesreduce = goodticks.reduce(maxvaluesreduce) ### TODO: write the maxvaluesreduce function minvaluesreduce = goodticks.reduce(minvaluesreduce) ### TODO: write the minvaluesreduce function 8
9 avgunits = sumvalues.units / float(numticks) avgprice = sumvalues.price / float(numticks) print "Max units %i, avg units %f\n" % (maxvaluesreduce.units, avgunits) print "Max price %f, min price %f, avg price %f\n" % (maxvaluesreduce.price, minvaluesreduce.price, avgprice) ### Daily and hourly spreads # Here is how the daily spread is computed. For each data point, the spread can be calculated # using the following formula : (ask - bid) / 2 * (ask + bid) # 1) We have a MapReduce phase that uses the generatespreadsdailykeys() function as an argument # to map(), and the spreadssumreduce() function as an argument to reduce() # - The keys will be a unique date in the ISO 8601 format (so that sorting dates # alphabetically will sort them chronologically) # - The values will be tuples that contain adequates values to (1) only take one value into # account per second (which value is picked doesn t matter), (2) sum the spreads for the # day, and (3) count the number of spread values that have been added. # 2) We have a Map phase that computes thee average spread using (b) and (c) # 3) A final Map phase formats the output by producing a string with the following format: # "<key (date)>, <average_spread>" # 4) The output is written using.saveastextfile("wdc_daily") avgdailyspreads = goodticks.map(generatespreadsdailykeys).reducebykey(spreadssumreduce); # (1) #avgdailyspreads = avgdailyspreads.map(lambda a:???) # (2) #avgdailyspreads = avgdailyspreads.sortbykey().map(lambda a:???) # (3) avgdailyspreads = avgdailyspreads.saveastextfile("wdc_daily") # (4) # For the hourly spread you only need to change the key. How? avghourlyspreads = goodticks.map(generatespreadshourlykeys).reducebykey(spreadssumreduce); # (1) #avghourlyspreads = avghourlyspreads.map(lambda a:???) # (2) #avghourlyspreads = avghourlyspreads.sortbykey().map(lambda a:???) # (3) avghourlyspreads = avghourlyspreads.saveastextfile("wdc_hourly") # (4) sc.stop() When you are done editing the file, you can try running your program. To this end, you must first load Spark: $ module load natlang $ module load NL/HADOOP/SPARK/1.0.0 If you get an error message saying that load was not found, you can try to run source /etc/profile and run the command again. You can then run your script using: $ spark-submit --py-files stocktick.py stock.py With the WDC_tickbidask_short.txt, the output should be: 9
10 Max units , avg units Max price , min price , avg price With the following contents for the files that contain the daily and hourly spread: $ cat WDC_daily/part , e , e , , e , $ head WDC_hourly/part :09, :10, e :11, e :12, e :13, e :14, e :15, e :16, :09, :10, What results do you get for WDC_tickbidask.txt? 10
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
Introduction to Apache Pig Indexing and Search
Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP ICTP, Trieste, March 24th 2015 The objectives of this session are: Understand the Spark RDD programming model Familiarize with
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Hadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf
Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
This is a brief tutorial that explains the basics of Spark SQL programming.
About the Tutorial Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types
Assignment 2: More MapReduce with Hadoop
Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz
Hadoop Hands-On Exercises
Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Hadoop Scripting with Jaql & Pig
Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1 Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2 Introduction Goal: Compare two high level scripting languages
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Conquering Big Data with BDAS (Berkeley Data Analytics)
UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»
Big Data and Scripting Systems beyond Hadoop
Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid
Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide
Microsoft SQL Server Connector for Apache Hadoop Version 1.0 User Guide October 3, 2011 Contents Legal Notice... 3 Introduction... 4 What is SQL Server-Hadoop Connector?... 4 What is Sqoop?... 4 Supported
Data Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
Big Data Spatial Analytics An Introduction
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data Spatial Analytics An Introduction Marwa Mabrouk Mansour Raad Esri iu UC2013. Technical Workshop
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Scaling Up 2 CSE 6242 / CX 4242. Duen Horng (Polo) Chau Georgia Tech. HBase, Hive
CSE 6242 / CX 4242 Scaling Up 2 HBase, Hive Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Data Intensive Computing Handout 6 Hadoop
Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
Pig Laboratory. Additional documentation for the laboratory. Exercises and Rules. Tstat Data
Pig Laboratory This laboratory is dedicated to Hadoop Pig and consists of a series of exercises: some of them somewhat mimic those in the MapReduce laboratory, others are inspired by "real-world" problems.
Big Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Cloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.
What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations [email protected] What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
Other Map-Reduce (ish) Frameworks. William Cohen
Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci
Apache Hive. Big Data 2015
Apache Hive Big Data 2015 Hive Configuration Translates HiveQL statements into a set of MapReduce jobs which are then executed on a Hadoop Cluster Execute on Hadoop Cluster HiveQL Hive Monitor/Report Client
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Integrate Master Data with Big Data using Oracle Table Access for Hadoop
Integrate Master Data with Big Data using Oracle Table Access for Hadoop Kuassi Mensah Oracle Corporation Redwood Shores, CA, USA Keywords: Hadoop, BigData, Hive SQL, Spark SQL, HCatalog, StorageHandler
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
How To Choose A Data Flow Pipeline From A Data Processing Platform
S N A P L O G I C T E C H N O L O G Y B R I E F SNAPLOGIC BIG DATA INTEGRATION PROCESSING PLATFORMS 2 W Fifth Avenue Fourth Floor, San Mateo CA, 94402 telephone: 888.494.1570 www.snaplogic.com Big Data
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
Apache Spark and Distributed Programming
Apache Spark and Distributed Programming Concurrent Programming Keijo Heljanko Department of Computer Science University School of Science November 25th, 2015 Slides by Keijo Heljanko Apache Spark Apache
Certification Study Guide. MapR Certified Spark Developer Study Guide
Certification Study Guide MapR Certified Spark Developer Study Guide 1 CONTENTS About MapR Study Guides... 3 MapR Certified Spark Developer (MCSD)... 3 SECTION 1 WHAT S ON THE EXAM?... 5 1. Load and Inspect
Data Intensive Computing Handout 5 Hadoop
Data Intensive Computing Handout 5 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
