COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Size: px
Start display at page:

Download "COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015"

Transcription

1 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt file) explanations to the code answers to questions Deliver electronically to Expected by Friday, March 31, 11.59pm In case of questions: ask, ask, ask! 1

2 Part 1: Bulk Import into HBase 1. Given the same stack exchange data sets as used during the 1 st homework assignment. Develop a MapReduce code to perform a bulk import of the data in /bigdata-hw2 (not the large file from hw1) Same format as in the 1 st assignment, just different size ( ~220 MB). you have to decide on the schema, e.g. whether and how to group columns in families etc. Justify your selection in the document! Additional information: New build.xml example available on webpage, since you have to utilize different directories for the hbase jar files Additional argument required when starting a job to tell yarn where the hbase jar files are HADOOP_CLASSPATH=`hbase classpath` yarn jar edgarairquality.jar airqualitydemo.airquality /gabriel/input /gabriel/output Requirements: Final table name has to have the format bigd<xy>-hbasesample to avoid name clash of student tables 2

3 Required operations need to create first a table manually setting up the schema Perform MapReduce job using HFileOutputformat Input data using bulkload tool Document all steps in details Since not everything that you do is in the java file, please make sure that you document precisely how you created the table, how you executed the bulkload etc. Part 2 Write a MapReduce code, which calculate the average number of answers per question using the data from HBase (instead of the file) Compare the performance of the Hbase version of the code with your version from the 1 st homework assignment for the new file 3

4 Documentation The Documentation should contain (Brief) Problem description Solution strategy Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 4

5 Pig Pig is an abstraction on top of Hadoop Pig is a platform for analyzing large data sets Provides high level programming language designed for data processing Converted into MapReduce and executed on Hadoop Clusters Why using Pig? MapReduce requires programmers Must think in terms of map and reduce functions More than likely will require Java programming Pig provides high-level language that can be used by Analysts and Scientists Does not require know how in parallel programming Pig s Features Join Datasets Sort Datasets Filter Data Types Group By User Defined Functions 5

6 Pig Components Pig Latin Command based language Designed specifically for data transformation and flow expression Execution Environment The environment in which Pig Latin commands are executed Supporting local and Hadoop execution modes Pig compiler converts Pig Latin to MapReduce Automatic vs. user level optimizations compared to manual MapReduce code Running Pig Script Execute commands in a file $pig scriptfile.pig Grunt Interactive Shell for executing Pig Commands Started when script file is NOT provided Can execute scripts from Grunt via run or exec commands Embedded Execute Pig commands using PigServer class Can have programmatic access to Grunt via PigRunner class 6

7 Pig Latin concepts Building blocks Field piece of data Tuple ordered set of fields, represented with ( and ) (10.4, 5, word, 4, field1) Bag collection of tuples, represented with { and } { (10.4, 5, word, 4, field1), (this, 1, blah) } Similar to Relational Database Bag is a table in the database Tuple is a row in a table $ pig grunt> cat /input/pig/a.txt a 1 d 4 c 9 k 6 Simple Pig Latin example Load grunt in default map-reduce mode grunt supports file system commands Load contents of text file into a bag called records grunt> records = LOAD '/input/a.txt' as (letter:chararray, count:int); grunt> dump records;... org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - 50% complete :36:22,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapreducelayer.mapreducelauncher - 100% complete... (a,1) (d,4) (c,9) (k,6) grunt> Display records on screen 7

8 Simple Pig Latin example No action is taken until DUMP or STORE commands are encountered Pig will parse, validate and analyze statements but not execute them STORE saves results (typically to a file) DUMP displays the results to the screen doesn t make sense to print large arrays to the screen For information and debugging purposes you can print a small sub-set to the screen grunt> records = LOAD '/input/excite-small.log' AS (userid:chararray, timestamp:long, query:chararray); grunt> toprint = LIMIT records 5; grunt> DUMP toprint; Simple Pig Latin example LOAD 'data' [USING function] [AS schema]; data name of the directory or file Must be in single quotes USING specifies the load function to use By default uses PigStorage which parses each line into fields using a delimiter Default delimiter is tab ( \t ) The delimiter can be customized using regular expressions AS assign a schema to incoming data Assigns names and types to fields ( alias:type) (name:chararray, age:int, gpa:float) 8

9 records = LOAD '/input/excite-small.log USING PigStorage() AS (userid:chararray, timestamp:long, query:chararray); int Signed 32-bit integer 10 long Signed 64-bit integer 10L or 10l float 32-bit floating point 10.5F or 10.5f double 64-bit floating point 10.5 or 10.5e2 or 10.5E2 chararray Character array (string) in Unicode UTF-8 bytearray Byte array (blob) hello world tuple An ordered set of fields (T: tuple (f1:int, f2:int)) bag A collection of tuples (B: bag {T: tuple(t1:int, t2:int)}) Pig Latin Diagnostic Tools Display the structure of the Bag grunt> DESCRIBE <bag_name>; Display Execution Plan Produces Various reports, e.g. logical plan, MapReduce plan grunt> EXPLAIN <bag_name>; Illustrate how Pig engine transforms the data grunt> ILLUSTRATE <bag_name>; 9

10 FOREACH statement FOREACH <bag> GENERATE <data> Iterate over each element in the bag and produce a result grunt> records = LOAD /input/a.txt' AS (c:chararray, i:int); grunt> dump records; (a,1) (d,4) (c,9) (k,6) grunt> counts = foreach records generate i; grunt> dump counts; (1) (4) (9) (6) Joining Two Data Sets Join Steps Load records into a bag from input #1 Load records into a bag from input #2 Join the 2 data-sets (bags) by provided join key Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result Inner join Set 1 Set 2 10

11 Simple join example 1. Load records into a bag from input #1 posts = load '/input/user-posts.txt' using PigStorage(',') as (user:chararray, post:chararray, date:long); 2. Load records into a bag from input #2 likes = load '/input/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long); 3. Join the data sets when a key is equal in both data-sets then the rows are joined into a new single row; In this case when user name is equal userinfo = join posts by user, likes by user; dump userinfo; $ hdfs dfs -cat /input/user-posts.txt user1,funny Story, user2,cool Deal, user4,interesting Post, user5,yet Another Blog, $ hdfs dfs -cat /input/user-likes.txt user1,12, user2,7, user3,0, user4,50, $ pig /code/innerjoin.pig (user1,funny Story, ,user1,12, ) (user2,cool Deal, ,user2,7, ) (user4,interestingpost, ,user4,50, ) 11

12 Join re-uses the names of the input fields and prepends the name of the input bag <bag_name>::<field_name> grunt> describe posts; posts: {user: chararray,post: chararray,date: long} grunt> describe likes; likes: {user: chararray,likes: int,date: long} grunt> describe userinfo; UserInfo: { posts::user: chararray, posts::post: chararray, posts::date: long, likes::user: chararray, likes::likes: int, likes::date: long} Outer Join Records which will not join with the other record-set are still included in the result Left Outer Records from the first data-set are included whether they have a match or not. Fields from the unmatched (second) bag are set to null. Right Outer The opposite of Left Outer Join: Records from the second dataset are included no matter what. Fields from the unmatched (first) bag are set to null. Full Outer Records from both sides are included. For unmatched records the fields from the other bag are set to null. 12

13 Pig Use cases Loading large amounts of data Pig is built on top of Hadoop -> scales with the number of servers Alternative to manual bulkloading Using different data sources, e.g. collect web server logs, use external programs to fetch geo-location data for the users IP addresses, join the new set of geo-located web traffic to click maps stored Support for data sampling Hive Data Warehousing Solution built on top of Hadoop Provides SQL-like query language named HiveQL Minimal learning curve for people with SQL expertise Data analysts are target audience Early Hive development work started at Facebook in 2007 Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster 13

14 Hive Ability to bring structure to various data formats Simple interface for ad hoc querying, analyzing and summarizing large amounts of data Access to files on various data stores such as HDFS and HBase Hive does NOT provide low latency or realtime queries Even querying small amounts of data may take minutes Designed for scalability and ease-of-use rather than low latency responses Hive To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database Packaged with Derby, a lightweight embedded SQL DB Default Derby based is good for evaluation an testing Schema is not shared between users as each user has their own instance of embedded Derby Stored in metastore_db directory which resides in the directory that hive was started from Can easily switch another SQL installation such as MySQL 14

15 Hive Architecture Hive Interface Options Command Line Interface (CLI) Hive Web Interface https://cwiki.apache.org/confluence/display/hive/hivewebinterface Java Database Connectivity (JDBC) https://cwiki.apache.org/confluence/display/hive/hiveclient Re-used from Relational Databases Database: Set of Tables, used for name conflict resolution Table: Set of Rows that have the same schema (same columns) Row: A single record; a set of columns Column: provides value and type for a single value 15

16 Hive creating a table hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE; OK Time taken: seconds creates a table with 3 columns How the underlying file should be parsed hive> show tables; OK posts Time taken: seconds hive> describe posts; OK user string post string time bigint Time taken: seconds Display schema for posts table Hive Query Data hive> select * from posts where user="user2"; OK user2 Cool Deal Time taken: seconds hive> select * from posts where time<= limit 2; OK user1 Funny Story user2 Cool Deal Time taken: seconds hive> 16

17 Partitions To increase performance Hive has the capability to partition data The values of partitioned column divide a table into segments Entire partitions can be ignored at query time Similar to relational databases indexes but not as granular Partitions have to be properly crated by users When inserting data must specify a partition At query time, whenever appropriate, Hive will automatically filter out partitions Bucketing Mechanism to query and examine random samples of data Break data into a set of buckets based on a hash function of a "bucket column" Capability to execute queries on a sub-set of random data Doesn t automatically enforce bucketing User is required to specify the number of buckets by setting # of reducer 17

18 Joins Hive support outer joins left, right and full joins Can join multiple tables Default Join is Inner Join Rows are joined where the keys match Rows that do not have matches are not included in the result Pig vs. Hive vs. Spark Hive Uses an SQL like query language called HQL Gives non-programmers the ability to query and analyze data in Hadoop. Pig Uses a workflow driven scripting language Don't need to be an expert Java programmer but need a few coding skills. Can be used to convert unstructured data into a meaningful form. Spark Successor to map-reduce in Hadoop, Emphasis on in-memory computing, lower level programming required than in Hive and Spark 18

Apache Pig Joining Data-Sets

Apache Pig Joining Data-Sets 2012 coreservlets.com and Dima May Apache Pig Joining Data-Sets Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Recommended Literature for this Lecture

Recommended Literature for this Lecture COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin,

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Introduction to Apache Hive

Introduction to Apache Hive Introduction to Apache Hive Pelle Jakovits 14 Oct, 2015, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language User Defined Functions Hive

More information

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.

More information

Introduction to Apache Hive

Introduction to Apache Hive Introduction to Apache Hive Pelle Jakovits 1. Oct, 2013, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language Examples Internals Hive vs

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Hadoop Tutorial GridKa School 2011

Hadoop Tutorial GridKa School 2011 Hadoop Tutorial GridKa School 2011 Ahmad Hammad, Ariel García Karlsruhe Institute of Technology September 7, 2011 Abstract This tutorial intends to guide you through the basics of Data Intensive Computing

More information

Hadoop Scripting with Jaql & Pig

Hadoop Scripting with Jaql & Pig Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1 Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2 Introduction Goal: Compare two high level scripting languages

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc. Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Integration of Apache Hive and HBase

Integration of Apache Hive and HBase Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Programming with Pig. This chapter covers

Programming with Pig. This chapter covers 10 Programming with Pig This chapter covers Installing Pig and using the Grunt shell Understanding the Pig Latin language Extending the Pig Latin language with user-defined functions Computing similar

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM 2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Scaling Up 2 CSE 6242 / CX 4242. Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Scaling Up 2 CSE 6242 / CX 4242. Duen Horng (Polo) Chau Georgia Tech. HBase, Hive CSE 6242 / CX 4242 Scaling Up 2 HBase, Hive Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le

More information

How, What, and Where of Data Warehouses for MySQL

How, What, and Where of Data Warehouses for MySQL How, What, and Where of Data Warehouses for MySQL Robert Hodges CEO, Continuent. Introducing Continuent The leading provider of clustering and replication for open source DBMS Our Product: Continuent Tungsten

More information

Allan Mitchell. Joint author on 2005/2008 SSIS Book by Wrox Websites. www.copperblueconsulting.com

Allan Mitchell. Joint author on 2005/2008 SSIS Book by Wrox Websites. www.copperblueconsulting.com Jive with Hive Allan Mitchell Joint author on 2005/2008 SSIS Book by Wrox Websites www.copperblueconsulting.com Specialise in Data and Process Integration Microsoft SQL Server MVP Twitter: allansqlis E:

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved. Agenda Design rationale Implementation Installation

More information

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36 Big Data: Pig Latin P.J. McBrien Imperial College London P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36 Introduction Scale Up 1GB 1TB 1PB P.J. McBrien (Imperial College London) Big Data:

More information

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig Pig Page 1 Programming Map/Reduce Wednesday, February 23, 2011 3:45 PM Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig Pig Page 2 Recall from last time Wednesday,

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

An Insight on Big Data Analytics Using Pig Script

An Insight on Big Data Analytics Using Pig Script An Insight on Big Data Analytics Using Pig Script J.Ramsingh 1, Dr.V.Bhuvaneswari 2 1 Ph.D research scholar Department of Computer Applications, Bharathiar University, 2 Assistant professor Department

More information

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Data Warehouse and Hive Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze Decision support systems Decision Support Systems allowed managers, supervisors, and executives to once again see the

More information

07/11/2014 Julien! Poorna! Andreas

07/11/2014 Julien! Poorna! Andreas Ad-hoc Query Brown Bag Session 07/11/2014 Julien Poorna Andreas User Story Procedures are only developer friendly and not ad-hoc Open datasets to broader audience of non developers Introduce schema to

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

A Brief Introduction to MySQL

A Brief Introduction to MySQL A Brief Introduction to MySQL by Derek Schuurman Introduction to Databases A database is a structured collection of logically related data. One common type of database is the relational database, a term

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Akmal B. Chaudhri* Hortonworks *about.me/akmalchaudhri My background ~25 years experience in IT Developer (Reuters) Academic (City

More information

Scaling Up HBase, Hive, Pegasus

Scaling Up HBase, Hive, Pegasus CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,

More information

SQL Server An Overview

SQL Server An Overview SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system

More information

Hadoop Pig. Introduction Basic. Exercise

Hadoop Pig. Introduction Basic. Exercise Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Alexander Rubin Principle Architect, Percona April 18, 2015. Using Hadoop Together with MySQL for Data Analysis

Alexander Rubin Principle Architect, Percona April 18, 2015. Using Hadoop Together with MySQL for Data Analysis Alexander Rubin Principle Architect, Percona April 18, 2015 Using Hadoop Together with MySQL for Data Analysis About Me Alexander Rubin, Principal Consultant, Percona Working with MySQL for over 10 years

More information

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.

Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved. Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document

More information

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager Oracle Data Integrator for Big Data Alex Kotopoulis Senior Principal Product Manager Hands on Lab - Oracle Data Integrator for Big Data Abstract: This lab will highlight to Developers, DBAs and Architects

More information

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer. About Tungsten Replicator Open source drop-in

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved. Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!

More information

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo. Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com Hadoop Hadoop is an open source distributed platform for data storage

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Big Data : Experiments with Apache Hadoop and JBoss Community projects Big Data : Experiments with Apache Hadoop and JBoss Community projects About the speaker Anil Saldhana is Lead Security Architect at JBoss. Founder of PicketBox and PicketLink. Interested in using Big

More information

Architecting the Future of Big Data

Architecting the Future of Big Data Hive ODBC Driver User Guide Revised: October 1, 2012 2012 Hortonworks Inc. All Rights Reserved. Parts of this Program and Documentation include proprietary software and content that is copyrighted and

More information

CASE STUDY OF HIVE USING HADOOP 1

CASE STUDY OF HIVE USING HADOOP 1 CASE STUDY OF HIVE USING HADOOP 1 Sai Prasad Potharaju, 2 Shanmuk Srinivas A, 3 Ravi Kumar Tirandasu 1,2,3 SRES COE,Department of er Engineering, Kopargaon,Maharashtra, India 1 psaiprasadcse@gmail.com

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Hadoop Hands-On Exercises

Hadoop Hands-On Exercises Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala

More information