Introduction To Hive



Similar documents
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Data processing goes big

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Ecosystem B Y R A H I M A.

Qsoft Inc

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Job Oriented Training Agenda

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

MapReduce with Apache Hadoop Analysing Big Data

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Workshop on Hadoop with Big Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ITG Software Engineering

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Getting Started with Hadoop with Amazon s Elastic MapReduce

Implement Hadoop jobs to extract business value from large and varied data sets

ITG Software Engineering

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Big Data Course Highlights

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Open source Google-style large scale data analysis with Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Using distributed technologies to analyze Big Data

Real World Hadoop Use Cases

Big Data and Scripting Systems build on top of Hadoop

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Apache Hive

How To Scale Out Of A Nosql Database

COURSE CONTENT Big Data and Hadoop Training

How To Create A Large Data Storage System

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah


Hadoop IST 734 SS CHUNG

Open source large scale distributed data management with Google s MapReduce and Bigtable

MapR, Hive, and Pig on Google Compute Engine

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hands-on Exercises with Big Data

Hadoop: The Definitive Guide

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

CS 378 Big Data Programming

Important Notice. (c) Cloudera, Inc. All rights reserved.

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Chase Wu New Jersey Ins0tute of Technology

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

HiBench Introduction. Carson Wang Software & Services Group

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

MySQL and Hadoop. Percona Live 2014 Chris Schneider

The Hadoop Eco System Shanghai Data Science Meetup

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data and Scripting Systems build on top of Hadoop

American International Journal of Research in Science, Technology, Engineering & Mathematics

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Apache Hive

Complete Java Classes Hadoop Syllabus Contact No:

Large Scale Text Analysis Using the Map/Reduce

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Impala Introduction. By: Matthew Bollinger

Hadoop Project for IDEAL in CS5604

Moving From Hadoop to Spark

Apriori-Map/Reduce Algorithm

A Brief Outline on Bigdata Hadoop

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data Too Big To Ignore

Hadoop Distributed File System. -Kishan Patel ID#

Big Data Spatial Analytics An Introduction

Map Reduce & Hadoop Recommended Text:

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop and Map-Reduce. Swati Gore

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

On a Hadoop-based Analytics Service System

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Connecting Hadoop with Oracle Database

Deploying Hadoop with Manager

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Transcription:

Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The Definitive Guide Roshan Sumbaly, LinkedIn

Todays Session Framework: Hadoop/Hive Computing Power: Amazon Web Service Demo LinkedIn s frameworks & project ideas

Hadoop Collection of related sub projects for distributed computing Open source Core, Avro, MapReduce, HDFS, Pig, HBase, ZooKeeper, Hive, Chukwa...

Hive Data warehousing tool on top of Hadoop Built at Facebook 3 Parts Metastore over Hadoop Libraries for (De)Serialization Query Engine(HQL)

AWS - Amazon Web Service S3 - Data Storage EC2 - Computing Power Elastic Map Reduce

Step by step Prepare Security Keys Upload your input files to S3 Turn on elastic Map-Reduce job flow Log in to job flow HiveQL with custom mapper/reducer

0. Prepare Security Key AWS: Access Key / Private Key EC2: Key Pair - Key name and Key file(.pem)

1. Upload files to S3 Data stored in buckets(folders) This is your only permanent storage in AWS - save input, output here Use Firefox Add-on S3Fox Organizer (http://www.s3fox.net)

2. Turn Elastic MapReduce On

3. Connect to Job Flow (1) Using Amazon Elastic MapReduce Client http://developer.amazonwebservices.com/ connect/entry.jspa?externalid=2264 Need Ruby installed on your computer

3.Connect to Job flow (2) - security Place credentials.json and.pem file in Amazon Elastic MapReduce Client folder, to avoid type things again and again { } "access-id": "<access-id>", "private-key": "<private-key>", "key-pair": "new-key", "key-pair-file": "./new-key.pem", "region": "us-west-1",

3. Connect to Job Flow (3) list jobflows: elastic-mapreduce --list terminate job flow: elastic-mapreduce --terminate --jobflow <id> SSH to master: elastic-mapreduce --ssh <id>

4.HiveQL(1) SQL like language Hive WIKI http://wiki.apache.org/hadoop/hive/ GettingStarted Cloudera Hive Tutorial http://www.cloudera.com/hadoop-traininghiveintroduction

4.HiveQL(2) SQL like Queries SHOW TABLES, DESCRIBE, DROP TABLE CREATE TABLE, ALTER TABLE SELECT, INSERT

4.HiveQL(3)- usage Create a schema around data: CREATE EXTERNAL TABLE Use like regular SQL: Hive automatically change SQL query to map/reduce Use with custom mapper/reducer: Any executable program with stdin/stdout.

Example - problem Basic map reduce example - count frequencies of each word! I - 3 data - 2 mining - 2 awesome - 1...

Example - Input Input: 270 twitter tweets sample_tweets.txt T 2009-06-08 21:49:37 U http://twitter.com/evion W I think data mining is awesome! T 2009-06-08 21:49:37 U http://twitter.com/hyungjin W I don t think so. I don t like data mining

Example - How? Create table from raw data file table raw_tweets Parse data file to match our format, and save to new table parser.py table tweets_test_parsed Run map/reduce mappr.py, reducer.py Save result to new table table word_count Find top 10 most frequent words from word_count table.

Example-Create Input Table Create Schema around raw data file CREATE EXTERNAL TABLE raw_tweets(line string) ROW FORMAT DELIMITED LOCATION 's3://cs341/test-tweets'; With this command, \t will be separator among columns, and \n will be separator among rows.

Example -Create Output Table CREATE EXTERNAL TABLE tweets_parsed (time string, id string, tweet string) ROW FORMAT DELIMITED LOCATION 's3://cs341/tweets_parsed'; CREATE EXTERNAL TABLE word_count (word string, count int) ROW FORMAT DELIMITED LOCATION 's3://cs341/word_count';

Example - TRANSFORM TRANSFORM - given python script will transform the input columns Let s parse original file to <time>, <id>, <tweet> ADD FILE parser.py; INSERT OVERWRITE TABLE tweets_parsed SELECT TRANSFORM(line) USING 'python parser.py' AS (time, id, tweet) FROM raw_tweets; Add whatever the script file you want to use to hive first. Write out result of this select to tweets_parsed table

Example - Map/Reduce Use command MAP and REDUCE: Basically, same as TRANSFOM tweets_parsed -> map_output -> word_count ADD FILE mapper.py; ADD FILE reducer.py; FROM ( FROM tweets_parsed MAP tweets_parsed.time, tweets_parsed.id, tweets_parsed.tweet USING 'python mapper.py' AS word, count CLUSTER BY word) map_output INSERT OVERWRITE TABLE word_count REDUCE map_output.word, map_output.count USING 'python reducer.py' AS word, count; Use word as key

Example - Finding Top 10 Words Using similar syntax as SQL SELECT word, count FROM word_count ORDER BY count DESC limit 10;

Example -JOIN Finding pairs of words that have same count, and count bigger than 5 SELECT wc1.word, wc2.word, wc2.count FROM word_count wc1 JOIN word_count wc2 ON(wc1.count = wc2.count) WHERE wc1.count > 5 AND wc1.word < wc2.word;

Frameworks from LinkedIn Complete data stack from LinkedIn open source @ http://sna-projects.com Any questions - rsumbaly@linkedin.com Introduce Kafka and Azkaban today.

Kafka(1) Distributed publish/ subscribe system Used at LinkedIn for tracking activity events http://sna-projects.com/kafka/

Kafka(2) Parsing data in files every time you want to run an algorithm is tedius What would be ideal? An iterator over your data(hiding all the underneath semantics) Kafka helps you publish data once(or continuously) to this system and then consume it as a stream.

Kafka(3) Example: Easy for implementing stream algorithms on top of Twitter stream

Azkaban(1) A simple Hadoop workflow system Used at LinkedIn to generate wokflows for recommendation features Last year many students wanted to iterate on their algorithms multiple times. This required them to build a chain of Hadoop jobs which they ran manually every day.

Azkaban(2) Example workflow Generate n-grams as a Java program -> Feed n-grams to MR Algorithms X run on Hadoop -> Fork n parallel MR jobs to feed this to Algorithm X_1 to X_n -> Compare the results at the end http://sna-projects.com/azkaban