python hadoop pig October 29, 2015



Similar documents
Click Stream Data Analysis Using Hadoop

CDH installation & Application Test Report

CS 455 Spring Word Count Example

Big Data Operations Guide for Cloudera Manager v5.x Hadoop

Running Hadoop on Windows CCNP Server

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

HDFS. Hadoop Distributed File System

CDH 5 Quick Start Guide

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Extreme computing lab exercises Session one

Hadoop Data Warehouse Manual

Instructions for Accessing the Advanced Computing Facility Supercomputing Cluster at the University of Kansas

Data Intensive Computing Handout 6 Hadoop

Hadoop Training Hands On Exercise

Hadoop Tutorial. General Instructions

Introduction to Cloud Computing

Data Intensive Computing Handout 5 Hadoop

Getting Started with Tableau Server 6.1

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Kognitio Technote Kognitio v8.x Hadoop Connector Setup

IDS 561 Big data analytics Assignment 1

Virtual Machine (VM) For Hadoop Training

Hadoop Elephant in Active Directory Forest. Marek Gawiński, Arkadiusz Osiński Allegro Group

How To Install Hadoop From Apa Hadoop To (Hadoop)

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Introduction To Hive

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Red Hat Enterprise Linux OpenStack Platform 7 OpenStack Data Processing


Hadoop Hands-On Exercises

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop Installation MapReduce Examples Jake Karnes

L1: Introduction to Hadoop

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Important Notice. (c) Cloudera, Inc. All rights reserved.

ITG Software Engineering

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Author: Sumedt Jitpukdebodin. Organization: ACIS i-secure. ID: My Blog:

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

FEEG Applied Programming 3 - Version Control and Git II

Hadoop MultiNode Cluster Setup

Extreme computing lab exercises Session one

Data processing goes big

SparkLab May 2015 An Introduction to

Olivier Renault Solu/on Engineer Hortonworks. Hadoop Security

Practical Hadoop. Security. Bhushan Lakhe

Important Notice. (c) Cloudera, Inc. All rights reserved.

Hadoop Hands-On Exercises

Hadoop 2.6 Configuration and More Examples

MATLAB & Git Versioning: The Very Basics

Hadoop Development & BI- 0 to 100

Single Node Hadoop Cluster Setup

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Instructions for Setup and Connecting to Department of Education Secure FTP(SFTP) server January 23, 2013

A Study of Data Management Technology for Handling Big Data

Hadoop Internals for Oracle Developers and DBAs Exploring the HDFS and MapReduce Data Flow

Apache Hadoop new way for the company to store and analyze big data

Assignment 1: MapReduce with Hadoop

Data Analyst Program- 0 to 100

Обработка больших данных: Map Reduce (Python) + Hadoop (Streaming) Максим Щербаков ВолгГТУ 8/10/2014

CS2510 Computer Operating Systems Hadoop Examples Guide

PassTest. Bessere Qualität, bessere Dienstleistungen!

Cisco UCS CPA Workflows

Hadoop (pseudo-distributed) installation and configuration

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

File S1: Supplementary Information of CloudDOE

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Hadoop Basics with InfoSphere BigInsights

Impala Introduction. By: Matthew Bollinger

Extending Remote Desktop for Large Installations. Distributed Package Installs

Chase Wu New Jersey Ins0tute of Technology

Hadoop Streaming. Table of contents

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Centrify Server Suite For MapR 4.1 Hadoop With Multiple Clusters in Active Directory

ratings.dat ( UserID::MovieID::Rating::Timestamp ) users.dat ( UserID::Gender::Age::Occupation::Zip code ) movies.dat ( MovieID::Title::Genres )

Hadoop Basics with InfoSphere BigInsights

Git - Working with Remote Repositories

docs.hortonworks.com

VCL Access. VCL provides access to Linux and Windows 7 Virtual Machines. Users will only see those images that they are authorized to access.

Hadoop & Spark Using Amazon EMR

Strategies for scheduling Hadoop Jobs. Pere Urbon-Bayes

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Hadoop Setup. 1 Cluster

Internet Address: cloud.ndcl.org

Cloudera Manager Training: Hands-On Exercises

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Global TAC Secure FTP Site Customer User Guide

Transcription:

python hadoop pig October 29, 2015 1 Python Hadoop Pig This notebook aims at showing how to submit a PIG job to remote hadoop cluster (tested with Cloudera). It works better if you know Hadoop otherwise I recommend reading Map/Reduce avec PIG (French). First, we download data. We are going to upload that data to the remote cluster. The Hadoop distribution tested here is Cloudera. In [1]: import pyensae pyensae.download_data("conflongdemo_jsi.txt", website="https://archive.ics.uci.edu/ml/machine-le Out[1]: ConfLongDemo JSI.txt We open a SSH connection to the bridge which can communicate to the cluster. In [1]: import pyquickhelper params={"server":"", "username":"", "password":""} pyquickhelper.open_html_form(params=params,title="credentials",key_save="ssh_remote_hadoop") Out[1]: <IPython.core.display.HTML at 0x742c9f0> In [2]: password = ssh_remote_hadoop["password"] server = ssh_remote_hadoop["server"] username = ssh_remote_hadoop["username"] We open the SSH connection: In [4]: %remote_open Out[4]: <pyensae.remote.ssh remote connection.asshclient at 0xa2422e8> We check the content of the remote machine: In [4]: %remote_cmd ls -l Out[4]: <IPython.core.display.HTML object> In [7]: %remote_ls. Out[7]: attributes code alias folder size unit \ -rw-rw-r-- 1 xavierdupre xavierdupre 1043 Jul 14 23:40 -rw-r--r-- 1 xavierdupre xavierdupre 2 Jul 15 00:22 -rw-rw-r-- 1 xavierdupre xavierdupre 0 Sep 27 00:21 1 xavierdupre xavierdupre 290 Jul 14 23:48 1 xavierdupre xavierdupre 1654 Jul 15 00:20 1 xavierdupre xavierdupre 235 Jul 14 23:37 1 xavierdupre xavierdupre 1778 Jul 14 23:57 1

1 xavierdupre xavierdupre 4570 Jul 15 00:45 1 xavierdupre xavierdupre 4570 Jul 15 23:52 1 xavierdupre xavierdupre 574 Jul 15 23:51 1 xavierdupre xavierdupre 659 Sep 27 00:21 1 xavierdupre xavierdupre 382 Sep 27 00:21 1 xavierdupre xavierdupre 26186 Jul 15 23:52 1 xavierdupre xavierdupre 0 Jul 15 23:51 1 xavierdupre xavierdupre 3400818 Jul 15 23:48 -rw-rw-r-- 1 centrer reduire.pig False -rw-r--r-- 1 diff cluster False -rw-rw-r-- 1 dummy False 1 init random.pig False 1 iteration complete.pig False 1 nb obervations.pig False 1 pig 1436911046432.log False 1 pig 1436913856496.log False 1 pig 1436997076356.log False 1 post traitement.pig False 1 pystream.pig False 1 pystream.py False 1 redirection.err False 1 redirection.out False 1 Skin NonSkin.txt False We check the content on the cluster: In [5]: %remote_cmd hdfs dfs -ls Out[5]: <IPython.core.display.HTML object> In [8]: %dfs_ls. Out[8]: attributes code alias folder size date time \ 0 drwx------ - xavierdupre xavierdupre 0 2015-09-27 02:00 1 drwx------ - xavierdupre xavierdupre 0 2015-09-27 00:22 2 -rw-r--r-- 3 xavierdupre xavierdupre 132727 2014-11-16 02:37 3 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-16 02:38 4 -rw-r--r-- 3 xavierdupre xavierdupre 3400818 2015-07-14 23:35 5 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:22 6 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:44 7 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:43 8 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:49 9 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:41 10 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-14 23:38 11 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:05 12 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:22 13 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:07 14 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:09 15 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:11 16 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:13 17 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:15 18 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:17 19 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:18 2

20 drwxr-xr-x - xavierdupre xavierdupre 0 2015-07-15 00:20 21 -rw-r--r-- 3 xavierdupre xavierdupre 461444 2014-11-20 01:33 22 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 22:03 23 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 22:07 24 drwxr-xr-x - xavierdupre xavierdupre 0 2014-12-03 22:55 25 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-20 23:43 26 drwxr-xr-x - xavierdupre xavierdupre 0 2015-09-27 00:23 27 drwxr-xr-x - xavierdupre xavierdupre 0 2015-09-27 00:22 28 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-20 01:53 29 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-21 01:17 30 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 21:34 31 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-23 21:51 32 drwxr-xr-x - xavierdupre xavierdupre 0 2014-11-21 11:08 0.Trash True 1.staging True 2 ConfLongDemo JSI.small.example.txt False 3 ConfLongDemo JSI.small.example2.walking.txt True 4 Skin NonSkin.txt False 5 diff cluster True 6 donnees normalisees True 7 ecartstypes True 8 init random True 9 moyennes True 10 nb obervations True 11 output iter1 True 12 output iter10 True 13 output iter2 True 14 output iter3 True 15 output iter4 True 16 output iter5 True 17 output iter6 True 18 output iter7 True 19 output iter8 True 20 output iter9 True 21 paris.2014-11-11 22-00-18.331391.txt False 22 python info.txt True 23 python info2.txt True 24 random True 25 unitest2 True 26 unittest True 27 unittest2 True 28 velib 1hjs True 29 velib py True 30 velib py results True 31 velib py results 3days True 32 velib several days True We upload the file on the bridge (we should zip it first, it would reduce the uploading time). In [9]: %remote_up ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt Out[9]: ConfLongDemo JSI.txt We check it got there: 3

In [12]: %remote_cmd ls Conf*JSI.txt Out[12]: <IPython.core.display.HTML object> We put it on the cluster: In [13]: %remote_cmd hdfs dfs -put ConfLongDemo_JSI.txt ConfLongDemo_JSI.txt Out[13]: <IPython.core.display.HTML object> We check it was put on the cluster: In [14]: %remote_cmd hdfs dfs -ls Conf*JSI.txt Out[14]: <IPython.core.display.HTML object> In [15]: dfs_ls Conf*JSI.txt Out[15]: attributes code alias folder size date time \ 0 -rw-r--r-- 3 xavierdupre xavierdupre 21546346 2015-09-27 11:33 0 ConfLongDemo JSI.txt False We create a simple PIG program: In [5]: %%PIG filter_example.pig myinput = LOAD ConfLongDemo_JSI.txt USING PigStorage(, ) AS (index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activit filt = FILTER myinput BY activity == walking ; STORE filt INTO ConfLongDemo_JSI.walking.txt USING PigStorage() ; In [6]: %pig_submit filter_example.pig -r=filter_example.redirect Out[6]: <IPython.core.display.HTML object> We check the redirected files were created: In [7]: %remote_cmd ls f*redirect* Out[7]: <IPython.core.display.HTML object> We check the tail on a regular basis to see the job running (some other commands can be used to monitor jobs, %remote cmd mapred --help). In [11]: %remote_cmd tail filter_example.redirect.err Out[11]: <IPython.core.display.HTML object> In [10]: %remote_cmd hdfs dfs -ls Conf*JSI.walking.txt Out[10]: <IPython.core.display.HTML object> In [9]: %dfs_ls Conf*JSI.walking.txt 4

Out[9]: attributes code alias folder size date time \ 0 -rw-r--r-- 3 xavierdupre xavierdupre 0 2015-09-27 11:38 1 -rw-r--r-- 3 xavierdupre xavierdupre 0 2015-09-27 11:38 0 ConfLongDemo JSI.walking.txt/ SUCCESS False 1 ConfLongDemo JSI.walking.txt/part-m-00000 False After that, the stream has to downloaded to the bridge and then to the local machine with %remote down. We finally close the connection. In [12]: %remote_close Out[12]: True END In [ ]: 5