Big Data and Scripting Systems build on top of Hadoop

Similar documents

Big Data and Scripting Systems build on top of Hadoop

How To Scale Out Of A Nosql Database

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

American International Journal of Research in Science, Technology, Engineering & Mathematics

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Hadoop Ecosystem B Y R A H I M A.

Integration of Apache Hive and HBase

Scaling Up HBase, Hive, Pegasus

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Using distributed technologies to analyze Big Data

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Implement Hadoop jobs to extract business value from large and varied data sets

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Hadoop IST 734 SS CHUNG

Introduction to Apache Hive

Big Data and Apache Hadoop s MapReduce

Open source large scale distributed data management with Google s MapReduce and Bigtable

Data storing and data access

COURSE CONTENT Big Data and Hadoop Training

Integrating Big Data into the Computing Curricula

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

The Hadoop Eco System Shanghai Data Science Meetup

ITG Software Engineering

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Apache HBase: the Hadoop Database

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop and Big Data Research

Comparing SQL and NOSQL databases

Hadoop Scripting with Jaql & Pig

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Hypertable Architecture Overview

Workshop on Hadoop with Big Data

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Introduction to Apache Hive

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Xiaoming Gao Hui Li Thilina Gunarathne

Sentimental Analysis using Hadoop Phase 2: Week 2

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Introduction To Hive

Introduction to Hbase Gkavresis Giorgos 1470

A Brief Outline on Bigdata Hadoop

Transforming the Telecoms Business using Big Data and Analytics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data Too Big To Ignore

Moving From Hadoop to Spark

MapReduce with Apache Hadoop Analysing Big Data

Introduction to Big Data Training

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Constructing a Data Lake: Hadoop and Oracle Database United!

Apache Hadoop. Alexandru Costan

Big Data and Data Science: Behind the Buzz Words

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Chase Wu New Jersey Ins0tute of Technology

Cloudera Certified Developer for Apache Hadoop

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Media Upload and Sharing Website using HBASE

Introduction to Hadoop

Big Data Course Highlights

Peers Techno log ies Pv t. L td. HADOOP

BIG DATA What it is and how to use?

NoSQL Data Base Basics

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Job Oriented Training Agenda

Hadoop: The Definitive Guide

Introduction to Apache Pig Indexing and Search

Internals of Hadoop Application Framework and Distributed File System

MapR, Hive, and Pig on Google Compute Engine

Hadoop and Map-Reduce. Swati Gore

Can the Elephants Handle the NoSQL Onslaught?

HiBench Introduction. Carson Wang Software & Services Group

Open Source Technologies on Microsoft Azure

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Understanding NoSQL on Microsoft Azure

Hadoop Usage At Yahoo! Milind Bhandarkar

NoSQL and Hadoop Technologies On Oracle Cloud

Best Practices for Hadoop Data Analysis with Tableau

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Practical Cassandra. Vitalii

Transcription:

Big Data and Scripting Systems build on top of Hadoop 1,

2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the provided programming language Pig Latin is similar to query languages like SQL still procedural, not declarative commands describe actions to execute, not the desired result can be extended using various languages developed at Yahoo, moved to the Apache Software foundation in 2007 top level project (pig.apache.org)

3, Pig/Latin aim provide means to quickly express algorithms that can be run on a Hadoop cluster simple/easy to learn language enable rapid prototyping of map reduce applications runs on local or distributed Hadoop interactive or batch mode commands from script or command line are translated into Hadoop jobs executed on the Hadoop system

4, an example session [nagelu@localhost bds]$ pig -x local... A = load /etc/passwd using PigStorage( : ); dump A ;... start in local mode (access to local file system) assign file content to variable A dump file content to console load does not result in an actual action lazy evaluation: content of file is only loaded when necessary for execution dump A ; starts a map reduce job that reads and prints the file content

5, concepts in Pig Latin commands (operators) act on relations a relation is typically a CSV-file from hdfs example: assign a relation with named fields to variable A A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float); further operators then transform relations into other relations example: group items by field age B = GROUP A BY age; again, nothing is executed until needed relations can have schemas (column names and types) schemas can be used to ensure type-safety

6, overview use LOAD to assign relations from files to variables modify relations with the various operators ( FILTER, GROUP, UNION, JOIN ) store results using STORE pig statements can be embedded in a number of languages example in JavaScript: importpackage(packages.org.apache.pig.scripting.js) Pig = org.apache.pig.scripting.js.jspig function main() { input = "original" output = "output" P = Pig.compile("A = load $in ; store A into $out ;") result = P.bind({ in :input, out :output}).runsingle() if (result.issuccessful()) {...

7, extending Pig Latin new functions can be added by implementing them in various languages Java, Python, JavaScript, Ruby most extensive support is provided in Java: extend class org.apache.pig.evalfunc register in Pig REGISTER myudfs.jar; Java programs can be run native on Hadoop special interfaces allow more efficient integration of special types of functions

8, aggregation interfaces generic (EvalFunc) gets one input tuple at a time Algebraic - incremental aggregation of values can be executed distributed must be able to compute own intermediate results produces intermediate and final results Accumulator gets all tuples belonging to one key produces one (final) output value for all tuples of one key FilterFunc decides for each input tuple whether it passes specified Java-types relate to information from relation-schema implemented functions can be overloaded - providing fitting implementations for different types

9, Apache Hive top level Apache project built on top of Hadoop distributed data-warehouse allowing queries and transformations use various file systems as backend (HDFS, Amazon S3 fs,... ) SQL-like query language HiveQL execution by translation into map reduce jobs indexing to accelerate queries command line interface Hive CLI originally developed by Facebook turned into open source project

10, HiveQL create a table with two columns: hive> create table student (sid string, sname string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY, ; tables correspond to directories in the underlying file system stored as CSV-files load some data: hive> LOAD DATA INPATH /tmp/students.txt INTO TABLE student; imports the file content into Hive s storage dropping the table deletes data and index from Hive storage, does not affect external data select * from student;

11, HiveQL multiple tables can be joined only equality joins SELECT * FROM student JOIN scores ON (student.sid = marks.sid); many standard SQL statements available, e.g. GROUP BY INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; grouping and aggregation write result into new table

12, indexing indexes allow speed up of certain operations indexes are stored independent from actual data in RDBMS example: CREATE INDEX index_name ON TABLE base_table_name (col_name,...) AS index_type... Hive allows plugin indexes implement create, rebuild, drop in interface HiveIndexHandler (Java)

13, data organization databases tables correspond to (top-level) directories partitions - sub directories of the table directory tables can be partitioned by some column CREATE TABLE table (col1 INT, col2 STRING) PARTITIONED BY (col3 DATE); buckets - further break downs of partitions, allow better organization with respect to map reduce storage of actual data in flat files arbitrary formats can be used, description via regular expressions

14, Hive summary bring together SQL functionality and scaling features of Hadoop subset of table operations specified by SQL no low-latency queries optimized for scalability storage in flat files on distributed file system querying/processing by translation into map reduce jobs on Hadoop extending storage by indexing in RBMS due to distributed storage: no individual updates appending to tables not (yet) possible

15, HBase sparse, distributed, persistent multidimensional sorted map implementation of the Bigtable 1 idea (google) uses HDFS and Hadoop, Zookeeper for storage and execution implements servers for administration and storage/computation mapping keys to values keys are structured, values are arbitrary implements random read/write access on top of HDFS provides consistency accessible via shell or Java-API 1 research.google.com/archive/bigtable.html

16, structure of HBase keys are stored sorted, allowing range queries keys are highly structured into: rowkey column family column timestamp tables are stored sparsely missing values are not encoded but simply not stored every value has to be stored with full address data is distributed, load balancing automated consistency is guaranteed on row-key level: all changes within one rowkey are atomic all data of one rowkey is stored on a single machine

17, storage and access data is partitioned by keys column families define storage properties columns are only label for the corresponding values principle operations: put insert/update value delete delete value get retrieve single value scan retrieve collection of values sequential reading

18, timestamp each value is stored together with a timestamp - the version there can be arbitrary many values in the same row/column family/column differing only by their version get/scan can retrieve more than one version of a value put uses by default current time as version, can be changed to other timestamp it is possible to delete only certain versions (e.g. the old ones) of a value column families can have time-to-live value, versions older than this will be deleted automatically

19, coprocessors HBase provides hooking into events coprocessors run on the server side are triggered by all kinds of specific events reading/scanning writing/deletion before start and after completion of individual operations possible use cases: secondary indexes reference integrity aggregating views

20, guarantees in HBase atomicity mutations are atomic within a row operation result reported not atomic over multiple rows (parts may fail, others succeed) consistency and isolation returned rows consist of complete rows having existed at some point: contained data may have changed in between the data returned refers to a single point in the past scans are not consistent over multiple rows different rows may refer to different points in time

21, guarantees in HBase visibility after successful writing, data is immediately visible to all clients versions of rows strictly increase durability refers to data being stored on disk data that has been read from a cell is guaranteed to be durable successful operations are durable, failed operations not visibility and durability may be tuned for performance individual reads without visibility guarantees instead of durability only periodic writing of data

22, what HBase can and can t do joins - no not supported naturally joins have to be implemented on the application level select - yes it is possible to create a scan reading contents of a complete table or subset of columns scans can be provided with a filter that selects subsets of rows/columns filters are executed on the server side group by/aggregate - not yet is planned feature workarounds using Hive (external table) or map reduce

comparison Pig Latin allows to view data as tables provides ad hoc queries extendable to arbitrary map reduce jobs Hive tries to provide SQL-functionality slow, large scale queries structured query language, query planner Hbase more like a NOSQL database or key/value store no sql operations, only storage and retrieval guarantees for operations optimized for random, real-time access note: Pig and Hive can access data from Hbase directly note: Cassandra is a dbs similar to HBase, optimized for security 23,

24, Mahout implementations of data mining/learning algorithms top-level Apache project provide a library for easy access to machine learning implementations provide algorithms for the most common problems: clustering classification frequent pattern mining,... optimize for practical (e.g. business) usage language: Java

25, some implemented algorithms (examples) classification logistic regression Bayesian classification random forrests hidden Markov models clustering (fuzzy) k-means EM-clustering latent Dirichlet allocation spectral clustering frequent item set mining Parallel FP Growth spectral decomposition (approximated) single value decomposition recommanders/collaborative filtering item-based matrix-factorization

26, usage, parameters, environment provides large set of Java-classes these can be used directly in other applications some (most) implemented as map reduce jobs implementations can be adapted to specific problems provide individual I/O classes individual similarity/distance functions,... most algorithms can be started from command line some provide standalone servers for integration into other systems runs single machine or distributed (on Hadoop) integration of Apache Lucene (document search engine)

27, example 2 a recommendation engine can be specified by configuring a standard framework applying it to data configure classes implementing Java interfaces: DataModel reading information from storage in files UserSimilarity define similarity function between users ItemSimilarity define similarity function between items Recommender the actual recommender implementation UserNeighborhood compute neighborhood of similar users Mahout provides off the shelf implementations for all of these can be extended by individual implementations uses library Taste 2 source: www.ibm.com/developerworks/java/library/j-mahout/

28, configuring the parts //create the data model FileDataModel datamodel = new FileDataModel(new File(recsFile)); UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(dataModel); // Optional: usersimilarity.setpreferenceinferrer( new AveragingPreferenceInferrer(dataModel)); //Get a neighborhood of users UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighborhoodSize, usersimilarity, datamodel); data model reads directly from file use Pearson correlation as user similarity

29, generating recommendations: //Create the recommender Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, usersimilarity); User user = datamodel.getuser(userid); System.out.println("User: " + user); //Print out the users own preferences first TasteUtils.printPreferences(user, handler.map); //Get the top 5 recommendations List<RecommendedItem> recommendations = recommender.recommend(userid, 5); TasteUtils.printRecs(recommendations, handler.map); Mahout provides classes that allow to run the individual computations directly on an Hadoop cluster