Introduction to Apache Pig Indexing and Search
|
|
|
- Melvyn Mathews
- 9 years ago
- Views:
Transcription
1 Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013
2 Organizational 1st Tutorial Tuesday 29th Pig and Indexing Every person crosses what exercises he/she did For every exercise, one person is picked and he/she solves it At least 50% crosses to qualify for the exam Name Last name Ex 1.1 Ex 1.2 Ex 1.3 Max Muestermann X X Daniel Winkler X X
3 What is Pig? Apache Pig is a high level platform for big data analysis: Compiler Generates Map Reduce programs High-level language PigLatin
4 Why Pig? Writing Map Reduce jobs can be painful Difficult to make abstractions Verbose Joins are difficult Linking many Map Reduce jobs can be difficult Pig aims to solve these problems
5 Why Pig? High Level Support many relational features Join, Group by, User defined functions Multiple MapReduce jobs easy
6 Why Pig? motivation by example Load users Load sites Assume two data sources: User data Website data Filter by age Join by user id We need: Top 10 visited urls for people between Group by url Count clicks Sort by clicks Take top 10
7 Example in MapReduce
8 Example in PigLatin Load users Load sites Users = load 'users' as (name, age); filtered_users = filter Users by age>18 and age <45; Pages = load 'pages' as (user, url); Joined = join filtered_users by name, Pages by user; Grouped = group Joined by url; Counted = foreach Grouped generate group, count(pages) as clicks; Sorted = order Counted by clicks desc; Top10 = limit Sorted 10; store Top10 into 'topten'; Filter by age Join by user id Group by url Count clicks Sort by clicks Take top 10
9 Usage Run modes Local Mode MapReduce Mode Run ways Interactive (grunt shell) Script Embedded in another program
10 Running Pig executes PigLatin statements in two steps: 1) Validation of syntax/semantics of statements Grunt> Employees = load 'employees' as (name, address) Grunt> EmployeesF = filter Employees by address == 'Berlin' 2) If 'DUMP' or 'STORE' then execute statements Grunt> dump EmployeesF; (Thomas, Berlin) (Maria, Berlin) (Jan, Berlin)...
11 Data types Simple data types Complex data types Int, long Tuple Float, double Ordered Chararray Bytearray Bag fixed length of values Accessed by index Unordered collection Map of tuples All of the same type Chararray as key and any type for value
12 Example Schema: employees = load 'department1' as ( firstname:chararray, lastname:chararray, salary:float, subordinates: bag{t:(firstname:chararray, lastname:chararray)}, deductions:map[float], address:tuple(street: chararray,city:chararray,state: chararray, zip:int)); File format: Patrik Peters {(Jan, Roberts),(Fritz, Karls)} [Federal Taxes#0.2,State Taxes#0.05,Insurance#0.1] (Zeughausstrasse 30, Darmstadt, Hessen,64289) Fields separated by '\t' Tuples : (field1, field2,...) Bags : {(tuple1), (tuple2),...} Maps : [key1#value1, key2#value2,...]
13 I/O operations Load X = load '/data/customers.tsv' as (id:int, name:chararray, age:int) Store store X into '/data/customers.tsv' Dump dump X
14 Relational operations Foreach Projection operation Y = foreach X generate $1, $3; I = foreach employees generate lastname, deductions#'insurance' C = foreach employees generate lastname, address.city Filter Y = filter X by age>30 Y = filter X by name matches 'Ja.*'
15 Relational operations Group Y = group X by age count_by_age = foreach Y generate group, COUNT(X) Order O = order X by age Join Z = join X by id, Y by id
16 Other operations Flatten Removes nesting A = foreach employees generate name, flatten(address) Sample S = sample employees 0.10 Describe Displays the schema of a relation Explain Displays execution plans
17 Word count example A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
18 Word Count Example input.txt the cat and the dog the dog eats he eats bananas i have a dog Dump A; (the cat and the dog) (the dog eats) (he eats bananas) (i have a dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
19 Word Count Example Dump B; (the) (cat) (and) (the) (dog) (the) (dog) (eats) (he) (eats) (bananas) (I) (have) (a) (dog) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
20 Word Count Example Dump C; (the, {(the), (the), (the)}) (cat, {(cat)}) (and, {(and)}) (dog, {(dog), (dog), (dog)}) (eats, {(eats), (eats)}) (he, {(he)}) (bananas, {(bananas)}) (I, {(I)}) (have, {(have)}) (a, {(a)}) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
21 Word Count Example Dump D; (the, 3) (cat, 1) (and, 1) (dog, 3) (eats, 2) (he, 1) (bananas, 1) (I, 1) (have, 1) (a, 1) A = load './input.txt' as (sentence:chararray); B = foreach A generate flatten(tokenize(sentence)) as word; C = group B by word; D = foreach C generate group, COUNT(B);
22 Build-in functions Math: MIN, MAX, SUM,... String manipulation: CONCAT, REPLACE,... Others: e.g. SIZE, IsEmpty, TOTUPLE, TOBAG,...
23 User Defined Functions Support for UDFs For things that cannot be done in pure PigLatin Custom load, Column transformation, filtering, aggregation Can be written in Java, Python or Javascript PiggyBank Java UDFs (from users to users)
24 UDF example in Java package myudfs; import java.io.ioexception; import org.apache.pig.evalfunc; import org.apache.pig.data.tuple; import org.apache.pig.impl.util.wrappedioexception; public class UPPER extends EvalFunc (String) { public String exec(tuple input) throws IOException { if (input == null input.size() == 0) return null; try{ String str = (String)input.get(0); return str.touppercase(); }catch(exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
25 Calling UDF register myudfs.jar; U = foreach employee generate myudfs.upper(name); dump U;
26 UDF example in def UPPER(word): return word.upper() Registering: register 'myudfs.py' using jython as myudfs; U = foreach employee generate myudfs.upper(name); dump U;
27 Is Pig fast? PigMix Set of queries to test Pig's efficiency On average 1.1x the time of a Map-Reduce program
28 Pig Conclusion Pig opens the Map-Reduce system to more people (non Java experts) Pig Provides common (relational) operations Increases productivity 10 lines Pig 200 lines Java Only slightly slower than Java implementation
29 Searching Search for a specific keyword/query over many documents?
30 Searching Search for a specific keyword/query over many documents Do it sequentially Build data structures for indexing Not efficient!!! Data structures for searching Inverted Index Tries
31 Inverted indexes Map tokens to documents Extra information can be considered HTML-tags, type setting, etc Terms Occurences T1 D1,D3 T2 D2, D
32 Inverted index for documents
33 Inverted index for text position
34 Empirical memory analysis Text size = n Vocabulary n n Storing occurrences 0.3n 0.4n Heap's Law (depends of the text) Omitting stop words
35 Naive Construction Straight forward computation: 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) Problems Insertions need an efficient data structure Workaround Use intermediate data structure and derive index in the end
36 Tries Tree-based structure for building indexes Inner nodes indicate potential splits for unseen tokens Leaves are labeled with token and position Edges are labeled with characters
37 root Tries
38 Tries root c curiosity:1
39 Tries root c k curiosity:1 kills: 11
40 Tries root c t curiosity:1 k kills: 11 the: 17
41 Tries a cat:21 c u curiosity:1 root t k kills: 11 the: 17
42 Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 u Unfortunate: 26
43 Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17 f u Unfortunate: 26 for: 38
44 Tries a cat:21 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38
45 Tries a cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38
46 Tries r caramba a t cat:21,46 c u curiosity:1 root t k kills: 11 the:17,42 u Unfortunate: 26 for: 38
47 Constructing a Trie
48 Inverted indexes using Tries 1) Loop over all documents 2) Loop over all tokens 3) Insert(token, position) into Trie 4) Extract Index from Trie If out of memory Save Trie and load subtree Only consider tokens of subtree Save and loop again over all documents using next subtree
49 Inverted Index in Hadoop Similar to WordCount example Mapper: Compute (token, occurrence) Reducer: Sort/merge output of Mapper Tutorial Pig can be used for simplicity
50 More about indexing
51 Pig resources Programming Pig Cloudera's introduction IBM tutorial
52 Thanks for your attention!
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
Apache Pig Joining Data-Sets
2012 coreservlets.com and Dima May Apache Pig Joining Data-Sets Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
American International Journal of Research in Science, Technology, Engineering & Mathematics
American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.
Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,
Hadoop Pig. Introduction Basic. Exercise
Your Name Hadoop Pig Introduction Basic Exercise A set of files A database A single file Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36
Big Data: Pig Latin P.J. McBrien Imperial College London P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36 Introduction Scale Up 1GB 1TB 1PB P.J. McBrien (Imperial College London) Big Data:
Other Map-Reduce (ish) Frameworks. William Cohen
Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com Hadoop Hadoop is an open source distributed platform for data storage
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Pig vs Hive. Big Data 2014
Pig vs Hive Big Data 2014 Pig Configuration In the bash_profile export all needed environment variables Pig Configuration Download a release of apache pig: pig-0.11.1.tar.gz Pig Configuration Go to the
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Hadoop Scripting with Jaql & Pig
Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1 Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2 Introduction Goal: Compare two high level scripting languages
Big Data Technology Pig: Query Language atop Map-Reduce
Big Data Technology Pig: Query Language atop Map-Reduce Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class MR Implementation This class
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
Hadoop Hands-On Exercises
Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming
Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May
2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite
Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig
Pig Page 1 Programming Map/Reduce Wednesday, February 23, 2011 3:45 PM Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig Pig Page 2 Recall from last time Wednesday,
Assignment 4: Pig, Hive and Spark
Assignment 4: Pig, Hive and Spark Jean-Pierre Lozi March 15, 2015 Provided files An archive that contains all files you will need for this assignment can be found at the following URL: http://sfu.ca/~jlozi/cmpt732/assignment4.tar.gz
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations [email protected] What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 9-10: Advanced Pig Latin!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship
Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
Introduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Cloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
This exam contains 17 pages (including this cover page) and 21 questions. Check to see if any pages are missing.
Big Data Processing 2015-2016 Q2 January 29, 2016 Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer, continue
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Data Intensive Computing Handout 5 Hadoop
Data Intensive Computing Handout 5 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Relational Database: Additional Operations on Relations; SQL
Relational Database: Additional Operations on Relations; SQL Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Overview The course packet
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone [email protected] June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Relational Processing on MapReduce
Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Hadoop Project for IDEAL in CS5604
Hadoop Project for IDEAL in CS5604 by Jose Cadena Mengsu Chen Chengyuan Wen {jcadena,mschen,[email protected] Completed as part of the course CS5604: Information storage and retrieval offered by Dr. Edward
Large Scale Data Analysis Using Apache Pig Master's Thesis
UNIVERSITY OF TARTU FACULTY OF MATHEMATICS AND COMPUTER SCIENCE Institute of Computer Science Jürmo Mehine Large Scale Data Analysis Using Apache Pig Master's Thesis Advisor: Satish Srirama Co-advisor:
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
DIPLOMA IN WEBDEVELOPMENT
DIPLOMA IN WEBDEVELOPMENT Prerequisite skills Basic programming knowledge on C Language or Core Java is must. # Module 1 Basics and introduction to HTML Basic HTML training. Different HTML elements, tags
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Data Intensive Computing Handout 6 Hadoop
Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design
Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal
CSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Parquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala Outline Context from various
Word count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Computers. An Introduction to Programming with Python. Programming Languages. Programs and Programming. CCHSG Visit June 2014. Dr.-Ing.
Computers An Introduction to Programming with Python CCHSG Visit June 2014 Dr.-Ing. Norbert Völker Many computing devices are embedded Can you think of computers/ computing devices you may have in your
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL
Scalable Network Measurement Analysis with Hadoop Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL Outline Motivation Hadoop overview Approach doing the right thing, Avro what worked,
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Download and install Download virtual machine Import virtual machine in Virtualbox
Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual
Advanced Business Analytics using Distributed Computing (Hadoop)
Advanced Business Analytics using Distributed Computing (Hadoop) MIS 6100-01 Final Project Submitted By: Mani Kumar Pantangi M - Management Information Systems Jon M. Huntsman School of Business Utah State
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Hands-on Exercises with Big Data
Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer
Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
