On Saying Enough Already! in MapReduce
|
|
|
- Andrea Davidson
- 5 years ago
- Views:
From this document you will learn the answers to the following questions:
How many f2 f2 did the researchers use to sort data?
What did the work on MapReduce use?
What did the work on top - k queries provide?
Transcription
1 On Saying Enough Already! in MapReduce Christos Doulkeridis and Kjetil Nørvåg Department of Computer and Information Science (IDI) Norwegian University of Science and Technology (NTNU) Trondheim, Norway
2 Outline Motivation Preliminaries MapReduce Top-k queries and top-k joins Rank-aware query processing in MapReduce Sorted access Intelligent data placement Data synopses Related work Conclusions and outlook 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 2
3 Motivation Business intelligence (BI) technology is essential for effective decision-making for the enterprise The challenges and needs of BI applications explode in the era of Big Data Data collection from various sources Retail, banking, RFID tags, , query logs, blogs, reviews, etc. Cloud Intelligence for data analysis of massive data sets The only scalable solution to-date Popularity of MapReduce and its open-source implementation Hadoop Important to support advanced BI operators over MapReduce Focus of this work: efficient rank-aware query processing 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 3
4 MapReduce Overview Map: (k 1, v 1 ) [(k 2,v 2 )] Reduce: (k 2, [v 2 ]) [v 3 ] Salient features Scalability, fault-tolerance, ease of use, flexibility, Limitations Performance!!! Lack of early termination 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 4
5 Rank-aware Processing Important tool for BI applications Decision-making based on top-k results matching the user s constraints, and ranked according to user preferences Inspection of a bounded set of k tuples only Rather than retrieval and display of a huge result set Challenging to report the top-k result, without exhaustive access to the underlying input data (early termination) 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 5
6 Top-k Queries ID NAME CITY PRICE SELECT * FROM hotels WHERE hotels.city= Trondheim ORDER BY hotels.price STOP AFTER k; h1 BEST Trondheim 1400 h2 ABC London 1200 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 h5 HILTON London 1800 h1 BEST Trondheim 1400 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 k = 1 h4 CHEAP Trondheim 1200 h1 BEST Trondheim 1400 h3 HOLIDAY Trondheim st International Workshop on Cloud Intelligence (Cloud-I 2012) 6
7 Top-k Queries A top-k query q k ( f ) is defined by a user-specified scoring function f, which aggregates the objects characteristics into a single score E.g., f = w 1 *X + w 2 *Y where Σw i =1 defines a total ordering Y p 7 Given a positive integer k and a user-defined weighting vector w Find p 3 p 4 the k data points p with the minimum f(p) scores p 1 w p2 p 6 p 5 X 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 7
8 Top-k Queries SELECT * FROM hotels HOLIDAY WHERE hotels.city = Trondheim ORDER BY 0.5*hotels.price+0.5*hotels.dist STOP AFTER 1; SELECT * FROM hotels CHEAP WHERE hotels.city = Trondheim ORDER BY 0.9*hotels.price+0.1*hotels.dist STOP AFTER 1; ID NAME CITY PRICE DIST h1 BEST Trondheim h2 ABC London h3 HOLIDAY Trondheim h4 CHEAP Trondheim h5 HILTON London SCORE st International Workshop on Cloud Intelligence (Cloud-I 2012) 8
9 Top-k Joins ID NAME CITY PRICE SELECT * FROM hotels, flights WHERE hotels.city = flights.to_city ORDER BY (0.5*hotels.price + 0.5*flights.price ) STOP AFTER k h1 BEST Trondheim 1400 h2 ABC London 1200 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 h5 HILTON London 1800 ID AIRLINE TO_CITY PRICE f1 KLM Trondheim 5000 f2 KLM London 4200 f3 SAS Trondheim 3000 f4 SAS Trondheim 3500 f5 KLM London st International Workshop on Cloud Intelligence (Cloud-I 2012) 9
10 Top-k Joins ID NAME CITY PRICE h1 BEST Trondheim 1400 h2 ABC London 1200 h3 HOLIDAY Trondheim 1500 h4 CHEAP Trondheim 1200 h5 HILTON London 1800 ID AIRLINE TO_CITY PRICE f1 KLM Trondheim 5000 f2 KLM London 4200 f3 SAS Trondheim 3000 f4 SAS Trondheim 3500 f5 KLM London 2000 ID NAME CITY PRICE ID AIRLINE TO_CITY PRICE h1 BEST Trondheim 1400 f1 KLM Trondheim 5000 h1 BEST Trondheim 1400 f3 SAS Trondheim 3000 h1 BEST Trondheim 1400 f4 SAS Trondheim st International Workshop on Cloud Intelligence (Cloud-I 2012) 10
11 Rank-aware Processing in MapReduce Access the complete input data! No support for early termination! 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 11
12 Rank-aware Processing in MapReduce Our contributions: Sorted access for top-k queries Intelligent data placement Use of data synopses 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 12
13 1. Sorted Access for Top-k Queries In centralized DBs, efficient processing of topk queries relies on sorted access to data Directly: data is stored sorted on disk Indirectly: provided by a secondary index How can we provide sorted access in MapReduce to support top-k queries q k ( f )? Two alternative techniques Always sort data before top-k processing based on f (query-dependent sorting) Store data sorted based on a scoring function F f (query-independent sorting) 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 13
14 Query-dependent Sorting Use a separate MR job to sort the data based on f before processing the top-k query q k ( f ) Easy to find the top-k results Simply report the k first tuples High overhead to sort data for each incoming query Sorting is query-dependent on scoring function f MR1 MR2 MR1 MR2 MR1 MR2 DataNodes 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 14
15 Query-independent Sorting Store data sorted on DataNodes based on a scoring function F (different than f) To process query q k ( f ) it suffices to access only the K first tuples of the stored data (where K > k) Sorting based on F is a one-time cost Again, can be performed using a separate MR job The difference (extra cost) in the number of accessed tuples K-k increases when F differs much from f MR1 MR2 MR3 MR1 MR2 MR3 MR1 MR2 MR3 DataNodes 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 15
16 2. Data Placement Data placement on DataNodes affects performance Determined by a Partitioner class in Hadoop Existing partitioning schemes (e.g., range or hash-based) used for data placement are oblivious to the nature of top-k queries Example: RanKloud [IEEE Multimedia 11] Desiderata Balance the useful work to DataNodes Avoid redundant processing 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 16
17 Intelligent Data Placement for Top-k Queries Angle-based partitioning [SIGMOD 08] (proposed by our group in the context of skyline queries) Splits the useful work fairly Splits the region near the origin of the axes to all partitions Advantages More intuitive for top-k queries Easy to generalize in higher dimensions Can be combined with sorting Sort based on distance to origin 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 17
18 3. Data Synopses Main idea Create and store metadata together with data Exploit the metadata during query processing to access only those blocks that may contain top-k results Metadata in the form of multidimensional histograms 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 18
19 Metadata for Top-k Queries Number of tuples Block IDs Can be constructed seamlessly during data upload Score range for d 2 Score range for d 1 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 19
20 Top-k Query Processing Exploiting Data Synopses Progressively access histogram bins Until it is guaranteed that the top-k tuples are enclosed in the bins Use upper bound on score Retrieve block IDs Use random access to retrieve only these blocks Example (k=5) Access bins up to d 13 and d 22 Total of 34 tuples (> 5) Block IDs = {1,2,3} 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 20
21 Optimizations Cost model for deciding when the cost of random access of few blocks is smaller than sequential access of many blocks Use optimized histograms (e.g., equi-depth) Use more advanced methods for histogram bin exploration E.g., examine more bins from the dimension that is more promising to produce the top-k results faster The use of data synopses can be combined with sorting and intentional data placement to boost the performance of query processing 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 21
22 Related Work on MapReduce RanKloud [IEEE Multimedia 11] Proposed for top-k join queries Cannot guarantee retrieval of k results CoHadoop [PVLDB 11] Co-location of files on the same DataNode Useful for joins EARL [PVLDB 12] Mechanism to stop execution of MapReduce jobs on demand 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 22
23 Conclusions & Outlook An overview of techniques for supporting rankaware processing in MapReduce Sorting, Data placement, Use of data synopses Currently, we evaluate these techniques Future work Analytical cost models Optimal partitioning scheme for top-k queries More complex query functions Extend the techniques to be applicable for intermediate results produced by other MapReduce jobs 1st International Workshop on Cloud Intelligence (Cloud-I 2012) 23
24 More information:
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
An Approach to High-Performance Scalable Temporal Object Storage
An Approach to High-Performance Scalable Temporal Object Storage Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology 791 Trondheim, Norway email:
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
A Survey of Large-Scale Analytical Query Processing in MapReduce
The VLDB Journal manuscript No. (will be inserted by the editor) A Survey of Large-Scale Analytical Query Processing in MapReduce Christos Doulkeridis Kjetil Nørvåg Received: date / Accepted: date Abstract
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
Processing Joins over Big Data in MapReduce
Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Apache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM
Using Big Data for Smarter Decision Making Colin White, BI Research July 2011 Sponsored by IBM USING BIG DATA FOR SMARTER DECISION MAKING To increase competitiveness, 83% of CIOs have visionary plans that
Search and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework
Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Big Data Technology CS 236620, Technion, Spring 2013
Big Data Technology CS 236620, Technion, Spring 2013 Structured Databases atop Map-Reduce Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Roadmap Previous class MR Implementation This class Query Languages
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
A Survey of Top-k Query Processing Techniques in Relational Database Systems
11 A Survey of Top-k Query Processing Techniques in Relational Database Systems IHAB F. ILYAS, GEORGE BESKALES, and MOHAMED A. SOLIMAN University of Waterloo Efficient processing of top-k queries is a
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Design of Electric Energy Acquisition System on Hadoop
, pp.47-54 http://dx.doi.org/10.14257/ijgdc.2015.8.5.04 Design of Electric Energy Acquisition System on Hadoop Yi Wu 1 and Jianjun Zhou 2 1 School of Information Science and Technology, Heilongjiang University
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Massive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
GeoGrid Project and Experiences with Hadoop
GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Fast Contextual Preference Scoring of Database Tuples
Fast Contextual Preference Scoring of Database Tuples Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece Joint work with Evaggelia Pitoura http://dmod.cs.uoi.gr 2 Motivation
MS SQL Performance (Tuning) Best Practices:
MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Load Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
CLOUD STORAGE USING HADOOP AND PLAY
27 CLOUD STORAGE USING HADOOP AND PLAY Devateja G 1, Kashyap P V B 2, Suraj C 3, Harshavardhan C 4, Impana Appaji 5 1234 Computer Science & Engineering, Academy for Technical and Management Excellence
Data warehousing with PostgreSQL
Data warehousing with PostgreSQL Gabriele Bartolini http://www.2ndquadrant.it/ European PostgreSQL Day 2009 6 November, ParisTech Telecom, Paris, France Audience
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig
Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
Data: To BI or not to BI?
NATIONAL CONFERENCE ON BMS, 30 MAY 2013, HILTON HOTEL Challenges in building a BI and Big data analytics system Data: To BI or not to BI? Iva Valerieva, 1 Marketing & Business Development Manager Who Are
Snapshots in Hadoop Distributed File System
Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)
Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900) Ian Foster Computation Institute Argonne National Lab & University of Chicago 2 3 SQL Overview Structured Query Language
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig
BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
KEYWORD SEARCH IN RELATIONAL DATABASES
KEYWORD SEARCH IN RELATIONAL DATABASES N.Divya Bharathi 1 1 PG Scholar, Department of Computer Science and Engineering, ABSTRACT Adhiyamaan College of Engineering, Hosur, (India). Data mining refers to
Vertica Live Aggregate Projections
Vertica Live Aggregate Projections Modern Materialized Views for Big Data Nga Tran - HPE Vertica - [email protected] November 2015 Outline What is Big Data? How Vertica provides Big Data Solutions? What
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Why Big Data in the Cloud?
Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
QuickDB Yet YetAnother Database Management System?
QuickDB Yet YetAnother Database Management System? Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Radim Bača, Peter Chovanec, Michal Krátký, and Petr Lukáš Department of Computer Science, FEECS,
Data Doesn t Communicate Itself Using Visualization to Tell Better Stories
SAP Brief Analytics SAP Lumira Objectives Data Doesn t Communicate Itself Using Visualization to Tell Better Stories Tap into your data big and small Tap into your data big and small In today s fast-paced
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Actian Vector in Hadoop
Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single
2015 The MathWorks, Inc. 1
25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex
Data Semantics Aware Cloud for High Performance Analytics
Data Semantics Aware Cloud for High Performance Analytics Microsoft Future Cloud Workshop 2011 June 2nd 2011, Prof. Jun Wang, Computer Architecture and Storage System Laboratory (CASS) Acknowledgement
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
SQLSaturday #399 Sacramento 25 July, 2015. Big Data Analytics with Excel
SQLSaturday #399 Sacramento 25 July, 2015 Big Data Analytics with Excel Presenter Introduction Peter Myers Independent BI Expert Bitwise Solutions BBus, SQL Server MCSE, SQL Server MVP since 2007 Experienced
Management & Analysis of Big Data in Zenith Team
Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence
Design and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
