Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
|
|
|
- Jean Sullivan
- 10 years ago
- Views:
Transcription
1 Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
2 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce Hadoop Ecosystem Hadoop vs RDBMSes Conclusion Amr Awadallah, Cloudera Inc 2
3 Current Storage Systems Can t Compute Amr Awadallah, Cloudera Inc 3
4 Current Storage Systems Can t Compute Collection Instrumentation Amr Awadallah, Cloudera Inc 3
5 Current Storage Systems Can t Compute Storage Farm for Unstructured Data (20TB/day) Mostly Append Collection Instrumentation Amr Awadallah, Cloudera Inc 3
6 Current Storage Systems Can t Compute Interactive Apps RDBMS (200GB/day) ETL Grid Storage Farm for Unstructured Data (20TB/day) Mostly Append Collection Instrumentation Amr Awadallah, Cloudera Inc 3
7 Current Storage Systems Can t Compute Interactive Apps RDBMS (200GB/day) ETL Grid Filer heads are a bottleneck Storage Farm for Unstructured Data (20TB/day) Mostly Append Collection Instrumentation Amr Awadallah, Cloudera Inc 3
8 Current Storage Systems Can t Compute Interactive Apps RDBMS (200GB/day) ETL Grid Filer heads are a bottleneck Storage Farm for Unstructured Data (20TB/day) Collection Instrumentation Ad hoc Queries & Data Mining Mostly Append Non-Consumption Amr Awadallah, Cloudera Inc 3
9 The Solution: A Store-Compute Grid Amr Awadallah, Cloudera Inc 4
10 The Solution: A Store-Compute Grid Storage + Computation Mostly Append Collection Instrumentation Amr Awadallah, Cloudera Inc 4
11 The Solution: A Store-Compute Grid Interactive Apps ETL and Aggregations RDBMS Storage + Computation Mostly Append Collection Instrumentation Amr Awadallah, Cloudera Inc 4
12 The Solution: A Store-Compute Grid Interactive Apps ETL and Aggregations RDBMS Batch Apps Ad hoc Queries & Data Mining Storage + Computation Mostly Append Collection Instrumentation Amr Awadallah, Cloudera Inc 4
13 What is Hadoop? Amr Awadallah, Cloudera Inc 5
14 What is Hadoop? A scalable fault-tolerant grid operating system for data storage and processing Amr Awadallah, Cloudera Inc 5
15 What is Hadoop? A scalable fault-tolerant grid operating system for data storage and processing Its scalability comes from the marriage of: HDFS: Self-Healing High-Bandwidth Clustered Storage MapReduce: Fault-Tolerant Distributed Processing Amr Awadallah, Cloudera Inc 5
16 What is Hadoop? A scalable fault-tolerant grid operating system for data storage and processing Its scalability comes from the marriage of: HDFS: Self-Healing High-Bandwidth Clustered Storage MapReduce: Fault-Tolerant Distributed Processing Operates on unstructured and structured data Amr Awadallah, Cloudera Inc 5
17 What is Hadoop? A scalable fault-tolerant grid operating system for data storage and processing Its scalability comes from the marriage of: HDFS: Self-Healing High-Bandwidth Clustered Storage MapReduce: Fault-Tolerant Distributed Processing Operates on unstructured and structured data A large and active ecosystem (many developers and additions like HBase, Hive, Pig, ) Amr Awadallah, Cloudera Inc 5
18 What is Hadoop? A scalable fault-tolerant grid operating system for data storage and processing Its scalability comes from the marriage of: HDFS: Self-Healing High-Bandwidth Clustered Storage MapReduce: Fault-Tolerant Distributed Processing Operates on unstructured and structured data A large and active ecosystem (many developers and additions like HBase, Hive, Pig, ) Open source under the friendly Apache License Amr Awadallah, Cloudera Inc 5
19 What is Hadoop? A scalable fault-tolerant grid operating system for data storage and processing Its scalability comes from the marriage of: HDFS: Self-Healing High-Bandwidth Clustered Storage MapReduce: Fault-Tolerant Distributed Processing Operates on unstructured and structured data A large and active ecosystem (many developers and additions like HBase, Hive, Pig, ) Open source under the friendly Apache License Amr Awadallah, Cloudera Inc 5
20 Hadoop History Amr Awadallah, Cloudera Inc 6
21 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch Amr Awadallah, Cloudera Inc 6
22 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers Amr Awadallah, Cloudera Inc 6
23 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch Amr Awadallah, Cloudera Inc 6
24 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch Amr Awadallah, Cloudera Inc 6
25 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s Amr Awadallah, Cloudera Inc 6
26 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s 2008: Web-scale deployments at Y!, Facebook, Last.fm Amr Awadallah, Cloudera Inc 6
27 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s 2008: Web-scale deployments at Y!, Facebook, Last.fm April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes Amr Awadallah, Cloudera Inc 6
28 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s 2008: Web-scale deployments at Y!, Facebook, Last.fm April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes May 2009: Yahoo does fastest sort of a TB, 62secs over 1460 nodes Yahoo sorts a PB in 16.25hours over 3658 nodes Amr Awadallah, Cloudera Inc 6
29 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s 2008: Web-scale deployments at Y!, Facebook, Last.fm April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes May 2009: Yahoo does fastest sort of a TB, 62secs over 1460 nodes Yahoo sorts a PB in 16.25hours over 3658 nodes June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500) Amr Awadallah, Cloudera Inc 6
30 Hadoop History : Doug Cutting and Mike Cafarella started working on Nutch : Google publishes GFS and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Nutch 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch 2007: NY Times converts 4TB of archives over 100 EC2s 2008: Web-scale deployments at Y!, Facebook, Last.fm April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes May 2009: Yahoo does fastest sort of a TB, 62secs over 1460 nodes Yahoo sorts a PB in 16.25hours over 3658 nodes June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500) Amr Awadallah, Cloudera Inc 6 September 2009: Doug Cutting joins Cloudera
31 Hadoop Design Axioms Amr Awadallah, Cloudera Inc 7
32 Hadoop Design Axioms 1. System Shall Manage and Heal Itself Amr Awadallah, Cloudera Inc 7
33 Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly Amr Awadallah, Cloudera Inc 7
34 Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data Amr Awadallah, Cloudera Inc 7
35 Hadoop Design Axioms 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data 4. Simple Core, Modular and Extensible Amr Awadallah, Cloudera Inc 7
36 HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few /month vs $/month Amr Awadallah, Cloudera Inc 8
37 HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few /month vs $/month Amr Awadallah, Cloudera Inc 8
38 MapReduce: Distributed Processing Amr Awadallah, Cloudera Inc 9
39 MapReduce: Distributed Processing Amr Awadallah, Cloudera Inc 9
40 MapReduce Example for Word Count SELECT word, COUNT(1) FROM docs GROUP BY word; cat *.txt mapper.pl sort reducer.pl > out.txt Split 1 Split i Split N Amr Awadallah, Cloudera Inc 10
41 MapReduce Example for Word Count SELECT word, COUNT(1) FROM docs GROUP BY word; cat *.txt mapper.pl sort reducer.pl > out.txt Split 1 (docid, text) To Be Or Not To Be? Map 1 (words, counts) Be, 12 Be, 5 Split i (docid, text) Map i Be, 7 Be, 6 Split N (docid, text) Map M (words, counts) Amr Awadallah, Cloudera Inc 10
42 MapReduce Example for Word Count SELECT word, COUNT(1) FROM docs GROUP BY word; cat *.txt mapper.pl sort reducer.pl > out.txt Split 1 (docid, text) To Be Or Not To Be? Map 1 (words, counts) Be, 12 Be, 5 (sorted words, counts) Reduce 1 Split i (docid, text) Map i Reduce i Split N (docid, text) Map M Be, 7 Be, 6 (words, counts) Shuffle (sorted words, counts) Reduce R Amr Awadallah, Cloudera Inc 10
43 MapReduce Example for Word Count SELECT word, COUNT(1) FROM docs GROUP BY word; cat *.txt mapper.pl sort reducer.pl > out.txt Split 1 Split i (docid, text) To Be Or Not To Be? (docid, text) Map 1 Map i (words, counts) Be, 12 Be, 5 (sorted words, counts) Reduce 1 Reduce i (sorted words, sum of counts) Be, 30 (sorted words, sum of counts) Output File 1 Output File i Split N (docid, text) Map M Be, 7 Be, 6 (words, counts) Shuffle (sorted words, counts) Reduce R (sorted words, sum of counts) Output File R Amr Awadallah, Cloudera Inc 10
44 Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves blocks of data Share Physical Node Task Tracker Runs tasks (work units) within a job Amr Awadallah, Cloudera Inc 11
45 Apache Hadoop Ecosystem MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) Amr Awadallah, Cloudera Inc 12
46 Apache Hadoop Ecosystem Zookeepr (Coordination) MapReduce (Job Scheduling/Execution System) HDFS (Hadoop Distributed File System) Avro (Serialization) Amr Awadallah, Cloudera Inc 12
47 Apache Hadoop Ecosystem Zookeepr (Coordination) MapReduce (Job Scheduling/Execution System) HBase (key-value store) HDFS (Hadoop Distributed File System) Avro (Serialization) Amr Awadallah, Cloudera Inc 12
48 Apache Hadoop Ecosystem ETL Tools BI Reporting RDBMS Zookeepr (Coordination) Pig (Data Flow) MapReduce (Job Scheduling/Execution System) HBase (key-value store) Hive (SQL) HDFS (Hadoop Distributed File System) Sqoop (Streaming/Pipes APIs) Avro (Serialization) Amr Awadallah, Cloudera Inc 12
49 Use The Right Tool For The Right Job Hadoop: Relational Databases: Amr Awadallah, Cloudera Inc 13
50 Use The Right Tool For The Right Job Hadoop: Relational Databases: Amr Awadallah, Cloudera Inc 13
51 Use The Right Tool For The Right Job Hadoop: Relational Databases: When to use? Affordable Storage/ Compute Structured or Not (Agility) When to use? Interactive Reporting (<1sec) Multistep Transactions Resilient Auto Scalability Interoperability Amr Awadallah, Cloudera Inc 13
52 Economics of Hadoop Amr Awadallah, Cloudera Inc 14
53 Economics of Hadoop Typical Hardware: Two Quad Core Nehalems 24GB RAM 12 * 1TB SATA disks (JBOD mode, no need for RAID) 1 Gigabit Ethernet card Amr Awadallah, Cloudera Inc 14
54 Economics of Hadoop Typical Hardware: Two Quad Core Nehalems 24GB RAM 12 * 1TB SATA disks (JBOD mode, no need for RAID) 1 Gigabit Ethernet card Cost/node: $5K/node Amr Awadallah, Cloudera Inc 14
55 Economics of Hadoop Typical Hardware: Two Quad Core Nehalems 24GB RAM 12 * 1TB SATA disks (JBOD mode, no need for RAID) 1 Gigabit Ethernet card Cost/node: $5K/node Effective HDFS Space: ¼ reserved for temp shuffle space, which leaves 9TB/node 3 way replication leads to 3TB effective HDFS space/node But assuming 7x compression that becomes ~ 20TB/node Amr Awadallah, Cloudera Inc 14
56 Economics of Hadoop Typical Hardware: Two Quad Core Nehalems 24GB RAM 12 * 1TB SATA disks (JBOD mode, no need for RAID) 1 Gigabit Ethernet card Cost/node: $5K/node Effective HDFS Space: ¼ reserved for temp shuffle space, which leaves 9TB/node 3 way replication leads to 3TB effective HDFS space/node But assuming 7x compression that becomes ~ 20TB/node Effective Cost per user TB: $250/TB Amr Awadallah, Cloudera Inc 14
57 Economics of Hadoop Typical Hardware: Two Quad Core Nehalems 24GB RAM 12 * 1TB SATA disks (JBOD mode, no need for RAID) 1 Gigabit Ethernet card Cost/node: $5K/node Effective HDFS Space: ¼ reserved for temp shuffle space, which leaves 9TB/node 3 way replication leads to 3TB effective HDFS space/node But assuming 7x compression that becomes ~ 20TB/node Effective Cost per user TB: $250/TB Other solutions cost in the range of $5K to $100K per user TB Amr Awadallah, Cloudera Inc 14
58 Sample Talks from Hadoop World 09 VISA: Large Scale Transaction Analysis JP Morgan Chase: Data Processing for Financial Services China Mobile: Data Mining Platform for Telecom Industry Rackspace: Cross Data Center Log Processing Booz Allen Hamilton: Protein Alignment using Hadoop eharmony: Matchmaking in the Hadoop Cloud General Sentiment: Understanding Natural Language Yahoo!: Social Graph Analysis Visible Technologies: Real-Time Business Intelligence Facebook: Rethinking the Data Warehouse with Hadoop and Hive Slides and Videos at Amr Awadallah, Cloudera Inc 15
59 Cloudera Desktop Amr Awadallah, Cloudera Inc 16
60 Conclusion Amr Awadallah, Cloudera Inc 17
61 Conclusion Hadoop is a data grid operating system which provides an economically scalable solution for storing and processing large amounts of unstructured or structured data over long periods of time. Amr Awadallah, Cloudera Inc 17
62 Contact Information Amr Awadallah CTO, Cloudera Inc. Online Training Videos and Info: Amr Awadallah, Cloudera Inc 18
63 (c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Big Data Big Data/Data Analytics & Software Development
Big Data Big Data/Data Analytics & Software Development Danairat T. [email protected], 081-559-1446 1 Agenda Big Data Overview Business Cases and Benefits Hadoop Technology Architecture Big Data Development
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Apache Hadoop in the Enterprise. Dr. Amr Awadallah, CTO/Founder @awadallah, [email protected]
Apache Hadoop in the Enterprise Dr. Amr Awadallah, CTO/Founder @awadallah, [email protected] Cloudera The Leader in Big Data Management Powered by Apache Hadoop The Leading Open Source Distribution of Apache
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
The Future of Data Management with Hadoop and the Enterprise Data Hub
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years [email protected] 2 What is Hadoop? An open-source
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Introduction to Analytics and Big Data - Hadoop. Rob Peglar EMC Isilon
Introduction to Analytics and Big Data - Hadoop Rob Peglar EMC Isilon SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may use
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Hadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
The Enterprise Data Hub and The Modern Information Architecture
The Enterprise Data Hub and The Modern Information Architecture Dr. Amr Awadallah CTO & Co-Founder, Cloudera Twitter: @awadallah 1 2013 Cloudera, Inc. All rights reserved. Cloudera Overview The Leader
Big Data and Industrial Internet
Big Data and Industrial Internet Keijo Heljanko Department of Computer Science and Helsinki Institute for Information Technology HIIT School of Science, Aalto University [email protected] 16.6-2015
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Using Hadoop for Webscale Computing. Ajay Anand Yahoo! [email protected] Usenix 2008
Using Hadoop for Webscale Computing Ajay Anand Yahoo! [email protected] Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang 2011-10
Application and practice of parallel cloud computing in ISP Guangzhou Institute of China Telecom Zhilan Huang 2011-10 Outline Mass data management problem Applications of parallel cloud computing in ISPs
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Hadoop Trends and Practical Use Cases. April 2014
Hadoop Trends and Practical Use Cases John Howey Cloudera [email protected] Kevin Lewis Cloudera [email protected] April 2014 1 Agenda Hadoop Overview Latest Trends in Hadoop Enterprise Ready Beyond
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
White Paper: What You Need To Know About Hadoop
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING
WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING Using Cloudera to Improve Data Processing CLOUDERA WHITE PAPER 2 Table of Contents What is Data Processing? 3 Challenges 4 Flexibility and Data Quality
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations [email protected] What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
MySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team [email protected] @rob1lancaster Organizer of Chicago
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies
Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies 1 Copyright 2011, Oracle and/or its affiliates. All rights Big Data, Advanced Analytics:
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
