Google Bing Daytona Microsoft Research
|
|
- Matilda Lindsey
- 8 years ago
- Views:
Transcription
1
2 Google Bing Daytona Microsoft Research
3
4 Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch...
5
6 An increased number and variety of data sources that generate large quantities of data Sensors (e.g. location, acoustical, ) Web 2.0 (e.g. twitter, wikis, ) Web clicks Realization that data was too valuable to delete Dramatic decline in the cost of hardware, especially storage If storage was still $00/GB there would be no big data revolution underway
7 Old guard Use a parallel database system ebay 0PB on 256 nodes Young turks Use a NoSQL system Facebook - 20PB on 2700 nodes Bing 50PB on 40K nodes Just a subset of the data these companies manage
8 NOSQL What s in the name...
9 NO to SQL It s not about saying that SQL should never be used, or that SQL is dead...
10 NOT Only SQL It s about recognizing that for some problems other storage solutions are better suited! Would have been more intuitive if named NO JOINS
11 More data model flexibility JSON as a data model, simple grammar, maps to ds, No schema first requirement Relaxed consistency models such as eventual consistency Willing to trade consistency for availability and speed See Sebastian Burckhardt s talk(s) Low upfront software costs Never learned anything but C/Java in school Hate declarative languages like SQL Faster time to insight from data acquisition
12 SQL: Data Arrives Cleanse the data Transform the data Load the data Derive a schema 2 RDBMS SQL Queries 6 NoSQL: Data Arrives No cleansing! No ETL! No load! NOSQL System Application Program 2 Analyze data where it lands!
13 Key/Value Stores Examples: MongoDB, CouchBase, Cassandra, Windows Azure, Flexible data model such as JSON Records sharded across nodes in a cluster by hashing on key Single record retrievals based on key Hadoop Scalable fault tolerant framework for storing and processing MASSIVE data sets Typically no data model Records stored in distributed file system
14 Relational Databases vs. Hadoop? (what s the future?) Relational databases and Hadoop are designed to meet different needs 4
15 Structured & Unstructured Relational DB Systems Structured data w/ known schema ACID Transactions SQL Rigid Consistency Model ETL Longer time to insight Maturity, stability, efficiency NoSQL Systems (Un)(Semi)structured data w/o schema No ACID No transactions No SQL Eventual consistency No ETL Faster time to insight Flexibility
16 2003 MR/GFS 2006 Hadoop Requirements: Store Hadoop = HDFS + MapReduce Massive amounts of click stream data that had to be stored and analyzed Store Scalable to PBs and 000s of nodes Highly fault tolerant Simple to program against Process GFS + MapReduce distributed $ fault-tolerant Process new programming paradigm
17 Scalability and a high degree of fault tolerance Ability to quickly analyze massive collections of records without forcing data to first be modeled, cleansed, and loaded Easy to use programming paradigm for writing and executing analysis programs that scale to 000s of nodes and PBs of data Low, up front software and hardware costs Think data warehousing for Big Data
18 Zookeeper Avro (Serialization) HDFS distributed, fault tolerant file system MapReduce framework for writing/executing distributed, fault tolerant algorithms Hive & Pig SQL-like declarative languages Sqoop package for moving data between HDFS and relational DB systems + Others BI Reporting ETL Tools RDBMS Hive & Pig Map/ Reduce HBase Sqoop HDFS
19 Underpinnings of the entire Hadoop ecosystem HDFS design goals: Scalable to 000s of nodes Assume failures (hardware and software) are common Targeted towards small numbers of very large files Write once, read multiple times Traditional hierarchical file organization of directories and files Highly portable Hive & Pig Map/ Reduce HDFS Sqoop
20 Large File Let s color-code them 6440MB Block Block 2 Block 3 Block 4 Block 5 Block 6 Block 00 Block 0 64MB 64MB 64MB 64MB 64MB 64MB 64MB 40MB e.g., Block Size = 64MB Files are composed of set of blocks Typically 64MB in size Each block is stored as a separate file in the local file system (e.g. NTFS)
21 Block Block 2 Block 3 Block 2 Block 2 Block 3 Block Block 3 Block Default placement policy: Node Node 2 e.g., Replication factor = 3 Node 3 Node 4 Node 5 First copy is written to the node creating the file (write affinity) Second copy is written to a data node within the same rack (to minimize cross-rack network traffic) Third copy is written to a data node in a different rack (to tolerate switch failures)
22 NameNode one instance per cluster Responsible for filesystem metadata operations on cluster, replication and locations of file blocks NameNode Master Backup Node responsible for backup of NameNode CheckpointNode or BackupNode (backups) DataNodes one instance on each node of the cluster Responsible for storage of file blocks Serving actual file data to client DataNode DataNode DataNode DataNode Slaves
23 NameNode BackupNode namespace backups (heartbeat, balancing, replication, etc.) DataNode DataNode DataNode DataNode DataNode nodes write to local disk
24 Giant File HDFS Client {node2, {node, node3, node4, node2, node 4} 5} 3} (based on replication factor ) (3 by default) Name node tells client where to store each block of the file Client transfers block directly to assigned data nodes NameNode BackupNode and so on DataNode DataNode DataNode DataNode DataNode
25 Giant File HDFS Client return locations of blocks of file stream blocks from data nodes NameNode BackupNode DataNode DataNode DataNode DataNode DataNode
26 HDFS was designed with the expectation that failures (both hardware and software) would occur frequently Failure types: Disk errors and failures DataNode failures Switch/Rack failures NameNode failures Datacenter failures DataNode NameNode
27 NameNode BackupNode Blocks are NameNode auto-replicated detects DataNode on remaining loss nodes to satisfy replication factor DataNode DataNode DataNode DataNode DataNode
28 NameNode BackupNode NameNode Blocks detects are re-balanced new DataNode and is added re-distributed to cluster DataNode DataNode DataNode DataNode DataNode
29 Highly scalable 000s of nodes and massive (00s of TB) files Large block sizes to maximize sequential I/O performance No use of mirroring or RAID. Reduce cost Use one mechanism (triply replicated blocks) to deal with a wide variety of failure types rather than multiple different mechanisms Negatives Why? Block locations and record placement is invisible to higher level software Makes it impossible to employ many optimizations successfully employed by parallel DB systems
30 Zookeeper Avro (Serialization) HDFS distributed, fault tolerant file system MapReduce framework for writing/executing distributed, fault tolerant algorithms Hive & Pig SQL-like declarative languages Sqoop package for moving data between HDFS and relational DB systems + Others BI Reporting Hive & Pig ETL Tools Map/ Reduce RDBMS Sqoop HDFS
31 Programming framework (library and runtime) for analyzing data sets stored in HDFS MapReduce jobs are composed of two functions: map() reduce() sub-divide & conquer combine & reduce cardinality User only writes the Map and Reduce functions MR framework provides all the glue and coordinates the execution of the Map and Reduce jobs on the cluster. Fault tolerant Scalable
32 REDUCE MAP Essentially, it s. Take a large problem and divide it into sub-problems 2. Perform the same function on all sub-problems DoWork() DoWork() DoWork() 3. Combine the output from all sub-problems Output
33 DataNode <key i, value i > Mapper <key A, value a > <key B, value b > <key C, value c > <key A, list(value a, value b, value c, )> Reducer DataNode DataNode <key i, value i > <key i, value i > Mapper Mapper <key A, value a > <key B, value b > <key C, value c > <key A, value a > <key B, value b > <key C, value c > Sort and group by key <key B, list(value a, value b, value c, )> Reducer Output <key C, list(value a, value b, value c, )> <key i, value i > Mapper <key A, value a > <key B, value b > <key C, value c > Reducer
34 - Coordinates all M/R tasks & events - Manages job queues and scheduling - Maintains and Controls TaskTrackers - Moves/restarts map/reduce tasks if needed - Uses checkpointing to combat single points of failure Master hadoop-namenode JobTracker NameNode JobTracker controls and heartbeats TaskTracker nodes MapReduce Layer HDFS Layer Execute individual map and reduce tasks as assigned by JobTracker MapReduce (in separate JVM) Layer HDFS Layer Slaves hadoopdatanode hadoopdatanode2 hadoopdatanode3 hadoopdatanode4 TaskTracker TaskTracker TaskTracker TaskTracker TaskTrackers store temp data DataNode Temporary DataNode data stored to DataNode local file system DataNode
35 MR Client Submit jobs to JobTracker JobTracker jobs get queued map() s are assigned to TaskTrackers (HDFS DataNode reduce phase locality begins aware) mappers spawned in separate JVM and execute TaskTracker TaskTracker TaskTracker TaskTracker Mapper Mapper Mapper Mapper Reducer Reducer Reducer Reducer mappers store temp results Temporary data stored to local file system
36 DataNode DataNode2 DataNode3 Get sum sales grouped by zipcode (custid, zipcode, amount) $75 Blocks of the Sales file in HDFS $ $ $ $ $ $ $ $ $ $65 $ $ $ $ $ $5 Mapper Group By $ $ $ $ $ $ $5 One output bucket per reduce task $ $ $5 Mapper $ $25 Group By 025 $5 025 $ $ $ $5 Map tasks 4433 $ $25
37 Mapper Shuffle Mapper Reducer $ $ $ $ $ $5 Sort $ $5 SUM $ $ $ $55 Reducer $ $ $ $ $95 Sort 0025 $ $ $ $25 SUM 0025 $ $ $ $ $55 Reduce tasks 4433 $55 Reducer 025 $5 025 $ $ $ $ $ $5 025 $5 Sort 025 $5 025 $ $ $22 SUM 025 $ $97
38 Actual number of Map tasks M is generally made much larger than the number of nodes used. Why? Helps deal with data skew and failure Example: Say M = 0,000 and W = 00 (W is number of Map workers) Each worker does (0,000 / 00) = 00 Map tasks If it suffers from skew or fails the uncompleted work can easily be shifted to another worker Skew with Reducers is still a problem Example: In a query = get sales by zipcodes, some zipcodes (e.g. 0025) may have many more sales records than others
39 Like HDFS, MapReduce framework designed to be highly fault tolerant Worker (Map or Reduce) failures Detected by periodic Master pings Map or Reduce jobs that fail are reset and then given to a different node If a node failure occurs after the Map job has completed, the job is redone and all Reduce jobs are notified Master failure (Hadoop 2..0-beta 8/5/203) If the master fails there is now failover
40 Highly fault tolerant Relatively easy to write arbitrary distributed computations over very large amounts of data MR framework removes burden of dealing with failures from programmer Schema embedded in application code A lack of shared schema Makes sharing data between applications difficult Makes lots of DBMS goodies such as indices, integrity constraints, views, impossible No declarative query language
41 Pace of cloud-scale data processing available in the open source community is accelerating Workflow Management Chubby Azkaban ZooKeeper Big Top Data Analytics Data Processing Protoco l Buffers Pig Dremel & Pregel Giraph FlumeJava Crunch Storage and File Systems BigTable Voldemort Spanner GFS MapReduce Colossus
42 HBase saw a sharp rise in it s code base due, in large part, to contributions from Cloudera, Twitter, and Facebook. 570K Lines of code Lines of Code & 50K 00+ Adopters # of Adopters 5+ New Committers Cumulative Committers Note: Number of adopters is non-exhaustive Represents one committer
43 Increased integration with Hadoop helped adoption of Cassandra grow from 20+ to 68+ companies between 20 and K 20K Lines of code Lines of Code 68+ Adopters & # of Adopters 50K 75K 20+ New Committers Cumulative Committers Note: Number of adopters is non-exhaustive Represents one committer
44 Apache Mahout is driving a new era of machine learning analytics. Lines of Code 50K 250K 300K 26+ Adopters Lines of code & # of Adopters 0K 50K New Committers Cumulative Committers Represents one contributor
45 At LinkedIn in product development and operations. People You May Know What it takes 20B New daily relationships Who s Viewed My Profile 82 Hadoop jobs 6TB ~5 Intermediate data Weekly test algorithms Career Explorer
46 46
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationMySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationA Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationMASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationGetting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
More informationBig Data Introduction
Big Data Introduction Ralf Lange Global ISV & OEM Sales 1 Copyright 2012, Oracle and/or its affiliates. All rights Conventional infrastructure 2 Copyright 2012, Oracle and/or its affiliates. All rights
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBig Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
More information<Insert Picture Here> Big Data
Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationDeploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution
More informationLecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
More informationBig Data Analytics: Hadoop-Map Reduce & NoSQL Databases
Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationBBM467 Data Intensive ApplicaAons
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationSQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationUsing Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008
Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationCommunicating with the Elephant in the Data Center
Communicating with the Elephant in the Data Center Who am I? Instructor Consultant Opensource Advocate http://www.laubersoltions.com sml@laubersolutions.com Twitter: @laubersm Freenode: laubersm Outline
More informationQsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationCertified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
More informationHadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationBig Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)
Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015 After this talk Realize the potential: Data vs. Big Data Understand why we
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationBIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
More informationextensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
More informationHadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationSurvival Tips for Big Data Impact on Performance Share Pittsburgh Session 15404
Survival Tips for Big Data Impact on Performance Share Pittsburgh Session 15404 Laura Knapp WW Business Consultant Laurak@aesclever.com ipv6hawaii@outlook.com 08/04/2014 Applied Expert Systems, Inc. 2014
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationCan the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
More informationComparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra dwd@fnal.gov Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
More informationA Survey of Distributed Database Management Systems
Brady Kyle CSC-557 4-27-14 A Survey of Distributed Database Management Systems Big data has been described as having some or all of the following characteristics: high velocity, heterogeneous structure,
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationA Survey on Big Data Concepts and Tools
A Survey on Big Data Concepts and Tools D. Rajasekar 1, C. Dhanamani 2, S. K. Sandhya 3 1,3 PG Scholar, 2 Assistant Professor, Department of Computer Science and Engineering, Sri Krishna College of Engineering
More informationHADOOP AND THE BI ENVIRONMENT
1 Toronto Edition! HADOOP AND THE BI ENVIRONMENT TDWI Toronto Chapter March 1, 2012 An introduction to Hadoop, its components, and position in tomorrow s BI environments 2 HADOOP AND THE BI ENVIRONMENT
More informationSystems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
More informationBig Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center
Big Data Processing using Hadoop Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center Apache Hadoop Hadoop INRIA S.IBRAHIM 2 2 Hadoop Hadoop is a top- level Apache project» Open source implementation
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationВовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More information