Hadoop 2.0 Introduction with HDP for Windows. Seele Lin

Size: px
Start display at page:

Download "Hadoop 2.0 Introduction with HDP for Windows. Seele Lin"

Transcription

1 Hadoop 2.0 Introduction with HDP for Windows Seele Lin

2 Who am I Speaker: 林 彥 辰 A.K.A Seele Lin Mail: [email protected] Experience 2010~Present 2013~2014 Trainer of Hortonworks Certificated Training lecture HCAHD(Hortonworks Certified Apache Hadoop Developer) HCAHA (Hortonworks Certified Apache Hadoop Administrator)

3 Agenda What is Big Data The Need for Hadoop Hadoop Introduction What is Hadoop 2.0 Hadoop Architecture Fundamentals What is HDFS What is MapReduce What is YARN Hadoop eco-systems HDP for Windows What is HDP How to install HDP on Windows The advantages of HDP What s Next Conclusion Q&A

4 What is Big Data

5 What is Big Data? 1. In what timeframe do we now create the same amount of information that we created from the dawn of civilization until 2003? 2. 90% of the world s data was created in the last (how many years)? 3. This is data from 2010 report! 2 days 2 years Sources: cs=

6 How large can it be? 1ZB = 1000 EB = 1,000,000 PB = 1,000,000,000 TB

7 Every minute

8 The definition? Big Data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it too. Dan Ariely. A set of files A database A single file

9 Big Data Includes All Types of Data Structured Pre-defined schema Relational database system Semi-structured Inconsistent structure Cannot be stored in rows in a single table Logs, tweets Often has nested structure Unstructured Irregular structure or.. Parts of it lack structure Pictures Video Time-sensitive Immutable

10 6 Key Hadoop DATA TYPES 1. Sentiment How your customers feel 2. Clickstream Website visitors data 3. Sensor/Machine Data from remote sensors and machines 4. Geographic Location-based data 5. Server Logs Value 6. Text Millions of web pages, s, and documents Hortonworks Inc Page

11 4 V s of Big Data

12 Next Product to Buy (NPTB) Business Problem Telecom product portfolios are complex There are many cross-sell opportunities to installed base Sales associates use in-person conversations to guess about NPTB recommendations, with little supporting data Solution Hadoop gives telcos the ability to make confident NPTB recommendations, based on data from all its customers Confident NPTB recommendations empower sales associates and improve their interactions with customers Use the HDP data lake to reduce sales friction and create NPTB advantage like Amazon s advantage in ecommerce

13 Use case Walmart prediction Beer Diapers Friday Revenue?

14 Localized, Personalized Promotions Business Problem Telcos can geo-locate their mobile subscribers They could create localized and personalized promotions This requires connections with both deep historical data and realtime streaming data Those connections have been expensive and complicated Solution Hadoop brings the data together to inexpensively localize and personalize promotions delivered to mobile devices Notify subscribers about local attractions, events and sales that align with their preferences and location Telcos can sell these promotional services to retailers

15 360 View of the Customer Business Problem Retailers interact with customers across multiple channels Customer interaction and purchase data is often siloed Few retailers can correlate customer purchases with marketing campaigns and online browsing behavior Merging data in relational databases is expensive Solution Hadoop gives retailers a 360 view of customer behavior Store data longer & track phases of the customer lifecycle Gain competitive advantage: increase sales, reduce supply chain expenses and retain the best customers

16 Use case Target case Target mined their customer data and send coupons to shopper who have high pregnancy prediction score. One angry father stormed into a Target to yell at them for sending his daughter coupons for baby clothes and cribs. Guess what, she was pregnant, and hadn t told her father yet.

17 Changes in Analyzing Data Big data is fundamentally changing the way we analyze information. Ability to analyze vast amounts of data rather than evaluating sample sets. Historically we have had to look at causes. Now we can look at patterns and correlations in data that give us much better perspective.

18 Recent day cases 1:

19 Recent day cases 1: Practice on LINE

20 Recent day cases 2: in Taiwan The media analyze of 2014 Taipei City mayor election IMHO, 黑 貘 來 說 破 解 社 群 與 APP 行 銷 行 銷 絕 不 等 於 買 廣 告 年 台 北 市 長 選 舉 柯 文 哲 與 連 /

21 Scale up or Scale out?

22 Guess what Traditionally, computation has been processor-bound For decades, the primary push was to increase the computing power of a single machine Faster processor, more RAM Distributed systems evolved to allow developers to use multiple machines for a single job At compute time, data is copied to the compute nodes

23 Scaling with a traditional database scalling with a queue sharding the database fault-tolerane issue corruption issue problems

24 NOSQL Not Only SQL provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. Column: Accumulo, Cassandra, Druid, HBase, Vertica Document: Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeace, Aerospike Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog

25 First principles(1/2) "At the most fundamental level, what does a data system do?" a data system does: "A data system answers questions based on information that was acquired in the past". "What is this person's name? "How many friends does this person have? A bank account web page answers questions like "What is my current balance? "What transactions have occurred on my account recently?"

26 First principles(2/2) Data is often used interchangeably with the word "information". You answer questions on your data by running functions that take data as input The most general purpose data system can answer questions by running functions that take in the as input. In fact, any query can be answered entire dataset by running a function on the complete dataset

27 Desired Properties of a Big Data System Robust and fault-tolerant Low latency reads and updates Scalable General Extensible Allow ad hoc queries Minimal maintenance

28 The Lambda Architecture There is no single tool that provides a complete solution. You have to use The Lambda Architecture a variety of tools and techniques to build a complete Big Data system. solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers batch layer, the serving layer, and the speed layer

29 The Lambda Architecture model

30 Batch Layer 1 The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset. The master dataset can be thought of us a very large list of records. two things : store an immutable, constantly growing master dataset, and compute arbitrary functions on that dataset. If you're going to precompute views on a dataset, you need to be able to do so for any view and any dataset. There's a class of systems called "batch processing systems" that are built to do exactly what the batch layer requires

31 Batch Layer(2

32 What is Batch View Everything starts from the "query = function(all data)" equation. You could literally run your query functions on the fly on the complete dataset to get the results. it would take a huge amount of resources to do and would be unreasonably expensive Instead of computing the query on the fly, you read the results from the precomputed view.

33 How we get Batch View

34 Serving Layer 1

35 Serving Layer 2 The batch layer emits batch views as the result of its functions. The next step is to load the views somewhere so that they can be queried. The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view.

36 Batch and serving layers satisfy almost all properties Robust and fault tolerant Scalable General Extensible Allows ad hoc queries Minimal maintenance

37 Speed Layer 1

38 Speed Layer 2 speed layer as similar to the batch layer in that it produces views based on data it receives. One big difference is that in order to achieve the fastest latencies possible, the speed layer doesn't look at all the new data at once. it updates the realtime view as it receives new data instead of recomputing them like the batch layer does. "incremental updates vs "recomputation updates". Page view example

39 Speed Layer 3 complexity isolation:complexity is pushed into a layer whose results are only temporary The last piece of the Lambda Architecture is merging the resutlts from the batch and realtime views

40 Summary of the Lambda Architecture

41 Summary of the Lambda Architecture All new data is sent to both the batch layer and the speed layer The master dataset is an immutable, append-only set of data The batch layer pre-computes query functions from scratch The serving layer indexes the batch views produced by the batch layer and makes it possible to get particular values out of a batch view very quickly The speed layer compensates for the high latency of updates to the serving layer. Queries are resolved by getting results from both the batch and realtime views and merging them together

42 The Need for Hadoop SCALE (storage & processing) Traditional Database EDW MPP Analytics NoSQL Hadoop Platform Store and use all types of data Process all the data Scalability Commodity hardware

43 Hadoop as a Data Factory A role Hadoop can play in an enterprise data platform is that of a data factory Structured, semi-structured and raw data Business value Hadoop

44 Hadoop as a Data Lake A larger more general role Hadoop can play in an enterprise data platform is that of a data lake.

45 Integrating Hadoop ODBC Access for Popular BI Tools Tools MACHINE GENERATED Applications & Spreadsheets Visualization & Intelligence WEB LOGS, CLICK STREAMS ODBC Big Data Data Analysis Messaging Social Media Staging Area Hadoop Connectors EDW Data Marts OLTP

46 Hadoop Introduction

47 inspired by Apache Hadoop project inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Hadoop Creator: Doug Cu:ng Yahoo has been the largest contributor to the project and uses Hadoop extensively in its Web search and Ad business.

48 Hadoop Concepts Distribute the data as it is initially stored in the system Moving Computation is Cheaper than Moving Data Individual nodes can work on data local to those nodes Users can focus on developing applications.

49 Relational Databases vs. Hadoop Relational VS. Hadoop Required on write Reads are fast Standards and structured Limited, no data processing Structured schema speed governance processing data types Required on read Writes are fast Loosely structured Processing coupled with data Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store best fit use Data Discovery Processing unstructured data Massive Storage/Processing

50 Different behaviors between RDBMS and Hadoop RDBMS Application Schema RDBMS SQL Hadoop Application Hadoop Schema MapReduce

51 Why we use Hadoop, not RDBMS? Limitation of RDBMS Capacity 100GB~100TB Speed Cost High-end devices price increases over than its linear proportion Software cost on technical support or license fee Too Complex A Distributed File System is more likely fit our need DFS usually provides backup and faulttolerance mechanism More cheap than RDBMS when the data is really huge enough

52 What is Hadoop 2.0? The Apache Hadoop 2.0 project consists of the following modules: Hadoop Common: the utilities that provide support for the other Hadoop modules. HDFS: the Hadoop Distributed File System YARN: a framework for job scheduling and cluster resource management. MapReduce: for processing large data sets in a scalable and parallel fashion.

53 Difference between Hadoop 1.0 and 2.0

54 What is YARN Yet Another Resource Negotiator Jira ticket (MAPREDUCE-279) raised in January 2008 by Hortonworks co-founder Arun Murthy. YARN is the result of 5 years of subsequent development in the open community. YARN has been tested by Yahoo! since September 2012 and has been in production across 30,000 nodes and 325PB of data since January More recently, other enterprises such as Microsoft, ebay, Twitter, XING and Spotify have adopted a YARNbased architecture. Apache Hadoop YARN wins Best Paper award at SoCC 2013! - Hortonworks

55 YARN: Taking Hadoop Beyond Batch With YARN, applications run natively in Hadoop (instead of on Hadoop)

56 HDFS Federation /app/hive /app/hbase /home/ Hadoop Hadoop 2.0

57 HDFS High Availability (HA) Secondary Name Node is not Name Node

58 HDFS High Availability (HA)

59 Hadoop Architecture Fundamentals

60 What is HDFS NameNode Shared multi-petabyte file system for an entire cluster Managed by a single NameNode Multiple DataNodes DataNode DataNode DataNode

61 The Components of HDFS NameNode The master node of HDFS Determines and maintains how the chunks of data are distributed across the DataNodes DataNode Stores the chunks of data, and is responsible for replicating the chunks across other DataNodes

62 Concept: What is NameNode NameNode holds metadata for the files One HDFS cluster only has one metadata NameNode is a single point of failure Only one NameNode for One HDFS cluster One HDFS cluster only has one namespace and one root directory Metadata saves in NameNode s RAM in case to query it faster 1G RAM can almost saves 1,000,000 blocks of the mapping metadata information If the block size is 64MB, the metadata may mapping to 64TB actual data

63 More on the Metadata NameNode uses two important local files to save the metadata information.. fsimage fsimage saves file directory tree information fsimage saves the mapping of the file and the blocks edits edits saves the file system journal When the client tries to create / move a file, the operation will be first recorded into edits. If the operation succeed, the data in RAM will later be changed. fsimage WILL NOT instantly be changed.

64 The NameNode 1. When the NameNode starts, it reads the fsimage and edits files. 2. The transactions in edits are merged with fsimage, and edits is emptied. Data fsimage edits 3. A client application creates a new file in HDFS. NameNode 4. The NameNode logs that transaction in the edits file.

65 File Name Replicas Block Sequence Others /data/part-0 2 B1, B2, B3 user, group,... /data/part-1 3 B4, B5 foo, bar,... Memory Disk File Name Replicas Block Sequence Others fsimage /data/part-0 3 B1, B2, B3 user, group,... /data/part-1 3 B4, B5 user, group,... OP Code Operands edits OP_SET_REPLICATION "/data/part-0", 2 OP_SET_OWNER "/data/part-1", "foo", "bar"

66 Concept: What is DataNode DataNode hold the actual blocks Each block will be 64MB or 128MB in size Each block is replicated three times on the cluster DataNode communicates through Heartbeat with NameNode

67 Block backup and replication Each block is replicated multiple times Default replica number = 3 Client can modify the configuration Each block s replica has the same ID System has no need to record which blocks are the same Replicas can be set by rack awareness First backup on a rack The other two backups are on another rack, but on the different machines

68 The DataNodes NameNode I m still alive! This is my latest Blockreport. Replicate block 123 to DataNode 1. I m here! Here is my latest Blockreport. DataNode 1 DataNode 2 DataNode 3 DataNode 4 123

69 Data 1. Client sends a request to the NameNode to add a file to HDFS 2. NameNode tells client how and where to distribute the blocks NameNode 3. Client breaks the data into blocks and distributes the blocks to the DataNodes DataNode 1 DataNode 2 DataNode 3 4. The DataNodes replicate the blocks (as instructed by the NameNode) Hortonworks Inc. 2013

70 What is MapReduce Two Functions Mapper Since we are processing huge amount of data, it s nature to split the input data The Mapper reads data in the form of key/value pairs M(K1, V1) à list(k2, V2) Reducer Since the input data are split, we would need another phase to aggregate result in each split R(K2, list(v2)) à list(k3, V3)

71 Hadoop 1.0 Basic Core Architecture Mapper Reducer Map Shuffle/Sort Reduce MapReduce Hadoop Distributed File System (HDFS) Hadoop

72 Words to Websites - Simplified From words provide locations Provides what to display for a search Note: Page rank determines the order For example to find URLs with books on them Map Reduce <url, keyword> books calendars sports finance celebrity shoes books toolkits finance search operahng- system produchvity system K, V <keyword, url> books finance groceries toolkits

73 Data Model MapReduce works on <key, value> pairs (Key input, Value input) ( books calendars) Map Other Compute result (other map result) (Key intermediate, Value intermediate) (books, Reduce (Key output, Value output) (books,

74 The M/R concept Job Tracker Heartbeat, Task Report Worker Nodes Task Tracker Task Tracker Task Tracker Task Tracker M M M M M M M M M M M M R R R R R R R R

75 Map -> Shuffle -> Reduce Task Tracker A Mapper A Sort A Task Tracker D Task Tracker B A Mapper B Sort B Fetch B Merge Reducer 0 Task Tracker C C Mapper C Sort C

76 Map -> Shuffle -> Reduce Mapper A Partition + Sort A0 A1 Fetch A0 B0 Merge Reducer 0 Mapper B Partition + Sort B0 B1 C0 Fetch A1 B1 Merge Reducer 1 Mapper C Partition + Sort C0 C1 C1

77 Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

78 What is YARN? YARN is a re-architecture of Hadoop that allows multiple applications to run on the same platform

79 Why YARN support non-mapreduce workloads reducing the need to move data between Hadoop HDFS and other storage systems improve scalability cores, 16GB of RAM, 4x1TB disk cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk. scale to production deployments of ~5000 nodes of hardware of 2009 vintage cluster utilization JobTracker views the cluster as composed of nodes (managed by individual TaskTrackers) with distinct map slots and reduce slots customer agility

80 How YARN Works YARN s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities: a global ResourceManager a per-application ApplicationMaster a per-node slave NodeManager a per-application Container running on a NodeManager

81 MapReduce v1

82 YARN

83 The Hadoop 1.x and 2 Ecosystem Hadoop

84 The Path to ROI Raw Data 1. Put the data into HDFS in its raw format Hadoop Distributed File System 2. Use Pig to explore and transform Answers to questions = $$ 3. Data Analysts use Hive to query the data Structured Data Hidden gems = $$ 4. Data Scientists use MapReduce, R and Mahout to mine the data

85 Flume & Sqoop

86 Flume / Sqoop Data Integration Framework What s the problem for data collection Data collection is currently a priori and ad hoc A priori decide what you want to collect ahead of time Ad hoc each kind of data source goes through its own collection path

87 (and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

88 Flume: High-Level Overview Logical Node Source Sink

89 An example flow

90 Sqoop Easy, parallel database import/export You want Insert data from RDBMS to HDFS Export data from HDFS back into RDBMS

91 Sqoop - import process

92 Sqoop - export process Exports are performed in parallel using MapReduce

93 Why Sqoop JDBC-based implementation Works with many popular database vendors Auto-generation of tedious user-side code Write MapReduce applications to work with your data, faster Integration with Hive Allows you to stay in a SQL-based environment

94 Pig & Hive

95 Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!

96 Pig Initiated by An engine for executing programs on top of Hadoop A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load a.txt as (id, name, age,...) B = load b.txt as (id, address,...) C = JOIN A BY id, B BY id;store C into c.txt

97 Hive Developed by What is Hive? An SQL-like interface to Hadoop Treat your Big Data as tables Data Warehouse infrastructure Which provides Data summarization MapRuduce for execution Ad hoc querying on top of Hadoop Maintains metadata information about your Big Data stored on HDFS Hive Query Language SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

98 WordCount Example Input Hello World Bye World Hello Hadoop Goodbye Hadoop For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> the < Bye, reduce 1> just sums up the values < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

99 WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); }

100 WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

101 WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

102 Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions that are stored in a metastore Programmait Access JDBC, ODBC A schema is optionally defined at runtime PigServer

103 HCatalog in the Ecosystem Java MapReduce HCatalog HDFS HBase???

104 Oozie

105 What is? A Java Web Application Oozie is a workflow scheduler for Hadoop Crond for Hadoop Job 1 Job 2 Job 3 Job 4 Job 5

106 How it triggered Time Execute your workflow every 15 minutes 00:15 00:30 00:45 01:00 Event Materialize your workflow every hour, but only run them when the input data is ready. Input Data Exists? Hadoop 01:00 02:00 03:00 04:00

107 Defining an Oozie Workflow Start Action Contr ol Flow Action Action Action End

108 HDP on Windows

109 General Planning Considerations Run on a single node? For test and perform simple operations Not suitable for big data Start from a small cluster Maybe 4 or 6 nodes As the data grows, add more nodes. Expand when necessary Storage is not enough Improve computing capability

110 Traditional Operating System Selection RedHat Enterprise Linux CentOS Ubuntu Server SuSE Enterprise Linux

111 Topology Master Node Active NameNode ResourceManager Secondary NameNode ( or Standby NameNode) Slave Node DataNode NodeManager

112 Network Topology World Switch Switch Switch Switch Switch Switch Namenode RM SNN DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM DN + NM Rack 1 Rack 2 Rack 3 Rack n

113 Hadoop Distribution Apache Cloudera MapR Hortonworks Amazon EMR Apache Hadoop MapR Greenplum IBM

114 Get all your Hadoop packages and Make sure the packages compatibility!

115 Who is Hortonworks? Upstream Community Projects Downstream Enterprise Product Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream Integrate & Test Apache Pig Apache Hive Test & Patch Apache Hadoop Design & Develop Release Design & Develop Hortonworks Data Platform Package & Certify Apache HBase Apache HCatalog Distribute Other Apache Projects Apache Ambari No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects

116 What is HDP OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE PLATFORM SERVICES FLUME SQOOP WEBHDFS HDFS DATA SERVICES PIG HIVE HCATALOG MAP REDUCE YARN HBASE Enterprise Readiness: HA, DR, Snapshots, Security, HORTONWORKS DATA PLATFORM (HDP) Enterprise Hadoop The Hortonworks Data Platform (HDP) The ONLY 100% open source and complete distribution Enterprise grade, proven and tested at scale Ecosystem endorsed to ensure interoperability OS Cloud VM Appliance

117 The management for HDP

118 What is HDP for Windows HDP for Windows significantly expands the ecosystem for the next generation big data platform. This means that the Microsoft partners and tools you already rely on can help you with your Big Data initiatives. HDP for Windows is the Microsoft recommended way to deploy Hadoop on Windows Server environments. Support Windows server 2008 Windows server 2012

119 Choose your HDP

120 HDP hardware recommendations Machine Type Workload Pattern/ Cluster Type Storage Processor (# of Cores) Memory (GB) Network Balanced workload Twelve 2-3 TB disks GB onboard, 2x10 GBE mezzanine/ external Slave Nodes Compute-intensive workload Twelve 1-2 TB disks GB onboard, 2x10 GBE mezzanine/ external Storage-heavy workload Twelve 4+ TB disks GB onboard, 2x10 GBE mezzanine/ external NameNode Balanced workload Four or more 2-3 TB RAID 10 with spares GB onboard, 2x10 GBE mezzanine/ external ResourceManager Balanced workload Four or more 2-3 TB RAID 10 with spares GB onboard, 2x10 GBE mezzanine/ external

121 The installation of HDP for Windows 1

122 The installation of HDP for Windows 2

123 The installation of HDP for Windows 3 #Log directory HDP_LOG_DIR=d:\hadoop\logs #Data directory HDP_DATA_DIR=d:\hdp\data #Hosts NAMENODE_HOST=NAMENODE_MASTER.acme.com SECONDARY_NAMENODE_HOST=SECONDARY_NAMENODE_MASTER.acme.com RESOURCEMANAGER_HOST.acme.com HIVE_SERVER_HOST=HIVE_SERVER_MASTER.acme.com OOZIE_SERVER_HOST=OOZIE_SERVER_MASTER.acme.com WEBHCAT_HOST=WEBHCAT_MASTER.acme.com FLUME_HOSTS=FLUME_SERVICE1.acme.com,FLUME_SERVICE2.acme.com,FLUME_SERV ICE3.acme.com HBASE_MASTER=HBASE_MASTER.acme.com HBASE_REGIONSERVERS=slave1.acme.com, slave2.acme.com, slave3.acme.com ZOOKEEPER_HOSTS=slave1.acme.com, slave2.acme.com, slave3.acme.com SLAVE_HOSTS=slave1.acme.com, slave2.acme.com, slave3.acme.com

124 The installation of HDP for Windows 4 #Database host DB_FLAVOR=derby DB_HOSTNAME=DB_myHostName #Hive properties HIVE_DB_NAME=hive HIVE_DB_USERNAME=hive HIVE_DB_PASSWORD=hive #Oozie properties OOZIE_DB_NAME=oozie OOZIE_DB_USERNAME=oozie OOZIE_DB_PASSWORD=oozie

125 What is HDP for Windows(Con.)

126 The management of HDP for Windows The Ambari SCOM integration is made possible by the pluggable nature of Ambari.

127 The management of HDP for Windows(Con.)

128 The Advantages of HDP for Windows Hadoop on Windows Made Easy With HDP for Windows, Hadoop is both simple to install and manage. It demystifies the Hadoop distribution so you don t need to choose and test the right combination of Hadoop projects to deploy. Clean and Easy Management Apache Ambari, the open source choice for management of a Hadoop cluster is integrated and extends Microsoft System Center so that IT Operators can manage their Hadoop clusters side-by-side with their databases, applications and other IT assets on a single screen. Secure, Reliable, Enterprise-Ready Hadoop Offering the most reliable, innovative and trusted distribution available, Microsoft and Hortonworks together deliver tighter security through integration with Windows Server Active Directory, ease of management through System Center integration.

129 The Data Integration The Hive ODBC Driver BI Tools Analytics Reporting Hive ODBC Driver

130 Using Hive with Excel Using the Hive ODBC Driver, your Excel spreadsheets can query data stored in Hadoop

131 Querying Hive from Excel

132 Querying Hive from Excel (Con.)

133 Combine model using Power View in Excel

134 Then Next? Spark Ambari Ranger Falcon

135 Why MapReduce is too slow Aims to make data analytics fast both fast to run and fast to write. When you have the request: iterative algorithms

136 What is In-memory distributed computing framework Create by UC Berkeley AMP Lab in 2010 Target Problem that Hadoop MR is bad at Iterative algorithm (Machine Learning ) Interactive data mining More general purpose than Hadoop MR Active contributions from ~15 companies

137 What Different between Hadoop and Spark Map Map Data Source Data Source 2 Reduce Reduce Map() Join() Transform Cache() HDFS

138 What is Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. Ambari handles configuration of Hadoop services for the cluster. Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. Monitor a Hadoop Cluster Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.

139 Ambari installation Wizard

140 Ambari central dashboard

141 Conclusion

142 Recap - Lifecycle of a YARN Application Client Resource Manager Container Basic unit of allocation Ex. Container A = 2GB, 1CPU Fine-grained resource allocation Replace the fixed map/ reduce slots Node Manager Node Manager Node Manager Node Manager Application Master Container Container Container Container Container Container Container

143 Hadoop 2.0 Eco-systems

144 Q&A

145 Questions?

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager The Cloud Computing Era and Ecosystem Phoenix Liau, Technical Manager Three Major Trends to Chang the World Cloud Computing Big Data Mobile Mobility and Personal Cloud My World! My Way! What is Personal

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Cloud Computing Era. Trend Micro

Cloud Computing Era. Trend Micro Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Akmal B. Chaudhri* Hortonworks *about.me/akmalchaudhri My background ~25 years experience in IT Developer (Reuters) Academic (City

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

HDP Enabling the Modern Data Architecture

HDP Enabling the Modern Data Architecture HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Upcoming Announcements

Upcoming Announcements Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Certified Big Data and Apache Hadoop Developer VS-1221

Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Big Data: Making Sense of it all!

Big Data: Making Sense of it all! Big Data: Making Sense of it all! Jamie Engesser E-mail : [email protected] Page 1 Data Driven Business? Facts not Intuition! Data driven decisions are better decisions its as simple as that. Using

More information

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS WHAT IS BIG DATA? describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

<Insert Picture Here> Big Data

<Insert Picture Here> Big Data Big Data Kevin Kalmbach Principal Sales Consultant, Public Sector Engineered Systems Program Agenda What is Big Data and why it is important? What is your Big

More information

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1 Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal [email protected] Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

YARN Apache Hadoop Next Generation Compute Platform

YARN Apache Hadoop Next Generation Compute Platform YARN Apache Hadoop Next Generation Compute Platform Bikas Saha @bikassaha Hortonworks Inc. 2013 Page 1 Apache Hadoop & YARN Apache Hadoop De facto Big Data open source platform Running for about 5 years

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

Big Data Realities Hadoop in the Enterprise Architecture

Big Data Realities Hadoop in the Enterprise Architecture Big Data Realities Hadoop in the Enterprise Architecture Paul Phillips Director, EMEA, Hortonworks [email protected] +44 (0)777 444 3857 Hortonworks Inc. 2012 Page 1 Agenda The Growth of Enterprise

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Please give me your feedback

Please give me your feedback Please give me your feedback Session BB4089 Speaker Claude Lorenson, Ph. D and Wendy Harms Use the mobile app to complete a session survey 1. Access My schedule 2. Click on this session 3. Go to Rate &

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Modernizing Your Data Warehouse for Hadoop

Modernizing Your Data Warehouse for Hadoop Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist [email protected] O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015 We Do Hadoop Fall 2014 Page 1 HDP delivers a comprehensive data management platform GOVERNANCE Hortonworks Data Platform

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Microsoft SQL Server 2012 with Hadoop

Microsoft SQL Server 2012 with Hadoop Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information