The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager"

Transcription

1 The Cloud Computing Era and Ecosystem Phoenix Liau, Technical Manager

2 Three Major Trends to Chang the World Cloud Computing Big Data Mobile

3 Mobility and Personal Cloud My World! My Way!

4 What is Personal Cloud 讓 消 費 者 利 用 智 慧 手 機 多 媒 體 平 板 裝 置 電 視 與 個 人 電 腦 等 各 種 聯 網 裝 置, 透 過 網 路 無 縫 儲 存 同 步 作 業 串 流 並 分 享 內 容 -- Gartner Multiple Screens Diverse Platforms Sync Store Stream Share

5 User Behaviors (Data source: Gartner)

6 Personal Cloud Challenge Gartner 預 估 消 費 者 在 2012 年 花 在 數 位 科 技 產 品 與 服 務 的 費 用 約 為 2.2 兆 美 元, 或 將 近 一 般 家 庭 平 均 可 支 配 收 入 的 10% 2015 年 前, 全 球 消 費 者 花 在 連 接 裝 置 相 關 執 行 與 內 容 傳 輸 服 務 的 費 用 將 達 2.8 兆 美 元 But if you are a developer.. 80% 的 開 發 者 並 沒 有 賺 到 足 夠 的 錢 以 成 立 獨 立 業 務 59% 的 開 發 者 無 法 回 收 開 發 時 所 投 入 的 資 金 63% 的 開 發 者 所 開 發 出 來 的 應 用 下 載 量 少 於 50,000 次 75% 的 開 發 者 在 每 款 應 用 上 只 能 夠 獲 得 US$5,000 或 更 少 的 收 入

7 Personal Cloud Opportunity Business Model: Social network, Location-base, Mobile powered "Services" not just "Apps"

8 Cloud Computing - What Come to Your Mind?

9 Cloud Computing - What Come to Your Mind? Virtualization Social Media SaaS Cloud Computing Hadoop Internet Storage Mobility

10 The NIST Definition of Cloud Computing Essential Characteristics Service Models Deployment Models 以 服 務 (as-a-service) 的 商 業 模 式, 透 過 Internet 技 術, 提 供 具 有 擴 充 性 (scalable) 和 彈 性 (elastic) 的 IT 相 關 功 能 給 使 用 者

11 It s About the Ecosystem Structured, Semi-structured Cloud Computing Enterprise Data Warehouse SaaS PaaS IaaS Generate Big Data Lead Business Insights create Competition, Innovation, Productivity

12 Top Cloud Computing Prediction for 2012 In 2012, 80% of new commercial enterprise apps will be deployed on cloud platforms. -- IDC Amazon Web Services will exceed $1 billion in cloud services business in 2012 with Google s Enterprise business to follow within 18 months. -- IDC By 2015, low-cost cloud services will cannibalize up to 15% of top outsourcing players' revenue. -- Gartner By 2016, 40% of enterprises will make proof of independent security testing a precondition for using any type of cloud service. -- Gartner At year-end 2016, more than 50% of Global 1000 companies will have stored customer-sensitive data in the public cloud. -- Gartner Estimated more than 20% percent of organizations have already begun to selectively store their customer-sensitive data in a hybrid architecture

13 The Need of Business Agility on Infrastructure Business Owner We got a mission from CEO to introduce a new SaaS services Developers Just getting the infrastructure to develop is so slow! Operations How do we get the h/w, manage the app and deliver the SLA in production? We need to: Get capacity now Get s/w stacks deployed Simulate production Once in prod, we need Plan capacity for app Place on Tier 1 capacity Provision the App Server, web, database Set up the load balancer Set up the firewall Set up data protection Set up mgmt Manage the app

14 Cloud Infrastructure Landscape Cloud Infrastructure As-a-Product As-a-Service Virtualization.. a lot more Storage

15 Cloud IaaS Providers Cloud IaaS is the future of outsourced hosting Every company that offers Web hosting is being forced to evolve its business. On-demand, pay-as-you-go capability has been the norm. Primary Customers: 1. Traditional Web hosting customers 2. Interning in what cloud computing is able to do for their business.

16 Cloud Infrastructure Management Central management solution for enterprises to manage servers in multiple public clouds and private clouds, or hybrid clouds. Solution Providers:

17 CIO s concern on Cloud Computing Concerns which drive the cloud adoption strategy Security and Compliance Performance and SLAs Availability and Data Protection Intellectual Property

18 Private Cloud Infrastructure Option Enterprise Private Cloud Options Commercial option: VMWare vcloud (with VMWare-base virtualization) Open-Source option: CloudStack, OpenStack, Eucalyptus (with XEN base virtualization) PROs Infrastructure is dedicated owned by organization thus more secure. CONs Not give enterprise the benefits of the cloud: true elasticity and CapEx and OpEx elimination. Virtual Private Data Center (VPDC) from Public Cloud service provider Options Amazon (Virtual Private Cloud), Savvis, Rackspace, Terremark PROs Meet business agility and flexibility requirement with relatively low cost Better security than Public Cloud Enterprise can access servers in VPDC via the secured tunnel Dedicated hardware for single customer under VPDC offering is available from some vendors (e.g. Amazon EC2 Dedicated Instances) Enterprise level managed services were offered by most of the vendros CONs Isolation occurs at network layer. Information store in VPDC are still be shared with other companies data on the actual servers

19 Hybrid Cloud Infrastructure A survey of 500 CIOs across UK, France, Germany, Spain and Benelux in 2011 highlights 16% has company-wide implementations of cloud computing to date 60% believe that the Cloud will be their most significant IT operating method by % prefer Hybrid Cloud method Hybrid Cloud balance the security strengths of a private cloud with lower costs and elasticity available when using a public cloud service while maintain business agility. Example: Traffic bursting for newly introduced services like SafeSync Hybrid Cloud Private Cloud Bridging Public Cloud Security Lower cost, Elasticity

20 Comparison of Different Cloud Deployment Models ROI TCO Average Security Performance Elasicity Private Cloud Hybrid Cloud Public Cloud

21 What is BigData? A set of files A database A single file

22 The Data-Driven World Modern systems have to deal with far more data than was the case in the past Organizations are generating huge amounts of data That data has inherent value, and cannot be discarded Examples: Yahoo over 170PB of data Facebook over 30PB of data ebay over 5PB of data Many organizations are generating data at a rate of terabytes per day

23 What is the problem Traditionally, computation has been processor-bound For decades, the primary push was to increase the computing power of a single machine Faster processor, more RAM Distributed systems evolved to allow developers to use multiple machines for a single job At compute time, data is copied to the compute nodes

24 What is the problem Getting the data to the processors becomes the bottleneck Quick calculation Typical disk data transfer rate: 75MB/sec Time taken to transfer 100GB of data to the processor: approx. 22 minutes!

25 What is the problem Failure of a component may cost a lot What we need when job fail? May result in a graceful degradation of application performance, but entire system does not completely fail Should not result in the loss of any data Would not affect the outcome of the job

26 Big Data Solutions by Industries The most common problems Hadoop can solve

27 Threat Analysis/Trade Surveillance Challenge: Detecting threats in the form of fraudulent activity or attacks Large data volumes involved Like looking for a needle in a haystack Solution with Hadoop: Parallel processing over huge datasets Pattern recognition to identify anomalies i.e., threats Typical Industry: Security, Financial Services

28 Big Data Use Case Smart Protection Network Challenge Information accessibility and transparency problems for threat researcher due to the size and source of data (volume, variety and velocity) Size of Data Overall Data Data sources: 20+ Data fields: Daily new records: 23 Billion+ Daily new data size: 4TB+ SPN Smart Feedback Feedback components: 26 Data fields : 300+ Daily new file counts: 6 Million+ Daily new records: 90 Million+ Daily new data size: 261GB+

29 Index= vsapi zbot

30

31 Recommendation Engine Challenge: Using user data to predict which products to recommend Solution with Hadoop: Batch processing framework Allow execution in in parallel over large datasets Collaborative filtering Collecting taste information from many users Utilizing information to predict what similar users like Typical Industry ISP, Advertising

32

33 Hadoop!

34 inspired by Apache Hadoop project inspired by Google's MapReduce and Google File System papers. Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware Open Source Software + Hardware Commodity IT Costs Reduction

35 Hadoop Concepts Distribute the data as it is initially stored in the system Individual nodes can work on data local to those nodes Users can focus on developing applications.

36 Hadoop Components Hadoop consists of two core components The Hadoop Distributed File System (HDFS) MapReduce Software Framework There are many other projects based around core Hadoop Often referred to as the Hadoop Ecosystem Pig, Hive, HBase, Flume, Oozie, Sqoop, etc Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

37 Hadoop Components: HDFS HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster Two roles in HDFS Namenode: Record metadata Datanode: Store data Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

38 How Files Are Stored: Example NameNode holds metadata for the data files DataNodes hold the actual blocks Each block is replicated three times on the cluster

39 HDFS: Points To Note When a client application wants to read a file: It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on It then communicates directly with the DataNodes to read the data

40 Hadoop Components: MapReduce MapReduce is a method for distributing a task across multiple nodes It works like a Unix pipeline: cat input grep sort uniq -c cat > output Input Map Shuffle & Sort Reduce Output Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

41 Features of MapReduce Automatic parallelization and distribution Automatic re-execution on failure Locality optimizations MapReduce abstracts all the housekeeping away from the developer Developer can concentrate simply on writing the Map and Reduce functions Hue (Web Console) Oozie (Job Workflow & Scheduling) Mahout (Data Mining) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

42 Example : word count Word count is challenging over massive amounts of data Using a single compute node would be too time-consuming Number of unique words can easily exceed the RAM MapReduce breaks complex tasks down into smaller elements which can be executed in parallel More nodes, more faster

43 Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa

44 The Hadoop Ecosystems

45 Growing Hadoop Ecosystem The term Hadoop is taken to be the combination of HDFS and MapReduce There are numerous other projects surrounding Hadoop Typically referred to as the Hadoop Ecosystem Zookeeper Hive and Pig HBase Flume Other Ecosystem Projects Sqoop Oozie Hue Mahout

46 The Ecosystem is the System Hadoop has become the kernel of the distributed operating system for Big Data No one uses the kernel alone A collection of projects at Apache

47 Relation Map Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

48 Zookeeper Coordination Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

49 What is ZooKeeper A centralized service for maintaining Configuration information Providing distributed synchronization A set of tools to build distributed applications that can safely handle partial failures ZooKeeper was designed to store coordination data Status information Configuration Location information

50 Why use ZooKeeper? Manage configuration across nodes Implement reliable messaging Implement redundant services Synchronize process execution

51 ZooKeeper Architecture All servers store a copy of the data (in memory) A leader is elected at startup 2 roles leader and follower Followers service clients, all updates go through leader Update responses are sent when a majority of servers have persisted the change HA support

52 Hbase Column NoSQL DB Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

53 Structured-data vs Raw-data

54 I Inspired by Apache open source project Inspired from Google Big Table Non-relational, distributed database written in Java Coordinated by Zookeeper

55 Row & Column Oriented

56 Hbase Data Model Cells are versioned Table rows are sorted by row key Region a row range [start-key:end-key]

57 Architecture Master Server (HMaster) Assigns regions to regionservers Monitors the health of regionservers RegionServers Contain regions and handle client read/write request

58 Hbase workflow

59 When to use HBase Need random, low latency access to the data Application has a variable schema where each row is slightly different Add columns Most of columns are NULL in each row

60 Flume / Sqoop Data Integration Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

61 What s the problem for data collection Data collection is currently a priori and ad hoc A priori decide what you want to collect ahead of time Ad hoc each kind of data source goes through its own collection path

62 (and how can it help?) A distributed data collection service It efficiently collecting, aggregating, and moving large amounts of data Fault tolerant, many failover and recovery mechanism One-stop solution for data collection of all formats

63 Flume: High-Level Overview Logical Node Source Sink

64 Architecture basic diagram one master control multiple node

65 Architecture multiple master control multiple node

66 An example flow

67 Flume / Sqoop Data Integration Framework Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

68 Sqoop Easy, parallel database import/export What you want do? Insert data from RDBMS to HDFS Export data from HDFS back into RDBMS

69 What is Sqoop A suite of tools that connect Hadoop and database systems Import tables from databases into HDFS for deep analysis Export MapReduce results back to a database for presentation to end-users Provides the ability to import from SQL databases straight into your Hive data warehouse

70 How Sqoop helps The Problem Structured data in traditional databases cannot be easily combined with complex data stored in HDFS Sqoop (SQL-to-Hadoop) Easy import of data from many databases to HDFS Generate code for use in MapReduce applications

71 Sqoop - import process

72 Sqoop - export process Exports are performed in parallel using MapReduce

73 Why Sqoop JDBC-based implementation Works with many popular database vendors Auto-generation of tedious user-side code Write MapReduce applications to work with your data, faster Integration with Hive Allows you to stay in a SQL-based environment

74 Sqoop - JOB Job management options E.g sqoop job create myjob import connect xxxxxxx --table mytable

75 Pig / Hive Analytical Language Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

76 Why Hive and Pig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!

77 Hive Developed by What is Hive? An SQL-like interface to Hadoop Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop MapRuduce for execution HDFS for storage Hive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

78 Pig Initiated by A high-level scripting language (Pig Latin) Process data one step at a time Simple to write MapReduce program Easy understand Easy debug A = load a.txt as (id, name, age,...) B = load b.txt as (id, address,...) C = JOIN A BY id, B BY id;store C into c.txt

79 Hive vs. Pig Hive Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions that are stored in a metastore Programmait Access JDBC, ODBC Pig A schema is optionally defined at runtime PigServer

80 WordCount Example Input Hello World Bye World Hello Hadoop Goodbye Hadoop For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> the reduce just sums up the values

81 WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.setmapperclass(map.class); job.setreducerclass(reduce.class); job.setinputformatclass(textinputformat.class); job.setoutputformatclass(textoutputformat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitforcompletion(true); }

82 WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;

83 WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;

84 Oozie Job Workflow & Scheduling Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

85 What is? A Java Web Application Oozie is a workflow scheduler for Hadoop Crond for Hadoop Job 1 Job 2 Job 3 Job 4 Job 5

86 Why Why use Oozie instead of just cascading a jobs one after another Major flexibility Start, Stop, Suspend, and re-run jobs Oozie allows you to restart from a failure You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes

87 High Level Architecture Web Service API database store : Workflow definitions Currently running workflow instances, including instance states and variables Oozie WS API Tomcat web-app Hadoop/Pig/HDFS DB

88 How it triggered Time Execute your workflow every 15 minutes 00:15 00:30 00:45 01:00 Time and Data Materialize your workflow every hour, but only run them when the input data is ready. Input Data Exists? Hadoop 01:00 02:00 03:00 04:00

89 Exeample Workflow

90 Oozie use criteria Need Launch, control, and monitor jobs from your Java Apps Java Client API/Command Line Interface Need control jobs from anywhere Web Service API Have jobs that you need to run every hour, day, week Need receive notification when a job done when a job is complete

91 Hue Web Console Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

92 Hue developed by Hadoop User Experience Apache Open source project HUE is a web UI for Hadoop Platform for building custom applications with a nice UI library

93 Hue HUE comes with a suite of applications File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. Job Browser: View jobs, tasks, counters, logs, etc. Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.

94 Hue: File Browser UI

95 Hue: Beewax UI

96 Mahout Data Mining Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

97 What is Machine-learning tool Distributed and scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster

98 Why Current state of ML libraries Lack Community Lack Documentation and Examples Lack Scalability Are Research oriented

99 Mahout scale Scale to large datasets Hadoop MapReduce implementations that scales linearly with data Scalable to support your business case Mahout is distributed under a commercially friendly Apache Software license Scalable community Vibrant, responsive and diverse

100 Mahout four use cases Mahout machine learning algorithms Recommendation mining : takes users behavior and find items said specified user might like Clustering : takes e.g. text documents and groups them based on related document topics Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together

101 Use case Example Predict what the user likes based on His/Her historical behavior Aggregate behavior of people similar to him

102 Conclusion Today, we introduced: Why Hadoop is needed The basic concepts of HDFS and MapReduce What sort of problems can be solved with Hadoop What other projects are included in the Hadoop ecosystem

103 Recap Hadoop Ecosystem Hue (Web Console) Mahout (Data Mining) Oozie (Job Workflow & Scheduling) Zookeeper (Coordination) Sqoop/Flume (Data integration) MapReduce Runtime (Dist. Programming Framework) Pig/Hive (Analytical Language) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)

104 Questions?

105 Thank you!

Cloud Computing Era. Trend Micro

Cloud Computing Era. Trend Micro Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

From Distributed Systems to Data Science. William C. Benton Red Hat Emerging Technology

From Distributed Systems to Data Science. William C. Benton Red Hat Emerging Technology From Distributed Systems to Data Science William C. Benton Red Hat Emerging Technology About me At Red Hat: scheduling, configuration management, RPC, Fedora, data engineering, data science. Before Red

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop 2.0 Introduction with HDP for Windows. Seele Lin

Hadoop 2.0 Introduction with HDP for Windows. Seele Lin Hadoop 2.0 Introduction with HDP for Windows Seele Lin Who am I Speaker: 林 彥 辰 A.K.A Seele Lin Mail: seele_lin@trend.com.tw Experience 2010~Present 2013~2014 Trainer of Hortonworks Certificated Training

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

HADOOP SDJ INFOSOFT PVT LTD

HADOOP SDJ INFOSOFT PVT LTD HADOOP SDJ INFOSOFT PVT LTD DATA FACT 6/17/2016 SDJ INFOSOFT PVT. LTD www.javapadho.com Big Data Definition Big data is high volume, high velocity and highvariety information assets that demand cost

More information

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce

More information

Cloud Computing i Hadoop

Cloud Computing i Hadoop Cloud Computing i Hadoop X JPL Barcelona, 01/07/2011 Marc de Palol @lant Qui sóc? Qui sóc? Qui sóc? Qui sóc? Qui sóc? Qui sóc? Grid Computing vs Cloud Grid Computing vs Cloud Els dos són sistemes distribuïts

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1 Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MySQL and Hadoop. Percona Live 2014 Chris Schneider MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for

More information

An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING

An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING An Introduction to MOHAMMAD REZA KARIMI DASTJERDI SPRING 2015 1 Table Of Contents Introduction Problems with RDBMs What is Hadoop? Who use Hadoop? Job Positions History Hadoop Distributions Hadoop Ecosystem

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

Apache Hadoop: Past, Present, and Future

Apache Hadoop: Past, Present, and Future The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer aaa@cloudera.com, twitter: @awadallah Hadoop Past

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Cost-Effective Business Intelligence with Red Hat and Open Source

Cost-Effective Business Intelligence with Red Hat and Open Source Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

So What s the Big Deal?

So What s the Big Deal? So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

Outline. What is Big Data? Hadoop HDFS MapReduce

Outline. What is Big Data? Hadoop HDFS MapReduce Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

Microsoft SQL Server 2012 with Hadoop

Microsoft SQL Server 2012 with Hadoop Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information