The little elephant driving Big Data



Similar documents
Big Data and Apache Hadoop s MapReduce

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

BIG DATA TRENDS AND TECHNOLOGIES

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

The Inside Scoop on Hadoop

Big Data and Industrial Internet

Application Development. A Paradigm Shift

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

How To Scale Out Of A Nosql Database

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data With Hadoop

Manifest for Big Data Pig, Hive & Jaql

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop and Map-Reduce. Swati Gore

Data Mining in the Swamp

Big Data and Hadoop for the Executive A Reference Guide

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop & its Usage at Facebook

BIG DATA USING HADOOP

The Future of Data Management

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

HDP Enabling the Modern Data Architecture

Comprehensive Analytics on the Hortonworks Data Platform

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

MapReduce with Apache Hadoop Analysing Big Data

Big Data Explained. An introduction to Big Data Science.

CSE-E5430 Scalable Cloud Computing Lecture 2

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Hadoop. Sunday, November 25, 12

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Introduction to Hadoop

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Virtualizing Apache Hadoop. June, 2012

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Bringing Big Data to People

Global Hadoop Market (Hardware, Software, Services) applications, Geography, Haas, Global Trends,Opportunities, Segmentation and Forecast

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Implement Hadoop jobs to extract business value from large and varied data sets

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data Analytics OverOnline Transactional Data Set

Bringing the Power of SAS to Hadoop. White Paper

Open source Google-style large scale data analysis with Hadoop

HDP Hadoop From concept to deployment.

How To Handle Big Data With A Data Scientist

Microsoft Big Data. Solution Brief

Apache Hadoop FileSystem and its Usage in Facebook

Dominik Wagenknecht Accenture

So What s the Big Deal?

CDH AND BUSINESS CONTINUITY:

Hadoop & its Usage at Facebook

Big Data Realities Hadoop in the Enterprise Architecture

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Tap into Hadoop and Other No SQL Sources

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop Ecosystem B Y R A H I M A.

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Accelerating and Simplifying Apache

Hadoop: Embracing future hardware

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Building Your Big Data Team

L1: Introduction to Hadoop

The future: Big Data, IoT, VR, AR. Leif Granholm Tekla / Trimble buildings Senior Vice President / BIM Ambassador

Workshop on Hadoop with Big Data

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Big Data Workshop. dattamsha.com

How Companies are! Using Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Oracle Big Data SQL Technical Update

Why Big Data in the Cloud?

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop-BAM and SeqPig

Approaches for parallel data loading and data querying

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Big Data and Data Science: Behind the Buzz Words

Modernizing Your Data Warehouse for Hadoop

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

From Internet Data Centers to Data Centers in the Cloud

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Transcription:

The little elephant driving Big Data Despite the funny-sounding name, Hadoop is a serious enterprise software suite that drives Big Data Hadoop enables the storage and processing of very large databases in a cluster of inexpensive servers Internet companies such as Yahoo, Facebook, LinkedIn and many others use Hadoop to manage their databases There is a growing community of startups focused on expanding and commercializing Hadoop technology DEBORAH WEINSWIG Executive Director Head of Global Retail & Technology Fung Business Intelligence Centre deborahweinswig@fung1937.com New York: 646.839.7017

The little elephant driving Big Data Executive Summary Hadoop represents the circulatory and central nervous systems of Big Data. Despite the funny- sounding name, Hadoop is a serious enterprise software suite that enables the storage and processing of very large databases in a cluster of inexpensive servers. Why Hadoop? We are awash in data, and the situation is likely to grow even more acute as machines, appliances and our clothing become part of the Internet of Things. Cisco Systems forecasts that the amount of Internet Protocol traffic passing through data centers will grow at a 23% CAGR to 8.6 zettabytes (that s 8.6 x 10 21 bytes) during 2013 2018. A Hadoop system has two main parts: a distributed file system that handles the storage of data across a cluster of servers (or nodes), and a management program that coordinates the storage of data and the running of programs within the individual nodes. A key distinction of Hadoop is that the individual nodes both store the data and handle processing in a parallel fashion. This provides redundancy of both storage and processing, so that if a server drops out, no data is lost and no processing is interrupted. Hadoop is based on software originally written at Google, which its author reverse- engineered and altruistically made available in the public domain as open source. That led to widespread adoption by many Internet companies and enterprises. The funny- sounding name comes from the toy stuffed elephant that belonged to the son of its creator. Hadoop s adoption built upon itself and fostered an ecosystem of Hadoop tools and add- ons. Extensions and tools for Hadoop also have odd- sounding names: Pig, Hive and HBase. A community of startups focused on expanding and commercializing Hadoop technology has also emerged. This report contains a list of 13 startups, which have raised a total of nearly $1.7 billion. One Hadoop- focused company, Hortonworks, recently raised $115 million in an IPO, and Cloudera is the next leading contender to go public, having already raised $1.2 billion. Most Internet names you know Yahoo, Facebook, LinkedIn and many others use Hadoop to manage their databases. Since Hadoop, founded in 2006, is getting a bit old in Internet years, other technologies continuing the tradition of odd- sounding names such as Percolator, Dremel and Pregel have emerged for handling large- scale databases. Interestingly, Google, the original creator of the underlying technology, has moved on to the aforementioned technologies and is therefore no longer a big Hadoop user. Hadoop is the little yellow stuffed animal with the funny name that powers many of the Internet services we depend on today. 2

History of Hadoop In the early 2000s, Google faced an immense technical challenge: how to organize the entire world s information, which was stored on the Internet and steadily growing in volume. No commercially available software was up to the task, and Google s custom- designed hardware was running out of steam. Google engineers Jeff Dean and Sanjay Ghemawat designed two tools to solve this problem Google File System (GFS) for fault- tolerant, reliable and scalable storage, and Google Map/Reduce (GMR), for parallel data analysis across a large number of servers which they described in an academic paper published in 2004. At that time, Doug Cutting was a well- known open- source software developer who was working on a web- indexing program and was facing similar challenges. Cutting replaced the data collection and processing code in his web crawler with reverse- engineered versions of GFS and GMR and named the framework after his two- year- old son s toy elephant, Hadoop. Learning of Cutting s work, Yahoo! invested in Hadoop s development, and Cutting decided that Hadoop would remain open source and therefore free to use, and available for expansion and improvement by everyone. By 2006, established and emerging web companies had started to use Hadoop in production systems. Today, the Apache Software Foundation coordinates Hadoop development, and Mr. Cutting is Chief Architect at Cloudera, which was founded in 2008 to commercialize Hadoop technology. (The Apache HTTP Server software, commonly just called Apache, is the world s most widely used software for running web servers.) What is Hadoop? According to its current home, Apache, Hadoop is a framework for running applications on large cluster[s] built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. The first part of that description means that Hadoop lets large banks of computers analyze data; the latter part means Hadoop has built- in redundancy that can recover from the failure of a server or a rack of servers, and the framework can process data that is changing over time. Hadoop has two main parts: Map/Reduce, which divides a large piece of data into many small fragments; this analysis can be executed or re- executed on any node in a cluster Hadoop Distributed File System (HDFS), which stores data within the nodes of the cluster; this file system scheme offers high aggregate bandwidth across the cluster Prominent Hadoop users include Facebook, Yahoo! and LinkedIn. We are awash in data, and the amount of data that needs to be processed will only get larger as the Internet of Things grows to include smart home appliances and other objects, including wearable technology that contains multiple sensors that generate reams of data. Figure 1 illustrates Cisco Systems forecast that the amount of Internet Protocol traffic in zettabytes passing through data centers will grow at a 23% 21 CAGR during 2013 2018. One zettabyte is 10 bytes, or 10,000,000,000,000,000,000,000 bytes. 3

Figure 1. Global Data Center IP Traffic Growth 9 8.6 Ze$abytes per Year 7 5 3 3.1 3.8 4.7 5.8 7.1 1 2013 2014 2015 2016 2017 2018 Source: Cisco Global Cloud Index, 2013 2018 The SAS Institute outlines the benefits of Hadoop: It s inexpensive: Hadoop uses low- cost commodity hardware It s scalable: More nodes can be added to increase capacity and processing power It can use unstructured data: Any data type can be used It employs parallel processing and redundancy: Hadoop can process multiple copies of data and redirect jobs from malfunctioning servers Hadoop has its limitations, including: Other methods for rationalizing clusters with data center infrastructure are being developed Data security is fragmented Map/Reduce is batch oriented Its ecosystem lacks easy- to- use, full- feature tools for data integration and other functions Skilled Hadoop professionals are few and expensive Technology continues to evolve, and next- generation alternatives to Hadoop are emerging. Google, the pioneer of Hadoop technology, is moving away from Map/Reduce and is embracing other technologies such as Percolator, Dremel and Pregel. 4

Figure 2 shows the three key parts of a Hadoop cluster. How Does Hadoop Work? Client Machines load data into the cluster, submit Map/Reduce jobs that describe how the jobs should be processed, and then retrieve and view the results of the finished jobs Master Nodes oversee data storage with HDFS and running parallel computations via Map/Reduce Slave Nodes store the data and run the computations Figure 2. Hadoop Server Roles Source: BradHedlund.com 5

Figure 3 illustrates the Hadoop Distributed File System, which breaks up a large file into pieces that are redundantly stored on several different servers, offering the ability to recover completely from the failure of one (or even two) servers. Figure 3. Source: Cloudera Figure 4 illustrates the Map/Reduce software framework, which similarly breaks a problem into smaller pieces, which are redundantly processed, offering the ability to complete the analysis in the event of the failure of one or more servers. Figure 4. Source: Cloudera 6

Hadoop Components A Hadoop setup contains the following core software modules: Hadoop Module Function Common Distributed File System YARN Map/Reduce Hadoop libraries and utilities Storing data across the cluster Managing and scheduling activities in the cluster Manages data processing across multiple servers More Unusual Names in the Hadoop Ecosystem Apache Pig, Hive and HBase The success of Hadoop has sparked an ecosystem of related software: Apache Pig: A high- level platform for creating Map/Reduce programs using Hadoop Apache Hive: A data warehouse infrastructure on top of Hadoop for data summarization, query and analysis Apache HBase: An open- source, non- relational distributed database written in Java Market Opportunity IDC estimates that the worldwide Big Data technology market which comprises hardware, software and services will reach $32 billion in 2017. Gartner estimates that the global data- related enterprise software market will hit $110 billion in 2018. Of this figure, startup Cloudera projects that $30 billion represents analytical workloads and operational data stores, which stands in contrast to transactional workloads (used for analysis rather than commerce). Cloudera says that this market is the most immediately addressable by Hadoop technology, as well as one of the fastest- growing segments of the data- related enterprise market. 7

Who Uses It? As of 2008, many of the world s best- known Internet companies such as ebay, Facebook, LinkedIn, Yahoo!, and others had already adopted Hadoop as the software foundation for their big- data processing activities. Figure 6 shows the top- 20 selected Hadoop users, sorted by the number of nodes. These nodes represent 90% of the more than 17,000 Hadoop nodes counted by Apache. Data for Hadoop users Amazon, Google, IBM, and Twitter were not available. Figure 6. Selected Top Hadoop Users 5,000 Number of Nodes 4,000 3,000 2,000 1,000 0 Yahoo! LinkedIn Facebook SpoBfy Criteo Inmobi ebay CRS4 Adknowledge Neptune AOL FOX Audience Specific Media Search Wikia ecircle Lydia News Analysis A9 (Amazon) ARA.COM.TR Cornell Univ. Last.fm Source: Apache Top Hadoop Technology Companies: Source: edureka! 8

Hadoop Startups Hadoop startups have been well funded by venture capitalists. Figure 7 shows selected Hadoop startups, whose funding totals nearly $1.7 billion, with Cloudera receiving more than half of the total. The company reportedly generated more than $100 million in revenue in 2014 and was recently valued at $4.1 billion. Figure 7. Selected Venture- Capital Investments in Hadoop Startups Total Funding ($ Million) Company Description Location Innovator and largest contributor to Hadoop community Palo Alto, CA 1,200 MapR Technologies Enterprise- grade platform for mission- critical and real- time production San Jose, CA 174 Platfora Analytics that transforms raw data into interactive, in- memory business intelligence San Mateo, CA 65.2 Altiscale Hadoop- as- a- service Palo Alto, CA Trifacta Platform to transform raw, complex data for analysis San Francisco, CA 41.3 Datameer End- to- end Big Data analytics application native for Hadoop San Francisco, CA 36.8 DataTorrent Real- time stream processing platform Santa Clara, CA 23.8 Alpine Data Labs Solutions to simplify the process of building predictive models in Big Data San Francisco, CA 23.5 Splice Machine Hadoop- based, SQL- compliant database San Francisco, CA 22 Qubole Self- service platform for Big Data analytics built on Amazon, Microsoft and Google clouds Mountain View, CA 20 Cask Brings virtualization to Hadoop data and apps Palo Alto, CA 12.5 Nuevora Analytics solutions for marketing effectiveness, customer management and risk mitigation San Ramon, CA 2.3 Xplenty Platform enabling the transformation of data into business insights Tel Aviv 42 2 Source: CrunchBase Hortonworks, which makes business software focused on the development and support of Apache Hadoop, raised $10p0 million in an initial public offering on December 11, 2014. Conclusion At the time of its introduction, Hadoop brought Google s technology for the low- cost analysis of very large data sets on a cluster of inexpensive PC hardware to the world, thanks to its author s history in the open- source software community. Subsequently, many Internet search engines and large enterprises adopted the software. In addition, a community developed around the software, made up of tools and startups created to enhance the technology; it has led to one IPO so far. The Hadoop community continues to thrive, even though it is somewhat aged by Internet standards and some newer alternatives have emerged. That s quite an achievement for a piece of software named after a yellow stuffed elephant. 9

Deborah Weinswig, CPA Executive Director Head of Global Retail & Technology Fung Business Intelligence Centre New York: 917.655.6790 Hong Kong: +852 6119 1779 deborahweinswig@fung1937.com Cam Bolden cambolden@fung1937.com Marie Driscoll, CFA mariedriscoll@fung1937.com John Harmon, CFA johnharmon@fung1937.com Amy Hedrick amyhedrick@fung1937.com Aragorn Ho aragornho@fung1937.com John Mercer johnmercer@fung1937.com Charlie Poon charliepoon@fung1937.com Kiril Popov Kirilpopov@fung1937.com Stephanie Reilly stephaniereilly@fung1937.com Lan Rosengard lanrosengard@fung1937.com Jing Wang jingwang@fung1937.com 10