Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014



Similar documents
Chase Wu New Jersey Ins0tute of Technology

Hadoop. Sunday, November 25, 12

How To Scale Out Of A Nosql Database

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Transforming the Telecoms Business using Big Data and Analytics

Bringing Big Data to People

Modernizing Your Data Warehouse for Hadoop

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

HDP Hadoop From concept to deployment.

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Hadoop: The Big Data Refinery

Hadoop Ecosystem B Y R A H I M A.

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

IBM Big Data Platform

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

#TalendSandbox for Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Workshop on Hadoop with Big Data

BIG DATA TRENDS AND TECHNOLOGIES

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

A Survey on Big Data Concepts and Tools

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Large scale processing using Hadoop. Ján Vaňo

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

HDP Enabling the Modern Data Architecture

Big Data Explained. An introduction to Big Data Science.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Peers Techno log ies Pv t. L td. HADOOP

The Next Wave of Data Management. Is Big Data The New Normal?

Reference Architecture, Requirements, Gaps, Roles

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

White Paper: Hadoop for Intelligence Analysis

Big Data and Data Science: Behind the Buzz Words

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data. Lyle Ungar, University of Pennsylvania

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Upcoming Announcements

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Constructing a Data Lake: Hadoop and Oracle Database United!

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Application Development. A Paradigm Shift

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Big Data Advanced Analytics for Game Monetization. Kimberly Chulis

Dominik Wagenknecht Accenture

CA Big Data Management: It s here, but what can it do for your business?

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

IBM Big Data Platform

BIG DATA TECHNOLOGY. Hadoop Ecosystem

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Dell In-Memory Appliance for Cloudera Enterprise

Big data for the Masses The Unique Challenge of Big Data Integration

The Future of Data Management

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data Realities Hadoop in the Enterprise Architecture

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Making Sense of Big Data in Insurance

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

BIG DATA What it is and how to use?

A Modern Data Architecture with Apache Hadoop

Microsoft Big Data. Solution Brief

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

WHITE PAPER ON. Operational Analytics. HTC Global Services Inc. Do not copy or distribute.

Data Warehouse design

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Please give me your feedback

White Paper: What You Need To Know About Hadoop

A Systematic Approach to Big Data Exploration of the Hadoop Framework

The 3 questions to ask yourself about BIG DATA

Big Data? Definition # 1: Big Data Definition Forrester Research

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

A Brief Outline on Bigdata Hadoop

BIRT in the World of Big Data

Big Data 101 Webinar

SAP and Hortonworks Reference Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Information Builders Mission & Value Proposition

Big Data and Industrial Internet

Tap into Hadoop and Other No SQL Sources

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Deploying Hadoop with Manager

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Transcription:

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. - The McKinsey Global Institute, 2011i This data is more than just large, it is also data that is non-traditional and needs to be handled differently. Big Data is about adopting new technologies that enable the storage, processing, and analysis of data that was previously ignored. 12, pg. 19

Dark Data & Big Data Gartner marks dark data as information assets that organizations collect, process and store in the course of their regular business activity, but generally fail to use for other purposes. Hadoop clusters and NoSQL databases can process large volumes of data which makes it feasible to incorporate long-neglected information into big data analytics applications to unlock its business value. Edmunds.com Put a Hadoop-based data warehouse into production in February which has accelerated the process of mining dark data and has opened up new views of data that are helping the company reduce operating costs, said Paddy Hannon, VP of architecture, Edmunds, in Santa Monica, California.

Characteristics of Big Data

Defining Data - Volume Size of data. Big data comes in one size; that is large, or rather, Massive. In 1986, the world s technological capacity to receive information through one-way broadcast networks was 0.432 Zettabytes. In 2016, Internet traffic is expected to reach 1.3 Zettabytes From wikipedia

Defining Data - Velocity How fast data is being generated. Big data must be used as it is streaming into the enterprise to maximize its value to the business. Typically considers how quickly the data is arriving, is stored, and its associated rate of retrieval. Think of this as data in motion, or the speed at which the data is flowing. Examples: 1. # of Tweets/hour worldwide 2. Traffic Sensors from traffic in Los Angeles during rush hour traffic, or international airplane traffic sensors/signals while planes are in flight 3. Velocity Twitter processes 400,000,000 tweets/day or over 4,500 tweets per second.

Describing Big Data - Variety Variation of data types to include source, format, and structure. Big data extends beyond structured data, including unstructured data of all varieties, including text, audio, video, click streams, and log files. Example: Banking uses various types of banking transactions occurring around the world every minute iphone, phone, in person, computers, terminals, tellers..

Defining Data - Veracity

SQL Databases & NoSQL Traditional OLAP/OLTP Limitations: 1. A SQL database needs to know what is being stored in advance. 2. The Agile development approach doesn t work well. Each time new features are added, the schema of the database requires changes. 3. If the database is large, the process is slow. 4. Rapid iterations and frequent data changes result in frequent downtime.

NoSQL Advantages 1. NoSQL databases allow insertion of data without a predefined schema. 2. Application changes in real-time are easier, resulting in faster development. 3. Code integration is more reliable, and less database administration is needed. 4. NoSQL provides the ability to handle a variety of database technologies. It was developed in response to handling volume of data, frequency in which this data is accessed, performance and processing needs.

Sample No-SQL Databases by DB Type

& Big Data When the term Hadoop is often considered synonymous with the term, Big Data. So, what is Hadoop? Hadoop is an open-source software from Apache Software Foundation to store and process large non-relational data sets via a large, reliable, scalable distributed computing model. Commercialized Hadoop distributions are available from companies such as Hortonworks and Cloudera. 4

Key Hadoop Components

Elements of Hadoop Hadoop is a framework made of a variety of components that allows for the distributed processing of large data sets across a fault-tolerant cluster of servers. Hadoop Common: part of the core Hadoop project which includes the utilities that support the other Hadoop modules; Hadoop Distributed File System is a distributed file system that provides high-throughput access to application data; Hadoop YARN is a framework for job scheduling and cluster resource management Hadoop MapReduce is a YARN-based interface for parallel processing of large data sets. See more at: http://www.cioinsight.com/it-news-trends/slideshows/hadoopadoption-proves-slow-but-steady-08/#sthash.6u7xjwik.dpuf

Chief Advantages of Hadoop and MapReduce? 1. Potentially lower costs than analytical databases, and more scalability with reduced processing time and higher performance. 2. It s open source. Although this implies free, it s not entirely free, because you might want to pay for support. However, it s a lower-cost alternative. 3. There is no database license. Hadoop and other open source big data implementations offer a less expensive alternative to traditional, proprietary data warehouses.

Chief Advantages of Hadoop and MapReduce - II? Improved scalability over analytic databases. 1) It can handle very large amounts of data because you can take 10, 50, 100 machines to do the processing. The infrastructure around it handles the parallel processing. 2) These relatively simple routines can be written for mapping and reduction. The infrastructure takes responsibility for scheduling the jobs on each of the 100 machines and making sure that all 100 complete successfully. If one fails, it will redistribute that work to the other machines.

When Not To Use Hadoop

When to Use Big Data Tooling Users want to interact with their data: totality, exploration, and frequency. Totality refers to the increased desire to process and analyze all available data, rather than analyzing a sample of data and extrapolating the results. However: Apache Hadoop does not replace the data warehouse and NoSQL databases do not replace transactional relational databases. Neither do MapReduce, nor streaming analytics, Hive Apache s data warehousing application which is used to query Hadoop data stores

Gartner Prediction for Big Data By 2015 Big data demand will reach 4.4 million jobs globally, but only one-third of those jobs will be filled. Gartner says the demand for Big Data is growing, and enterprises will need to reassess their competencies and skills to respond to this opportunity. Jobs that are filled will result in real financial and competitive benefits for organizations. An important aspect of the challenge in filling these jobs lies in the fact that enterprises need people with new skills data management, analytics and business expertise and non-traditional skills necessary for extracting the value of Big Data, as well as artists and designers for data visualization. 3

Gartner Predictions for Big Data - II By 2016 Wearable smart electronics in shoes, tattoos and accessories will emerge as a $10 billion industry. Gartner claims the majority of revenue from wearable smart electronics over the next few years will come from athletic shoes and fitness tracking, communications devices for the ear, and automatic insulin delivery for diabetics. By 2017 40 per cent of enterprise contact information will have leaked into Facebook via employees increased use of mobile device collaboration applications. According to Gartner, while many organizations have been legitimately concerned about the physical coexistence of consumer and enterprise applications on devices that interact with IT infrastructure 3

The Hadoop Project & Components Hadoop delivers a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes the following modules: 1. Hadoop Common: Common utilities that support the other Hadoop modules. 2. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. 3. Hadoop YARN: A framework for job scheduling and cluster resource management. 4. Hadoop MapReduce: A core Hadoop analytics component using a YARNbased system for parallel processing of large data sets. Very complex analytics that are hard to do in SQL would be easy to do in MapReduce.

Hadoop 1.0 vs. 2.0

Overview of Apache Hadoop-Related Projects 1. Ambari : web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. - It includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heat maps. - It can also view MapReduce, Pig and Hive applications visually and provides a user interface with functionality to diagnose performance characteristics. 2. Avro TM is a data serialization system - http://avro.apache.org 3. Cassandra : A scalable multi-master database with no single points of failure. http://cassandra.apache.org 4. Chukwa : A data collection system for managing large distributed systems.

Overview Apache Hadoop-Related Projects - II 6. HBase : A scalable, distributed database that supports structured data storage for large tables. 7. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Runs on the MapReduce framework of platform Symphony. 8. Mahout : A Scalable machine learning and data mining library. 9. Pig : A high-level data-flow language and execution framework for parallel computation. Runs on the MapReduce framework of platform Symphony.

Overview Apache Hadoop-Related Projects - III 11.Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. 12.Oozie: the scheduler used to run/manage jobs. 13.Fair Scheduler is used for basic management of job submission is a distributed, reliable and highly available service for efficiently moving large amounts of data around a cluster. http://flume.apache.org 14. HCatalog is a table and storage management service for Hadoop

Tooling for Big Data - Top 16 Platforms Source: Information Week Jan. 30, 2014

References 1. Understanding Big Data- Analytics for Enterprise Class Hadoop and Streaming Data, Zikopoulos, Paul C., Eaton, Chris, et al, McGraw Hill, 2012. 2. The Forrester Wave : Enterprise Hadoop Solutions, Q1 2012, Kobielus, James G. 3. 7 Big Data Trends for 2014, December 27, 2013. Rijmenam, Mark van, http://smartdatacollective.com/bigdatastartups/174741/seven-big-data-trends-2014 9. Introduction to NoSQL, Fowler, Martin -- http://www.youtube.com/watch?v=qi_g07c_q5i 12. Harness the Power of Big Data The IBM Big Data Platform, Zikupulos, Paul, et al. 2013, McGraw Hill 13. IBM Whitepaper - Wrangling big data: Fundamentals of data lifecycle management 15. Hadoop Architecture, Keith McDonald, http://www.youtube.com/watch?v=yewlbxj3rv8 16. Intro to Map Reduce, MapRAcademy, http://www.youtube.com/watch?v=hfplubebhcm 17. How Big Is a Petabyte, Exabyte, Zettabyte, or a Yottabyte? http://highscalability.com/blog/2012/9/11/how-big-is-a-petabyte-exabyte-zettabyte-ora-yottabyte.html

Other Reading 1. Hadoop -- http://hadoop.apache.org 2. Avro -- http://avro.apache.org 3. Flume -- http://flume.apache.org 4. Hbase -- http://hbase.apache.org 5. Hive -- http://hive.apache.org 6. Lucene -- http://lucene.apache.org 7. Oozie -- http://oozie.apache.org 8. Pig -- http://pig.apache.org 9. Zookeeper -- http://zookeeper.apache.org