IT and Storage for Big Data Analytics

Similar documents

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Apache Hadoop FileSystem and its Usage in Facebook

Can Storage Fix Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Large scale processing using Hadoop. Ján Vaňo

Apache Hadoop FileSystem Internals

Hadoop implementation of MapReduce computational model. Ján Vaňo

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Open source Google-style large scale data analysis with Hadoop

Hadoop IST 734 SS CHUNG

Introduction to Cloud Computing

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop. Sunday, November 25, 12

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data and Apache Hadoop s MapReduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

How To Scale Out Of A Nosql Database

Application Development. A Paradigm Shift

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chapter 7. Using Hadoop Cluster and MapReduce

<Insert Picture Here> Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA-AS-A-SERVICE

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Daniel J. Adabi. Workshop presentation by Lukas Probst

Are You Ready for Big Data?

Big + Fast + Safe + Simple = Lowest Technical Risk

Hybrid Software Architectures for Big

Can the Elephants Handle the NoSQL Onslaught?

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Hadoop Architecture and its Usage at Facebook

Networking in the Hadoop Cluster

Big Data and Its Impact on the Data Warehousing Architecture

Hadoop and Map-Reduce. Swati Gore

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

CSE-E5430 Scalable Cloud Computing Lecture 2

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

SQL Server 2012 Parallel Data Warehouse. Solution Brief

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

A Performance Analysis of Distributed Indexing using Terrier

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Virtualizing Apache Hadoop. June, 2012

Are You Ready for Big Data?

Big Data Big Data/Data Analytics & Software Development

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

I/O Considerations in Big Data Analytics

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

BIG DATA TECHNOLOGY. Hadoop Ecosystem

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data and Industrial Internet

How To Handle Big Data With A Data Scientist

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data on Cloud Computing- Security Issues

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

Log Mining Based on Hadoop s Map and Reduce Technique

Information Architecture

There s no way around it: learning about Big Data means

Red Hat Storage Server

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

L1: Introduction to Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data With Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

XpoLog Competitive Comparison Sheet

Big data management with IBM General Parallel File System

Manifest for Big Data Pig, Hive & Jaql

Cost-Effective Business Intelligence with Red Hat and Open Source

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Apache Hadoop: Past, Present, and Future

Introduction to Analytics and Big Data - Hadoop. Rob Peglar EMC Isilon

OnX Big Data Reference Architecture

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Journal of Environmental Science, Computer Science and Engineering & Technology

So What s the Big Deal?

Accelerating and Simplifying Apache

Investor Presentation. Second Quarter 2015

From Internet Data Centers to Data Centers in the Cloud

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

MapReduce with Apache Hadoop Analysing Big Data

An Oracle White Paper October Oracle: Big Data for the Enterprise

Transcription:

IT and Storage for Big ata Analytics Randy Kerns Senior Strategist valuator Group

verview Big data can mean two different things - Storage for large amounts of data - Analytics against very large amounts of data Usually from machine-tomachine data - Called pervasive computing So, what does this mean for storage?

What It Means for IT

The Storage Way to Say Big ata efined by architectural platform, big data storage is: Scale-out AS Global amespace File System AS gateway to SA and Scale-out SA efined by application, big data storage is: Storage for applications that handle large files and requires performance Storage for extremely large number of files xamples: Media & entertainment, oil & gas exploration, life sciences, etc.

The Analytics Way to Say Big ata Big data analytics is: - A term for business intelligence (BI) processes that are different from traditional data warehousing - The ability to tap unstructured data as a source for BI processes - Information delivered to users in real or near real-time (but not an absolute requirement) - Convergence of multiple data sources Latency introduced by storage, including networked storage, is often assiduously avoided Cost is minimized

ata Analytics Model Customer Profiles osql B HFS Logs, Tweets Location High Scale ata Reductions Predictions on Buying Behavior BI and Analytics PS Batch Low Latency 3) Input Into xpert System 4) Real-time: etermine Best ffer For This User 2b) Lookup Location osql B 2a)Lookup User Profile 1) Identify User

Why Should Storage Professionals Care? istributed computing for analytics (Hadoop, for example) is moving from science experiment to mission-critical As this happens, data encompassed by these applications becomes the responsibility of people who worry about: - Security - ata protection/disaster recovery/business continuance - ata governance and compliance - igital records management and archiving

Shared Storage for the Traditional ata Warehouse Archive LTP Files / XML data Log Files perational xtract, Transform, Load (TL) ata Warehouse Schedules Ad hoc Queries Reports ashboards otifications

istributed, Shared-othing Architectures for Big ata Analytics etwork Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Pwr Console Compute Layer C T R L 1 2 3 n Storage Layer AS AS AS AS AS

CAP Theorem It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: - Consistency (all nodes see the same data at the same time) - Availability (a guarantee that every request receives a response about whether it was successful or failed) - Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) A distributed system can satisfy any two of these guarantees at the same time, but not all three

Issue for IT How to store information for big data - How much data is there????? - Where did this idea come from? What are the requirements Is it from analytics operations - Store original data capture in flight as part of the analytics operation? - Store as secondary process? - on t save anything, except results? What about Rental ata?

Shared Storage as Secondary Storage Is there a place for shared storage in shared-nothing? If so, what does it look like? etwork Layer Compute Layer B8GMR3 C T R L 1 Link 2 3 Link 4 5 Link 6 7 Link 8 1 2 3 Pwr Console n Storage Layer SA/AS

Shared Storage as Primary Storage etwork Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Pwr Console Compute Layer C T R L 1 2 3 n Storage Layer SA or AS, but more commonly Scale-out AS

Shared Primary/Secondary Storage Advantages - Can reduces latency for queries that span nodes - nhances system availability - Addresses the enterprise storage requirements Security ata protection/disaster recovery/business continuance ata governance and compliance igital records management and archiving isadvantages - Additional cost - Crosses a cultural boundary

Why ot Shared Storage?

Big ata Storage for Big ata Analytics Shared storage as secondary storage for big data analytics - ata Protection, atabase of Record, Archive - xamples: etapp and ParAccel, MC ata omain/vmax and Greenplum, RainStor Shared storage as primary storage for big data analytics - xamples: Calpont, Red Hat Gluster, IBM GPFS, exenta ZFS, Hadoop nodes in Virtual Machines

Is Hadoop a Storage evice? - It s a distributed computing platform YS - 1K node cluster w/ 1TB RAM per node = 1PB of very high performance storage - ata protection built-in (multiple data copies but not RAI) - HFS - mbedded, distributed file system (like scale-out AS)

HFS Hadoop File System Very large istributed File System (FS) 10K nodes, 100 million files, 10 PB Uses standard servers with direct attached storage Files are replicated to handle hardware failure 3 copies etect failures and recovers from them ptimized for batch processing ata locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth Runs in user space - heterogeneous S

Hadoop File System on Standard Servers Source: Matt Foley

Typical Hadoop Configuration etwork Layer B8GMR3 1 Link 2 3 Link 4 5 Link 6 7 Link 8 Pwr Console Compute Layer C T R L 1 2 3 n Storage Layer AS AS AS AS AS

Hadoop Key Milestones ec 2004 Google GFS paper published July 2005 MapReduce first used Feb 2006 Becomes Lucene subproject Apr 2007 Yahoo! on 1000-node cluster Jan 2008 Apache Top Level Project May 2009 Hadoop sorts a Petabyte in 17 hours Aug 2010 World s largest Hadoop cluster at Facebook - 2900 nodes - 30+ Petabytes

valuating Hadoop as a Storage evice Snapshots? Scale capacity and performance concurrently? SS and automated tiering? edupe? Insert your hot-button storage feature here:

valuating Hadoop as a Storage evice

IT and Big ata Analytics There will be big data Circumstances may vary. and change Participate early - ata scientists may not have same concerns or requirements - ecisions can limit choices Understand options - Products / software

Thank You! Questions? Randy Kerns: randy@evaluatorgroup.com Twitter: @rgkerns Blog: http://itknowledgeexchange.techtarget.com/storage-soup/