How To Understand The Power Of Big Data In Science And Technology

Similar documents

Cloud Computing at Google. Architecture

Big Data With Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Open source large scale distributed data management with Google s MapReduce and Bigtable

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Future Prospects of Scalable Cloud Computing

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

How To Handle Big Data With A Data Scientist

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Data-Intensive Computing with Map-Reduce and Hadoop

BIG DATA What it is and how to use?

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

RevoScaleR Speed and Scalability

CSE-E5430 Scalable Cloud Computing Lecture 2

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Open source Google-style large scale data analysis with Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

What is Analytic Infrastructure and Why Should You Care?

Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Hadoop Ecosystem B Y R A H I M A.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Bringing Big Data Modelling into the Hands of Domain Experts

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Large scale processing using Hadoop. Ján Vaňo

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

On the Varieties of Clouds for Data Intensive Computing

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Architectures for Big Data Analytics A database perspective

Big Table A Distributed Storage System For Data

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop & Spark Using Amazon EMR

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Jeffrey D. Ullman slides. MapReduce for data intensive computing

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Hadoop Architecture. Part 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

Unified Big Data Processing with Apache Spark. Matei

Hadoop IST 734 SS CHUNG

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Hadoop-BAM and SeqPig

Integrating Apache Spark with an Enterprise Data Warehouse

A programming model in Cloud: MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Data Challenges in Bioinformatics

Large-Scale Data Processing

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Hadoop

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Parallel Processing of cluster by Map Reduce

How To Scale Out Of A Nosql Database

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Big Data and Apache Hadoop s MapReduce

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Challenges for Data Driven Systems

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Big Data Course Highlights

THE HADOOP DISTRIBUTED FILE SYSTEM

Distributed File Systems

Internals of Hadoop Application Framework and Distributed File System

Hadoop: Embracing future hardware

MapReduce and Hadoop Distributed File System

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Timo Elliott VP, Global Innovation Evangelist SAP SE or an SAP affiliate company. All rights reserved. 1

Manifest for Big Data Pig, Hive & Jaql

NoSQL and Hadoop Technologies On Oracle Cloud

Unified Big Data Analytics Pipeline. 连城

HDFS. Hadoop Distributed File System

Processing NGS Data with Hadoop-BAM and SeqPig

Reference Architecture, Requirements, Gaps, Roles

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Timo Elliott VP, Global Innovation Evangelist SAP SE or an SAP affiliate company. All rights reserved. 1

Parallel Computing. Benson Muite. benson.

Snapshots in Hadoop Distributed File System

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Spark: Making Big Data Interactive & Real-Time

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

How Companies are! Using Spark

Moving From Hadoop to Spark

Transcription:

http://www.facebook.com/notes/facebook-engineering/visualizingfriendships/469716398919 Big Data: Science Use Cases and Infrastructure. Dr. André Luckow

Agenda' Motivation( Use'Cases' Infrastructure' Map'Reduce' Beyond'MapReduce' Seite 2

' From'the'dawn'of'civilization'until'2003'humankind' generated'five'exabytes'of'data.'' ' Now'we'produce'five'exabytes'every'two'days.' '...and'the'pace'is'accelerating.' ' Eric'Schmidt,'Google'(2010)' 3

The'Digital'Universe' 4

As'we'begin'to'distribute'trillions'of'connected'sensors' around'the'planet,'virtually'every'animate'and' inanimate'object'on'earth'will'be'generating'and' transmitting'data,'including'our'homes,'our'cars,'our' natural'and'mansmade'environments,'and'yes,'even' our'bodies. ' ' Anthony'D'Williams' ' ' 5

Data'Everywhere' A380' TB/flight' Car' GB/drive' Human' GB/day'

Motivation' McKinsey'Big'Data'Report'(2011):' Data%has%swept%into%every%industry%and% business%function%and%are%now%an% important%factor%of%production. % Challenge:(Derive(more(and(more(value( from(more(and(more(data(in(less(time( http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/ 469716398919 Seite 7

Gartner'Hypecycle' This research note is restricted to the personal use of human.ramezani@bmw.de Figure 1. Hype Cycle for Emerging Technologies, 2012 expectations Wireless Power Hybrid Cloud Computing HTML5 Gamification Big Data Crowdsourcing Speech-to-Speech Translation Silicon Anode Batteries Natural-Language Question Answering Internet of Things Mobile Robots Autonomous Vehicles 3D Scanners Automatic Content Recognition Volumetric and Holographic Displays 3D Bioprinting Quantum Computing Human Augmentation Source: Gartner (July 2012) Technology Trigger Internet TV Peak of Inflated Expectations Plateau will be reached in: NFC In-Memory Analytics 3D Printing BYOD Complex-Event Processing Social Analytics Private Cloud Computing Application Stores Augmented Reality In-Memory Database Management Systems Activity Streams NFC Payment Audio Mining/Speech Analytics Text Analytics Home Health Monitoring Hosted Virtual Desktops Virtual Worlds Cloud Computing Machine-to-Machine Communication Services Mesh Networks: Sensor Gesture Control Trough of Disillusionment time Consumerization Media Tablets Mobile OTA Payment Idea Management Slope of Enlightenment less than 2 years 2 to 5 years 5 to 10 years more than 10 years Speech Recognition Consumer Telematics Biometric Authentication Methods Predictive Analytics As of July 2012 Plateau of Productivity obsolete before plateau

What'is'Big'Data?' Big%Data%refers%to%datasets%whose%size%is%beyond% the%ability%of%typical%database%software%tools%to% capture,%store,%manage%and%analyze.%(mckinsey)% Big%Data%is%any%that%is%expensive%to%manage%and% hard%to%extract%value%from.%(michael%franklin,% Berkley)% '

The 3 V s of Big Data http://strataconf.com/strataeu The'volumes'of'data'that' need'to'be'processed'(in' near'realtime)'are'increasing' in'various'domains' How'to'deal'with'the'3'V s?*' Volume' Velocity' Variety' *Doug Laney, Gartner, 3-D Data Management, 2001 Seite 10

Agenda' Motivation' Use(Cases( Infrastructure' Map'Reduce' Beyond'MapReduce' Seite 11

The'Fourth'Paradigm'of'Science' ABOUT THE FOURTH PARADIGM 1. Empirical This book presents the first broad look at the rapidly emerging field of dataintensive science, with the goal of influencing the worldwide scientific and computing research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of escience such as databases, workflow management, visualization, and cloudcomputing technologies. This collection of essays expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized. 2. Theoretical The impact of Jim Gray s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science. BILL GATES I often tell people working in escience that they aren t in this field because 3. Computational they are visionaries or super-intelligent it s because they care about science and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better. RHYS FRANCIS, AUSTRALIAN eresearch INFRASTRUCTURE COUNCIL One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive science. This is recognized as a new paradigm beyond experimental and theoretical research and computer simulations of natural phenomena one that requires new tools, techniques, and ways of working. 4. Data Exploration DOUGLAS KELL, UNIVERSITY OF MANCHESTER The contributing authors in this volume have done an extraordinary job of helping to refine an understanding of this new paradigm from a variety of disciplinary perspectives. GORDON BELL, MICROSOFT RESEARCH THE FOURTH PARADIGM HEY TANSLEY TOLLE The F O U R T H P ARADIGM DATA-INTENSI VE SCIENTI F I C DI SCOVERY EDITED BY TONY HEY, STEWART TANSLEY, AND KRISTIN TOLLE PART NO. 098-115630 Tony Hey et al., The Fourth Paradigm Data-intensive Scientific Discovery 12

Science'is'not'only'Simulation' Experimen: tation( Theory( Big'Data' Simulation' 13

Big'Data'in'Sciences' High'Energy'Physics:' LHC'at'CERN'produces'petabytes'of'data'per'day' O(10)'TB/day'stored' Data'is'distributed'to'Tier'1'and'Tier'2'sites' Astronomy:' Sloan'Digital'Sky'Survey'(80'TB'over'7'years)' LSST'will'produce'40TB'per'day'(for'10'years)' ' Geonomics:' Data'volume'increasing'with'every'new'generation'of'sequence' machine.'a'machine'can'produce'tb/day.' Costs'for'Sequencing'are'decreasing' Jha, Hong, Dobson, Katz, Luckow, Rana, Simmhan, Introducing Distributed, Dynamic, Data-intensive Sciences (D3): Understanding Applications and Infrastructure, in preparation for CCPE, 2013 14

Genome'Sequencing' Source: http://www3.appliedbiosystems.com/cms/groups/applied_markets_marketing/ documents/generaldocuments/cms_096460.pdf 15

Genome Sequencing' 3 GB per sequenced human genome National Human Genome Research Institute, The Costs of DNA Sequencing

Spatial'Data' Maps'&'metaSdata' Observational'data' LocationSbased' services' Crowdsourced' location'data'(e.g.' GPS'probes)' Source: http://blog.uber.com/2012/01/09/uberdata-san-franciscomics/ 17

Agenda' Motivation' Use'Cases' Infrastructure( Map(Reduce( Beyond'MapReduce' Seite 18

Types'of'Data' Unstructured(Data:( Irregular'structure'or'parts'of'it'lack' structure'' FreeSform'text,'reports,'customer' feedback'forms,'pictures'and'video' Low(information(density( Structured:' PreSdefined'schema,'relational'data' High(information(densitiy( Publications' Derived'and' Combined'Data' Raw'Data' 19

Infrastructure'for'Big'Data' Challenges: Disk capacities increased, but disk latencies and bandwidths have not kept up How to efficiently store, transfer and process data? Year( Capacity( Transfer( Read( 1997' 2.1'GB' 16.6'MB/s' 126'sec' 2004' 200'GB' 56.5'MB/s' 59'min' 2012' 3000'GB' 210'MB/s' 239'min' Source: Cloudera CPU, RAM, disk size double every 18-24 months, seek times only increase 5% per year 20

Data'Management'Traditional'' Storage Nodes Compute Nodes 1. Copy files Network' 3. Copy data back 2. Process data 21

Data'Management'S'The'Google'Way' Google(File(System(' http://research.google.com/ archive/gfs.html((2003)(( Google(MapReduce(' http://research.google.com/ https://plus.google.com/103717168472380547505/photos?hl=en archive/mapreduce.html((2004)( (Google(BigTable(' http://research.google.com/ archive/bigtable.html((2006)(' 22

Google'Filesystem' Application GFS client (file name, chunk index) (chunk handle, chunk locations) (chunk handle, byte range) chunk data GFS master File namespace /foo/bar chunk 2ef0 Instructions to chunkserver GFS chunkserver Linux file system ion decisions using global knowledge. However, nimize its involvement in reads and writes so not become a bottleneck. Clients never read e data through the master. Instead, a client asks hich chunkservers it should contact. It caches tion for a limited time and interacts with the s directly for many subsequent operations. lain the interactions for a simple read with referre 1. First, using the fixed chunk size, the client e file name and byte offset specified by the apo a chunk index within the file. Then, it sends a request containing the file name and chunk master replies with the corresponding chunk locations of the replicas. The client caches this using the file name and chunk index as the key. t then sends a request to one of the replicas, Chunkserver state Figure 1: GFS Architecture GFS chunkserver Linux file system Source: Sanjay Ghemawatl et.al, The Google File System, 2003 Design(Principles:( Inspired'by'distributed/parallel' filesystems'(pvfs,'lustre)' Data messages Usage'of'commodity'hardware' Control messages Disk'seeks'are'expensive'' Optimized'for'sequential'reads' of'large'files' Partitioning'for'optimal'data' parallelism' Legend: tent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we Failures'are'very'common' But!( No'strong'consistency' No'file'locking' will discuss in Section 2.6.1. On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially. However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines at the same time. The few chunkservers storing this 23

Google'Map'Reduce' Shuffle Reduce Map Build'on'top'of'Google' Filesystem' Bring'compute'to'data' Programming'model'for' efficiently'processing'large' volumes'of'data' Modeled'after'functional' languages'that'often'have'a' map'and'reduce'function( Abstracts'low'level'details,'such' as'data'distribution,'parallelism' and'fault'tolerance' 24

Map Reduce Programming Model Query:(( How'many'tweets'per'user?'! Select user, count(*) from tweet group by user;! ' Input:'key=row,'value=tweet'info' Map:'output'key=user_id,'value=1' Shuffle:'sort'by'user_id' Reduce:'for'each'user_id,'sum' Output:'user_id,'tweet'count' ' '

Apache'Hadoop' Set of open source projects inspired by Google Filesystem and MapReduce Transforms commodity hardware into a service that: Stores petabytes of data reliably (Hadoop Filesystem (HDFS)) Allows huge distributed computations (MapReduce) Attributes: Redundant and reliable (no data loss) Batch processing centric Easy to program distributed apps

Hadoop Components Hadoop Filesystem (HDFS): Open Source Implementierung vom Google Dateisystem MapReduce: Key/Value-basierte API für die parallele Verarbeitung von großen Datenmengen Implementierung: Apache Hadoop (MR1, MR2, Local) Hadoop as a Service: Amazon Elastic MapReduce

HDFS'Overview' Chunking:'Data'is'split'into' larger'chunks'(128'mb)'to' facilitate'datasparallel' processing' Replication:'Distribute'data' across'nodes'for'performance' and'reliability' Analysis'moved'to'data'' (avoiding'data'copy)' Scanning'of'data'' (avoiding'random'seeks)' Input'Data' Data Nodes

HDFS'Architecture' Source: http://hadoop.apache.org/docs/stable/hdfs_design.html 29

Hadoop'MapReduce'Architecture' 2. Retrieve input splits MapReduce' application' Job' Client' 1. Submit 1. job Job'Tracker' HDFS' Namenode' 3. Assign task Task'Tracker' Task'Tracker' 4. run Map'or'Reduce' Task' Map'or'Reduce' Task' HDFS' 30

Hadoop'Distributionen' Hadoop market is very dynamic Cloudera and Hortonworks are the leading Hadoop distributions Differentiation often based on Enterprise Features, e.g. security and management Several Cloud-based offerings: Amazons Elastic MapReduce Hadoop on Azure The Forrester Wave, Enterprise Hadoop Solutions, Q1 2012 Seite 31

Hadoop'Ecosystem' Various'tools'on'top'of'Hadoop:' SQLSbased'Analytics' Hive'(SQLSQuery'Engine)' Cloudera'Impala' Apache'Drill' Pig'(ETL'tool)' Mahout'(Machine'Learning)' ' 32

"Big'data'is'what'happened'when'the'cost'of' keeping'information'became'less'than'the'cost' of'throwing'it'away."'s'george'dyson'' ' ' ' 33

Hadoop'Example'Usages' Data'Warehouse' Search'Index'creation' Machine'Learning:' Discovery'of'statistical'patterns'(commonly'requires'multiple' iterations/read'passes)' Science'applications:'' CloudBlast'Genome'Sequencing' Processing'telescopes''images'(e.g.'from'LSST)'with'machine' learning'algorithms' ' 34

Hadoop'Limitations' Common'query'patterns,'e.g.'joins'are'difficult'to'express' as'mapreduce'application' Algorithms'that'require'many'iterations'difficult'to' implement'on'top'of'mapreduce'model' No'point'to'point'communication' Critical'functionalities'missing,'e.g.''SQL,'InSMemory' analytics'or'realtime'capabilties' Security'insufficient:' Authentication' Encryption' 35

Agenda' Motivation' Use'Cases' Infrastructure( Map'Reduce' Beyond(MapReduce( Seite 36

Beyond'MapReduce'and' traditional'databases' MapReduce'too'lowSlevel,'traditional' databases'have'limited'scalability' Stonebraker:' One'Size'Fits'All' 'An'Idea' Whose'Time'Has'Come'and'Gone '' Different'data'processing'tasks'have'different' requirements' Google''e.g.'has''multiple'data'stores'in'place:' Big'Table'(a'NoSQL'store)'' Dremmel'(a'readSonly'query'engine)'' Spanner'(a'SQL'ACID'engine)'' 37

Infrastructure:'Beyond'MapReduce' '' Volume TeraByte PetaByte Innovations Spotential' Parallele' Datenbanken' InSMemory' Datenbanken' Traditionelle' Databases' MapReduce' MapReduce/Hadoop' increasingly'popular:' Open(Source/Free( Good(vendor(support' However,'there'are'further' options:' Parallele(Databases' 'massive' parallel'data'stores'with'full'sql' functionality' In:Memory(Databases'support' smaller'data'volumes,'but'provide' lower'latencies.' Latency Velocity Throughput Seite 38

Scale'Up'versus'Scale'Out' Traditionally'databases'were'centrally'managed' Scale'up'as'the'only'way'to'scale'a'database:' Limited'by'physics'(CPU,'memory,...)' Cost/Performance'ratio'drops'significantly'after'a'certain' point'(reliable'hardware'is'really'expensive!!)' ScaleSout'as'a'new'option'for'scaling'databases' Shared'nothing'architectures'enable'greatest'scalability' 39

Traditionally, the data values in a database are stored in a row-oriented fashion, with complete tuples stored in adjacent blocks on disk or in main memory [52]. This can Column'versus'Row'Data'Layout' be contrasted with column-oriented storage, where columns, rather than rows, are stored in adjacent blocks. This is illustrated in Figure 3.3. Source: Plattner, Zeiler, In-Memory Data Management, Springer, 2012 Figure 3.3: Row- and Column-Oriented Data Layout Traditional'databases'are'row'storage'systems'and' Row-oriented storage allows fast reading of single tuples but is not well suited optimized'for'accessing'complete'tupels' to reading a set of results from a single column. The upper part of Figure 3.4 exemplifies this by illustrating row and column operations in a row store. Analytic'workloads'typically'operate'along'columns' ' using'a'columnsbased'data'organization'avoids' accessing'data'that'is'not'required'for'a'join' 40

Brewer s CAP Theorem Distributed'data'systems'can'only'have'two'out'of'these'three'desirable' properties'(eric'a.'brewer.'towards'robust'distributed'systems.'podc' Keynote,'2000):' Consistency((( You'receive'a'correct'response.'Applications'behave'as'if'there'is'no' concurrency.' Availability( You'eventually'receive'a'response.' Partition(Tolerance( Communication'in'distributed'systems'is'unreliable.' System'is'capable'of'dealing'with'a'network'split,'i.e.'the'existence' of'multiple'nonscommunicating'groups'

Brewer s CAP Theorem (2) Availability/Partition(Tolerance( A'reachable'replica'provides'service'even'if'a'network' partition'exists,'which'may'results'in'inconsistencies.' Consistency/Partition(Tolerance(( Always'consistent,'but'an'online'replica'may'deny'service' if'it'cannot'synchronize'with'replica'in'another'partition.' Consistency/Availability Available and consistent if no network partition occurs.

InSMemory'Databases' 1.2 The Impact of Recent Hardware Trends 15 100000000 10000000 1000000 100000 10000 1000 Main Memory Flash Drives Disk Drives Main'Memory'Access' Read'1'MB'Sequentially' from'memory'' Disk'Seek' 100'ns' 250,000'ns' 5,000,000'ns' 100 10 1 Read'1'MB'Sequentially' from'disk'' 30,000,000'ns' 0.1 0.01 0.001 0.0001 0.00001 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year Figure 1.2: Storage Price Development 1.2.3 From Maximizing CPU Speed to Multi-Core Processors Source: Plattner/Zeiler: In-Memory Data-Management, Springer, 2011. In 1965, Intel co-founder Gordon E. Moore made his famous prediction about the increasing complexity of integrated circuits in the semiconductor industry [20]. The prediction became known as Moore s Law, and has become shorthand for rapid technological change. Moore s Law states that the number of transistors on a single chip is doubled approximately every two years [21]. In reality, the performance of Central Processing Units (CPUs) doubles every 20 months on average. The brilliant achievement that computer architects have 43

' Tape'is'Dead,'Disk'is'Tape,'Flash'is'Disk,'RAM' Locality'is'King 'Jim'Gray' 44

Infrastructure:'Beyond'MapReduce' Volume TeraByte PetaByte Latency Dremel' Bigtable' Parallele' Green Datenbanken' plum' InSMemory' Datenbanken' Traditionelle' Databases' Based( on( SAP' HANA' Velocity Hadoop/ HDFS' '' MapReduce' Throughput Many'new'data'management' systems'emerged'both'in' proprietary'space'as'well'as'in' open'source'space' ' ' ' ' ' ' Seite 45

Google'BigTable' Columnar'storage'based'on'Google'Filesystem' Design'goals:' Enable'random'access'to'data' Manage'small,'structured'data'entities'in'large'volumes' Master/Slave'architecture' Data'Model:'...a'sparse,'distributed,'persistent'multiS dimensional'sorted'map... '(Big'Table'paper)' Limited'support'for'transactions' 46

BigTable'Data'Model'' "contents:" "anchor:cnnsi.com" "anchor:my.look.ca" "com.cnn.www" "<html>..." "<html>..." "<html>..." t "CNN" t 9 "CNN.com" t8 5 t 6 t 3 ure 1: A slice Row of ankeys: example table that stores Web pages. The row name is a reversed URL. The contents column family s the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN s home Lexigraphically ordered and used for dynamic partioning ferenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi anchor:my.look.ca. Each Each partition anchor cell is has called one version; tablet the and contents is the column unit has of distribution three versions, at timestamps t 3, t 5, and Column keys: e settled on this Column data model family after examining to group adata variety (access Column control, Families co-location) otential uses Multiple of a Bigtable-like versions stored system. As one conte example that drove some of our design decisions, Column keys are grouped into sets called column f Source: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew 47 pose we Fikes, want Robert toe. keep Gruber, abigtable: copya of Distributed a large Storage collection System for Structured of lies, Data, OSDI, which 2006 formthe basic unit of access control. All stored in a column family is usually of the same type

BigTableSbased'Systems' Google'provides'access'to'BigTable'as'part'of'its' Google(App(Engine(cloud'service' BigTable'provided'Blueprint'for'a'proliferation'of' NoSQL'stores.' Apache'HBase' Apache'Accumulo' Apache'Cassandra' 48

Google'Dremel' Interactive'query' engine'for'large'data' volumes' Optimized'for'semiS structured,'nested' (denormalized)'data' Operates'on'data'inS situ,'e.g.'in'google' Filesystem'or'BigTable' Available'as'BigQuery' cloud'service' root server intermediate servers leaf servers (with local storage) client......... storage layer (e.g., GFS) query execution tree......... Figure 7: System architecture and execution inside a server node with the storage layer or access the data on local disk. Consider a simple aggregation query below: SELECT A, COUNT(B) FROM T GROUP BY A When the root server receives the above query, it determines all tablets, i.e., horizontal partitions of the table, that comprise T and rewrites the query as follows: SELECT A, SUM(c) FROM (R1 1 UNION ALL... R1 n ) GROUP49BY A Tables R1,...,R 1 n 1 are the results of queries sent to the nodes

HadoopSbased'Analytics'Engines' User 0.7 Query 1 Hive' User Query 2 Logistic 0.96 Apache'Drill' Regression 1.0 Shark/Spark' 0 20 40 60 80 100 120 Cloudera'Impala' Pivotal'Hawq' Shark Hive Figure 1: Performance of Shark vs. Hive/Hadoop on two SQL queries from an early user and one iteration of logistic regression (a classification algorithm that runs 10 such steps). Results measure the runtime (seconds) on a 100-node cluster. ning the deterministic operations used to build lost data partitions on other nodes, similar to MapReduce. Indeed, it typically recovers within seconds, by parallelizing this work across the cluster. To run SQL efficiently, however, we also had to extend the RDD execution model, bringing in several concepts from traditional analytical databases and some new ones. We started with an existing implementation of RDDs called Spark [33], and added several features. First, to store and process relational data efficiently, we implemented in-memory columnar storage and columnar compression. This reduced both the data size and the processing time by as much as 5 over naïvely storing the data in a Spark program in its original format. Second, to optimize SQL queries based on the data characteristics even in the presence of analytics functions and UDFs, we extended Spark with Partial DAG Execution (PDE): Shark can reoptimize a running query after running the first few stages of its task DAG, choosing better join strategies or the right degree of parallelism based on observed statistics. Third, we leverage other properties of the Spark engine not present in traditional MapReduce systems, such as control over data partitioning.!!master!node!!slave!node!spark!runtime HDFS!NameNode Resource!Manager!Scheduler Master!Process Execution!Engine Memstore Resource!Manager!Daemon HDFS!DataNode Metastore (System! Catalog)!!Slave!Node!Spark!Runtime Figure 2: Shark Architecture Execution!Engine Memstore Resource!Manager!Daemon HDFS!DataNode Shark ment Shark to be compatible with Apache Hive. It can be used to query an existing Hive warehouse and return results much faster, without modification to either the data or the queries. Thanks to its Hive compatibility, Shark can query data in any system that supports the Hadoop storage API, including HDFS and Amazon S3. It also supports a wide range of data formats such as text, binary sequence files, JSON, and XML. It inherits Hive s schema-on-read capability and nested data types [28]. In addition, users can choose to load high-value data into Shark s memory store for fast analytics, as shown below: CREATE TABLE latest_logs TBLPROPERTIES ("shark.cache"=true) AS SELECT * FROM logs WHERE date > now()-3600; http://blog.cloudera.com/blog/2012/10/cloudera-impalareal-time-queries-in-apache-hadoop-for-real/ Figure 2 shows the architecture of a Shark cluster, consisting of a single master node and a number of slave nodes, with the warehouse metadata stored in an external transactional database. It is Stoica et al., Shark: SQL and Rich Analytics at Scale, Berkley, 2012 Shark Impala 50

Big'Data'&'Science' Highly'distributed'heterogeneous'computing' environments'(hpc,'htc)' Complex'usage'models'of'data:' Many'custom'applications' Data'is'distributed'and'accessed'worldwide' MultiSstep'work'flows/data'processing' MapReduce'often'not'suitable' HPC'infrastructures'not'optimal'for'dataSintensive'tasks:' E.g.'Analysis'of'O(PB)'data'archives' Lead'to'a'proliferation'of'custom'domainSspecific' solutions'for'data'management,'e.g.'panda'in'lhc'grid' (Atlas)' 51

Conclusion' Big'Data'comes'from'data'exhaust'from'users,''new'and' pervasive'sensors,''and'the'ability'to'keep'everything' One'Size'does(not(fit'all:' Hadoop'optimized'for'embarassingly'parallel'tasks' Interactive'querying'and'analytics'tools'are'emerging'on'top'of' Hadoop' InSMemory'for'highly'interactive'tasks' Data'Analysis'is'the'new'bottleneck'to'discovery'(in' contrast'to'data'acquisition)' Less'Than'1%'of'World s'data'is'analyzed'(idc'digital'universe' Study)' 52

' Data'is'the'IntelSinside'of'nextSgeneration' compute'applications ' ' Tim'O Reilly' 53

' Data'Noise'to'One'Group'of'Scientists'is'Gold'to' Another. '' ' Rick'Smolan' ' 54

Questions?' 55