http://www.facebook.com/notes/facebook-engineering/visualizingfriendships/469716398919 Big Data: Science Use Cases and Infrastructure. Dr. André Luckow
Agenda' Motivation( Use'Cases' Infrastructure' Map'Reduce' Beyond'MapReduce' Seite 2
' From'the'dawn'of'civilization'until'2003'humankind' generated'five'exabytes'of'data.'' ' Now'we'produce'five'exabytes'every'two'days.' '...and'the'pace'is'accelerating.' ' Eric'Schmidt,'Google'(2010)' 3
The'Digital'Universe' 4
As'we'begin'to'distribute'trillions'of'connected'sensors' around'the'planet,'virtually'every'animate'and' inanimate'object'on'earth'will'be'generating'and' transmitting'data,'including'our'homes,'our'cars,'our' natural'and'mansmade'environments,'and'yes,'even' our'bodies. ' ' Anthony'D'Williams' ' ' 5
Data'Everywhere' A380' TB/flight' Car' GB/drive' Human' GB/day'
Motivation' McKinsey'Big'Data'Report'(2011):' Data%has%swept%into%every%industry%and% business%function%and%are%now%an% important%factor%of%production. % Challenge:(Derive(more(and(more(value( from(more(and(more(data(in(less(time( http://www.facebook.com/notes/facebook-engineering/visualizing-friendships/ 469716398919 Seite 7
Gartner'Hypecycle' This research note is restricted to the personal use of human.ramezani@bmw.de Figure 1. Hype Cycle for Emerging Technologies, 2012 expectations Wireless Power Hybrid Cloud Computing HTML5 Gamification Big Data Crowdsourcing Speech-to-Speech Translation Silicon Anode Batteries Natural-Language Question Answering Internet of Things Mobile Robots Autonomous Vehicles 3D Scanners Automatic Content Recognition Volumetric and Holographic Displays 3D Bioprinting Quantum Computing Human Augmentation Source: Gartner (July 2012) Technology Trigger Internet TV Peak of Inflated Expectations Plateau will be reached in: NFC In-Memory Analytics 3D Printing BYOD Complex-Event Processing Social Analytics Private Cloud Computing Application Stores Augmented Reality In-Memory Database Management Systems Activity Streams NFC Payment Audio Mining/Speech Analytics Text Analytics Home Health Monitoring Hosted Virtual Desktops Virtual Worlds Cloud Computing Machine-to-Machine Communication Services Mesh Networks: Sensor Gesture Control Trough of Disillusionment time Consumerization Media Tablets Mobile OTA Payment Idea Management Slope of Enlightenment less than 2 years 2 to 5 years 5 to 10 years more than 10 years Speech Recognition Consumer Telematics Biometric Authentication Methods Predictive Analytics As of July 2012 Plateau of Productivity obsolete before plateau
What'is'Big'Data?' Big%Data%refers%to%datasets%whose%size%is%beyond% the%ability%of%typical%database%software%tools%to% capture,%store,%manage%and%analyze.%(mckinsey)% Big%Data%is%any%that%is%expensive%to%manage%and% hard%to%extract%value%from.%(michael%franklin,% Berkley)% '
The 3 V s of Big Data http://strataconf.com/strataeu The'volumes'of'data'that' need'to'be'processed'(in' near'realtime)'are'increasing' in'various'domains' How'to'deal'with'the'3'V s?*' Volume' Velocity' Variety' *Doug Laney, Gartner, 3-D Data Management, 2001 Seite 10
Agenda' Motivation' Use(Cases( Infrastructure' Map'Reduce' Beyond'MapReduce' Seite 11
The'Fourth'Paradigm'of'Science' ABOUT THE FOURTH PARADIGM 1. Empirical This book presents the first broad look at the rapidly emerging field of dataintensive science, with the goal of influencing the worldwide scientific and computing research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of escience such as databases, workflow management, visualization, and cloudcomputing technologies. This collection of essays expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized. 2. Theoretical The impact of Jim Gray s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science. BILL GATES I often tell people working in escience that they aren t in this field because 3. Computational they are visionaries or super-intelligent it s because they care about science and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better. RHYS FRANCIS, AUSTRALIAN eresearch INFRASTRUCTURE COUNCIL One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive science. This is recognized as a new paradigm beyond experimental and theoretical research and computer simulations of natural phenomena one that requires new tools, techniques, and ways of working. 4. Data Exploration DOUGLAS KELL, UNIVERSITY OF MANCHESTER The contributing authors in this volume have done an extraordinary job of helping to refine an understanding of this new paradigm from a variety of disciplinary perspectives. GORDON BELL, MICROSOFT RESEARCH THE FOURTH PARADIGM HEY TANSLEY TOLLE The F O U R T H P ARADIGM DATA-INTENSI VE SCIENTI F I C DI SCOVERY EDITED BY TONY HEY, STEWART TANSLEY, AND KRISTIN TOLLE PART NO. 098-115630 Tony Hey et al., The Fourth Paradigm Data-intensive Scientific Discovery 12
Science'is'not'only'Simulation' Experimen: tation( Theory( Big'Data' Simulation' 13
Big'Data'in'Sciences' High'Energy'Physics:' LHC'at'CERN'produces'petabytes'of'data'per'day' O(10)'TB/day'stored' Data'is'distributed'to'Tier'1'and'Tier'2'sites' Astronomy:' Sloan'Digital'Sky'Survey'(80'TB'over'7'years)' LSST'will'produce'40TB'per'day'(for'10'years)' ' Geonomics:' Data'volume'increasing'with'every'new'generation'of'sequence' machine.'a'machine'can'produce'tb/day.' Costs'for'Sequencing'are'decreasing' Jha, Hong, Dobson, Katz, Luckow, Rana, Simmhan, Introducing Distributed, Dynamic, Data-intensive Sciences (D3): Understanding Applications and Infrastructure, in preparation for CCPE, 2013 14
Genome'Sequencing' Source: http://www3.appliedbiosystems.com/cms/groups/applied_markets_marketing/ documents/generaldocuments/cms_096460.pdf 15
Genome Sequencing' 3 GB per sequenced human genome National Human Genome Research Institute, The Costs of DNA Sequencing
Spatial'Data' Maps'&'metaSdata' Observational'data' LocationSbased' services' Crowdsourced' location'data'(e.g.' GPS'probes)' Source: http://blog.uber.com/2012/01/09/uberdata-san-franciscomics/ 17
Agenda' Motivation' Use'Cases' Infrastructure( Map(Reduce( Beyond'MapReduce' Seite 18
Types'of'Data' Unstructured(Data:( Irregular'structure'or'parts'of'it'lack' structure'' FreeSform'text,'reports,'customer' feedback'forms,'pictures'and'video' Low(information(density( Structured:' PreSdefined'schema,'relational'data' High(information(densitiy( Publications' Derived'and' Combined'Data' Raw'Data' 19
Infrastructure'for'Big'Data' Challenges: Disk capacities increased, but disk latencies and bandwidths have not kept up How to efficiently store, transfer and process data? Year( Capacity( Transfer( Read( 1997' 2.1'GB' 16.6'MB/s' 126'sec' 2004' 200'GB' 56.5'MB/s' 59'min' 2012' 3000'GB' 210'MB/s' 239'min' Source: Cloudera CPU, RAM, disk size double every 18-24 months, seek times only increase 5% per year 20
Data'Management'Traditional'' Storage Nodes Compute Nodes 1. Copy files Network' 3. Copy data back 2. Process data 21
Data'Management'S'The'Google'Way' Google(File(System(' http://research.google.com/ archive/gfs.html((2003)(( Google(MapReduce(' http://research.google.com/ https://plus.google.com/103717168472380547505/photos?hl=en archive/mapreduce.html((2004)( (Google(BigTable(' http://research.google.com/ archive/bigtable.html((2006)(' 22
Google'Filesystem' Application GFS client (file name, chunk index) (chunk handle, chunk locations) (chunk handle, byte range) chunk data GFS master File namespace /foo/bar chunk 2ef0 Instructions to chunkserver GFS chunkserver Linux file system ion decisions using global knowledge. However, nimize its involvement in reads and writes so not become a bottleneck. Clients never read e data through the master. Instead, a client asks hich chunkservers it should contact. It caches tion for a limited time and interacts with the s directly for many subsequent operations. lain the interactions for a simple read with referre 1. First, using the fixed chunk size, the client e file name and byte offset specified by the apo a chunk index within the file. Then, it sends a request containing the file name and chunk master replies with the corresponding chunk locations of the replicas. The client caches this using the file name and chunk index as the key. t then sends a request to one of the replicas, Chunkserver state Figure 1: GFS Architecture GFS chunkserver Linux file system Source: Sanjay Ghemawatl et.al, The Google File System, 2003 Design(Principles:( Inspired'by'distributed/parallel' filesystems'(pvfs,'lustre)' Data messages Usage'of'commodity'hardware' Control messages Disk'seeks'are'expensive'' Optimized'for'sequential'reads' of'large'files' Partitioning'for'optimal'data' parallelism' Legend: tent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we Failures'are'very'common' But!( No'strong'consistency' No'file'locking' will discuss in Section 2.6.1. On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially. However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines at the same time. The few chunkservers storing this 23
Google'Map'Reduce' Shuffle Reduce Map Build'on'top'of'Google' Filesystem' Bring'compute'to'data' Programming'model'for' efficiently'processing'large' volumes'of'data' Modeled'after'functional' languages'that'often'have'a' map'and'reduce'function( Abstracts'low'level'details,'such' as'data'distribution,'parallelism' and'fault'tolerance' 24
Map Reduce Programming Model Query:(( How'many'tweets'per'user?'! Select user, count(*) from tweet group by user;! ' Input:'key=row,'value=tweet'info' Map:'output'key=user_id,'value=1' Shuffle:'sort'by'user_id' Reduce:'for'each'user_id,'sum' Output:'user_id,'tweet'count' ' '
Apache'Hadoop' Set of open source projects inspired by Google Filesystem and MapReduce Transforms commodity hardware into a service that: Stores petabytes of data reliably (Hadoop Filesystem (HDFS)) Allows huge distributed computations (MapReduce) Attributes: Redundant and reliable (no data loss) Batch processing centric Easy to program distributed apps
Hadoop Components Hadoop Filesystem (HDFS): Open Source Implementierung vom Google Dateisystem MapReduce: Key/Value-basierte API für die parallele Verarbeitung von großen Datenmengen Implementierung: Apache Hadoop (MR1, MR2, Local) Hadoop as a Service: Amazon Elastic MapReduce
HDFS'Overview' Chunking:'Data'is'split'into' larger'chunks'(128'mb)'to' facilitate'datasparallel' processing' Replication:'Distribute'data' across'nodes'for'performance' and'reliability' Analysis'moved'to'data'' (avoiding'data'copy)' Scanning'of'data'' (avoiding'random'seeks)' Input'Data' Data Nodes
HDFS'Architecture' Source: http://hadoop.apache.org/docs/stable/hdfs_design.html 29
Hadoop'MapReduce'Architecture' 2. Retrieve input splits MapReduce' application' Job' Client' 1. Submit 1. job Job'Tracker' HDFS' Namenode' 3. Assign task Task'Tracker' Task'Tracker' 4. run Map'or'Reduce' Task' Map'or'Reduce' Task' HDFS' 30
Hadoop'Distributionen' Hadoop market is very dynamic Cloudera and Hortonworks are the leading Hadoop distributions Differentiation often based on Enterprise Features, e.g. security and management Several Cloud-based offerings: Amazons Elastic MapReduce Hadoop on Azure The Forrester Wave, Enterprise Hadoop Solutions, Q1 2012 Seite 31
Hadoop'Ecosystem' Various'tools'on'top'of'Hadoop:' SQLSbased'Analytics' Hive'(SQLSQuery'Engine)' Cloudera'Impala' Apache'Drill' Pig'(ETL'tool)' Mahout'(Machine'Learning)' ' 32
"Big'data'is'what'happened'when'the'cost'of' keeping'information'became'less'than'the'cost' of'throwing'it'away."'s'george'dyson'' ' ' ' 33
Hadoop'Example'Usages' Data'Warehouse' Search'Index'creation' Machine'Learning:' Discovery'of'statistical'patterns'(commonly'requires'multiple' iterations/read'passes)' Science'applications:'' CloudBlast'Genome'Sequencing' Processing'telescopes''images'(e.g.'from'LSST)'with'machine' learning'algorithms' ' 34
Hadoop'Limitations' Common'query'patterns,'e.g.'joins'are'difficult'to'express' as'mapreduce'application' Algorithms'that'require'many'iterations'difficult'to' implement'on'top'of'mapreduce'model' No'point'to'point'communication' Critical'functionalities'missing,'e.g.''SQL,'InSMemory' analytics'or'realtime'capabilties' Security'insufficient:' Authentication' Encryption' 35
Agenda' Motivation' Use'Cases' Infrastructure( Map'Reduce' Beyond(MapReduce( Seite 36
Beyond'MapReduce'and' traditional'databases' MapReduce'too'lowSlevel,'traditional' databases'have'limited'scalability' Stonebraker:' One'Size'Fits'All' 'An'Idea' Whose'Time'Has'Come'and'Gone '' Different'data'processing'tasks'have'different' requirements' Google''e.g.'has''multiple'data'stores'in'place:' Big'Table'(a'NoSQL'store)'' Dremmel'(a'readSonly'query'engine)'' Spanner'(a'SQL'ACID'engine)'' 37
Infrastructure:'Beyond'MapReduce' '' Volume TeraByte PetaByte Innovations Spotential' Parallele' Datenbanken' InSMemory' Datenbanken' Traditionelle' Databases' MapReduce' MapReduce/Hadoop' increasingly'popular:' Open(Source/Free( Good(vendor(support' However,'there'are'further' options:' Parallele(Databases' 'massive' parallel'data'stores'with'full'sql' functionality' In:Memory(Databases'support' smaller'data'volumes,'but'provide' lower'latencies.' Latency Velocity Throughput Seite 38
Scale'Up'versus'Scale'Out' Traditionally'databases'were'centrally'managed' Scale'up'as'the'only'way'to'scale'a'database:' Limited'by'physics'(CPU,'memory,...)' Cost/Performance'ratio'drops'significantly'after'a'certain' point'(reliable'hardware'is'really'expensive!!)' ScaleSout'as'a'new'option'for'scaling'databases' Shared'nothing'architectures'enable'greatest'scalability' 39
Traditionally, the data values in a database are stored in a row-oriented fashion, with complete tuples stored in adjacent blocks on disk or in main memory [52]. This can Column'versus'Row'Data'Layout' be contrasted with column-oriented storage, where columns, rather than rows, are stored in adjacent blocks. This is illustrated in Figure 3.3. Source: Plattner, Zeiler, In-Memory Data Management, Springer, 2012 Figure 3.3: Row- and Column-Oriented Data Layout Traditional'databases'are'row'storage'systems'and' Row-oriented storage allows fast reading of single tuples but is not well suited optimized'for'accessing'complete'tupels' to reading a set of results from a single column. The upper part of Figure 3.4 exemplifies this by illustrating row and column operations in a row store. Analytic'workloads'typically'operate'along'columns' ' using'a'columnsbased'data'organization'avoids' accessing'data'that'is'not'required'for'a'join' 40
Brewer s CAP Theorem Distributed'data'systems'can'only'have'two'out'of'these'three'desirable' properties'(eric'a.'brewer.'towards'robust'distributed'systems.'podc' Keynote,'2000):' Consistency((( You'receive'a'correct'response.'Applications'behave'as'if'there'is'no' concurrency.' Availability( You'eventually'receive'a'response.' Partition(Tolerance( Communication'in'distributed'systems'is'unreliable.' System'is'capable'of'dealing'with'a'network'split,'i.e.'the'existence' of'multiple'nonscommunicating'groups'
Brewer s CAP Theorem (2) Availability/Partition(Tolerance( A'reachable'replica'provides'service'even'if'a'network' partition'exists,'which'may'results'in'inconsistencies.' Consistency/Partition(Tolerance(( Always'consistent,'but'an'online'replica'may'deny'service' if'it'cannot'synchronize'with'replica'in'another'partition.' Consistency/Availability Available and consistent if no network partition occurs.
InSMemory'Databases' 1.2 The Impact of Recent Hardware Trends 15 100000000 10000000 1000000 100000 10000 1000 Main Memory Flash Drives Disk Drives Main'Memory'Access' Read'1'MB'Sequentially' from'memory'' Disk'Seek' 100'ns' 250,000'ns' 5,000,000'ns' 100 10 1 Read'1'MB'Sequentially' from'disk'' 30,000,000'ns' 0.1 0.01 0.001 0.0001 0.00001 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year Figure 1.2: Storage Price Development 1.2.3 From Maximizing CPU Speed to Multi-Core Processors Source: Plattner/Zeiler: In-Memory Data-Management, Springer, 2011. In 1965, Intel co-founder Gordon E. Moore made his famous prediction about the increasing complexity of integrated circuits in the semiconductor industry [20]. The prediction became known as Moore s Law, and has become shorthand for rapid technological change. Moore s Law states that the number of transistors on a single chip is doubled approximately every two years [21]. In reality, the performance of Central Processing Units (CPUs) doubles every 20 months on average. The brilliant achievement that computer architects have 43
' Tape'is'Dead,'Disk'is'Tape,'Flash'is'Disk,'RAM' Locality'is'King 'Jim'Gray' 44
Infrastructure:'Beyond'MapReduce' Volume TeraByte PetaByte Latency Dremel' Bigtable' Parallele' Green Datenbanken' plum' InSMemory' Datenbanken' Traditionelle' Databases' Based( on( SAP' HANA' Velocity Hadoop/ HDFS' '' MapReduce' Throughput Many'new'data'management' systems'emerged'both'in' proprietary'space'as'well'as'in' open'source'space' ' ' ' ' ' ' Seite 45
Google'BigTable' Columnar'storage'based'on'Google'Filesystem' Design'goals:' Enable'random'access'to'data' Manage'small,'structured'data'entities'in'large'volumes' Master/Slave'architecture' Data'Model:'...a'sparse,'distributed,'persistent'multiS dimensional'sorted'map... '(Big'Table'paper)' Limited'support'for'transactions' 46
BigTable'Data'Model'' "contents:" "anchor:cnnsi.com" "anchor:my.look.ca" "com.cnn.www" "<html>..." "<html>..." "<html>..." t "CNN" t 9 "CNN.com" t8 5 t 6 t 3 ure 1: A slice Row of ankeys: example table that stores Web pages. The row name is a reversed URL. The contents column family s the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN s home Lexigraphically ordered and used for dynamic partioning ferenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi anchor:my.look.ca. Each Each partition anchor cell is has called one version; tablet the and contents is the column unit has of distribution three versions, at timestamps t 3, t 5, and Column keys: e settled on this Column data model family after examining to group adata variety (access Column control, Families co-location) otential uses Multiple of a Bigtable-like versions stored system. As one conte example that drove some of our design decisions, Column keys are grouped into sets called column f Source: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew 47 pose we Fikes, want Robert toe. keep Gruber, abigtable: copya of Distributed a large Storage collection System for Structured of lies, Data, OSDI, which 2006 formthe basic unit of access control. All stored in a column family is usually of the same type
BigTableSbased'Systems' Google'provides'access'to'BigTable'as'part'of'its' Google(App(Engine(cloud'service' BigTable'provided'Blueprint'for'a'proliferation'of' NoSQL'stores.' Apache'HBase' Apache'Accumulo' Apache'Cassandra' 48
Google'Dremel' Interactive'query' engine'for'large'data' volumes' Optimized'for'semiS structured,'nested' (denormalized)'data' Operates'on'data'inS situ,'e.g.'in'google' Filesystem'or'BigTable' Available'as'BigQuery' cloud'service' root server intermediate servers leaf servers (with local storage) client......... storage layer (e.g., GFS) query execution tree......... Figure 7: System architecture and execution inside a server node with the storage layer or access the data on local disk. Consider a simple aggregation query below: SELECT A, COUNT(B) FROM T GROUP BY A When the root server receives the above query, it determines all tablets, i.e., horizontal partitions of the table, that comprise T and rewrites the query as follows: SELECT A, SUM(c) FROM (R1 1 UNION ALL... R1 n ) GROUP49BY A Tables R1,...,R 1 n 1 are the results of queries sent to the nodes
HadoopSbased'Analytics'Engines' User 0.7 Query 1 Hive' User Query 2 Logistic 0.96 Apache'Drill' Regression 1.0 Shark/Spark' 0 20 40 60 80 100 120 Cloudera'Impala' Pivotal'Hawq' Shark Hive Figure 1: Performance of Shark vs. Hive/Hadoop on two SQL queries from an early user and one iteration of logistic regression (a classification algorithm that runs 10 such steps). Results measure the runtime (seconds) on a 100-node cluster. ning the deterministic operations used to build lost data partitions on other nodes, similar to MapReduce. Indeed, it typically recovers within seconds, by parallelizing this work across the cluster. To run SQL efficiently, however, we also had to extend the RDD execution model, bringing in several concepts from traditional analytical databases and some new ones. We started with an existing implementation of RDDs called Spark [33], and added several features. First, to store and process relational data efficiently, we implemented in-memory columnar storage and columnar compression. This reduced both the data size and the processing time by as much as 5 over naïvely storing the data in a Spark program in its original format. Second, to optimize SQL queries based on the data characteristics even in the presence of analytics functions and UDFs, we extended Spark with Partial DAG Execution (PDE): Shark can reoptimize a running query after running the first few stages of its task DAG, choosing better join strategies or the right degree of parallelism based on observed statistics. Third, we leverage other properties of the Spark engine not present in traditional MapReduce systems, such as control over data partitioning.!!master!node!!slave!node!spark!runtime HDFS!NameNode Resource!Manager!Scheduler Master!Process Execution!Engine Memstore Resource!Manager!Daemon HDFS!DataNode Metastore (System! Catalog)!!Slave!Node!Spark!Runtime Figure 2: Shark Architecture Execution!Engine Memstore Resource!Manager!Daemon HDFS!DataNode Shark ment Shark to be compatible with Apache Hive. It can be used to query an existing Hive warehouse and return results much faster, without modification to either the data or the queries. Thanks to its Hive compatibility, Shark can query data in any system that supports the Hadoop storage API, including HDFS and Amazon S3. It also supports a wide range of data formats such as text, binary sequence files, JSON, and XML. It inherits Hive s schema-on-read capability and nested data types [28]. In addition, users can choose to load high-value data into Shark s memory store for fast analytics, as shown below: CREATE TABLE latest_logs TBLPROPERTIES ("shark.cache"=true) AS SELECT * FROM logs WHERE date > now()-3600; http://blog.cloudera.com/blog/2012/10/cloudera-impalareal-time-queries-in-apache-hadoop-for-real/ Figure 2 shows the architecture of a Shark cluster, consisting of a single master node and a number of slave nodes, with the warehouse metadata stored in an external transactional database. It is Stoica et al., Shark: SQL and Rich Analytics at Scale, Berkley, 2012 Shark Impala 50
Big'Data'&'Science' Highly'distributed'heterogeneous'computing' environments'(hpc,'htc)' Complex'usage'models'of'data:' Many'custom'applications' Data'is'distributed'and'accessed'worldwide' MultiSstep'work'flows/data'processing' MapReduce'often'not'suitable' HPC'infrastructures'not'optimal'for'dataSintensive'tasks:' E.g.'Analysis'of'O(PB)'data'archives' Lead'to'a'proliferation'of'custom'domainSspecific' solutions'for'data'management,'e.g.'panda'in'lhc'grid' (Atlas)' 51
Conclusion' Big'Data'comes'from'data'exhaust'from'users,''new'and' pervasive'sensors,''and'the'ability'to'keep'everything' One'Size'does(not(fit'all:' Hadoop'optimized'for'embarassingly'parallel'tasks' Interactive'querying'and'analytics'tools'are'emerging'on'top'of' Hadoop' InSMemory'for'highly'interactive'tasks' Data'Analysis'is'the'new'bottleneck'to'discovery'(in' contrast'to'data'acquisition)' Less'Than'1%'of'World s'data'is'analyzed'(idc'digital'universe' Study)' 52
' Data'is'the'IntelSinside'of'nextSgeneration' compute'applications ' ' Tim'O Reilly' 53
' Data'Noise'to'One'Group'of'Scientists'is'Gold'to' Another. '' ' Rick'Smolan' ' 54
Questions?' 55