Big Data: Science Use Cases and Infrastructure. Dr. André Luckow

Size: px
Start display at page:

Download "Big Data: Science Use Cases and Infrastructure. Dr. André Luckow"

Transcription

1 Big Data: Science Use Cases and Infrastructure. Dr. André Luckow

2 Agenda' Motivation( Use'Cases' Infrastructure' Map'Reduce' Beyond'MapReduce' Seite 2

3 ' From'the'dawn'of'civilization'until'2003'humankind' generated'five'exabytes'of'data.'' ' Now'we'produce'five'exabytes'every'two'days.' '...and'the'pace'is'accelerating.' ' Eric'Schmidt,'Google'(2010)' 3

4 The'Digital'Universe' 4

5 As'we'begin'to'distribute'trillions'of'connected'sensors' around'the'planet,'virtually'every'animate'and' inanimate'object'on'earth'will'be'generating'and' transmitting'data,'including'our'homes,'our'cars,'our' natural'and'mansmade'environments,'and'yes,'even' our'bodies. ' ' Anthony'D'Williams' ' ' 5

6 Data'Everywhere' A380' TB/flight' Car' GB/drive' Human' GB/day'

7 Motivation' McKinsey'Big'Data'Report'(2011):' Data%has%swept%into%every%industry%and% business%function%and%are%now%an% important%factor%of%production. % Challenge:(Derive(more(and(more(value( from(more(and(more(data(in(less(time( Seite 7

8 Gartner'Hypecycle' This research note is restricted to the personal use of Figure 1. Hype Cycle for Emerging Technologies, 2012 expectations Wireless Power Hybrid Cloud Computing HTML5 Gamification Big Data Crowdsourcing Speech-to-Speech Translation Silicon Anode Batteries Natural-Language Question Answering Internet of Things Mobile Robots Autonomous Vehicles 3D Scanners Automatic Content Recognition Volumetric and Holographic Displays 3D Bioprinting Quantum Computing Human Augmentation Source: Gartner (July 2012) Technology Trigger Internet TV Peak of Inflated Expectations Plateau will be reached in: NFC In-Memory Analytics 3D Printing BYOD Complex-Event Processing Social Analytics Private Cloud Computing Application Stores Augmented Reality In-Memory Database Management Systems Activity Streams NFC Payment Audio Mining/Speech Analytics Text Analytics Home Health Monitoring Hosted Virtual Desktops Virtual Worlds Cloud Computing Machine-to-Machine Communication Services Mesh Networks: Sensor Gesture Control Trough of Disillusionment time Consumerization Media Tablets Mobile OTA Payment Idea Management Slope of Enlightenment less than 2 years 2 to 5 years 5 to 10 years more than 10 years Speech Recognition Consumer Telematics Biometric Authentication Methods Predictive Analytics As of July 2012 Plateau of Productivity obsolete before plateau

9 What'is'Big'Data?' Big%Data%refers%to%datasets%whose%size%is%beyond% the%ability%of%typical%database%software%tools%to% capture,%store,%manage%and%analyze.%(mckinsey)% Big%Data%is%any%that%is%expensive%to%manage%and% hard%to%extract%value%from.%(michael%franklin,% Berkley)% '

10 The 3 V s of Big Data The'volumes'of'data'that' need'to'be'processed'(in' near'realtime)'are'increasing' in'various'domains' How'to'deal'with'the'3'V s?*' Volume' Velocity' Variety' *Doug Laney, Gartner, 3-D Data Management, 2001 Seite 10

11 Agenda' Motivation' Use(Cases( Infrastructure' Map'Reduce' Beyond'MapReduce' Seite 11

12 The'Fourth'Paradigm'of'Science' ABOUT THE FOURTH PARADIGM 1. Empirical This book presents the first broad look at the rapidly emerging field of dataintensive science, with the goal of influencing the worldwide scientific and computing research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of escience such as databases, workflow management, visualization, and cloudcomputing technologies. This collection of essays expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized. 2. Theoretical The impact of Jim Gray s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science. BILL GATES I often tell people working in escience that they aren t in this field because 3. Computational they are visionaries or super-intelligent it s because they care about science and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better. RHYS FRANCIS, AUSTRALIAN eresearch INFRASTRUCTURE COUNCIL One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive science. This is recognized as a new paradigm beyond experimental and theoretical research and computer simulations of natural phenomena one that requires new tools, techniques, and ways of working. 4. Data Exploration DOUGLAS KELL, UNIVERSITY OF MANCHESTER The contributing authors in this volume have done an extraordinary job of helping to refine an understanding of this new paradigm from a variety of disciplinary perspectives. GORDON BELL, MICROSOFT RESEARCH THE FOURTH PARADIGM HEY TANSLEY TOLLE The F O U R T H P ARADIGM DATA-INTENSI VE SCIENTI F I C DI SCOVERY EDITED BY TONY HEY, STEWART TANSLEY, AND KRISTIN TOLLE PART NO Tony Hey et al., The Fourth Paradigm Data-intensive Scientific Discovery 12

13 Science'is'not'only'Simulation' Experimen: tation( Theory( Big'Data' Simulation' 13

14 Big'Data'in'Sciences' High'Energy'Physics:' LHC'at'CERN'produces'petabytes'of'data'per'day' O(10)'TB/day'stored' Data'is'distributed'to'Tier'1'and'Tier'2'sites' Astronomy:' Sloan'Digital'Sky'Survey'(80'TB'over'7'years)' LSST'will'produce'40TB'per'day'(for'10'years)' ' Geonomics:' Data'volume'increasing'with'every'new'generation'of'sequence' machine.'a'machine'can'produce'tb/day.' Costs'for'Sequencing'are'decreasing' Jha, Hong, Dobson, Katz, Luckow, Rana, Simmhan, Introducing Distributed, Dynamic, Data-intensive Sciences (D3): Understanding Applications and Infrastructure, in preparation for CCPE,

15 Genome'Sequencing' Source: documents/generaldocuments/cms_ pdf 15

16 Genome Sequencing' 3 GB per sequenced human genome National Human Genome Research Institute, The Costs of DNA Sequencing

17 Spatial'Data' Maps'&'metaSdata' Observational'data' LocationSbased' services' Crowdsourced' location'data'(e.g.' GPS'probes)' Source: 17

18 Agenda' Motivation' Use'Cases' Infrastructure( Map(Reduce( Beyond'MapReduce' Seite 18

19 Types'of'Data' Unstructured(Data:( Irregular'structure'or'parts'of'it'lack' structure'' FreeSform'text,'reports,'customer' feedback'forms,'pictures'and'video' Low(information(density( Structured:' PreSdefined'schema,'relational'data' High(information(densitiy( Publications' Derived'and' Combined'Data' Raw'Data' 19

20 Infrastructure'for'Big'Data' Challenges: Disk capacities increased, but disk latencies and bandwidths have not kept up How to efficiently store, transfer and process data? Year( Capacity( Transfer( Read( 1997' 2.1'GB' 16.6'MB/s' 126'sec' 2004' 200'GB' 56.5'MB/s' 59'min' 2012' 3000'GB' 210'MB/s' 239'min' Source: Cloudera CPU, RAM, disk size double every months, seek times only increase 5% per year 20

21 Data'Management'Traditional'' Storage Nodes Compute Nodes 1. Copy files Network' 3. Copy data back 2. Process data 21

22 Data'Management'S'The'Google'Way' Google(File(System(' archive/gfs.html((2003)(( Google(MapReduce(' https://plus.google.com/ /photos?hl=en archive/mapreduce.html((2004)( (Google(BigTable(' archive/bigtable.html((2006)(' 22

23 Google'Filesystem' Application GFS client (file name, chunk index) (chunk handle, chunk locations) (chunk handle, byte range) chunk data GFS master File namespace /foo/bar chunk 2ef0 Instructions to chunkserver GFS chunkserver Linux file system ion decisions using global knowledge. However, nimize its involvement in reads and writes so not become a bottleneck. Clients never read e data through the master. Instead, a client asks hich chunkservers it should contact. It caches tion for a limited time and interacts with the s directly for many subsequent operations. lain the interactions for a simple read with referre 1. First, using the fixed chunk size, the client e file name and byte offset specified by the apo a chunk index within the file. Then, it sends a request containing the file name and chunk master replies with the corresponding chunk locations of the replicas. The client caches this using the file name and chunk index as the key. t then sends a request to one of the replicas, Chunkserver state Figure 1: GFS Architecture GFS chunkserver Linux file system Source: Sanjay Ghemawatl et.al, The Google File System, 2003 Design(Principles:( Inspired'by'distributed/parallel' filesystems'(pvfs,'lustre)' Data messages Usage'of'commodity'hardware' Control messages Disk'seeks'are'expensive'' Optimized'for'sequential'reads' of'large'files' Partitioning'for'optimal'data' parallelism' Legend: tent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we Failures'are'very'common' But!( No'strong'consistency' No'file'locking' will discuss in Section On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially. However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines at the same time. The few chunkservers storing this 23

24 Google'Map'Reduce' Shuffle Reduce Map Build'on'top'of'Google' Filesystem' Bring'compute'to'data' Programming'model'for' efficiently'processing'large' volumes'of'data' Modeled'after'functional' languages'that'often'have'a' map'and'reduce'function( Abstracts'low'level'details,'such' as'data'distribution,'parallelism' and'fault'tolerance' 24

25 Map Reduce Programming Model Query:(( How'many'tweets'per'user?'! Select user, count(*) from tweet group by user;! ' Input:'key=row,'value=tweet'info' Map:'output'key=user_id,'value=1' Shuffle:'sort'by'user_id' Reduce:'for'each'user_id,'sum' Output:'user_id,'tweet'count' ' '

26 Apache'Hadoop' Set of open source projects inspired by Google Filesystem and MapReduce Transforms commodity hardware into a service that: Stores petabytes of data reliably (Hadoop Filesystem (HDFS)) Allows huge distributed computations (MapReduce) Attributes: Redundant and reliable (no data loss) Batch processing centric Easy to program distributed apps

27 Hadoop Components Hadoop Filesystem (HDFS): Open Source Implementierung vom Google Dateisystem MapReduce: Key/Value-basierte API für die parallele Verarbeitung von großen Datenmengen Implementierung: Apache Hadoop (MR1, MR2, Local) Hadoop as a Service: Amazon Elastic MapReduce

28 HDFS'Overview' Chunking:'Data'is'split'into' larger'chunks'(128'mb)'to' facilitate'datasparallel' processing' Replication:'Distribute'data' across'nodes'for'performance' and'reliability' Analysis'moved'to'data'' (avoiding'data'copy)' Scanning'of'data'' (avoiding'random'seeks)' Input'Data' Data Nodes

29 HDFS'Architecture' Source: 29

30 Hadoop'MapReduce'Architecture' 2. Retrieve input splits MapReduce' application' Job' Client' 1. Submit 1. job Job'Tracker' HDFS' Namenode' 3. Assign task Task'Tracker' Task'Tracker' 4. run Map'or'Reduce' Task' Map'or'Reduce' Task' HDFS' 30

31 Hadoop'Distributionen' Hadoop market is very dynamic Cloudera and Hortonworks are the leading Hadoop distributions Differentiation often based on Enterprise Features, e.g. security and management Several Cloud-based offerings: Amazons Elastic MapReduce Hadoop on Azure The Forrester Wave, Enterprise Hadoop Solutions, Q Seite 31

32 Hadoop'Ecosystem' Various'tools'on'top'of'Hadoop:' SQLSbased'Analytics' Hive'(SQLSQuery'Engine)' Cloudera'Impala' Apache'Drill' Pig'(ETL'tool)' Mahout'(Machine'Learning)' ' 32

33 "Big'data'is'what'happened'when'the'cost'of' keeping'information'became'less'than'the'cost' of'throwing'it'away."'s'george'dyson'' ' ' ' 33

34 Hadoop'Example'Usages' Data'Warehouse' Search'Index'creation' Machine'Learning:' Discovery'of'statistical'patterns'(commonly'requires'multiple' iterations/read'passes)' Science'applications:'' CloudBlast'Genome'Sequencing' Processing'telescopes''images'(e.g.'from'LSST)'with'machine' learning'algorithms' ' 34

35 Hadoop'Limitations' Common'query'patterns,'e.g.'joins'are'difficult'to'express' as'mapreduce'application' Algorithms'that'require'many'iterations'difficult'to' implement'on'top'of'mapreduce'model' No'point'to'point'communication' Critical'functionalities'missing,'e.g.''SQL,'InSMemory' analytics'or'realtime'capabilties' Security'insufficient:' Authentication' Encryption' 35

36 Agenda' Motivation' Use'Cases' Infrastructure( Map'Reduce' Beyond(MapReduce( Seite 36

37 Beyond'MapReduce'and' traditional'databases' MapReduce'too'lowSlevel,'traditional' databases'have'limited'scalability' Stonebraker:' One'Size'Fits'All' 'An'Idea' Whose'Time'Has'Come'and'Gone '' Different'data'processing'tasks'have'different' requirements' Google''e.g.'has''multiple'data'stores'in'place:' Big'Table'(a'NoSQL'store)'' Dremmel'(a'readSonly'query'engine)'' Spanner'(a'SQL'ACID'engine)'' 37

38 Infrastructure:'Beyond'MapReduce' '' Volume TeraByte PetaByte Innovations Spotential' Parallele' Datenbanken' InSMemory' Datenbanken' Traditionelle' Databases' MapReduce' MapReduce/Hadoop' increasingly'popular:' Open(Source/Free( Good(vendor(support' However,'there'are'further' options:' Parallele(Databases' 'massive' parallel'data'stores'with'full'sql' functionality' In:Memory(Databases'support' smaller'data'volumes,'but'provide' lower'latencies.' Latency Velocity Throughput Seite 38

39 Scale'Up'versus'Scale'Out' Traditionally'databases'were'centrally'managed' Scale'up'as'the'only'way'to'scale'a'database:' Limited'by'physics'(CPU,'memory,...)' Cost/Performance'ratio'drops'significantly'after'a'certain' point'(reliable'hardware'is'really'expensive!!)' ScaleSout'as'a'new'option'for'scaling'databases' Shared'nothing'architectures'enable'greatest'scalability' 39

40 Traditionally, the data values in a database are stored in a row-oriented fashion, with complete tuples stored in adjacent blocks on disk or in main memory [52]. This can Column'versus'Row'Data'Layout' be contrasted with column-oriented storage, where columns, rather than rows, are stored in adjacent blocks. This is illustrated in Figure 3.3. Source: Plattner, Zeiler, In-Memory Data Management, Springer, 2012 Figure 3.3: Row- and Column-Oriented Data Layout Traditional'databases'are'row'storage'systems'and' Row-oriented storage allows fast reading of single tuples but is not well suited optimized'for'accessing'complete'tupels' to reading a set of results from a single column. The upper part of Figure 3.4 exemplifies this by illustrating row and column operations in a row store. Analytic'workloads'typically'operate'along'columns' ' using'a'columnsbased'data'organization'avoids' accessing'data'that'is'not'required'for'a'join' 40

41 Brewer s CAP Theorem Distributed'data'systems'can'only'have'two'out'of'these'three'desirable' properties'(eric'a.'brewer.'towards'robust'distributed'systems.'podc' Keynote,'2000):' Consistency((( You'receive'a'correct'response.'Applications'behave'as'if'there'is'no' concurrency.' Availability( You'eventually'receive'a'response.' Partition(Tolerance( Communication'in'distributed'systems'is'unreliable.' System'is'capable'of'dealing'with'a'network'split,'i.e.'the'existence' of'multiple'nonscommunicating'groups'

42 Brewer s CAP Theorem (2) Availability/Partition(Tolerance( A'reachable'replica'provides'service'even'if'a'network' partition'exists,'which'may'results'in'inconsistencies.' Consistency/Partition(Tolerance(( Always'consistent,'but'an'online'replica'may'deny'service' if'it'cannot'synchronize'with'replica'in'another'partition.' Consistency/Availability Available and consistent if no network partition occurs.

43 InSMemory'Databases' 1.2 The Impact of Recent Hardware Trends Main Memory Flash Drives Disk Drives Main'Memory'Access' Read'1'MB'Sequentially' from'memory'' Disk'Seek' 100'ns' 250,000'ns' 5,000,000'ns' Read'1'MB'Sequentially' from'disk'' 30,000,000'ns' Year Figure 1.2: Storage Price Development From Maximizing CPU Speed to Multi-Core Processors Source: Plattner/Zeiler: In-Memory Data-Management, Springer, In 1965, Intel co-founder Gordon E. Moore made his famous prediction about the increasing complexity of integrated circuits in the semiconductor industry [20]. The prediction became known as Moore s Law, and has become shorthand for rapid technological change. Moore s Law states that the number of transistors on a single chip is doubled approximately every two years [21]. In reality, the performance of Central Processing Units (CPUs) doubles every 20 months on average. The brilliant achievement that computer architects have 43

44 ' Tape'is'Dead,'Disk'is'Tape,'Flash'is'Disk,'RAM' Locality'is'King 'Jim'Gray' 44

45 Infrastructure:'Beyond'MapReduce' Volume TeraByte PetaByte Latency Dremel' Bigtable' Parallele' Green Datenbanken' plum' InSMemory' Datenbanken' Traditionelle' Databases' Based( on( SAP' HANA' Velocity Hadoop/ HDFS' '' MapReduce' Throughput Many'new'data'management' systems'emerged'both'in' proprietary'space'as'well'as'in' open'source'space' ' ' ' ' ' ' Seite 45

46 Google'BigTable' Columnar'storage'based'on'Google'Filesystem' Design'goals:' Enable'random'access'to'data' Manage'small,'structured'data'entities'in'large'volumes' Master/Slave'architecture' Data'Model:'...a'sparse,'distributed,'persistent'multiS dimensional'sorted'map... '(Big'Table'paper)' Limited'support'for'transactions' 46

47 BigTable'Data'Model'' "contents:" "anchor:cnnsi.com" "anchor:my.look.ca" "com.cnn.www" "<html>..." "<html>..." "<html>..." t "CNN" t 9 "CNN.com" t8 5 t 6 t 3 ure 1: A slice Row of ankeys: example table that stores Web pages. The row name is a reversed URL. The contents column family s the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN s home Lexigraphically ordered and used for dynamic partioning ferenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi anchor:my.look.ca. Each Each partition anchor cell is has called one version; tablet the and contents is the column unit has of distribution three versions, at timestamps t 3, t 5, and Column keys: e settled on this Column data model family after examining to group adata variety (access Column control, Families co-location) otential uses Multiple of a Bigtable-like versions stored system. As one conte example that drove some of our design decisions, Column keys are grouped into sets called column f Source: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew 47 pose we Fikes, want Robert toe. keep Gruber, abigtable: copya of Distributed a large Storage collection System for Structured of lies, Data, OSDI, which 2006 formthe basic unit of access control. All stored in a column family is usually of the same type

48 BigTableSbased'Systems' Google'provides'access'to'BigTable'as'part'of'its' Google(App(Engine(cloud'service' BigTable'provided'Blueprint'for'a'proliferation'of' NoSQL'stores.' Apache'HBase' Apache'Accumulo' Apache'Cassandra' 48

49 Google'Dremel' Interactive'query' engine'for'large'data' volumes' Optimized'for'semiS structured,'nested' (denormalized)'data' Operates'on'data'inS situ,'e.g.'in'google' Filesystem'or'BigTable' Available'as'BigQuery' cloud'service' root server intermediate servers leaf servers (with local storage) client storage layer (e.g., GFS) query execution tree Figure 7: System architecture and execution inside a server node with the storage layer or access the data on local disk. Consider a simple aggregation query below: SELECT A, COUNT(B) FROM T GROUP BY A When the root server receives the above query, it determines all tablets, i.e., horizontal partitions of the table, that comprise T and rewrites the query as follows: SELECT A, SUM(c) FROM (R1 1 UNION ALL... R1 n ) GROUP49BY A Tables R1,...,R 1 n 1 are the results of queries sent to the nodes

50 HadoopSbased'Analytics'Engines' User 0.7 Query 1 Hive' User Query 2 Logistic 0.96 Apache'Drill' Regression 1.0 Shark/Spark' Cloudera'Impala' Pivotal'Hawq' Shark Hive Figure 1: Performance of Shark vs. Hive/Hadoop on two SQL queries from an early user and one iteration of logistic regression (a classification algorithm that runs 10 such steps). Results measure the runtime (seconds) on a 100-node cluster. ning the deterministic operations used to build lost data partitions on other nodes, similar to MapReduce. Indeed, it typically recovers within seconds, by parallelizing this work across the cluster. To run SQL efficiently, however, we also had to extend the RDD execution model, bringing in several concepts from traditional analytical databases and some new ones. We started with an existing implementation of RDDs called Spark [33], and added several features. First, to store and process relational data efficiently, we implemented in-memory columnar storage and columnar compression. This reduced both the data size and the processing time by as much as 5 over naïvely storing the data in a Spark program in its original format. Second, to optimize SQL queries based on the data characteristics even in the presence of analytics functions and UDFs, we extended Spark with Partial DAG Execution (PDE): Shark can reoptimize a running query after running the first few stages of its task DAG, choosing better join strategies or the right degree of parallelism based on observed statistics. Third, we leverage other properties of the Spark engine not present in traditional MapReduce systems, such as control over data partitioning.!!master!node!!slave!node!spark!runtime HDFS!NameNode Resource!Manager!Scheduler Master!Process Execution!Engine Memstore Resource!Manager!Daemon HDFS!DataNode Metastore (System! Catalog)!!Slave!Node!Spark!Runtime Figure 2: Shark Architecture Execution!Engine Memstore Resource!Manager!Daemon HDFS!DataNode Shark ment Shark to be compatible with Apache Hive. It can be used to query an existing Hive warehouse and return results much faster, without modification to either the data or the queries. Thanks to its Hive compatibility, Shark can query data in any system that supports the Hadoop storage API, including HDFS and Amazon S3. It also supports a wide range of data formats such as text, binary sequence files, JSON, and XML. It inherits Hive s schema-on-read capability and nested data types [28]. In addition, users can choose to load high-value data into Shark s memory store for fast analytics, as shown below: CREATE TABLE latest_logs TBLPROPERTIES ("shark.cache"=true) AS SELECT * FROM logs WHERE date > now()-3600; Figure 2 shows the architecture of a Shark cluster, consisting of a single master node and a number of slave nodes, with the warehouse metadata stored in an external transactional database. It is Stoica et al., Shark: SQL and Rich Analytics at Scale, Berkley, 2012 Shark Impala 50

51 Big'Data'&'Science' Highly'distributed'heterogeneous'computing' environments'(hpc,'htc)' Complex'usage'models'of'data:' Many'custom'applications' Data'is'distributed'and'accessed'worldwide' MultiSstep'work'flows/data'processing' MapReduce'often'not'suitable' HPC'infrastructures'not'optimal'for'dataSintensive'tasks:' E.g.'Analysis'of'O(PB)'data'archives' Lead'to'a'proliferation'of'custom'domainSspecific' solutions'for'data'management,'e.g.'panda'in'lhc'grid' (Atlas)' 51

52 Conclusion' Big'Data'comes'from'data'exhaust'from'users,''new'and' pervasive'sensors,''and'the'ability'to'keep'everything' One'Size'does(not(fit'all:' Hadoop'optimized'for'embarassingly'parallel'tasks' Interactive'querying'and'analytics'tools'are'emerging'on'top'of' Hadoop' InSMemory'for'highly'interactive'tasks' Data'Analysis'is'the'new'bottleneck'to'discovery'(in' contrast'to'data'acquisition)' Less'Than'1%'of'World s'data'is'analyzed'(idc'digital'universe' Study)' 52

53 ' Data'is'the'IntelSinside'of'nextSgeneration' compute'applications ' ' Tim'O Reilly' 53

54 ' Data'Noise'to'One'Group'of'Scientists'is'Gold'to' Another. '' ' Rick'Smolan' ' 54

55 Questions?' 55

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Future Prospects of Scalable Cloud Computing

Future Prospects of Scalable Cloud Computing Future Prospects of Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 7.3-2012 1/17 Future Cloud Topics Beyond

More information

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER A NEW PARADIGM IN INFORMATION TECHNOLOGY There is a revolution happening in information technology, and it s not

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER

Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER Breaking News! Big Data is Solved. What Is In-Memory Computing and What Does It Mean to U.S. Leaders? EXECUTIVE WHITE PAPER There is a revolution happening in information technology, and it s not just

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team) Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015 After this talk Realize the potential: Data vs. Big Data Understand why we

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive

Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive Big Data Analysis using Hadoop components like Flume, MapReduce, Pig and Hive E. Laxmi Lydia 1,Dr. M.Ben Swarup 2 1 Associate Professor, Department of Computer Science and Engineering, Vignan's Institute

More information

On the Varieties of Clouds for Data Intensive Computing

On the Varieties of Clouds for Data Intensive Computing On the Varieties of Clouds for Data Intensive Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group Yunhong Gu University of Illinois at Chicago Abstract By a cloud we mean

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Big Table A Distributed Storage System For Data

Big Table A Distributed Storage System For Data Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,

More information

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy Native Connectivity to Big Data Sources in MicroStrategy 10 Presented by: Raja Ganapathy Agenda MicroStrategy supports several data sources, including Hadoop Why Hadoop? How does MicroStrategy Analytics

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Data Warehouse Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

Timo Elliott VP, Global Innovation Evangelist. 2015 SAP SE or an SAP affiliate company. All rights reserved. 1

Timo Elliott VP, Global Innovation Evangelist. 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Timo Elliott VP, Global Innovation Evangelist 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Analytics Takes Over The World 2015 SAP SE or an SAP affiliate company. All rights reserved.

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY Jaipaul Agonus FINRA Strata Hadoop World New York, Sep 2015 FINRA - WHAT DO WE DO? Collect and Create

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems Storage of Structured Data: BigTable and HBase 1 HBase and BigTable HBase is Hadoop's counterpart of Google's BigTable BigTable meets the need for a highly scalable storage system for structured data Provides

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Timo Elliott VP, Global Innovation Evangelist. 2015 SAP SE or an SAP affiliate company. All rights reserved. 1

Timo Elliott VP, Global Innovation Evangelist. 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Timo Elliott VP, Global Innovation Evangelist 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Analytics Takes Over The World 2015 SAP SE or an SAP affiliate company. All rights reserved.

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Spark: Making Big Data Interactive & Real-Time

Spark: Making Big Data Interactive & Real-Time Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES AWS GLOBAL INFRASTRUCTURE 10 Regions 25 Availability Zones 51 Edge locations WHAT

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Data Management in SAP Environments

Data Management in SAP Environments Data Management in SAP Environments the Big Data Impact Berlin, June 2012 Dr. Wolfgang Martin Analyst, ibond Partner und Ventana Research Advisor Data Management in SAP Environments Big Data What it is

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information