Parallel Techniques for Big Data. Patrick Valduriez INRIA, Montpellier
|
|
|
- Junior Miles
- 10 years ago
- Views:
Transcription
1 Parallel Techniques for Big Data Patrick Valduriez INRIA, Montpellier 1
2 Outline of the Talk Big data: problem and issues Parallel data processing Parallel architectures Parallel techniques Cloud data mgt NoSQL DBMS MapReduce Conclusion 2
3 Big Data: problem and issues 3
4 Big Data: what is it? A buzz word! With different meanings depending on your perspective E.g. 10 terabytes is big for a TP system, but small for a web search engine A definition (Wikipedia) Consists of data sets that grow so large that they become awkward to work with using on-hand data management tools Difficulties: capture, storage, search, sharing, analytics, visualizing But size is only one dimension of the problem How big is big? Moving target: terabyte (1012 bytes), petabyte (1015 4
5 Why Big Data Today? Overwhelming amounts of data Generated by all kinds of devices, networks and programs E.g. sensors, mobile devices, internet, social networks, computer simulations, satellites, radiotelescopes, etc. Increasing storage capacity Storage capacity has doubled every 3 years since 1980 with prices steadily going down 1 Gigabyte for: 1M$ in 1982, 1K$ in 1995, 0.12$ in 2011 Very useful in a digital world! Massive data can produce high-value information and 5
6 Some estimates 1,8 zetaoctets: an estimation for the data stored by humankind in zetaoctets in 2020 Less than 1% of big data is analyzed Less than 20% of big data is protected v Source: Digital Universe study of International Data Corporation, december
7 Big Data Dimensions: the three V s Volume Refers to massive amounts of data Makes it hard to store and manage, but also to analyze (big analytics) Velocity Continuous data streams are being captured (e.g. from sensors or mobile devices) and produced Makes it hard to perform online processing Variety Different data formats (sequences, graphs, arrays, ), different semantics, uncertain data (because of data capture), multiscale data (with lots of dimensions) 7
8 Scientific Data common features Big data Manipulated through complex, distributed workflows Important metadata about experiments and their provenance Mostly append-only (with rare updates) 8
9 Parallel Data Processing 9
10 The solution to big data processing! Exploit a massively parallel computer A computer that interconnects lots of CPUs, RAM and disk units To obtain High performance through data-based parallelism High throughput for OLTP loads Low response time for OLAP queries High availability and reliability through data replication Extensibility of the architecture 10
11 Extensibility : ideal goals Linear increase in performance for a constant database size and load, and proportional increase of the system components (CPU, memory, disk) Sustained performance for a linear increase of database size and load, and proportional increase of components perf. perf. ideal Components ideal components + (db & load) 11
12 Data-based Parallelism Inter-query Different queries on the same data For concurrent queries Q1 D Qn Inter-operation Op3 Different operations of the same query on different data For complex queries Intra-operation Op1 D1 Op Op2 D2 Op The same operation on different data For large queries D1 Dn 12
13 Parallel Architectures 13
14 Parallel Architectures for Data Management Three main alternatives, depending on how processors, memory and disk are interconnected Shared-memory computer Shared-disk cluster Shared-nothing cluster 14
15 Shared-memory Computer All memory and disk are shared Symmetric Multiprocessor (SMP) Non Uniform Memory Architecture (NUMA) P P P P Examples IBM Numascale, HP Proliant, Data General NUMALiiNE, Bull Novascale M + Simple for apps, fast com., load balancing - Complex interconnect limits extensibility, cost For write-intensive workloads, expensive for big data 15
16 Shared-disk (SD) Cluster Disk is shared, memory is private Storage Area Network (SAN) to interconnect memory and disk (block level) Needs distributed lock manager (DLM) for cache coherence Examples P P M P P M Oracle RAC and Exadata IBM PowerHA + Simple for apps, extensibility - Complex DLM, cost For write-intensive workloads or big data 16
17 Shared-nothing (SN) Cluster No sharing of either memory or disk across nodes No need for DLM But needs data partitioning Examples P P P P DB2 DPF, SQL Server Parallel DW, Teradata, MySQLcluster M M Google search engine, NoSQL keyvalue stores (Bigtable, ) + highest extensibility, cost -updates, distributed trans. Perfect match for big data (read intensive) 17
18 Parallel Data Management Techniques 18
19 A Simple Model for Parallel Data Shared-nothing architecture The most general and most scalable Set-oriented Each dataset D is represented by a table of rows Key-value Each row is represented by a <key, value> pair, where Key uniquely identifies the value in D Value is a list of (attribute name : attribute value) pairs Can represent structured (relational) data or NoSQL data But graph is another story (see Pregel or DEX) 19
20 Design Considerations Big datasets Data partitioning and indexing Problem with skewed data distributions Disk is very slow (10K times slower than RAM) Exploit RAM data structures and compression Exploit flash memory (read 10 times faster than disk) Query parallelization and optimization Automatic if the query language is declarative (e.g. SQL) Parallel algorithms for algebraic operators Select is easy, Join is difficult Programmer-assisted otherwise (e.g. MapReduce) 20
21 Data Partitioning Vertical Keys A table Values Base Basis for column stores (e.g. MonetDB, Vertica): efficient for OLAP queries Easy to compress, e.g. using Bloom Horizontal filters (sharding) Shards can be stored (and replicated) at different nodes 21
22 Sharding Schemes Round-Robin ith row to node (i mod n) perfect balancing but full scan only Hashing (k,v) to node h(k) exact-match queries but problem with skew a-g h-m u-z Range (k,v) to node that holds k s interval exact-match and range queries deals with skew 22
23 Indexing Can be supported by special tables with rows of the form: <attribute, list of keys> pairs Ex. <att-value, (doc-id:id1, doc-id:id2, doc-id:id10)> Given an attribute value, returns all corresponding keys These keys can in turn be used to access the corresponding rows in shards Complements sharding with secondary indices or inverted files to speed up attribute-based queries Can be partitioned 23
24 Replication and Failover Replication The basis for faulttolerance and availability Have several copies of each shard Failover Node 1 Client Ping Node 2 On a node failure, another node detects and recovers the node s tasks connect1 connect1 24
25 Parallel Query Processing 1. Query parallelization Produces an optimized parallel execution plan, with operators Based on partitioning, replication, indexing 2. Parallel execution Relies on parallel main memory algorithms for operators Use of hashed-based join algorithms Adaptive degree of partitioning to deal with skew R1 Select from R,S where group by Sel. Parallelization R2 Sel. R3 Sel. R4 Sel. S1 S2 S3 S4 Join Join Join Join Grb Grb Grb Grb Grb 25
26 Parallel Hash Join Algorithm Both tables R and S are partitioned by hashing on the join attribute node node node node R1: R2: S1: S2: node 1 node 2 R join S = p i=1 Ri join Si 26
27 Parallel DBMS Generic: with full support of SQL, with user defined functions Structured data, XML, multimedia, etc. Automatic optimization and parallelization Transactional guarantees Atomicity, Consistency, Isolation, Durability Transactions make it easy to program complex updates Performance through Data partitioning, indexing, caching Sophisticated parallel algorithms, load balancing Two kinds Row-based: Oracle, MySQL, MS SQLserver, IBM DB2, 27
28 Main products Vendor Product Architecture Remarks EMC GreenPlum SN Hybrid SQL/MapReduce, based on PostgreSQL HP Vertica SN Column-based IBM Microsoft Oracle DB2 Pure Scale DB2 Database Partitioning Feature (DPF) SQL Server SQL Server Parallel Data Warehouse Real Application Cluster Exadata Database machine MySQL SD SN SD SN SD SD SN AIX on SP Linux on cluster Windows only Acquisition of Netezza Portability OSS on linux cluster ParAccel ParAccel Analytic Database SN Column-based Teradata Teradata Database Aster SN with Bynet SN Unix and Windows Hybrid SQL/MapReduce and row/column 28
29 Cloud Data Management 29
30 Cloud Data: problem and solution Cloud data Can be very large (e.g. text-based or scientific applications), unstructured or semi-structured, and typically append-only (with rare updates) Cloud users and application developers In very high numbers, with very diverse expertise but very little DBMS expertise Therefore, current cloud data management solutions trade consistency for scalability, simplicity and flexibility New file systems: GFS, HDFS, NOSQL systems: Amazon SimpleDB, Google Base, 30
31 Google File System (GFS) Used by many Google applications Search engine, Bigtable, Mapreduce, etc. The basis for Hadoop HDFS (Apache & Yahoo) Optimized for specific needs Shared-nothing cluster of thousand nodes, built from inexpensive harware => node failure is the norm! Very large files, of typically several GB, containing many objects such as web documents Mostly read and append (random updates are rare) Large reads of bulk data (e.g. 1 MB) and small random reads (e.g. 1 KB) Append operations are also large and there may be many concurrent clients that append the same file 31
32 Design Choices Traditional file system interface (create, open, read, write, close, and delete file) Two additional operations: snapshot and record append. Relaxed consistency, with atomic record append No need for distributed lock management Up to the application to use techniques such as checkpointing and writing self-validating records Single GFS master Maintains file metadata such as namespace, access control information, and data placement information Simple, lightly loaded, fault-tolerant Fast recovery and replication strategies 32
33 GFS Distributed Architecture Files are divided in fixed-size partitions, called chunks, of large size, i.e. 64 MB, each replicated at several nodes Application GFS client Get chunk location GFS Master Get chunk data GFS chunk server Linux file system GFS chunk server Linux file system 33
34 NoSQL Systems 34
35 NOSQL (Not Only SQL): definition Specific DBMS: for web-based data Specialized data model Key-value, table, document, graph Trade relational DBMS properties For Full SQL, ACID transactions, data independence Simplicity (schema, basic API) Scalability and performance Flexibility for the programmer (integration with programming language) NB: SQL is just a language and has nothing to 35
36 NoSQL Approaches Characterized by the data model, in increasing order of complexity: 1. key-value: Amazon DynamoDB and SimpleDB 2. big table: Google Bigtable 3. document: 10gen MongoDB 4. graph: Neo4J What about object DBMS or XML DBMS?... Were there much before NoSQL Sometimes presented as NoSQL But not really scalable 36
37 Key-value store: DynamoDB The basis for many systems Cassandra, Voldemort Simple (key, value) data model Key = unique id Value = a small object (< 1 Mbyte) Simple queries Put (key, value) Get (key) Replication and eventual consistency If no more updates, the replicas get mutually consistent No security Assumes the environment is secured (cloud) 37
38 DynamoDB consistent hashing Hash-based partitioning The interval of hash values is treated as a ring Ex. Node B is resp. for interval [A,B] Advantage: if a node fails, its successor takes over its data No impact on other nodes E F D A C put(c,v) h(c) XB 38
39 Google Bigtable Database storage system for a shared-nothing cluster Uses GFS to store structured data, with fault-tolerance and availability Used by popular Google applications Google Earth, Google Analytics, Google+, etc. The basis for popular Open Source implementations Hadoop Hbase on top of HDFS (Apache & Yahoo) Specific data model that combines aspects of rowstore and column-store DBMS Rows with multi-valued, timestamped attributes A Bigtable is defined as a multidimensional map, 39
40 A Bigtable Row Row unique id Column family Column key Row key Contents: Anchor: Language: "com.google.www" "<html> <\html>" t1 inria.fr "google.com" t2 "english" t1 "<html> <\html>" t5 "Google" t3 uwaterloo.ca "google.com" t4 Column family = a kind of multi-valued attribute Set of columns (of the same type), each identified by a key - Colum key = attribute value, but used as a name Unit of access control and compression 40
41 Bigtable DDL and DML No such thing as SQL Basic API for defining and manipulating tables, within a programming language such as C++ No impedance mismatch Various operators to write and update values, and to iterate over subsets of data, produced by a scan operator Various ways to restrict the rows, columns and timestamps produced by a scan, as in relational select, but no complex operator such as join or union Transactional atomicity for single row updates only 41
42 Dynamic Range Partitioning Range partitioning of a table on the row key Tablet = a partition (shard) corresponding to a row range Partitioning is dynamic, starting with one tablet (the entire table range) which is subsequently split into multiple tablets as the table grows Metadata table itself partitioned in metadata tablets, with a single root tablet stored at a master server, similar to GFS s master Implementation techniques Compression of column families Grouping of column families with high locality of access Aggressive caching of metadata information by clients 42
43 Document DBMS: MongoDB Objectives: performance and scalability as in (key, value) stores, but with typed A document is a collection of (key, typed value) with a unique key (generated by MongoDB) Data model and query language based on the BSON (Binary JSON) format No schema, no join, no complex transaction Sharding, replication and failover Secondary indices Integration with MapReduce 43
44 MongoDB document (post) example { _id : ObjectId("4e77bb3b8a3e f7a"), Generated by MongoDB when : Date(" T02:10:11.3Z"), author : "alex", title : "No Free Lunch", text : "This is the text of the post. It could be very long.", tags : [ "business", "ramblings" ], votes : 5, voters : [ "jane", "joe", "spencer", "phyllis", "li" ], Arrays comments : [ { who : "jane", when : Date(" T04:00:10.112Z"), comment : "I agree." }, { who : "meghan", Arrays of Documents when : Date(" T14:36:06.958Z"), comment : "You must be joking. etc etc..." } ] } 44
45 MongoDB query language Expression of the form db.nombd.fonction (JSON expression) Update examples db.posts.insert({author: alex, title: No Free Lunch }) db.posts.update({author: alex, {$set:{age:30}}) db.posts.update({author: alex, {$push:{tags: music }}) Select examples db.posts.find({author:"alex"}) All posts from Alex db.posts.find({comments.who:"jane"}) All posts commented by Jane 45
46 Graph DBMS: Neo4J Applications with very big graphs Billions of nodes and links Social networks, hypertext documents, linked open data, etc. Direct support of graphs Data model, API, query language Implemented by linked lists on disk Optimized for graph processing Transactions Implemented on SN cluster Asymmetric replication Graph partitioning 46
47 Neo4J data model Nodes Links between nodes Properties on nodes and links Bob 35 friend Mary 29 Eng. member-of likes members Group Tennis Ex. of Neo transaction NeoService neo = // factory Transaction tx = neo.begintx(); Node n1 = neo.createnode(); n1.setproperty("name", "Bob"); n1.setproperty("age", 35); Node n2 = neo.createnode(); n2.setproperty("name", "Mary"); n2.setproperty("age", 29); 47
48 Neo4J - languages Java API (navigational) Cypher query language Queries and updates with graph traversals Support of SparQL for RDF data Ex. of Cypher query that returns the (indirect) friends of Bob whose name starts with "M" START bob=node:node_auto_index (name = 'Bob') MATCH bob-[:friend]->()- [:friend]->follower WHERE follower.name =~ 'M.* RETURN bob, follower.name 48
49 Main NoSQL Systems Vendor Product Category Comments Amazon Apache Dynamo SimpleDB Cassandra Accumulo KV KV Big table Proprietary Open source, Origin Facebook Open source, Origin NSA Armadillo S.A. Armadillo Document Proprietary, security Google Bigtable Pregel Big table Graph Proprietary, patents Hadoop Hbase Big table Open source, Orig. Yahoo LinkedIn Voldemort KV Open source 10gen MongoDB Document Open source Neo4J.org Neo4J Graph Open source Sparcity DEX Graph Proprietary, Orig. UPC, Barcelone Ubuntu CouchDB Document Open source 49
50 NoSQL versus Relational The techniques are not new Database machines, SN cluster But very large scale Pros NoSQL Scalability, performance APIs suitable for programming Pros Relational Strong consistency, transactions Standard SQL (many tools) Towards NoSQL/Relational hybrids? Google F1: combines the scalability, fault tolerance, transparent sharding, and cost benefits so far available only in NoSQL systems with the usability, familiarity, 50
51 MapReduce 51
52 MapReduce Framework Invented by Google Proprietary (and protected by software patents) But popular Open Source version by Hadoop (Apache & Yahoo) For data analysis of very large data sets Highly dynamic, irregular, schemaless, etc. SQL or Xquery too heavy New, simple parallel programming model Data structured as (key, value) pairs E.g. (doc-id, content), (word, count), etc. Functional programming style with two functions to be given: Map(key, value) -> ikey, ivalue 52
53 MapReduce Typical Usages Counting the numbers of some words in a set of docs Distributed grep: text pattern matching Counting URL access frequencies in Web logs Computing a reverse Web-link graph Computing the term-vectors (summarizing the most important words) in a set of documents Computing an inverted index for a set of documents Distributed sorting 53
54 MapReduce Processing Input data set Split 0 Split 1 Split 2 Map Map Map (k1,v) (k2,v) (k2,v) (k2,v) (k1,v) Group by k Group by k (k1,(v,v,v)) (k2,(v,v,v,v)) Reduce Reduce Output data set Map (k1,v) (k2,v) Map phase Shuffle phase Reduce phase Simple programming model Key-value data storage Hash-based data partitioning 54
55 MapReduce Example EMP (ENAME, TITLE, CITY) Query: for each city, return the number of employees whose name is "Smith" SELECT CITY, COUNT(*) FROM EMP WHERE ENAME LIKE "\%Smith" GROUP BY CITY With MapReduce Map (Input (TID,emp), Output: (CITY,1)) if emp.ename like "%Smith" return (CITY,1) 55
56 Fault-tolerance Fault-tolerance is fine-grain and well suited for large jobs Input and output data are stored in GFS Already provides high fault-tolerance All intermediate data is written to disk Helps checkpointing Map operations, and thus provides tolerance from soft failures If one Map node or Reduce node fails during execution (hard failure) The tasks are made eligible by the master for scheduling onto other nodes It may also be necessary to re-execute completed Map tasks, since the input data on the failed node disk is 56
57 MapReduce vs Parallel DBMS [Pavlo et al. SIGMOD09]: Hadoop MapReduce vs two parallel DBMS, one row-store DBMS and one column-store DBMS Benchmark queries: a grep query, an aggregation query with a group by clause on a Web log, and a complex join of two tables with aggregation and filtering Once the data has been loaded, the DBMS are significantly faster, but loading is much time consuming for the DBMS Suggest that MapReduce is less efficient than DBMS because it performs repetitive format parsing and does not exploit pipelining and indices [Dean and Ghemawat, CACM10] Make the difference between the MapReduce model and 57
58 MapReduce Performance Much room for improvement (see MapReduce workshops) Map phase Minimize I/0 cost using indices (Hadoop++) Shuffle phase Minimize data transfers by partitioning data on the same intermediate key Current work in Zenith Reduce phase Exploit fine-grain parallelism of Reduce tasks Current work in Zenith 58
59 Conclusion 59
60 Research Directions Basic techniques are not new Parallel database machines, shared-nothing cluster Data partitioning, replication, indexing, parallel hash join, etc. But need to scale up Much room for research and innovation MapReduce extensions Dynamic workload-based partitioning Data-oriented scientific workflows Uncertain data mining Content-based IR Data privacy 60
Cloud & Big Data a perfect marriage? Patrick Valduriez
Cloud & Big Data a perfect marriage? Patrick Valduriez Cloud & Big Data: the hype! 2 Cloud & Big Data: the hype! 3 Behind the Hype? Every one who wants to make big money Intel, IBM, Microsoft, Oracle,
Data Management in the Cloud -
Data Management in the Cloud - current issues and research directions Patrick Valduriez Esther Pacitti DNAC Congress, Paris, nov. 2010 http://www.med-hoc-net-2010.org SOPHIA ANTIPOLIS - MÉDITERRANÉE Is
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Indexing and Processing Big Data
Indexing and Processing Big Data Patrick Valduriez INRIA, Montpellier Why Big Data Today? Overwhelming amounts of data Generated by all kinds of devices, networks and programs E.g. sensors, mobile devices,
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
How To Write A Database Program
SQL, NoSQL, and Next Generation DBMSs Shahram Ghandeharizadeh Director of the USC Database Lab Outline A brief history of DBMSs. OSs SQL NoSQL 1960/70 1980+ 2000+ Before Computers Database DBMS/Data Store
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)
Big Data Management in the Clouds Alexandru Costan IRISA / INSA Rennes (KerData team) Cumulo NumBio 2015, Aussois, June 4, 2015 After this talk Realize the potential: Data vs. Big Data Understand why we
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Applications for Big Data Analytics
Smarter Healthcare Applications for Big Data Analytics Multi-channel sales Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail:
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores
Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores Composite Software October 2010 TABLE OF CONTENTS INTRODUCTION... 3 BUSINESS AND IT DRIVERS... 4 NOSQL DATA STORES LANDSCAPE...
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Databases 2 (VU) (707.030)
Databases 2 (VU) (707.030) Introduction to NoSQL Denis Helic KMI, TU Graz Oct 14, 2013 Denis Helic (KMI, TU Graz) NoSQL Oct 14, 2013 1 / 37 Outline 1 NoSQL Motivation 2 NoSQL Systems 3 NoSQL Examples 4
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Introduction to NOSQL
Introduction to NOSQL Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 31, 2014 Motivations NOSQL stands for Not Only SQL Motivations Exponential growth of data set size (161Eo
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
Distributed Data Management in 2020?
Distributed Data Management in 2020? Patrick Valduriez INRIA & LIRMM Montpellier, France 1 DEXA, Toulouse, August 30, 2011 1 Basis for this Talk T. Özsu and P. Valduriez. Principles of Distributed Database
X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com
REPORT Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com The content of this evaluation guide, including the ideas and concepts contained within, are the property of Splice Machine,
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Structured Data Storage
Structured Data Storage Xgen Congress Short Course 2010 Adam Kraut BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT to conduct
Hurtownie Danych i Business Intelligence: Big Data
Hurtownie Danych i Business Intelligence: Big Data Robert Wrembel Politechnika Poznańska Instytut Informatyki [email protected] www.cs.put.poznan.pl/rwrembel Outline Introduction to Big Data
BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &
BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research & Innovation 04-08-2011 to the EC 8 th February, Luxembourg Your Atos business Research technologists. and Innovation
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
NoSQL Systems for Big Data Management
NoSQL Systems for Big Data Management Venkat N Gudivada East Carolina University Greenville, North Carolina USA Venkat Gudivada NoSQL Systems for Big Data Management 1/28 Outline 1 An Overview of NoSQL
Zenith: a hybrid P2P/cloud for Big Scientific Data
Zenith: a hybrid P2P/cloud for Big Scientific Data Esther Pacitti & Patrick Valduriez INRIA & LIRMM Montpellier, France 1 1 Basis for this Talk Zenith: Scientific Data Management on a Large Scale. Esther
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island
Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY ANN KELLY II MANNING Shelter Island contents foreword preface xvii xix acknowledgments xxi about this book xxii Part 1 Introduction
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
Big Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
So What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
A Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
NoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department
Pla7orms for Big Data Management and Analysis Michael J. Carey Informa(on Systems Group UCI CS Department Outline Big Data Pla6orm Space The Big Data Era Brief History of Data Pla6orms Dominant Pla6orms
Introduction to Apache Cassandra
Introduction to Apache Cassandra White Paper BY DATASTAX CORPORATION JULY 2013 1 Table of Contents Abstract 3 Introduction 3 Built by Necessity 3 The Architecture of Cassandra 4 Distributing and Replicating
Preparing Your Data For Cloud
Preparing Your Data For Cloud Narinder Kumar Inphina Technologies 1 Agenda Relational DBMS's : Pros & Cons Non-Relational DBMS's : Pros & Cons Types of Non-Relational DBMS's Current Market State Applicability
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15 1 MongoDB in the NoSQL and SQL world. NoSQL What? Why? - How? Say goodbye to ACID, hello BASE You
Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
A survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - [email protected] Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
Advanced Data Management Technologies
ADMT 2014/15 Unit 15 J. Gamper 1/44 Advanced Data Management Technologies Unit 15 Introduction to NoSQL J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE ADMT 2014/15 Unit 15
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Big Data Explained. An introduction to Big Data Science.
Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of
wow CPSC350 relational schemas table normalization practical use of relational algebraic operators tuple relational calculus and their expression in a declarative query language relational schemas CPSC350
Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.
Hadoop for MySQL DBAs + 1 About me Sarah Sproehnle, Director of Educational Services @ Cloudera Spent 5 years at MySQL At Cloudera for the past 2 years [email protected] 2 What is Hadoop? An open-source
Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012
Big Data Buzzwords From A to Z By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012 Big Data Buzzwords Big data is one of the, well, biggest trends in IT today, and it has spawned a whole new generation
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
