Bigdata Model And Components Of Smalldata Structure

Size: px
Start display at page:

Download "Bigdata Model And Components Of Smalldata Structure"

Transcription

1 bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1

2 Background Requirement Fast analytic access to massive, heterogeneous data Traditional approaches Relational Super computer Business intelligence tools Semantic web or graph based New approach Evolution lead by internet companies Linear scale out on commodity hardware Higher concurrency, up time; lower costs. bigdata 2

3 The Information Factories Google Massively concurrent reads and writes with atomic row updates and petabytes of data. Basic data model is { application key, column name, timestamp } : { value } Yahoo/Amazon Cloud computing and functional programming to distribute processing over clusters (Apache Hadoop, Amazon elastic computing cloud). e-bay Partitioned data horizontally and into transactional and non-transactional regimes. bigdata 3

4 Use Cases Online services Search (Google), Advertising (Google, AOL), Recommendation Systems (Amazon), Auctions (ebay), Logistics (Wal-Mart, FedEx) Federal Government OSINT Predict and interdict Information fusion Bond trade analysis Bioinformatics bigdata 4

5 OSINT Harvest OSINT to support IC mission Many source languages Directed and broad harvest Long standing and ad hoc analyses Entities of interest automatically tagged Knowledge captured and republished using a semantic web database Aligned to multiple ontologies. Challenges Scale deployment from department to enterprise. Continued data growth and increasing growth rate. Federated policy-based information sharing at scale. bigdata 5

6 Connect the actors Multi-agency database of names, events, relationships. Data quality issues (alternative spellings, timeline, provenance) Policy-based data sharing. Key computation Preempt actors by linking entities and events to predict and interdict enemy operations. bigdata 6

7 Semantic Web Databases Facilitate these dynamic demands, but not at necessary scale. Best solutions offer 15k triples per second Billions of triples 20 hours at 15k per second. Scale-up architecture bigdata 7

8 Desired solution characteristics Cost effective, scalable access to and analysis of massive data. Petabyte scale Linear scale out Support rapid innovation Low management costs Very high: parallelism concurrency resource utilization aggregate IO bandwidth bigdata 8

9 bigdata architecture bigdata 9

10 Architecture layer cake Search& Indexing Events Data Analysis Pub/Sub SLA Reports services GOM (OODBMS) RDFS Sparse Row Store data models Transaction Manager Map / Reduce Metadata Service Data Service bigdata storage and computing fabric bigdata 10

11 Distributed Service Architecture Centralized services Distributed services Transaction Manager Metadata Service Data Services Map / Reduce Services Grid-enabled cluster. Service discovery using JINI. Data discovery using metadata service. Automatic failover for centralized services bigdata 11

12 Service Discovery Clients 3. Discover & locate Metadata Service 2. Advertise & monitor Registrar 1. advertise 4. Clients talk directly to data services. Data Services 1. Data services discover registrars and advertise themselves. 2. Metadata services discover registrars, advertise themselves, and monitor data service leave/join. 3. Clients discover registrars, lookup the metadata service, and use it to locate data services. 4. Clients talk directly to data services. bigdata 12

13 Dynamic Data Layer writes reads journal journal journal index segments overflow Clients Data Services Append only journals and readoptimized index segments are basic building blocks. bigdata 13

14 Data Service API (low level) Batch B+Tree operations Insert, contains, remove, lookup Range query Fast key range scans Optional filters Submit job Run procedure in local process Procedures automatically mapped across the data bigdata 14

15 Data Service Architecture writes reads Data Service journal journal journal overflow Journal is ~200M per host index Pipeline writes for data redundancy segments On overflow New journal for new writes Migrate writes onto index segments Periodically Compacting merge of index segments Release old journals and segments Split / join index segments to maintain M per segment. bigdata 15

16 Metadata Service Index management Add, drop, provision. Index partition management Locate Split, Join, Compact Resource reclamation bigdata 16

17 Metadata Service Architecture One metadata index per scale out index. Clients query locations Metadata read / writeservice The metadata index maps application keys onto data service locators for index partitions. Clients go direct to data services for read and write operations Data Services bigdata 17

18 Managed Index Partitions journal journal journal overflow overflow overflow p0 p0 p1 pn Begins with just one partition. Index partitions are split as they grow. New partitions are distributed across the grid CPU, RAM and IO bandwidth grow with index size. bigdata 18

19 Metadata Addressing L0 alone can address 16 Terabytes. L1 can address 8 Exabytes per index. L0 metadata L0 128M L0 metadata partition with 256 byte records L1 metadata L1 L1 L1 128M L1 metadata partition with 1024 byte records. Data partitions p0 p1 pn 128M per application index partition bigdata 19

20 bigdata Map / Reduce Processing on Distributed File Systems bigdata 20

21 Map / Reduce Distributed computing paradigm Big win: Easy to write robust jobs that scale Lots of uses: Crawlers, indexers, reverse link graphs, etc. Implementations: Google, Hadoop, bigdata Distribute code to data (map) Local IO Aggregation phase (reduce) Collects intermediate outputs. Flexible basis for scalable enterprise computing services bigdata 21

22 Distributed File System Scale-out storage Block structured files 64M block size Partial blocks are stored compactly Applications manage logical records but enforce block boundaries. Atomic operations on file blocks Read, write, append, delete. bigdata 22

23 Use Case: Telescopic to Microscopic Analysis Begin with broad queries Initial questions must be broad to bring in all relevant data Ingest resources related to those queries Each query might select 10,000 documents. Sort, sift, and filter looking for key entities and new relationships Irrelevant data will be filtered or discarded to focus on the core problem bigdata 23

24 Scale-out Entity Extraction Query Control flow Map / Reduce Services Data flow Search Harvest Extract 1. Search web or intranet services, noting URLs of interest; 2. Harvest downloads documents; 3. Extract entities of interest. Distributed File System URLs docs RDF/ XML bigdata 24

25 bigdata Federation And Semantic Alignment bigdata 25

26 Semantic Web Technology Has been applied to federate IC data. Well suited to dealing with problems of federation and semantic alignment. Dynamic declarative mapping of classes, properties and instances to one another. owl:equivalentclass owl:equivalentproperty owl:sameas bigdata 26

27 RDF Data Model Resources URIs, Literals Plain literals, bigdata. Language code literals, guten tag : de Datatype literals: 12.0 : xsd:double XML Literals Blank nodes An anonymous resource that can be used to construct containers. bigdata 27

28 RDF Data Model Statements General form is a statement or assertion { Subject, Predicate, Object } x:mike rdf:type x:terrorist. x:mike x:name Mike There are constraints on the types of terms that may appear in each position of the statement. Model theory licenses entailments (aka inferences). bigdata 28

29 Semantic Alignment with RDFS rdfs:class rdfs:class rdf:type rdf:type rdf:type rdfs:subclassof x:person x:terrorist y:individual rdfs:domain (A) x:name (B) Two schemas for the same problem. rdfs:subclassof rdfs:domain y:fullname rdf:type y:terroristactor rdf:type x:person rdf:type y:individual x:mike rdf:type x:terrorist y:michael rdf:type y:terroristactor (A) x:name Mike (B) y:fullname Michael explicit property inferred property Sample instance data for each schema. bigdata 29

30 Mapping ontologies together rdfs:class rdfs:class rdf:type rdf:type rdf:type rdfs:subclassof x:person x:terrorist y:individual rdfs:domain (A) x:name (B) Two schemas for the same problem. rdfs:subclassof rdfs:domain y:fullname rdf:type y:terroristactor x:terrorist owl:equivalentclass y:terroristactor x:name owl:equivalentproperty y:fullname x:mike owl:sameas y:michael Assertions that map (A) and (B) together. bigdata 30

31 Semantically Aligned View y:individual rdf:type rdf:type x:person y:terroristactor rdf:type x:mike rdf:type x:terrorist y:fullname x:name Mike Michael explicit property y:individual rdf:type rdf:type x:person inferred property y:terroristactor rdf:type y:michael rdf:type x:terrorist y:fullname x:name Mike Michael The data from both sources are snapped together once we assert that x:mike and y:michael are the same individual. bigdata 31

32 Semantic Web Database Build on the bigdata architecture Feature complete triple store RDFS inference Very competitive performance Open source bigdata 32

33 RDFS+ Semantics RDF Model Theory Eager closure is implemented Truth maintenance Declarative rules (easily extended). Backward chaining of select rules to reduce data storage requirements. Simple OWL extensions Support semantic alignment and federation. Magic sets solution is planned More scalable Makes truth maintenance easier bigdata 33

34 Lexicon uses two indices terms Assigns unique term identifiers. Full Unicode and data type support. Multi-component key [typecode] [languagecode datatypeuri]? [lexicalform data type value] Value is the 64-bit unique term identifier ids Key is termid, value is the URI, BNode, or Literal. Primary use is to emit RDF. bigdata 34

35 Perfect Statement Index Strategy All access paths (SPO, POS, OSP). Facts are duplicated in each index. No bias in the access path. Key is concatenation of the term ids; value is unused. Fast range counts for choosing join ordering. Provided by the bigdata B+Tree package. bigdata 35

36 Data Load Strategy Sort data for each index RDF parser (rio) Buffers ~100k statements per second. Discard duplicate terms and statements. Batch ~100k statements per write. Net load rate is ~20,000 tps. buffer terms ids spo Batch index writes journal pos bigdata 36 osp concurrent

37 bigdata bulk load rates ontology #triples #terms tps seconds wines.daml 1, , sw_community 1, , russiaa 1,613 1,018 26, Could_have_been 2,669 1,624 21, iptc-srs 8,502 2,849 36, hu 8,251 5,063 22, core 1, , enzymes 33,124 24,855 22, alibaba_v41 45,655 18,275 24, wordnet nouns 273, ,169 20, cyc 247, ,944 21,254 (15,000) 11.6 (17) ncioncology 464, ,844 25, Thesaurus 1,047, ,923 20, taxonomy 1,375, ,773 14, totals 3,512,377 1,964,576 21, bigdata 37

38 Comparison of Bulk Load Rates Oracle tps 1. OpenLink 3k-6k tps 2. Franz AllegroGraph 15k-20k tps 3. Commercial best of breed bigdata ~20k tps 4. Plenty of room for improvement through simple tuning. (1) Source is Oracle -- (2) Source is (3) Source is (4) Dell Latidude D620 (laptop) using 1G of RAM and JRockit VR26. bigdata 38

39 Comparison to Franz AllegroGraph Franz: AMD64 quad-processor machine with 16GB of RAM AllegroGraph Data Size Load Rate LUBM, Lehigh U. Benchmark - 1,000 Files (50 Universities?) 6.88M triples 13.9K tps OpenCyc 255K triples 15.0K tps Bigdata: Xeon (32-bit) quad-processor machine with 4GB of RAM bigdata Data Size Load Rate LUBM, Lehigh U. Benchmark - 1,000 Files (100 Universities) 13.33M triples 22.0K tps OpenCyc 255K triples 21.3K tps bigdata 39

40 Scale-out data load Distributed architecture Range partitioned indices. Jini for service discovery. Client and database are distributed. Test platform Xeon (32-bit) 3Ghz quad-processor machines with 4GB of RAM and 280G disk each. Data set LUBM U1000 (133M triples) bigdata 40

41 Scale-out performance Performance penalty for distributed architecture All performance regained by the 2 nd machine. 10 machines would be 100k triples per second. Configuration Single-host architecture Scale-out architecture, one server Load Rate 18.5K tps 11.5K tps Scale-out architecture, two servers 25.0K tps bigdata 41

42 Next Steps Sesame 2.x integration Quad store SPARQL Inference and high-level query Backward chaining for owl:sameas Distributed JOINs Replace justifications with SLD/Magic Sets Scale-out Dynamic range partitioning and load balancing Service and data failover bigdata 42

43 bigdata Timeline today Development begins journal Bulk index builds Transactions Distributed indices Map / Reduce Dynamic Partitioning B+-Tree Partitioned indices Group Commit & High Concurrency Oct Nov Dec Jan Feb Mar Apr Sep Oct Nov Dec Jan Feb RDF application Forward closure Consistent Data Load Fast query & truth maintenance. Massive parallel ingest Quad Store bigdata 43

44 Bryan Thompson Chief Scientist SYSTAP, LLC bigdata TM Flexible Reliable Affordable Web-scale computing. bigdata 44

Flexible Reliable Affordable Web-scale computing.

Flexible Reliable Affordable Web-scale computing. bigdata Flexible Reliable Affordable Web-scale computing. bigdata 1 OSCON 2008 Background How bigdata relates to other efforts. Architecture Some examples RDF DB Some examples Web 3.0 Processing Using

More information

Bigdata : Enabling the Semantic Web at Web Scale

Bigdata : Enabling the Semantic Web at Web Scale Bigdata : Enabling the Semantic Web at Web Scale Presentation outline What is big data? Bigdata Architecture Bigdata RDF Database Performance Roadmap What is big data? Big data is a new way of thinking

More information

bigdata Managing Scale in Ontological Systems

bigdata Managing Scale in Ontological Systems Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural

More information

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc Big Data, Fast Data, Complex Data Jans Aasman Franz Inc Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are (1 (2 3) (4 5) (6 7) (8 9) (10 11) (12

More information

Bigdata High Availability (HA) Architecture

Bigdata High Availability (HA) Architecture Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources

More information

Graph Database Performance: An Oracle Perspective

Graph Database Performance: An Oracle Perspective Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective

More information

Introduction. Bigdata Database Architecture

Introduction. Bigdata Database Architecture Introduction Bigdata is a standards-based, high-performance, scalable, open-source graph database. Written entirely in Java, the platform supports the SPARQL 1.1 family of specifications, including Query,

More information

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 http://www.bigdata.com/blog. bigdata Presented to CSHALS 2/27/2014

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 http://www.bigdata.com/blog. bigdata Presented to CSHALS 2/27/2014 SYSTAP / Open Source High Performance Highly Available 1 SYSTAP, LLC Small Business, Founded 2006 100% Employee Owned Customers OEMs and VARs Government TelecommunicaHons Health Care Network Storage Finance

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

SEAIP 2009 Presentation

SEAIP 2009 Presentation SEAIP 2009 Presentation By David Tan Chair of Yahoo! Hadoop SIG, 2008-2009,Singapore EXCO Member of SGF SIG Imperial College (UK), Institute of Fluid Science (Japan) & Chicago BOOTH GSB (USA) Alumni Email:

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

AllegroGraph. a graph database. Gary King gwking@franz.com

AllegroGraph. a graph database. Gary King gwking@franz.com AllegroGraph a graph database Gary King gwking@franz.com Overview What we store How we store it the possibilities Using AllegroGraph Databases Put stuff in Get stuff out quickly safely Stuff things with

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.

More information

Cloud Computing and Advanced Relationship Analytics

Cloud Computing and Advanced Relationship Analytics Cloud Computing and Advanced Relationship Analytics Using Objectivity/DB to Discover the Relationships in your Data By Brian Clark Vice President, Product Management Objectivity, Inc. 408 992 7136 brian.clark@objectivity.com

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL davidh@il.ibm.com 2012 IBM Corporation

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL davidh@il.ibm.com 2012 IBM Corporation Page:1 Openstack Swift Object Store Cloud built from the grounds up David Hadas Swift ATC HRL davidh@il.ibm.com Page:2 Object Store Cloud Services Expectations: PUT/GET/DELETE Huge Capacity (Scale) Always

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc. How to Ingest Data into Google BigQuery using Talend for Big Data A Technical Solution Paper from Saama Technologies, Inc. July 30, 2013 Table of Contents Intended Audience What you will Learn Background

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data Collaborative Project LOD2 Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 2.1.4 150 Billion Triple dataset hosted on

More information

Network Graph Databases, RDF, SPARQL, and SNA

Network Graph Databases, RDF, SPARQL, and SNA Network Graph Databases, RDF, SPARQL, and SNA NoCOUG Summer Conference August 16 2012 at Chevron in San Ramon, CA David Abercrombie Data Analytics Engineer, Tapjoy david.abercrombie@tapjoy.com About me

More information

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database White Paper Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database Abstract This white paper explores the technology

More information

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering June 2014 Page 1 Contents Introduction... 3 About Amazon Web Services (AWS)... 3 About Amazon Redshift... 3 QlikView on AWS...

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013 Big Data Use Case How Rackspace is using Private Cloud for Big Data Bryan Thompson May 8th, 2013 Our Big Data Problem Consolidate all monitoring data for reporting and analytical purposes. Every device

More information

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo Expected Outcomes You will learn: Basic concepts related to ontologies Semantic model Semantic web Basic features of RDF and RDF

More information

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Using an In-Memory Data Grid for Near Real-Time Data Analysis SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses

More information

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD Boris Mocialov (H00180016) MSc Software Engineering Heriot-Watt University, Edinburgh April 5, 2015 1 1 Introduction The purpose

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Inge Os Sales Consulting Manager Oracle Norway

Inge Os Sales Consulting Manager Oracle Norway Inge Os Sales Consulting Manager Oracle Norway Agenda Oracle Fusion Middelware Oracle Database 11GR2 Oracle Database Machine Oracle & Sun Agenda Oracle Fusion Middelware Oracle Database 11GR2 Oracle Database

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? HPC2012 Workshop Cetraro, Italy Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy? Bill Blake CTO Cray, Inc. The Big Data Challenge Supercomputing minimizes data

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

bigdata SYSTAP, LLC 2006-2012 All Rights Reserved bigdata http://www.bigdata.com/blog Presented at Graph Data Management 2012

bigdata SYSTAP, LLC 2006-2012 All Rights Reserved bigdata http://www.bigdata.com/blog Presented at Graph Data Management 2012 1 1 Presentation outline Bigdata and the Semantic Web Bigdata Architecture Indices and Dynamic Sharding Services, Discovery, and Dynamics Bulk Data Loader Bigdata RDF Database Provenance mode Vectored

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

In-Memory Data Management for Enterprise Applications

In-Memory Data Management for Enterprise Applications In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University

More information

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB

More information

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this

More information

InfiniteGraph: The Distributed Graph Database

InfiniteGraph: The Distributed Graph Database A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

More information

DBMS / Business Intelligence, SQL Server

DBMS / Business Intelligence, SQL Server DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.

More information

Apache Hadoop FileSystem Internals

Apache Hadoop FileSystem Internals Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Optimizing Performance. Training Division New Delhi

Optimizing Performance. Training Division New Delhi Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani Big Data and Hadoop Sreedhar C, Dr. D. Kavitha, K. Asha Rani Abstract Big data has become a buzzword in the recent years. Big data is used to describe a massive volume of both structured and unstructured

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Semantic Web Standard in Cloud Computing

Semantic Web Standard in Cloud Computing ETIC DEC 15-16, 2011 Chennai India International Journal of Soft Computing and Engineering (IJSCE) Semantic Web Standard in Cloud Computing Malini Siva, A. Poobalan Abstract - CLOUD computing is an emerging

More information

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Cloud Based Distributed Databases: The Future Ahead

Cloud Based Distributed Databases: The Future Ahead Cloud Based Distributed Databases: The Future Ahead Arpita Mathur Mridul Mathur Pallavi Upadhyay Abstract Fault tolerant systems are necessary to be there for distributed databases for data centers or

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store Oracle NoSQL Database A Distributed Key-Value Store Charles Lamb, Consulting MTS The following is intended to outline our general product direction. It is intended for information

More information

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

More information

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009

Hadoop and its Usage at Facebook. Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Hadoop and its Usage at Facebook Dhruba Borthakur dhruba@apache.org, June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook

More information

Microsoft Analytics Platform System. Solution Brief

Microsoft Analytics Platform System. Solution Brief Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal

More information

CiteSeer x in the Cloud

CiteSeer x in the Cloud Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

More information

Trafodion Operational SQL-on-Hadoop

Trafodion Operational SQL-on-Hadoop Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL

More information

Acknowledgements References 5. Conclusion and Future Works Sung Wan Kim

Acknowledgements References 5. Conclusion and Future Works Sung Wan Kim Hybrid Storage Scheme for RDF Data Management in Semantic Web Sung Wan Kim Department of Computer Information, Sahmyook College Chungryang P.O. Box118, Seoul 139-742, Korea swkim@syu.ac.kr ABSTRACT: With

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Oracle Big Data Building A Big Data Management System

Oracle Big Data Building A Big Data Management System Oracle Big Building A Big Management System Copyright 2015, Oracle and/or its affiliates. All rights reserved. Effi Psychogiou ECEMEA Big Product Director May, 2015 Safe Harbor Statement The following

More information