Completing the Big Data Ecosystem:

Similar documents
Cloud Computing at Google. Architecture

PERFORMANCE MODELS FOR APACHE ACCUMULO:

How To Scale Out Of A Nosql Database

In Memory Accelerator for MongoDB

Practical Cassandra. Vitalii

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

How To Improve Performance In A Database

CitusDB Architecture for Real-Time Big Data

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

InfiniteGraph: The Distributed Graph Database

Comparing SQL and NOSQL databases

The Quest for Extreme Scalability

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Bigdata : Enabling the Semantic Web at Web Scale

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

COURSE SYLLABUS COURSE TITLE:

The Sierra Clustered Database Engine, the technology at the heart of

Apache HBase. Crazy dances on the elephant back

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Challenges for Data Driven Systems

NoSQL and Hadoop Technologies On Oracle Cloud

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Database Scalability and Oracle 12c

Big Data Analytics. Rasoul Karimi

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

Big Data and Scripting Systems build on top of Hadoop

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

An Approach to Implement Map Reduce with NoSQL Databases

Performance Tuning for the Teradata Database

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Bigtable is a proven design Underpins 100+ Google services:

EZManage V4.0 Release Notes. Document revision 1.08 ( )

Java SE 8 Programming

Module 1: Getting Started with Databases and Transact-SQL in SQL Server 2008

MS Designing and Optimizing Database Solutions with Microsoft SQL Server 2008

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Writing Queries Using Microsoft SQL Server 2008 Transact-SQL

Big Table A Distributed Storage System For Data

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

The Apache Cassandra storage engine

Oracle BI 11g R1: Build Repositories

Apache Cassandra for Big Data Applications

Developing Microsoft SharePoint Server 2013 Advanced Solutions MOC 20489

Apache Kylin Introduction Dec 8,

Cassandra A Decentralized, Structured Storage System

Cloud Self Service Mobile Business Intelligence MAKE INFORMED DECISIONS WITH BIG DATA ANALYTICS, CLOUD BI, & SELF SERVICE MOBILITY OPTIONS

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

The Hadoop Eco System Shanghai Data Science Meetup

A programming model in Cloud: MapReduce

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

bigdata Managing Scale in Ontological Systems

Toad for Oracle 8.6 SQL Tuning

Big Data and Scripting Systems build on top of Hadoop

In-Memory Databases MemSQL

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Open source large scale distributed data management with Google s MapReduce and Bigtable

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

SQL Server 2012 Business Intelligence Boot Camp

Ingo Hilgefort. Integrating SAP. Business Objects BI with SAP NetWeaver. Bonn Boston

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Writing Queries Using Microsoft SQL Server 2008 Transact-SQL

Business Intelligence in SharePoint 2013

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Apache HBase: the Hadoop Database

Self-Service Business Intelligence

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Microsoft End to End Business Intelligence Boot Camp

Big Data with Component Based Software

The IBM Cognos Platform for Enterprise Business Intelligence

Graph Database Proof of Concept Report

Hadoop IST 734 SS CHUNG

David Dye. Extract, Transform, Load

Integrating Big Data into the Computing Curricula

Java SE 7 Programming

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Business Application Services Testing

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

MS SQL Performance (Tuning) Best Practices:

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL Server 2012 Optimization, Performance Tuning and Troubleshooting

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Big Data and Big Data Modeling

Transcription:

Completing the Big Data Ecosystem: in sqrrl data INC. August 3, 2012

Design Drivers in Analysis of big data is central to our customers requirements, in which the strongest drivers are: Scalability: The ability to do twice the work at only (about) twice the cost. Adaptability: The ability to rapidly evolve the analytical tools available in an operational environment, building upon and enhancing existing capabilities. Security: Getting all of the above without giving up secrecy and assurance properties. From these directives we can derive the following requirements: Data-Centric Security to reduce coordination needed between application developers and data providers. Simplicity in the overall architecture to encourage participation and ameliorate learning curve. Generic design patterns to store and organize data whose mat we don t control. Generic discovery analytics to retrieve and visualize generic data.

Optimization... is a secondary concern, given: hundreds of evolving applications, hundreds of changing data sources, petabytes/exabytes of data, many complicated interactions. Instead, we need a generic platm that is cheap, simple, scalable, secure, and adaptable, with pretty good permance. in

Key/Value Structure in An Key is a 5-tuple, including: Row: controls Atomicity Column Family: controls Locality Column Qualifier: controls Uniqueness Visibility: controls Access (unique to ) Timestamp: controls Versioning Sample Entries Row : Col. Fam. : Col. Qual. : Visibility : Timestamp Value Adam : Favorites : Food : (Public) : 20090801 Sushi Adam : Favorites : Programming Language : (Private) : 20090830 Java Adam : Favorites : Programming Language : (Private) : 20070725 C++ Adam : Friends : Bob : (Public) : 20110601 Adam : Friends : Joe : (Private) : 20110601

Visibility Label Syntax and Semantics in Document Labels Doc 1 : (Federation) Doc 2 : (Klingon Vulcan) Doc 3 : (Federation&Human&Vulcan) Doc 4 : (Federation&(Human Vulcan)) Syntax WORD [a-za-z0-9 ]+ CLAUSE AND OR AND AND & AND (CLAUSE) WORD OR OR OR (CLAUSE) WORD User Authorization Sets CptKirk : {Federation,Human} MrSpock : {Federation,Human,Vulcan} Semantics (T τ) (τ A) term (T, A) = true (T T 1 & T 2 ) ((T 1, A) = true) ((T 2, A) = true) and (T, A) = true (T T 1 T 2 ) (((T 1, A) = true) ((T 2, A) = true)) or (T, A) = true (T (T1)) (T1 = true) paren (T, A) = true

Tablets in Collections of key/value pairs m Tables Tables are partitioned into Tablets Metadata tablets hold info about other tablets, ming a three-level hierarchy A Tablet is a unit of work a Tablet Server

Distributed Processes in

Tablet Server Composition in Quick and loose definitions: Table: A map of keys to values with one global sort order among keys. Tablet: A row range within a Table. Tablet Server: The mechanism that hosts Tablets, providing the primary functionality of Bigtable or. Tablet servers have several primary functions: 1 Hosting RPCs (read, write, etc.) 2 Managing resources (RAM, CPU, File I/O, etc.) 3 Scheduling background tasks (compactions, caching, etc.) 4 Handling key/value pairs Category 4 is almost entirely accomplished through the Iterator framework.

Tablet Server Data Flow in Iterator Uses File Reads Block Caching Merging Deletion Isolation Locality Groups Range Selection Column Selection Cell-level Security Versioning Filtering Aggregation Partitioned Joins

The Perils of Distributed Computing in Dealing with failures is hard! Operations like table creation are logically atomic, but consist of multiple operations on distributed systems. Resource locking (via mutex, semaphores, etc.) provides some sanity. Distributed systems have many complicated failure modes: clients, master, tablet servers, and dependent systems can all go offline periodically. Who is responsible unlocking locks when any component can fail? do we know it s safe to unlock a lock?

Testing Procedures in Testing Frameworks Unit: Verify correct functioning of each module separately System: Perm correctness and permance tests on a small running instance Load/Scale: Generate high loads at scale and measure permance and correctness Random Walk: Randomly, repeatedly, and concurrently execute a variety of test modules representative of user activity on an instance at scale Simulation: Evaluate the model to gauge expected permance Other Considerations Scoping tests to include server-side code, client-side code, dependent processes, etc. Code coverage vs. path coverage Static vs. dynamic analysis Simulating failures of distributed components Strange failure modes (often hardware/physics-related)

Adampotence Idempotent: f (f (x)) = f (x) Adampotent: f (f (x)) = f (x), where f (x) denotes partial execution of f (x) in

Fault-Tolerant Executor in If a process dies, previously submitted operations continue to execute on restart. FATE serializes every task in Zookeeper bee execution. The Master process uses FATE to execute table operations and administrative actions. FATE eliminates the single point of failure.

Verified State Models State models used many internal functions Explicit-state model checking proves correctness in

in

in

Event Table with Inverted Index in

Inverted Index Flow in

Multidimensional Index in See also: http://en.wikipedia.org/wiki/geohash

Graph Table in

The shard Table in

Appstores in Reduced barrier to entry Faster app development More innovation!

Security Model Supported Innovation in Role 1 Role 2 Role 3 Role 4 App 1 App 2 App 3 App 4 = role-centric security = data-centric security

Data-Centric Security in No in-app security necessary Apps can be used by anyone Adding users is easy

Maximizing Utility in Application developers need underlying design patterns exposed via an API. Examples include SQL, SPARQL, JAQL, D4M, and many more. What are the right ways to expose s technology and techniques?

Analytic Primitives Inmation Retrieval Graph Analytics in Online Statistics

Questions/Contact Info in Cofounder and CTO sqrrl data, INC. adam@sqrrl.co General Info Email: Website: John Cofounder and Director of Ecosystems sqrrl data, INC. john@sqrrl.co info@sqrrl.co www.sqrrl.co