OAK Database optimizations and architectures for complex large data Ioana MANOLESCU-GOUJOT

Similar documents
Preparing Your Data For Cloud

Place and date of birth Rome, November 26 th 1983

DataBridges: data integration for digital cities

EIT ICT Labs MASTER SCHOOL DSS Programme Specialisations

Querying MongoDB without programming using FUNQL

Disributed Query Processing KGRAM - Search Engine TOP 10

Principles of Distributed Database Systems

bigdata Managing Scale in Ontological Systems

How To Handle Big Data With A Data Scientist

NoSQL in der Cloud Why? Andreas Hartmann

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Increasing Business Productivity and Value in Financial Services with Secure Big Data Architecture

Big Data Processing with Google s MapReduce. Alexandru Costan

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Chapter 3. Database Environment - Objectives. Multi-user DBMS Architectures. Teleprocessing. File-Server

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Interrogation d'entrepôts distribués et hétérogènes

Search and Real-Time Analytics on Big Data

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Lecture Data Warehouse Systems

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Cloud Computing and Advanced Relationship Analytics

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 16,

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Data fusion, semantic alignment, distributed queries

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

Study concluded that success rate for penetration from outside threats higher in corporate data centers

ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson

Adding scalability to legacy PHP web applications. Overview. Mario Valdez-Ramirez

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

TRAINING PROGRAM ON BIGDATA/HADOOP

CloudDB: A Data Store for all Sizes in the Cloud

Databases 2 (VU) ( )

Designing Database Solutions for Microsoft SQL Server 2012 MOC 20465

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

An Approach to Implement Map Reduce with NoSQL Databases

LDIF - Linked Data Integration Framework

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Data Modeling in the Age of Big Data

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

XML enabled databases. Non relational databases. Guido Rotondi

Big Data Analytics. Rasoul Karimi

Graph Database Performance: An Oracle Perspective

MongoDB Developer and Administrator Certification Course Agenda

Understanding NoSQL on Microsoft Azure

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database System in Energy Data Management

MS SQL Server 2014 New Features and Database Administration

DYNAMIC QUERY FORMS WITH NoSQL

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Data Services Advisory

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

MicroStrategy Course Catalog

Cost-optimized, Policy-based Data Management in Cloud Environments

NoSQL for SQL Professionals William McKnight

SeaCloudDM: Massive Heterogeneous Sensor Data Management in the Internet of Things

Quality Measure Definitions Overview

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Understanding NoSQL Technologies on Windows Azure

HOW TO DO A SMART DATA PROJECT

Luncheon Webinar Series May 13, 2013

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

How To Improve Performance In A Database

NoSQL Systems for Big Data Management

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

RDF Data Management in the Amazon Cloud

Reverse Engineering in Data Integration Software

Semantic Stored Procedures Programming Environment and performance analysis

Publishing Linked Data Requires More than Just Using a Tool

Lofan Abrams Data Services for Big Data Session # 2987

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 30,

Analytical Processing in the Big Data Era

Top DBMS Insights From IT Executives

This paper defines as "Classical"

An Overview of SAP BW Powered by HANA. Al Weedman

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

Oracle Database 12c Plug In. Switch On. Get SMART.

An Approach for Knowledge-Based IT Management of Air Traffic Control Systems

Graph Databases What makes them Different?

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Oracle Spatial and Graph. Jayant Sharma Director, Product Management

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

The Ontological Approach for SIEM Data Repository

You Have Your Data, Now What?

DATA ANALYTICS Unlocking knowledge and value from data

Benchmarking and Analysis of NoSQL Technologies

Reference Architecture, Requirements, Gaps, Roles

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Introduction to Big Data Training

Open Source Technologies on Microsoft Azure

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Professional Organization Checklist for the Computer Information Systems Curriculum

Transcription:

OAK Database optimizations and architectures for complex large data Ioana MANOLESCU-GOUJOT INRIA Saclay Île-de-France Université Paris Sud LRI UMR CNRS 8623

Plan 1. The team 2. Oak research at a glance 3. Zoom: adaptive heterogeneous stores for Big Data Analytics 4. Wrap-up

1 The team

OAK project-team Joint between INRIA and U. Paris Sud INRIA: Ioana Manolescu (DR) U. Paris Sud faculty: Nicole Bidoit (Pr) Bogdan Cautis (Pr) Benoit Groz (MdC) External faculty: Dario Colazzo (Pr, U. Dauphine) François Goasdoué (Pr, U. Rennes 1) 2 post-docs 2 engineers 6 PhD students 2 M2 Interns

2 OAK research at a glance

Database optimizations and architectures Database processing: query transform the data through declarative languages Users specify what to do System figures out how to do it 1. Formal models for describing the data and the processing Careful compromise expressivity versus efficiency 2. Logical optimization Inferring whether a computation is equivalent to / contained into another Enumerating alternative methods of evaluating a given computation Query optimization for novel data models and languages 3. Physical optimization Automated storage tuning: selecting materialized views, indices. Physical operators

Database optimizations and architectures Database processing: query transform the data through declarative languages Users specify what to do System figures out how to do it 1. Formal models for describing the data and the processing Long-term Careful compromise goal: efficient expressivity tools versus for efficiency declarative management of complex data 2. Logical optimization Inferring whether a computation is equivalent to / contained into another Impact: Enumerating industrialize alternative the methods construction of evaluating a of given innovative computation data- Query optimization for novel data models and languages centric applications 3. Physical optimization Automated storage tuning: selecting materialized views, indices Physical operators

OAK research at a glance Document data (JSON, XML ) Static analysis and query optimization Storage optimization through views and indices Massively parallel processing in the cloud Semantic data (RDF, OWL ) Other complex data (XR, social )

3 Zoom: Self-tuning heterogeneous stores

The problem Glut of varied data management systems (DMS) DM includes DBMS Different data models: NoSQL Relational, nested relational, tree, k-v, graphs, DMSs - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. - Different performance - Different levels of transaction support Cloud DMSs

The problem Glut of varied data management systems (DMS) DM includes DBMS Different data models: Relational, nested relational, tree, k-v, graphs, - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. How do we get performance for a variety of datasets on a variety of DMSs - Different performance - Different levels of transaction support NoSQL DMSs Cloud DMSs

The problem Glut of varied data management systems (DMS) DM includes DBMS How do we get Different data models: NoSQL performance Relational, nested relational, tree, k-v, graphs, DMSs for a variety Focus of datasets not on beating the on a variety of most DMSs specialized optimizations of the most specialized engine for a given model/application. - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. - Different performance - Different levels of transaction support Cloud DMSs

The problem Glut of varied data management systems (DMS) DM includes DBMS How do we get Different data models: NoSQL performance Relational, nested relational, tree, k-v, graphs, DMSs for a variety Focus of datasets not on beating the on a variety of most DMSs specialized optimizations of the most specialized engine Focus for on a robust given model/application. performance for varied Cloud data DMSs models across a changing set of heterogeneous DMSs - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. - Different performance - Different levels of transaction support

The problem, qualified Glut With of varied data management With no hassle systems (DMS) correctness DM includes DBMS for the Different guarantees data models: application layer Automatically NoSQL Relational, nested relational, tree, k-v, graphs, DMSs - Different data access capabilities (from simple API to various query languages) How do we get performance for a variety of datasets - Different architectures: disk- vs. memory-based, centralized Resilient to vs. distributed etc. on a variety of DMSs changes - Different performance Cloud - Different levels of transaction support DMSs

Sample application: Big Data Analytics in Datalyse Investissement d Avenir Cloud & Big Data, 2013-2016 Led by Business et Decision, with INRIA Lille, LIG, LIRMM Goal: build cloud-based Big Data Analytics tools for heterogeneous data Data providers: OAK OAK OAK

Data models: As the data is Systems: Those available invisible glue for heterogeneous stores (side by side) (side by side) Store each data set as a set of Or splits / shards / partitions / indexes / materialized (potentially indexed) Each fragment resides in a DMS

Dataset fragmentations A B C D 1 2 3 4 5 6 A B C D 1 2 A B C D 3 4 A B C D 5 6 A B C D 1 3 A B 1 2 3 4 5 6 A C 1 2 3 4 5 6 A D 1 2 3 4 5 6 A B C D 5 6

Dataset fragmentations Example: relational dataset R

Dataset fragmentations Example: relational dataset R

Dataset fragmentations Example: relational dataset R

Dataset fragmentations Example: relational dataset R

Dataset fragmentations Example: relational dataset R

Dataset fragmentations Example: relational dataset R

Dataset fragmentations Example: relational dataset R

Fragmentations made of views The content of each fragment is described declaratively Fragment = (materialized) view [+ parameters] «The name and addresses of all clients» «The sales partitioned by zipcode» Also indexes «The name and addresses of all clients, by their age and zipcode» Also: navigation in trees or graphs key-value stores Fragment = materialized view [+ parameters] [+ input pattern]

Fragments distribution across stores

RDF DMS Fragments distribution across stores

Fragments distribution across stores RDF DMS K-v store

Fragments distribution across stores RDF DMS K-v store JSO N DMS

Fragments distribution across stores RDF DMS Rel DBMS K-v store JSON DMS

Fragments distribution across stores RDF DMS K-v store Data model translation applied at loading The extraction logic is in the view Rel DBMS Pig store on top of DFS JSON DMS

Fragments distribution across stores RDF DMS K-v store Applications query the data in native format Rel DBMS Pig store on top of DFS JSO N DMS

Fragments distribution across stores RDF DMS K-v store Fragment description by views guarantees properties such as: completeness equivalence Rel DBMS Pig store on top of DFS JSO N DMS

Query answering = View-Based Rewriting VBR known for dramatic performance improvements No limit (e.g. view = query) Comparison with «Local As Views» mediation data models Common data model (V1,, Vn, Q) Query Q Source schema V1 (DMS1) Mediator schema Source schema Vn (DMSn) vs. Query Q Native dataset model Source schema V1 (DMS1) Dataset schema Source schema Vn (DMSn)

Query answering = view-based rewriting Comparison with «Local As Views» mediation: data models Side-by-side data models at the top Native model of dataset 1 Query Q Dataset 1 schema Query Q Native model of dataset k Dataset k schema Source schema V 1 1 (DMS1) Source schema V 1 n1 (DMSn) Source schema V k 1 (DMSk1) Source schema V k nk (DMSknk) à Common benefit with LAV: Applications unaware of the fragmentation! à Novel benefit: fragments can migrate to systems and data models

architecture Data Centric Application Store Dataset 1 Dataset 2 Query Dataset 1 Dataset 2 Dataset n Dataset n Dataset 1 F1 F3 F2 Dataset 2 F4 F1 F3 F2 Storage Advisor Query Evaluator Storage Descriptors Manager Query Execution Plan Estocada Runtime Execution Engine D1/F1 D2/F2 D1 / F2 D1/F3 D1/F4 D2 / F3 D2/F1 NoSQL System Key-value store Document store Nested relations store Relational store

core modules View-based rewriting (VBR) Outputs: queries to DMSs (in their native language) + remaining integration operations DMS capability descriptions exploited here. Runtime To perform integration operations For this, a single runtime (for the most expressive model, e.g. nested relations), should do We may borrow one of the DMSs s runtime

What about performance? Select the rewriting likely to lead to the best query evaluation performance Cross-system cost model - Based on cost model calibration - Modest extension for binding patterns View recommendation «Cross-model, cross-system data storage advisor» Great progress in recent years on single-model storage (view, index etc.) recommendation Combinatorial problem (select a subset of the possible views minimizing cost estimation)

4 Advancement and potential perspectives

Estocada: advancement and perspectives Current status: 3 senior (IM, FG, Alin Deutsch from UCSD) 2 post-docs, 1 PhD student, 1 to start in 2015 Core code modules ready (VBR) Roadmap for deploying adaptors and costmodel for a few popular systems Pig MongoDB Hadoop-based RDF store Would like to have More real use case scenario An engineer (preferred) and/or another PhD student

Merci / questions?