OAK Database optimizations and architectures for complex large data Ioana MANOLESCU-GOUJOT

OAK Database optimizations and architectures for complex large data Ioana MANOLESCU-GOUJOT INRIA Saclay Île-de-France Université Paris Sud LRI UMR CNRS 8623

Plan 1. The team 2. Oak research at a glance 3. Zoom: adaptive heterogeneous stores for Big Data Analytics 4. Wrap-up

1 The team

OAK project-team Joint between INRIA and U. Paris Sud INRIA: Ioana Manolescu (DR) U. Paris Sud faculty: Nicole Bidoit (Pr) Bogdan Cautis (Pr) Benoit Groz (MdC) External faculty: Dario Colazzo (Pr, U. Dauphine) François Goasdoué (Pr, U. Rennes 1) 2 post-docs 2 engineers 6 PhD students 2 M2 Interns

2 OAK research at a glance

Database optimizations and architectures Database processing: query transform the data through declarative languages Users specify what to do System figures out how to do it 1. Formal models for describing the data and the processing Careful compromise expressivity versus efficiency 2. Logical optimization Inferring whether a computation is equivalent to / contained into another Enumerating alternative methods of evaluating a given computation Query optimization for novel data models and languages 3. Physical optimization Automated storage tuning: selecting materialized views, indices. Physical operators

Database optimizations and architectures Database processing: query transform the data through declarative languages Users specify what to do System figures out how to do it 1. Formal models for describing the data and the processing Long-term Careful compromise goal: efficient expressivity tools versus for efficiency declarative management of complex data 2. Logical optimization Inferring whether a computation is equivalent to / contained into another Impact: Enumerating industrialize alternative the methods construction of evaluating a of given innovative computation data- Query optimization for novel data models and languages centric applications 3. Physical optimization Automated storage tuning: selecting materialized views, indices Physical operators

OAK research at a glance Document data (JSON, XML ) Static analysis and query optimization Storage optimization through views and indices Massively parallel processing in the cloud Semantic data (RDF, OWL ) Other complex data (XR, social )

3 Zoom: Self-tuning heterogeneous stores

The problem Glut of varied data management systems (DMS) DM includes DBMS Different data models: NoSQL Relational, nested relational, tree, k-v, graphs, DMSs - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. - Different performance - Different levels of transaction support Cloud DMSs

The problem Glut of varied data management systems (DMS) DM includes DBMS Different data models: Relational, nested relational, tree, k-v, graphs, - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. How do we get performance for a variety of datasets on a variety of DMSs - Different performance - Different levels of transaction support NoSQL DMSs Cloud DMSs

The problem Glut of varied data management systems (DMS) DM includes DBMS How do we get Different data models: NoSQL performance Relational, nested relational, tree, k-v, graphs, DMSs for a variety Focus of datasets not on beating the on a variety of most DMSs specialized optimizations of the most specialized engine for a given model/application. - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. - Different performance - Different levels of transaction support Cloud DMSs

The problem Glut of varied data management systems (DMS) DM includes DBMS How do we get Different data models: NoSQL performance Relational, nested relational, tree, k-v, graphs, DMSs for a variety Focus of datasets not on beating the on a variety of most DMSs specialized optimizations of the most specialized engine Focus for on a robust given model/application. performance for varied Cloud data DMSs models across a changing set of heterogeneous DMSs - Different data access capabilities (from simple API to various query languages) - Different architectures: disk- vs. memory-based, centralized vs. distributed etc. - Different performance - Different levels of transaction support

The problem, qualified Glut With of varied data management With no hassle systems (DMS) correctness DM includes DBMS for the Different guarantees data models: application layer Automatically NoSQL Relational, nested relational, tree, k-v, graphs, DMSs - Different data access capabilities (from simple API to various query languages) How do we get performance for a variety of datasets - Different architectures: disk- vs. memory-based, centralized Resilient to vs. distributed etc. on a variety of DMSs changes - Different performance Cloud - Different levels of transaction support DMSs

Sample application: Big Data Analytics in Datalyse Investissement d Avenir Cloud & Big Data, 2013-2016 Led by Business et Decision, with INRIA Lille, LIG, LIRMM Goal: build cloud-based Big Data Analytics tools for heterogeneous data Data providers: OAK OAK OAK

Data models: As the data is Systems: Those available invisible glue for heterogeneous stores (side by side) (side by side) Store each data set as a set of Or splits / shards / partitions / indexes / materialized (potentially indexed) Each fragment resides in a DMS

Dataset fragmentations A B C D 1 2 3 4 5 6 A B C D 1 2 A B C D 3 4 A B C D 5 6 A B C D 1 3 A B 1 2 3 4 5 6 A C 1 2 3 4 5 6 A D 1 2 3 4 5 6 A B C D 5 6

Dataset fragmentations Example: relational dataset R

Fragmentations made of views The content of each fragment is described declaratively Fragment = (materialized) view [+ parameters] «The name and addresses of all clients» «The sales partitioned by zipcode» Also indexes «The name and addresses of all clients, by their age and zipcode» Also: navigation in trees or graphs key-value stores Fragment = materialized view [+ parameters] [+ input pattern]

Fragments distribution across stores

RDF DMS Fragments distribution across stores

Fragments distribution across stores RDF DMS K-v store

Fragments distribution across stores RDF DMS K-v store JSO N DMS

Fragments distribution across stores RDF DMS Rel DBMS K-v store JSON DMS

Fragments distribution across stores RDF DMS K-v store Data model translation applied at loading The extraction logic is in the view Rel DBMS Pig store on top of DFS JSON DMS

Fragments distribution across stores RDF DMS K-v store Applications query the data in native format Rel DBMS Pig store on top of DFS JSO N DMS

Fragments distribution across stores RDF DMS K-v store Fragment description by views guarantees properties such as: completeness equivalence Rel DBMS Pig store on top of DFS JSO N DMS

Query answering = View-Based Rewriting VBR known for dramatic performance improvements No limit (e.g. view = query) Comparison with «Local As Views» mediation data models Common data model (V1,, Vn, Q) Query Q Source schema V1 (DMS1) Mediator schema Source schema Vn (DMSn) vs. Query Q Native dataset model Source schema V1 (DMS1) Dataset schema Source schema Vn (DMSn)

Query answering = view-based rewriting Comparison with «Local As Views» mediation: data models Side-by-side data models at the top Native model of dataset 1 Query Q Dataset 1 schema Query Q Native model of dataset k Dataset k schema Source schema V 1 1 (DMS1) Source schema V 1 n1 (DMSn) Source schema V k 1 (DMSk1) Source schema V k nk (DMSknk) à Common benefit with LAV: Applications unaware of the fragmentation! à Novel benefit: fragments can migrate to systems and data models

architecture Data Centric Application Store Dataset 1 Dataset 2 Query Dataset 1 Dataset 2 Dataset n Dataset n Dataset 1 F1 F3 F2 Dataset 2 F4 F1 F3 F2 Storage Advisor Query Evaluator Storage Descriptors Manager Query Execution Plan Estocada Runtime Execution Engine D1/F1 D2/F2 D1 / F2 D1/F3 D1/F4 D2 / F3 D2/F1 NoSQL System Key-value store Document store Nested relations store Relational store

core modules View-based rewriting (VBR) Outputs: queries to DMSs (in their native language) + remaining integration operations DMS capability descriptions exploited here. Runtime To perform integration operations For this, a single runtime (for the most expressive model, e.g. nested relations), should do We may borrow one of the DMSs s runtime

What about performance? Select the rewriting likely to lead to the best query evaluation performance Cross-system cost model - Based on cost model calibration - Modest extension for binding patterns View recommendation «Cross-model, cross-system data storage advisor» Great progress in recent years on single-model storage (view, index etc.) recommendation Combinatorial problem (select a subset of the possible views minimizing cost estimation)

4 Advancement and potential perspectives

Estocada: advancement and perspectives Current status: 3 senior (IM, FG, Alin Deutsch from UCSD) 2 post-docs, 1 PhD student, 1 to start in 2015 Core code modules ready (VBR) Roadmap for deploying adaptors and costmodel for a few popular systems Pig MongoDB Hadoop-based RDF store Would like to have More real use case scenario An engineer (preferred) and/or another PhD student

Merci / questions?