Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Similar documents

Apache Flink Next-gen data analysis. Kostas

Hadoop Job Oriented Training Agenda

Apache Flink. Fast and Reliable Large-Scale Data Processing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Moving From Hadoop to Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem B Y R A H I M A.

How Companies are! Using Spark

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

CitusDB Architecture for Real-Time Big Data

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Spark and the Big Data Library

Oracle Big Data SQL Technical Update

HDFS. Hadoop Distributed File System

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Workshop on Hadoop with Big Data

HiBench Introduction. Carson Wang Software & Services Group

The basic data mining algorithms introduced may be enhanced in a number of ways.

Hadoop in the Enterprise

Big Data and Scripting Systems build on top of Hadoop

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

Big Data looks Tiny from the Stratosphere

Using distributed technologies to analyze Big Data

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Enterprise Operational SQL on Hadoop Trafodion Overview

The Internet of Things and Big Data: Intro

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

A Brief Introduction to Apache Tez

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Unified Big Data Processing with Apache Spark. Matei

On a Hadoop-based Analytics Service System

Upcoming Announcements

Big Data and Scripting Systems beyond Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Next-Gen Big Data Analytics using the Spark stack

Unified Big Data Analytics Pipeline. 连城

Constructing a Data Lake: Hadoop and Oracle Database United!

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Big Data Weather Analytics Using Hadoop

Cloud Computing at Google. Architecture

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Big Data Course Highlights

HadoopRDF : A Scalable RDF Data Analysis System

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Beyond Hadoop with Apache Spark and BDAS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

What is Analytic Infrastructure and Why Should You Care?

How To Handle Big Data With A Data Scientist

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Other Map-Reduce (ish) Frameworks. William Cohen

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Ali Ghodsi Head of PM and Engineering Databricks

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

REAL-TIME BIG DATA ANALYTICS

Big Data and Scripting map/reduce in Hadoop

Analysis of Web Archives. Vinay Goel Senior Data Engineer

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Can the Elephants Handle the NoSQL Onslaught?

HADOOP. Revised 10/19/2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Report: Declarative Machine Learning on MapReduce (SystemML)

Brave New World: Hadoop vs. Spark

How To Create A Data Visualization With Apache Spark And Zeppelin

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop & Spark Using Amazon EMR

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How to Choose Between Hadoop, NoSQL and RDBMS

How To Use Big Data For Telco (For A Telco)

MapReduce with Apache Hadoop Analysing Big Data

NoSQL Databases. Nikos Parlavantzas

Manifest for Big Data Pig, Hive & Jaql

A very short Intro to Hadoop

Transcription:

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015

Outline Who am I? Motivation Design objectives Overview of MRQL Examples Demo Architecture Current work Future plans Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 2

About me Leonidas Fegaras fegaras@cse.uta.edu Associate Professor at UTA (Univ. of Texas at Arlington) A committer and PPMC member of Apache MRQL Interested in big data management: cloud computing, web data management, distributed computing, data stream processing, query processing and optimization Past projects: HXQ: XQuery in Haskell XStreamCast: query processing of streamed XML data XQP: XQuery processing on P2P XQPull: stream processing for XQuery LDB: OODB query processing Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 3

History Apache MRQL (incubating) MRQL: a Map-Reduce Query Language History: Fall 2010: started at UTA as an academic research project March 2013: enters Apache Incubation 3 releases under Apache so far: latest MRQL 0.9.4 Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 4

Motivation MapReduce is not the only player in the Hadoop ecosystem any more designed for batch processing not well-suited for some big data workloads: real-time analytics, continuous queries, iterative algorithms,... Alternatives: Spark, Flink, Hama, Giraph,... New distributed stream processing engines: Spark Streaming, Flink Streaming, Storm, S4, Samza,... Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 5

Motivation Designed to relieve application developers from the intricacies of big-data analytics and distributed computing Steep learning curve Hard to develop, optimize, and maintain non-trivial applications coded in a general-purpose programming language Hard to tell which one of these systems will prevail in the near future applications coded in one of these paradigms may have to be rewritten as technologies evolve Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 6

Motivation... or you can express your applications in a query language that is independent of the underlying distributed platform! Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 7

Motivation... or you can express your applications in a query language that is independent of the underlying distributed platform! Does it have to be SQL? We re nosql after all! Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 8

Design objectives Wanted to develop a powerful and efficient query processing system for complex data analysis applications on big data more powerful than existing query languages able to capture most complex data analysis tasks declaratively able to work on read-only, raw (in-situ), complex data HDFS as the physical storage layer platform-independent: the same query can run on multiple platforms on the same cluster allowing developers to experiment with various platforms effortlessly efficient! Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 9

Design objectives We envision MRQL to be: a common front-end for the multitude of distributed processing frameworks emerging in the Hadoop ecosystem a tool for comparing these systems (functionality & performance) Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 10

Oh great! yet another SQL for map-reduce MRQL is NOT SQL! MRQL is an SQL-like query language for large-scale, distributed data analysis on a computer cluster Unlike SQL, MRQL supports a richer data model (nested collections, trees,...) arbitrary query nesting more powerful query constructs user-defined types and functions no nulls, no outer-joins MRQL queries can run on multiple distributed processing platforms currently Apache Hadoop MapReduce, Hama, Spark, and Flink The MRQL syntax and semantics have been influenced by modern database query languages (mostly, XQuery and ODMG OQL) functional programming languages (sequence comprehensions, algebraic data types, type inference) Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 11

Language features The MRQL query language: provides a rich type system that supports hierarchical data and nested collections uniformly general algebraic datatypes (similar to Haskell) JSON and XML are user-defined types pattern matching over data constructions (similar to case in Haskell) local type inference (similar to Scala) allows nested queries at any level and at any place no need for awkward nulls and outer-joins supports UDFs provided that they don t have side effects allows to operate on the grouped data using queries as is done in OQL and XQuery improves SQL group-by/aggregation (which are too awkward) Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 12

Language features The MRQL query language: supports custom aggregations/reductions using UDFs provided they have certain properties (associative & commutative) supports iteration declaratively to capture iterative algorithms, such as PageRank supports custom parsing and custom data fragmentation provides syntax-directed construction/deconstruction of data to capture domain-specific languages Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 13

How does MRQL compare to Hive? MRQL Hive metadata none stored in RDBMS data nested collections, trees, and custom complex types relational group-by on arbitrary queries not on subqueries aggregation arbitrary queries on grouped data SQL aggregations subqueries arbitrary query nesting limited subquery support platforms Hadoop, Hama, Spark, Flink Hadoop, Tez, (Spark) file formats text, sequence, XML, JSON text, sequence, ORC, RCFile iteration yes no streaming yes no Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 14

Simple example: matrix multiplication A sparse matrix X is represented as a bag of (X ij, i, j). Z ij = k X ik Y kj select ( sum(z), i, j ) from (x,i,k) in X, (y,k,j) in Y, z = x*y group by i, j Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 15

An XML example Group all persons according to their interests and the number of open auctions they watch. For each such group, return the number of persons in the group: select ( cat, os, count(p) ) from p in XMARK, i in p.profile.interest group by cat: i.@category, os: count(p.watches.@open_auctions) Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 16

Example: k-means clustering Derive k clusters from a set of points P: repeat centroids =... step select < X: avg(s.x), Y: avg(s.y) > from point in Points group by k: (select c from c in centroids order by distance(point,c))[0] Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 17

Example: the PageRank algorithm Simplified PageRank: A graph node is associated with a PageRank and its outgoing links: < id: 23, rank: 0.0, adjacent : { 10, 45, 35 } > Propagate the PageRank of a node to its outgoing links; each node gets a new PageRank by accumulating the propagated PageRanks from its incoming links: repeat nodes =... step select < id: m.id, rank: n.rank, adjacent : m.adjacent > from n in ( select < id: key, rank: sum(c.rank) > from c in ( select < id: a, rank: n.rank/count(n.adjacent) > from n in nodes, a in n.adjacent ) group by key: c. id ), m in nodes where n.id = m.id Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 18

The complete PageRank using map-reduce (MR) graph = select ( key, n.to ) from n in source(line, graph.csv,...) group by key: n.id; preprocessing: 1 MR job size = count(graph); select ( x.id, x.rank ) from x in (repeat nodes = select < id: key, rank: 1.0/size, adjacent: al > from (key,al) in graph init step: 1 MR job step select (< id: m.id, rank: n.rank, adjacent: m.adjacent >, abs((n.rank-m.rank)/m.rank) > 0.1) from n in (select < id: key, rank: 0.25/size+0.85*sum(c.rank) > from c in ( select < id: a, rank: n.rank/count(n.adjacent) > from n in nodes, a in n.adjacent ) group by key: c.id), m in nodes where n.id = m.id) repeat step: 1 MR job order by x.rank desc; postprocessing: 1 MR job Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 19

Demo Demo link Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 20

Architecture Query translation stages: 1 type inference 2 query translation and normalization 3 simplification 4 algebraic optimization 5 plan generation 6 plan optimization 7 compilation to Java code Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 21

The essence of distributed data processing distribute data to worker nodes (shuffling) perform computations on each data partition combine the results of these computations into one result Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 22

Algebraic operators Algebraic operations on bags: groupby ( X: {(κ, α)} ) : {(κ, {α})} flatmap ( f: α {β}, X: {α} ) : {β} reduce ( : (α, α) α, X: {α} ) : α union ( X: {α}, Y: {α} ) : {α} Extra operation (join): cogroup ( X: {(κ, α)}, Y: {(κ, β)} ) : {(κ, {α}, {β})} map-reduce = flatmap groupby flatmap List operations: orderby, append Iteration: repeat ( f: {α} {α}, X: {α} ) : {α} Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 23

Query optimization The query optimizer: uses a cost-based optimization framework to map algebraic terms to efficient workflows of physical operations handles dependent joins (used for nested collections) unnest deeply nested queries and converts them to join plans Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 24

Physical operations Case study: Queries similar to matrix multiplication: select ( sum(z), i, j ) from (x,i,k) in X, (y,k, j) in Y, z = x y group by i, j select h( k, reduce(acc, z) ) from x in X, y in Y, z = f(x,y) where jx(x) = jy(y) group by k: ( gx(x), gy(y) ) GroupByJoin (Valiant s algorithm): distribute the data to workers in the form of a grid n n of partitions each partition contains only those rows from X and those columns from Y needed to compute a single partition of the resulting matrix X and Y values are replicated n times each worker uses ( X Y )/n 2 memory Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 25

Current work: distributed stream processing Support for continuous queries over multiple streams of data Data come in incremental batches X Batch streaming based on sliding windows Query q(x 1, X 2 ; Y ) over one invariant and two streaming data sources Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 26

Current work: distributed stream processing select (k,avg(p.y)) from p in stream(binary,"points") group by k: p.x Currently, works on Spark Streaming Soon, on Flink Streaming Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 27

Next step: incremental query processing Problem: translate any batch program (eg, PageRank) to an incremental program automatically Solution: Break the query q(x 1, X 2 ; Y ) = g(f (X 1, X 2 ; Y )) such as: f (X 1 X 1, X 2 X 2 ; Y ) = f (X 1, X 2 ; Y ) f ( X 1, X 2 ; Y ) Requires program analysis & transformation Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 28

Summary Slides are available at the MRQL wiki page: http://wiki.apache.org/mrql/ We are looking for new developers to work on open tasks: add support for more distributed/streaming platforms Storm support more input formats, including key-value stores implement incremental query processing specify more data analysis algorithms benchmarking Are you developing a distributed processing platform in need for a query language? talk to us! Leonidas Fegaras (UTA) Apache MRQL http://mrql.incubator.apache.org/ 29