Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department



Similar documents
Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB

Big Data Platforms: What s Next?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Large-Scale Data Processing

Hadoop Ecosystem B Y R A H I M A.

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Hadoop IST 734 SS CHUNG

Big Data and Apache Hadoop s MapReduce

Big Data Analysis and HADOOP

Big Data Explained. An introduction to Big Data Science.

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Integrating Big Data into the Computing Curricula

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Workshop on Hadoop with Big Data

Internals of Hadoop Application Framework and Distributed File System

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

How To Scale Out Of A Nosql Database

Analysis of Web Archives. Vinay Goel Senior Data Engineer

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson

How Companies are! Using Spark

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Challenges for Data Driven Systems

What s next for the Berkeley Data Analytics Stack?

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Lecture Data Warehouse Systems

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Presenters: Luke Dougherty & Steve Crabb

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Tap into Hadoop and Other No SQL Sources

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Qsoft Inc

MapReduce with Apache Hadoop Analysing Big Data

Big Data Workshop. dattamsha.com

NoSQL Data Base Basics

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data Too Big To Ignore

Hadoop implementation of MapReduce computational model. Ján Vaňo

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CSE-E5430 Scalable Cloud Computing Lecture 2

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Data processing goes big

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data and Scripting Systems build on top of Hadoop

Using distributed technologies to analyze Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Deploying Hadoop with Manager

Large scale processing using Hadoop. Ján Vaňo

Big Data Technologies Compared June 2014

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Unified Big Data Analytics Pipeline. 连 城

Map Reduce & Hadoop Recommended Text:

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Introduction to Big Data Training

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Big Data: Tools and Technologies in Big Data

Apache Hadoop. Alexandru Costan

6.S897 Large-Scale Systems

ITG Software Engineering

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

NoSQL and Hadoop Technologies On Oracle Cloud

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Data Modeling for Big Data

Open source large scale distributed data management with Google s MapReduce and Bigtable

I/O Considerations in Big Data Analytics

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

BIG DATA TRENDS AND TECHNOLOGIES

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Apache HBase. Crazy dances on the elephant back

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Real Time Big Data Processing

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Applications for Big Data Analytics

Hadoop. Sunday, November 25, 12

Introduction To Hive

NoSQL for SQL Professionals William McKnight

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

How To Handle Big Data With A Data Scientist

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Unified Big Data Processing with Apache Spark. Matei

<Insert Picture Here> Big Data

Transcription:

Pla7orms for Big Data Management and Analysis Michael J. Carey Informa(on Systems Group UCI CS Department

Outline Big Data Pla6orm Space The Big Data Era Brief History of Data Pla6orms Dominant Pla6orms AsterixDB Project @ UCI / UCR Our Approach Feature Overview Call for Customers Quick Q&A (Ome permipng) 1

It s the Big Data Era Unprecedented growth in data being generated and its potenoal uses/value Tweets, social networks (statuses, check- ins, shared content), blogs, click streams, various logs, Facebook: > 1.3B monthly acove users, > 2.7B likes/day TwiEer: > 270M monthly acove users, > 500M tweets/day Everyone is interested (might be why you re here! J ) Trade press and popular press: Big Data! Enterprises, Web companies, online businesses, governments, public health researchers, social scienosts, Untapped value and countless new opportunioes to understand, opomize, assist, and/or compete 2

One Reason Big Data s Exploding 3

Brief History of Big Data Database community all about Big Data management since the beginning of Ome Born from the relaoonal database movement inioated by Tedd Codd in 1970 Distributed systems community all about Big Data since 2000 or so (thanks to the Web) Google, Yahoo!, Amazon, Facebook,, all started fresh, but are now following the DB field s path Let s briefly review architectures from both worlds (to combine/extend their best ideas) 4

Big Data in the Database World Enterprises wanted to store and query historical business data (data warehouses) 1970 s: RelaOonal databases appeared (w/sql) Late 1970 s: Database machines based on novel hardware and early (brute force) parallelism 1980 s: Parallel database systems based on shared- nothing architectures (Gamma, GRACE, Teradata) 2000 s: Netezza, Aster Data, DATAllegro, Greenplum, VerOca, ParAccel, (Serious Big $ acquisioons!) (Each node runs an instance of a legiomate, indexed DB data storage and runome system) 5

Also OLTP Database Systems On- line transacoon processing is another Big Data dimension ApplicaOons that power our daily business Producers of the data being warehoused Shared- nothing clustered architectures are also a serious scalable OLTP player 1980 s: Tandem s NonStop SQL system 6

Big Data in the Systems World 7

(MapReduce: A Quick Example) Input Splits (distributed)...... Par$$oned Parallelism! Mapper Outputs SHUFFLE PHASE (based on keys) Reducer Inputs Reducer Outputs (distributed) 8

Soon a Star Was Born Yahoo!, Facebook, and friends cloned Google s Big Data infrastructure from papers GFS à Hadoop Distributed File System (HDFS) MapReduce à Hadoop MapReduce Widely used for Web indexing, click stream analysis, log analysis, informaoon extracoon, some machine learning Tired of puzzle- solving with just two moves, higher- level languages were developed to hide MR E.g., Pig (Yahoo!), Hive (Facebook), Jaql (IBM) Now in heavy use over MR (Pig > 60%, HiveQL > 90%) Similar happenings at Microsot Cosmos, Dryad, and SCOPE (which powers Bing) 9

Also Key- Value or Document Stores Another Big Data dimension, for applicaoons that power social sites, gaming sites, and so on Systems community s version of OLTP, roughly Need for simple record stores Simple, key- based retrievals and updates (put/get) Fast, highly scalable, highly reliable, and available Many NoSQL systems (see Cavell survey) Proprietary: BigTable (Google), Dynamo (Amazon), Open Source: HBase (BigTable), Cassandra (Dynamo), 10

Parallel Database System Stack Notes: One storage manager per machine in a parallel cluster Upper layers orchestrate their shared- nothing cooperaoon One way in/out: through the SQL door at the top 11

Open Source Big Data Stack Notes: Giant byte sequence files at the bovom Map, sort, shuffle, reduce layer in middle Possible storage layer in middle as well HLLs now at the top (NoDce: More open...!) 12

No More One Size Fits All 13

Other Up- and- Coming Pla6orms Bulk Synchronous Programming (BSP) pla6orms, e.g., Pregel, Giraph, GraphLab,..., for doing Big Graph analysis Think Like a Vertex Receive messages Update state Send messages ( Big is the pla7orm s concern) Spark for in- memory cluster compuong for repeoove data analyses, iteraove machine learning tasks,... query 1 one- time processing query 2 Input Distributed memory query 3... Input iter. 1 iterative processing iter. 2... 14

AsterixDB: One Size Fits a Bunch Parallel Database Systems Semistructured Data Management Data- Intensive Computing BDMS Desiderata: Flexible data model Efficient runome Full query language Costs proporoonal to tasks at hand Built- in support for cononuous data ingesoon Support today s Big Data data types 15

User Model (Scale- Independent!) create dataverse MyNewSocialSite; create type MugshotUserType as { id: int32, alias: string, name: string, user- since: dateome, address: { street: string, city: string, state: string, zip: string, country: string }, friend- ids: {{ int32 }}, employment: [EmploymentType] } create dataset MugshotUsers(MugshotUserType) primary key id; Ex: User names and messages sent by users who joined the Mugshot social network within a certain (me window: for $user in dataset MugshotUsers where $user.user- since >= dateome('2010-07- 22T00:00:00') and $user.user- since <= dateome('2012-07- 29T23:59:59') return { "uname" : $user.name, "messages" : for $message in dataset MugshotMessages where $message.author- id = $user.id return $message.message }; Highlights include: #AsterixDB JSON++ based data model Rich base types (spaoal, temporal, ) Records, lists, bags Open vs. closed types Powerful query language 16

ASTERIX System Overview (ADM = ASTERIX Data Model; AQL = ASTERIX Query Language) 17

ASTERIX Sotware Stack AQL HiveQL XQuery AsterixDB Hivesterix Algebricks Algebra Layer Apache VXQuery Pregel Job Pregelix Hadoop M/R Job M/R Layer Hyracks Job Hyracks Data-Parallel Platform #AsterixDB 18

Project Status 4- year inioal NSF project done (now ~300 KLOC) Code developed jointly by UCI and UCR Two new efforts including hardening/sharing AsterixDB AsterixDB BDMS is here! (@ June 6 th, 2013) Semistructured NoSQL style data model DeclaraOve parallel queries, inserts, deletes,... Data storage with primary and secondary indexing Internal and external datasets both supported Rich set of data types (incl. text, Ome, and geo locaoon) Fuzzy and spaoal query processing Data feeds and external indexing as well We would love to work with you, if you have use cases...! Ex: Social data analyocs, educaoon (especially MOOCs), public health, emergency response, power use monitoring,... #AsterixDB 19

For More Info AsterixDB project page: hvp://asterixdb.ics.uci.edu Pregelix project page: hvp://pregelix.ics.uci.edu Open source code base: ASTERIX: hvp://code.google.com/p/asterixdb/ Hyracks: hvp://code.google.com/p/hyracks (Pregelix: hvp://hyracks.org/projects/pregelix/) #AsterixDB 20