Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB



Similar documents
Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Big Data Platforms: What s Next?

Hadoop Ecosystem B Y R A H I M A.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Enterprise Operational SQL on Hadoop Trafodion Overview

How To Scale Out Of A Nosql Database

Integrating Big Data into the Computing Curricula

ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Moving From Hadoop to Spark

GENOME ANALYTICS. Performance in-situ DDN BPGW15. Hanif Khalak September 22, 2015 Cambridge, UK

Hypertable Architecture Overview

TRAINING PROGRAM ON BIGDATA/HADOOP

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Apache Flink Next-gen data analysis. Kostas

Large scale processing using Hadoop. Ján Vaňo

Workshop on Hadoop with Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud Computing at Google. Architecture

Hadoop IST 734 SS CHUNG

Hadoop implementation of MapReduce computational model. Ján Vaňo

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Oracle Big Data SQL Technical Update

How Companies are! Using Spark

Dominik Wagenknecht Accenture

Analytics on Spark &

Big Data and Scripting Systems build on top of Hadoop

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Challenges for Data Driven Systems

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Can the Elephants Handle the NoSQL Onslaught?

HDFS. Hadoop Distributed File System

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Native Connectivity to Big Data Sources in MSTR 10

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Architectures for Big Data Analytics A database perspective

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Inside Big Data Management : Ogres, Onions, or Parfaits?

Lecture Data Warehouse Systems

Comparing SQL and NOSQL databases

Microsoft Azure Data Technologies: An Overview

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

NoSQL for SQL Professionals William McKnight

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Hadoop. Sunday, November 25, 12

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data Course Highlights

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

What s next for the Berkeley Data Analytics Stack?

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Chase Wu New Jersey Ins0tute of Technology

Where is Hadoop Going Next?

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Navigating the Big Data infrastructure layer Helena Schwenk

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Analysis and HADOOP

Kafka & Redis for Big Data Solutions

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

How To Handle Big Data With A Data Scientist

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data and Apache Hadoop s MapReduce

Apache Hadoop. Alexandru Costan

A Brief Introduction to Apache Tez

Cloud Scale Distributed Data Storage. Jürmo Mehine

SQL on NoSQL (and all of the data) With Apache Drill

Big Data Research in the AMPLab: BDAS and Beyond

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Unified Big Data Analytics Pipeline. 连 城

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

American International Journal of Research in Science, Technology, Engineering & Mathematics

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Replicating to everything

Map Reduce & Hadoop Recommended Text:

Transcription:

Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine #AsterixDB 0

Rough Topical Plan Background and motivation (quick!) Big Data storage landscape (satellite view ) Two points of view (plus cloudy skies) AsterixDB: a next-generation BDMS What we re doing (plus hedging our bets) Research plan, Q&A, and RFI... 1

Big Data: It s Everywhere... SEMI- What s going on 2

Ancient DB History I: DIRECT INGRES DIRECT 3

DB History II: Shared What? Wars Shared-everything Shared-nothing Shared-disk (1980 s) 4

Big Data in the Database World Enterprise data warehouses 1980 s: Shared-nothing parallel DBMSs 2000 s: Enter new players (Netezza, Aster Data, DATAllegro, Greenplum, Vertica, ParAccel,...) Scalable OLTP 1980 s: Tandem s NonStop SQL Notes: One storage manager per machine Upper layers orchestrate query execution One way in/out: through the SQL door 5

Later: Big Data in the Systems World Out to index and query the Web, Google laid a new foundation in the early 2000 s Google File System (GFS): Files spanning many machines with 3-way replication MapReduce (MR): Parallel programming for dummies (UDFs + parallel framework) Yahoo!, FB, et al read the papers HDFS and Hadoop MapReduce Declarative HLLs: Pig, Hive,... HLLs now heavily preferred to MR Also key-value stores ( NoSQL ) Social sites, online games, BigTable/HBase, Dynamo/Cassandra, MongoDB, 6

No Shortage of NoSQL Big Data Analysis Platforms! Query/Scripting Language SCOPE AQL Meteor PigLatin Jaql Sawzall Dremel SQL High-Level API Compiler/Optimizer SCOPE DryadLINQ Algebricks Spark Sopremo Java/Scala Pig Cascading Cascading Jaql FlumeJava FlumeJava Dremel SQL Low-Level API Execution Engine Dryad Hyracks RDDs Spark Nephele PACT Tez MapReduce Hadoop MapReduce Google MapReduce Dremel Dataflow Processor Data Store Cosmos TidyFS Hyracks LSM Storage HBase HDFS GFS Bigtable Relational Row/ Column Storage Resource Management Quincy Mesos YARN Omega 7

Remember History? (DIRECT) Shared secondary storage Scan-based query processing 8

One More Bit of History #AsterixDB 9

Also: Today s Big Data Tangle (Pig) 10

AsterixDB: One Size Fits a Bunch Parallel Database Systems Semistructured Data Management World of Hadoop & Friends BDMS Desiderata: Flexible data model Efficient runtime Full query capability Cost proportional to task at hand (!) Designed for continuous data ingestion Support today s Big Data data types 11

ASTERIX Data Model (ADM) create dataverse TinySocial; use dataverse TinySocial; create type MugshotUserType as { id: int32, alias: string, name: string, user-since: datetime, address: { street: string, city: string, state: string, zip: string, country: string }, friend-ids: {{ int32 }}, employment: [EmploymentType] } create type EmploymentType as open { organization-name: string, start-date: date, end-date: date? } create dataset MugshotUsers(MugshotUserType) primary key id; Highlights include: JSON++ based data model Rich type support (spatial, temporal, ) Records, lists, bags Open vs. closed types 12

ASTERIX Data Model (ADM) create dataverse TinySocial; use dataverse TinySocial; create type MugshotUserType as { id: int32, } alias: string, name: string, user-since: datetime, address: { street: string, city: string, state: string, zip: string, country: string }, friend-ids: {{ int32 }}, employment: [EmploymentType] } create type EmploymentType as open { organization-name: string, start-date: date, end-date: date? } create dataset MugshotUsers(MugshotUserType) primary key id; Highlights include: JSON++ based data model Rich type support (spatial, temporal, ) Records, lists, bags Open vs. closed types 13

Other DDL Features create index msusersinceidx on MugshotUsers(user-since); create index mstimestampidx on MugshotMessages(timestamp); create index msauthoridx on MugshotMessages(author-id) type btree; create index mssenderlocindex on MugshotMessages(sender-location) type rtree; create index msmessageidx on MugshotMessages(message) type keyword; create type AccessLogType as closed { ip: string, time: string, user: string, verb: string, path: string, stat: int32, size: int32 }; create external dataset AccessLog(AccessLogType) using localfs (("path"="{hostname}://{path}"), ("format"="delimited-text"), ("delimiter"=" ")); create feed socket_feed using socket_adaptor (("sockets"="{address}:{port}"), ("addresstype"="ip"), ("type-name"="mugshotmessagetype"), ("format"="adm")); connect feed socket_feed to dataset MugshotMessages; External data highlights: Equal opportunity access Keep everything! Data ingestion, not streams 14 Queries unchanged

ASTERIX Query Language (AQL) Ex: Identify active users and group/count them by country: with $end := current-datetime() with $start := $end - duration("p30d") from $user in dataset MugshotUsers where some $logrecord in dataset AccessLog satisfies $user.alias = $logrecord.user and datetime($logrecord.time) >= $start and datetime($logrecord.time) <= $end group by $country := $user.address.country with $user select { "country" : $country, "active users" : count($user) } AQL highlights: Lots of other features (see website!) Spatial predicates and aggregation Set-similarity matching And plans for more 15

AsterixDB System Overview 16

Local LSM-Based Storage & Indexes New data On-Disk Components C 0 C 1 C 2 In-Memory Component Instance of Index I Deleted-Key B + -Tree Bloom Filter LSM-ified Indexes: B+ trees R trees (secondary) Inverted (secondary) 17

Distributed Storage in AsterixDB Hash-partitioned, shared-nothing, local drives Partitioning based on primary key (hashing) Secondary indexes local to, and consistent with, corresponding primary partitions (all LSM-based) Also offer external dataset feature (for HDFS) Multiple (Hive) formats, secondary index support Index partitions co-located with data (if possible) Developed for space and IT comfort reasons #AsterixDB 18

Data Replication in AsterixDB (WIP) Chained Declustering Log-Based Replication (synchronous, recovery-only copies kept) 19

Hedging Our Bets We re currently porting our LSM-based storage system to also work on top of HDFS (and YARN) Might somehow feel more comforting (and/or environmentally friendly ) to Big Data IT shops Another path to replication and high availability Interesting experiments lie ahead Revisit Stonebraker-like OS issues (modern version) Bake-off: Distributed record management vs. DFS Just how well does HDFS do w.r.t. locality of writes? #AsterixDB 20

What About the Cloud? Computing may be elastic, but data is not...! Native storage hard to expand & contract Seems to make the argument for a shared-disklike approach based on cloud storage facilities Experimentation is needed E.g., Google persistent disks (in Google Cloud)? Performance implications will be interesting to explore... 21

Some AsterixDB Use Cases Recent/projected use case areas include: Behavioral science Social data analytics Cell phone event analytics Education (MOOC analytics) Power usage monitoring Public health Cluster management log analytics Let s take a quick pick at the first two Time permitting! 22

Current Status 4 year initial NSF project (250+ KLOC @ UCI/UCR) AsterixDB BDMS is here! ( June 6 th, 2013) Semistructured NoSQL style data model Declarative parallel queries, inserts, deletes, LSM-based storage/indexes (primary & secondary) Internal and external datasets both supported Rich set of data types (including text, time, location) Fuzzy and spatial query processing NoSQL-like transactions (for inserts/deletes) Data feeds and external indexes will appear soon Now in Apache incubation mode! 23

For More Info AsterixDB project page: http://asterixdb.ics.uci.edu Open source code base: ASTERIX: http://code.google.com/p/asterixdb/ Hyracks: http://code.google.com/p/hyracks (Pregelix: http://hyracks.org/projects/pregelix/) 24

The New Kid on the Block! http://asterixdb.ics.uci.edu 25