How To Scale Out Of A Nosql Database



Similar documents
Hadoop. Sunday, November 25, 12

Hadoop implementation of MapReduce computational model. Ján Vaňo

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Application Development. A Paradigm Shift

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data and Apache Hadoop s MapReduce

Large scale processing using Hadoop. Ján Vaňo

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

How To Handle Big Data With A Data Scientist

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Accelerating and Simplifying Apache

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop Ecosystem B Y R A H I M A.

Chase Wu New Jersey Ins0tute of Technology

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

MapReduce with Apache Hadoop Analysing Big Data

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

A Brief Outline on Bigdata Hadoop

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

NoSQL and Hadoop Technologies On Oracle Cloud

So What s the Big Deal?

BIG DATA TECHNOLOGY. Hadoop Ecosystem

How To Use Big Data For Telco (For A Telco)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Comparing SQL and NOSQL databases

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

White Paper: What You Need To Know About Hadoop

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Big Data and Scripting Systems build on top of Hadoop

Speak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure

Dominik Wagenknecht Accenture

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data With Hadoop

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop & its Usage at Facebook

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Can the Elephants Handle the NoSQL Onslaught?

HDFS. Hadoop Distributed File System

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop & its Usage at Facebook

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Systems, Big Data

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Integrating Big Data into the Computing Curricula

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

NoSQL Data Base Basics

INTRODUCTION TO CASSANDRA

White Paper: Hadoop for Intelligence Analysis

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Open source Google-style large scale data analysis with Hadoop

Lecture Data Warehouse Systems

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

BIG DATA What it is and how to use?

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

BIG DATA-AS-A-SERVICE

Transforming the Telecoms Business using Big Data and Analytics

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Big Data for Processing Data and Performing Workload

Big Data and Scripting Systems build on top of Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Workshop on Hadoop with Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Introduction coreservlets.com and Dima May coreservlets.com and Dima May

Cloud Scale Distributed Data Storage. Jürmo Mehine

A Survey on Big Data Concepts and Tools

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Apache Cassandra

Big Data: Tools and Technologies in Big Data

NoSQL for SQL Professionals William McKnight

Peers Techno log ies Pv t. L td. HADOOP

Big Data Course Highlights

Transcription:

Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI (FH) +43 7236 3343 843 michael.zwick@scch.at www.scch.at The SCCH is an initiative of The SCCH is located at

ATTENTION Think BIG Source: http://www.telovation.com/articles/gallery-old-computers.html 2

ATTENTION Think BIGGER Source: Google pictures search engine 3

Agenda Big Data Trend Scalability Challenges Apache Hadoop/HBase Firebird meets NoSQL Case study Q&A 4

Big Data Trend Data volume grows dramatically 40% up to 60% per year is common What is big? Gigabyte, Terabyte, Petabyte...? It depends on your business and the technology you use to manage this data Google, Facebook, Yahoo etc. are all managing petabytes of data Most of the data is unstructured or semistructured, thus (far) away from our perfect relational/normalized world 5

Big Data Trend The Three Vs Big Data is not just about data volume VOLUME Terabytes Records Transactions Tables, files VELOCITY Batch Near time Real time Streams Structured Unstructured Semistructured All the above VARIETY Source: Big Data Analytics TDWI Best Practices Report 6

Big Data Trend Typical application domains Sensor networks Social networks Search engines Health care (medical images and patient records) Science Bioinformatics Financial services Condition Monitoring systems 7

Agenda Big Data Trend Scalability Challenges Apache Hadoop/HBase Firebird meets NoSQL Case study 8

Scalability Challenges RDBMS scalability issues How do you tackle scalability issues in your RDBMS environment? Scale-Up database server Faster/more CPU, more RAM, Faster I/O Create indices, optimize statements, partition data Reduce network traffic Pre-aggregate data Denormalize data to avoid complex join statements Replication for distributed workload Problem: This usually fails in the Big Data area due to technical or licensing issues Solution: Big Data Management demands Scale-Out 9

Scalability Challenges A True Story (not happened to me) A MySQL (doesn t matter) environment Started with a regular setup, a powerful database server Business grow, data volume increased dramatically The project team began to Partition data on several physical disks Replicate data to several machines to distribute workload Denormalize data to avoid complex joins Removed transaction support (ISAM storage engine) At the end, they gave up: More or less a few (or even one) denormalized table Data spread across several machines and physical disks Expensive and hard to administrate Now they are using a NoSQL solution 10

Scalability Challenges How can NoSQL help NoSQL Not Only SQL NoSQL implementations target different user bases Document databases Key/Value databases Graph databases NoSQL can Scale-Out easily In the previous true story, they switched to a distributed key/value store implementation called HBase (NoSQL database) on top of Apache Hadoop 11

Agenda Big Data Trend Scalability Challenges Apache Hadoop/HBase Firebird meets NoSQL Case study Q&A 12

Apache Hadoop/HBase The Hadoop Universe Apache Hadoop is an open source software for reliable, scalable and distributed computing Consists of Hadoop Common: Common utilities to support other Hadoop sub-projects Hadoop Distributed Filesystem (HDFS): Distributed file system Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters Can run on commodity hardware Widely used Amazon/A9, Facebook, Google, IBM, Joost, Last.fm, New York, Times, PowerSet, Veoh, Yahoo!... 13

Apache Hadoop/HBase The Hadoop Universe Other Hadoop-related Apache projects Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. ZooKeeper : A high-performance coordination service for distributed applications. 14

Apache Hadoop/HBase MapReduce Framework Source: http://www.recessframework.org/page/map-reduce-anonymous-functions-lambdas-php 15

Apache Hadoop/HBase MapReduce Framework Source: http://www.larsgeorge.com/2009/05/hbase-mapreduce-101-part-i.html 16

Apache Hadoop/HBase HBase Overview Open source Java implementation of Google s BigTable concept Non-relational, distributed database Column-oriented storage, optimized for sparse data Multi-dimensional High Availability High Performance Runs on top of Apache Hadoop Distributed Filesystem (HDFS) Goals Billions of rows * Million of columns * Thousands of Versions Petabytes across thousands of commodity servers 17

Apache Hadoop/HBase HBase Architecture Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 18

Apache Hadoop/HBase HBase Data Model A sparse, multi-dimensional, sorted map {rowkey, column, timestamp} -> cell Column = <column_family>:<qualifier> Rows are sorted lexicographically based on the rowkey 19

Apache Hadoop/HBase HBase API s Native Java REST Thrift Avro Ruby shell Apache MapReduce Hive... 20

Apache Hadoop/HBase HBase Users Facebook Twitter Mozilla Trend Micro Yahoo Adobe... 21

Apache Hadoop/HBase HBase vs. RDBMS HBase is not a drop-in replacement for a RDBMS No data types just byte arrays, interpretation happens at the client No query engine No joins No transactions Secondary indexes problematic Denormalized data HBase is bullet-proof to Store key/value pairs in a distributed environment Scale horizontally by adding more nodes to the cluster Offer a flexible schema Efficiently store sparse tables (NULL doesn t need any space) Support semi-structured and structered data 22

Agenda Big Data Trend Scalability Challenges Apache Hadoop/HBase Firebird meets NoSQL Case study Q&A 23

Firebird meets NoSQL Case study High-Level Requirements Condition Monitoring System in the domain of solar collectors and inverters Collecting measurement values (current, voltage, temperature...) Expected data volume ~ 1300 billions measurement values (rows) per year Long-term storage solution necessary Web-based customer portal accesing aggregated and detailed data Backend data analysis, data mining, machine learning, fault detection 24

Firebird meets NoSQL Case study Prototypical implementation Initial problem: You can t manage this data volume with a RDBMS Measurement values are a good example for key/value pairs Evaluated Apache Hadoop/HBase for storing detailed measurement values Virtualized Hadoop/HBase test cluster with 12 nodes RDBMS (Firebird/Oracle) for storing aggregated data 25

Firebird meets NoSQL Case study Prototypical implementation HBase data model based on the web application object model Row-Key: <DataLogger-ID>-<Device-ID>-<Timestamp> 0000000000000050-0000000000000001-20110815134520 Column Families and Qualifiers based on classes and attributes in the object model Test data generator (TDG) and scanner (TDS) implementing the HBase data model for performance tests TDG: ~70000 rows per second TDS: <150ms response time with 400 simulated clients querying detail values for a particular DataLogger-ID/Device- ID for a given day for a HBase table with ~5 billions rows 26

Firebird meets NoSQL Case study Prototypical implementation MapReduce job implementation for data aggregation and storage Throughput: 600000 rows per second Pre-aggregation of ~15 billions rows in ~7h, including storage in RDBMS Prototypical implementation based on a fraction of the expected data volume Simply add more nodes to the cluster to handle more data 27

MapReduce Demonstration Live Demonstration 28

Agenda Big Data Trend Scalability Challenges Apache Hadoop/HBase Firebird meets NoSQL Case study Q&A 29

Q&A Questions and Answers 30

ATTENTION Think small 31

The End Thanks for your attention! thomas.steinmaurer@scch.at 32

Resources Apache Hadoop: http://hadoop.apache.org/ Apache HBase: http://hbase.apache.org/ MapReduce: http://www.larsgeorge.com/2009/05/hbase-mapreduce- 101-part-i.html Hadoop/Hbase powered by: http://wiki.apache.org/hadoop/poweredby http://wiki.apache.org/hadoop/hbase/poweredby HBase goals: http://www.docstoc.com/docs/2996433/hadoop-and-hbase-vs-rdbms HBase architecture: http://www.larsgeorge.com/2009/10/hbase-architecture-101- storage.html Cloudera Distribution: http://www.cloudera.com/ 33