CC 2.0 by William Brawley http://flic.kr/p/7pdup3



Similar documents
Apache HBase. Crazy dances on the elephant back

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BIG DATA TRENDS AND TECHNOLOGIES

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

A Performance Analysis of Distributed Indexing using Terrier

Big Data and Scripting Systems build on top of Hadoop

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

BIG DATA What it is and how to use?

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Open source large scale distributed data management with Google s MapReduce and Bigtable

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Hadoop IST 734 SS CHUNG

BIG DATA HADOOP TRAINING

Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience

Apache Hadoop: Past, Present, and Future

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Introduction to Hbase Gkavresis Giorgos 1470

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Apache HBase: the Hadoop Database

Big Data and Apache Hadoop s MapReduce

SPM: Large Scale Performance Monitoring for ElasticSearch HBase Solr & Friends. Otis Gospodnetić sematext.

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Scale Out Of A Nosql Database

Comparing Scalable NOSQL Databases

Apache Hadoop: The Big Data Refinery

Cost-Effective Business Intelligence with Red Hat and Open Source

Accelerating and Simplifying Apache

CitusDB Architecture for Real-Time Big Data

Using distributed technologies to analyze Big Data

GigaSpaces Real-Time Analytics for Big Data

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Oracle Data Integrator for Big Data. Alex Kotopoulis Senior Principal Product Manager

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Certified Big Data and Apache Hadoop Developer VS-1221

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Benchmarking Hadoop & HBase on Violin

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Dell In-Memory Appliance for Cloudera Enterprise

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

A Survey on Cloud Storage Systems

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Introduction to Apache Cassandra

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The Hadoop Eco System Shanghai Data Science Meetup

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

A Brief Introduction to Apache Tez

Cloud Storage Solution for WSN in Internet Innovation Union

Performance Analysis of Lucene Index on HBase Environment

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

The Future of Data Management

Big Data Explained. An introduction to Big Data Science.

Building Scalable Big Data Pipelines

Splice Machine: SQL-on-Hadoop Evaluation Guide

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Performance Testing Oracle SOA Platform and Services

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Data? Definition # 1: Big Data Definition Forrester Research

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Client Overview. Engagement Situation. Key Requirements

Dell Reference Configuration for Hortonworks Data Platform

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA USING HADOOP

Apache Hama Design Document v0.6

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Analytics Nokia

Practical Hadoop. Security. Bhushan Lakhe

Hadoop. Sunday, November 25, 12

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Transcription:

CC 2.0 by William Brawley http://flic.kr/p/7pdup3

Why Hadoop and HBase? Social Media Monitoring Prospective Search and Coprocessors Challenges & Lessons Learned Resources to get started 2 Agenda

Software Architect @ sentric 3 Co-founder and organizer of the Swiss HUG Contact: christian.guegi@sentric.ch http://www.sentric.ch @chrisgugi About me

Spin-off of MeMo News AG, the leading provider for Social Media Monitoring & Analytics in Switzerland Big Data expert, focused on Hadoop, HBase and Solr Objective: Transforming data into insights 4 About sentric

CC 2.0 by Pete Reed h"p://flic.kr/p/ks9kf

6 Information Gathering Information Processing Analysis & Interpretation Insight Presentation Why Hadoop and HBase? Social Media Monitoring Process

7 Cost effective Reliable SMM High scalable Analytical capabilities RT Alerting Why Hadoop and HBase? Requirements

Storage HBase /HDFS 8 Search Solr Analytics Hadoop Mahout Event mechanism (MQ) HBase RowLog Real-time alerting Prospective search Why Hadoop and HBase? Technology Stack

CC 2.0 by nolifebeforecoffee http://flic.kr/p/c1utf

Downloaded Articles 10 match? Search Agents Output Web-UI Reports RT Alerts Icons by http://dryicons.com Social Media Monitoring Overview

n Crawler 11 REST HBase Web-UI RowLog Coprocessor MySQL Solr RT Alerts Social Media Monitoring Solution Architecture Icons by http://dryicons.com

Inspired by Google Bigtable coprocessors HBase version 0.92 Embed code directly into server processes High-level call interface for clients Automatic scaling, load balancing, request routing 12 Short Primer on Coprocessors Overview

Like a database trigger Provides event based hooks Concrete Implementations RegionObserver CRUD or DML type operations MasterObserver DDL or metadata operations and cluster administration WALObserver Write-ahead-log appending and restoration 13 Short Primer on Coprocessors Observer Classes

Client:Get() 14 CP1:preGet() CP2:preGet() CP3:preGet() Hregion:Get() CP1:postGet() CP2:postGet() CP3:postGet() RegionServer client response Short Primer on Coprocessors Observer Execution

Comparable to stored procedures Custom RPC protocol, used between client and region server Loaded in region server Client call APIs over single row or a row range Framework translates row keys to region location Parallel execution 15 Short Primer on Coprocessors Endpoint Classes

16 Client code Batch.Call<CountProtocol,int> int call(countprotocol p) { return p.getrowcount(); }. Region Server 1 table,,12345678 table,bbb,12345678 CountProtocol CountProtocol HTable coprocessorexec() Region Server 2 table,ccc,12345678 CountProtocol Map<byte[], Integer> countsbyregion table,ddd,12345678 CountProtocol Short Primer on Coprocessors Endpoint Call Routine

HBase Security (Version 0.94) Aggregate operations avg(), sum() AggregatorProtocol HBASE-3529: Embedded search 17 Short Primer on Coprocessors Use Cases

18 Processing Put operations HRegion HRegionServer Prospective Search RT Alerts Social Media Monitoring Icons by http://dryicons.com Prospective Search with Coprocessors

Standard, virtualized test cluster: 4RS/DN, 1HM, 1NN, 3ZK Test dataset created from 2h of live index (1GB) Drive load on RS/DN 19 Social Media Monitoring Testing Setup

1800 1600 20 1400 Writes/sec 1200 1000 800 600 400 200 0 0 10 50 100 200 400 800 # of agents Social Media Monitoring Test Results

CC 2.0 by Sean Maurik h"p://flic.kr/p/juduu

Everyone is still learning Some issues only appear at scale Production cluster configuration Hardware issues Tuning cluster configuration to our work loads HBase stability Monitoring health of HBase 22 Challenges & Lessons Learned Challenges

Be careful with expensive operations in coprocessors At scale, nothing works as advertised Monitoring/Operational tooling is most important Play with all the configurations and benchmark for tuning 23 Challenges & Lessons Learned Lessons

https://blogs.apache.org/hbase/ entry/coprocessor_introduction http://hbase.apache.org/apidocs/ index.html http://www.lilyproject.org/lily/about/ playground/hbaserowlog.html http://www.github.com/sentric/ HBasePS 24 Resources to get started

25 Questions? Christian Gügi christian.guegi@sentric.ch Berlin Buzzwords Thank you!