Using distributed technologies to analyze Big Data



Similar documents
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop IST 734 SS CHUNG

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

How To Use Big Data For Telco (For A Telco)

Oracle Big Data SQL Technical Update

Moving From Hadoop to Spark

Large scale processing using Hadoop. Ján Vaňo

Testing 3Vs (Volume, Variety and Velocity) of Big Data

BIG DATA What it is and how to use?

BIG DATA TECHNOLOGY. Hadoop Ecosystem

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

A very short Intro to Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data With Hadoop

Big data blue print for cloud architecture

Can the Elephants Handle the NoSQL Onslaught?

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop implementation of MapReduce computational model. Ján Vaňo

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data and Scripting Systems build on top of Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Integrating Big Data into the Computing Curricula

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

CSE-E5430 Scalable Cloud Computing Lecture 2

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Apache Hadoop: Past, Present, and Future

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Xiaoming Gao Hui Li Thilina Gunarathne

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya


CitusDB Architecture for Real-Time Big Data

Comparing SQL and NOSQL databases

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Testing Big data is one of the biggest

Putting Apache Kafka to Use!

Open source large scale distributed data management with Google s MapReduce and Bigtable

Apache HBase. Crazy dances on the elephant back

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Apache Kylin Introduction Dec 8,

Hadoop and Map-Reduce. Swati Gore

Hadoop Ecosystem B Y R A H I M A.

Big Fast Data Hadoop acceleration with Flash. June 2013

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Workshop on Hadoop with Big Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

The Hadoop Eco System Shanghai Data Science Meetup

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Hadoop Job Oriented Training Agenda

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

How To Scale Out Of A Nosql Database

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Connecting Hadoop with Oracle Database

Trafodion Operational SQL-on-Hadoop

Open source Google-style large scale data analysis with Hadoop

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Big Data Course Highlights

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data Weather Analytics Using Hadoop

Analytics on Spark &

So What s the Big Deal?

Hive Development. (~15 minutes) Yongqiang He Software Engineer. Facebook Data Infrastructure Team

NoSQL Data Base Basics

East Asia Network Sdn Bhd

Cloudera Certified Developer for Apache Hadoop

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MySQL and Hadoop. Percona Live 2014 Chris Schneider

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Virtualizing Apache Hadoop. June, 2012

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Transcription:

Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1

Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/ min Data generated/server/year ~ 2 GB 50 K servers ~ 100 TB data / year 2

Online Warehouse - Time Series Extreme storage requirements TS data for a data center e.g. last year Online TS data availability i.e. no separate ETL Support for common analytics operations Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc Slice and Dice CPU util. for UNIX servers in SFO data center last week Statistical Operations : sum, count, avg., var, std. moving avg., frequency distributions, forecasting etc Ease of use SQL interface, design schema for TS data Horizontal scaling - lower cost commodity hardware High R/W volume Data Center OS Data Cube - CPU Time 3

/5/11 Why not use RDBMS based Data Warehousing? Star schema dimensions & facts Offline data availability ETL required not online Expensive to scale vertically High end Hardware & Software Limits to vertical scaling big data may not fit Features like transactions etc are unnecessary and a overhead for certain applications Large scale distributed/partitioning is painful sub optimal on high W/R ratios Flexible Schema support which can be changed on the fly is not possible 4

High Level Architecture Real time Continuous load of Metric & Dimension Data Schema & Query Hive Distributed SQL NoSQL Column Store - HBase Hadoop HDFS & Map Reduce Framework Map Reduce & HDFS Nodes 5

Map Reduce - Recap Map Function Reduce Function /5/11 Apply to input data, Emits reduction key and value Output of Map is sorted and partitioned for use by Reducers Mappers and Reducers can be chained together Apply to data grouped by reduction key Often reduces data (for example sum(values)) Mappers and Reducers can be chained together 6

HDFS Sweet spot /5/11 Big Data Storage : Optimized for large files (ETL) Writes are create, append, and large Reads are mostly big and streaming Throughput is more important than latency Distributed, HA, Transparent Replication 7

When is raw HDFS unsuitable? Mutable data Create, Update, Delete Small writes Random reads, % of small reads Structured data Online access to data HDFS Loading is offline / batch process 8

NoSQL Data stores - Column /5/11 Excellent W/R concurrent performance fast writes and fast reads (random and sequential) this is required for near real time update of data to TS Data Distributed architecture, horizontal scaling, transparent replication of data Highly Available (HA) and Fault Tolerant (FT) for no SPOF shared nothing architecture Reasonably rich data model Flexible in terms of schema amenable to ad-hoc changes even at runtime 9

HBase 0 /5/11 (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored value Table is split into multiple equal sized regions each of which is a range of sorted keys (partitioned automatically by the key) Ordered Rows by key, Ordered columns in a Column Family Table schema defines Column Families Rows can have different number of columns Columns have value and versions (any number) Column range and key range queries Row Key Column Family (dimensions) Column Family (metric) 112334-7782 server : host1 dc : PUNE value:20 112334-7783 server:host2 value:10 10

1 /5/11 Hive Distributed SQL > MR MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining several Mappers & Reducers required Hive provides familiar SQL queries which automatically gets translated to a flow of appropriate Mappers and Reducers that execute the query leveraging MR. Leverages Hadoop ecosystem - MR, HDFS, HBase Hive defines a schema for the meta-tables it will use to build a schema its SQL queries can use and to store metadata Storage Handlers for HDFS, HBase Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc clauses Hive stores the data partitioned by partitions (you can specify partitioning key while loading Hive tables) and buckets (useful for statistical operations like sampling) Hive queries can also include custom map/reduce tasks as scripts 11

Hive Queries - CREATE TABLE EXTERNAL TABLE CREATE TABLE wordfreq (word STRING, freq INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH freq.txt' OVERWRITE INTO TABLE wordfreq; CREATE external TABLE iops(key string, os string, deploymentsize string, ts int, value int) STORED BY 'org.apache.hadoop.hive.hbase.hb asestoragehandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:os,data:deploymentsize, data:ts,data:value") 12

Hive Queries - SELECT TABLE select * from wordfreq where freq > 100 sort by freq desc limit 3; explain select * from wordfreq where freq > 100 sort by freq desc limit 3; select freq, count(*) AS f2 from wordfreq group by freq sort by f2 desc limit 3; EXTERNAL TABLE select ts, avg(value) as cpu from cpu_util_5min group by ts; select architecture, avg(value) as cpu from cpu_util_5min group by architecture; 13

4 Hive SQL -> Map Reduce CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix SELECT timestamp, AVG(value) FROM timeseries WHERE server-type = Unix GROUP BY timestamp /5/11 timeseries Map Shuffle Sort Reduce 14

Thanks 15