Big Data Analytics - Accelerated. stream-horizon.com

Similar documents
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Analytics - Accelerated. stream-horizon.com

Big Data With Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Enabling High performance Big Data platform with RDMA

Hadoop: Embracing future hardware

Design and Evolution of the Apache Hadoop File System(HDFS)

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

[Hadoop, Storm and Couchbase: Faster Big Data]

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Big Data Technology Core Hadoop: HDFS-YARN Internals

Hadoop Architecture. Part 1

From Spark to Ignition:

Apache Hadoop. Alexandru Costan

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

CSE-E5430 Scalable Cloud Computing Lecture 2

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Virtualizing Apache Hadoop. June, 2012

Large scale processing using Hadoop. Ján Vaňo

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

I/O Considerations in Big Data Analytics

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

NoSQL Data Base Basics

Big Data Introduction

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Putting Apache Kafka to Use!

Apache HBase. Crazy dances on the elephant back

Application Development. A Paradigm Shift

Hadoop Distributed File System (HDFS) Overview

Distributed Filesystems

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop & its Usage at Facebook

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Dell* In-Memory Appliance for Cloudera* Enterprise

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Luncheon Webinar Series May 13, 2013

Architectures for massive data management

Kafka & Redis for Big Data Solutions

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Massive Cloud Auditing using Data Mining on Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

THE HADOOP DISTRIBUTED FILE SYSTEM

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The 3 questions to ask yourself about BIG DATA

Dell In-Memory Appliance for Cloudera Enterprise

A Brief Introduction to Apache Tez

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Big Fast Data Hadoop acceleration with Flash. June 2013

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

The Hadoop Distributed File System

Traditional BI vs. Business Data Lake A comparison

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

COURSE CONTENT Big Data and Hadoop Training

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

Hadoop IST 734 SS CHUNG

Apache Hadoop FileSystem Internals

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

A Brief Outline on Bigdata Hadoop

Hadoop & its Usage at Facebook

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

NOT IN KANSAS ANY MORE

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

International Journal of Advance Research in Computer Science and Management Studies

BIG DATA What it is and how to use?

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Intro to Map/Reduce a.k.a. Hadoop

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Accelerating and Simplifying Apache

Transcription:

Big Data Analytics - Accelerated stream-horizon.com

StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements generic input/output connectivity. Enables you to implement your own, customised input/output connectivity. Ability to process data from heterogeneous sources like: Spark Kafka Storm Netty Accelerates your clients TCP Streams Local File System Any Other Reduces Network data congestion Improves latency of, or any other Massively Parallel Processing S Query Engine. HBase Any Other And more Portable Across Heterogeneous Hardware and Software Platforms Portable from one platform to another

StreamHorizon Big Data Processing Pipeline

StreamHorizon - Flavours Big Data Processing Storm - Reactive, Fast, Real Time Processing Guaranteed data processing Guarantees no data loss Real-time processing Horizontal scalability Fault-tolerance Stateless nodes Open Source Seamless Integration Hadoop Big Batch Oriented Processing Batch processing Jobs runs to completion Stateful nodes Scalable Guarantees no data loss Open Source Kafka Designed for processing of real time activity stream data (metrics, KPI's, collections, social media streams) A distributed Publish-Subscribe messaging system for Big Data Acts as Producer, Broker, Consumer of message topics Persists messages (has ability to rewind) Initially developed by LinkedIn (current ownership of Apache)

NameNode Big Data & StreamHorizon Data Persistence (Example: Finance Industry) Data Calculation, Data Processing & Persistence Data Processing Tasks Big Data DataNode Cluster - Data Source (Quant Library Instance) - StreamHorizon instance (daemon)

HDSF vs. Tier 2 Storage Performance Benchmark Big Data () vs. Commodity Tier 2 Storage Benchmarks Non - filesystem filesystem Single Server Single DataNode Cluster of 10 DataNodes File to File 0.95 million records/sec 1.1 million records/sec per node 11 million records/sec per node Stream to File 1.04 million records/sec 1.2million records/sec per node 12million records/sec per node

StreamHorizon & Big Data Advanced Concepts Streaming Data Aggregations SDA SDA are integral part of all StreamHorizon instances (daemons) StreamHorizon SDA processes & aggregates data as it is processed (on the fly) Aggregated data is directly persisted to (or any alternative Data Target)

StreamHorizon Accelerator Method - persisting your Big Data (faster) Accelerate Hadoop or/and persistence by writing less: StreamHorizon enables you to persist only transactional data (traditionally known as Fact table data ) and omit persisting repetitive dimensional data (low cardinality data) to DataNodes across your Big Data cluster Client front end tool (or any end user application like Excel) simply merges Dimensional data with Fact (transactional data) brought from your Big Data cluster. Reduced Client & Network footprint StreamHorizon Accelerator Method reduces Network traffic between your clients & Big Data cluster to ~10% of nominal size (due to avoidance of shipping of low cardinality data (dimensional data) via network)

NameNode NameNode Big Data & StreamHorizon Data Retrieval (Example: Finance Industry) Nominal Network Congestion User Groups Network Congestion StreamHorizon Accelerator Method User Groups Dimensional Data (Low cardinality) Dimensional And Fact Data Fact Data (High Cardinality)

StreamHorizon beneficial to Big Data filesystem () StreamHorizon Accelerates Hadoop processing - MapReduce has two main disadvantages (processing is slow & inconvenient to use) + StreamHorizon SDA outperforms ( based reporting stack usually has higher query latency compared to stack. This is significantly improved with StreamHorizon) Due to aggregation, single file contains even more of your business data. Benefits are: queries are more effective Reduces number of I/O requests Single I/O request executes as sequential read in comparison to default I/O footprint Reduces replication latency Move via Network only high cardinality (Fact) rather than low cardinality (Dimensional) data. Achieve reduction of network traffic down to 10% of it s nominal value. StreamHorizon SDA accelerates heavy data operations like joins etc.

Client Queries - accelerated by StreamHorizon StreamHorizon accelerates ad-hoc queries for, or any other MPP S Query Engine. This is can be achieved with: StreamHorizon SDA (Streaming Data Aggregations) StreamHorizon Accelerated Method data topology StreamHorizon reduces memory pressure for (or any other memory dependent data access component) StreamHorizon reduces MapReduce processing latency for (when utilizing StreamHorizon SDA) query latency reduced by order of magnitude (function of data volume reduction of your StreamHorizon SDA aggregations)

Streaming Data Aggregations impact on Big Data Query Latency Out of the Box With Streaming Data Aggregations Query Latency Medium-Low Medium-High Low Medium-Low Memory Footprint High Nominal Medium - Low Nominal - Low Processing Footprint Medium-Low High Low Low Space Consumption Medium High Low Low

Q&A stream-horizon.com