Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com



Similar documents
How Companies are! Using Spark

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Unified Big Data Analytics Pipeline. 连 城

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop Architecture. Part 1

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop: Embracing future hardware

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Large-Scale Data Processing

Spark and the Big Data Library

Unified Big Data Processing with Apache Spark. Matei

How To Create A Data Visualization With Apache Spark And Zeppelin

Real-time Big Data Analytics with Storm

Big Data Processing: Past, Present and Future

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Architectures for massive data management

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Analytics Hadoop and Spark

BIG DATA TRENDS AND TECHNOLOGIES

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Big Data Analytics - Accelerated. stream-horizon.com

Oracle Big Data SQL Technical Update

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

HDP Hadoop From concept to deployment.

Dell In-Memory Appliance for Cloudera Enterprise

Introduction to Spark

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop. Sunday, November 25, 12

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Hadoop Cluster Applications

Hadoop Ecosystem B Y R A H I M A.

SEAIP 2009 Presentation

Spark: Cluster Computing with Working Sets

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

From Spark to Ignition:

Map Reduce / Hadoop / HDFS

Hadoop & Spark Using Amazon EMR

A Brief Introduction to Apache Tez

Managing large clusters resources

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

A survey on platforms for big data analytics

Amazon EC2 Product Details Page 1 of 5

Hardware- and Network-Enhanced Software Systems for Cloud Computing

Streaming items through a cluster with Spark Streaming

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Challenges for Data Driven Systems

Parallel Computing. Benson Muite. benson.

Big Data With Hadoop

Oracle Big Data Fundamentals Ed 1 NEW

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Hybrid Solutions Combining In-Memory & SSD

Big Data Systems CS 5965/6965 FALL 2015

Software-defined Storage Architecture for Analytics Computing

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

The Internet of Things and Big Data: Intro

Large scale processing using Hadoop. Ján Vaňo

Enabling High performance Big Data platform with RDMA

In-Memory Analytics for Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Apache Hama Design Document v0.6

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Information Architecture

Spark: Making Big Data Interactive & Real-Time

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Next-Gen Big Data Analytics using the Spark stack

Architectures for Big Data Analytics A database perspective

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Bayesian networks - Time-series models - Apache Spark & Scala

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

How To Scale Out Of A Nosql Database

Big Data Security Challenges and Recommendations

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Microsoft Analytics Platform System. Solution Brief

Introducing Storm 1 Core Storm concepts Topology design

BIG DATA ANALYTICS For REAL TIME SYSTEM

Spark. Fast, Interactive, Language- Integrated Cluster Computing

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

bigdata Managing Scale in Ontological Systems

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

NextGen Infrastructure for Big DATA Analytics.

How To Understand The Programing Framework For Distributed Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Unlocking the True Value of Hadoop with Open Data Science

Big Data. A general approach to process external multimedia datasets. David Mera

I/O Considerations in Big Data Analytics

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Transcription:

Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com

Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line assistance Products B-DAP Big Data Analytics Platform BotSearch Bot detection in Non supervised way Copyright Hurence 2014

Big Data versus High Performance Computing Big Data inherits all the concepts and architectures designed by HPC experts High processors density IO optimizations (avoid network latency by co-locating tasks and data) High availability (fail over mechanisms) to ensure no down time Etc. Big Data democratizes HPC and make it enter traditional Information Systems Still generally less advanced than specialized HPC systems with GPUs and FPGAs etc. More heterogeneous from a software standpoint: various tools coming from different open source communities and not initially designed to work together. Not so much focused on power consumption at the moment... More focused on providing easy enough installation and monitoring tools as well as programming tools and high level environments for non programmers (the Business Intelligence or Marketing teams).

Big Data benefits from HPC computing s revisited software Scale-out approach with commodity hardware (versus scale-up) A distributed File System Store data on several machines (High availability) Replicate data several times (fault tolerance) A parallel programming model GraphX (Spark) GridFS MPP (Spark) GFS GPFS Lustre MPI (Storm, IBM System-S) Distributed File System Parallel Progamming Model Organize tasks on multiple machines (parallelization) Schedule tasks where the data is (no network latency)

Big Data star: the Hadoop cluster 2 racks (on different electric systems) Network switch of 10 gigabytes / second to connect machines A number of machines to act as data nodes (store data) and task nodes (compute things). 2011 Gfi Informatique 2 processors machines (2x4 or 2x6 or 2x8 cores) Not less than 6 gigabytes RAM/core DAS Storage (Directly Attached Storage) with 1-2-3-4 terabytes disks SAS or SATA configured as JBOD: Just a Bunch of Disk

Big Data / Hadoop Distributed File System principles REPLICATION File stored as multiple blocks and 3 times CO-LOCATION OF TASKS AND DATA Job processing file is decomposed into parallel tasks Each task scheduled on a server that holds the block of data

Parallel programming models Batch or «near real time» Map Reduce Real time MPI (Message Passing Interface stream processing) MPP (Massively Parallel Processing) Specialized GraphX to process graphs on top of MPP Since Hadoop YARN 2.0 Hadoop supports all models and is becoming a de-facto Big Data operating system

Understanding Map Reduce through Hadoop MR Counting colored squares

Understanding MPI (Message Passing Interface) through Storm Jobs are DAGs (Directed Acyclic Graphs) of tasks with 2 types: Spouts: sources of data Bolts: processors of data tweets Bolt Text mining Spout Twitter (using twitter API) text video Spout Youtube) Bolt Video to Text Bolt Storage Text + Facts

Understanding MPP (Massively Parallel Processing) with Spark A program is decomposed into parallel tasks positioned on machines First iteration, no data is cached Second iteration, the program may run on cached data sets. Low latency, real-time, in-memory parallel processing Data is cached in-memory => Distributed cache Task RDD: Resilient Distributed Dataset

Big Data Life Cycle and Big Bazaar! Collecting, cleaning, enriching Preparing data Visualizing Big Data plateforms Analyzing And Predicting Storing

ƛ (lambda) architectures

New architectures enabled by Hadoop YARN 2.0 Different job schedulers can now launch different jobs (batch jobs or real-time jobs) Resources are managed globally by a resource manager Still... some layers / tools do not release their resources if not needed ; they are not good multi-tenant citizen...

Different parallel paradigms want to access the same resources Configure RAM utilization 64G Configure RAM utilization 64G Configure RAM utilization 64G I am the Speed layer and I want all my RDD to be in-memory I am the service layer and I want to cache the database data in-memory I am the batch layer and I want to set maximum memory on my processes to avoid swapping since swapping is performance killer For network latencies reasons all goes to the same cluster of machines

The problem with superposing many Big Data technologies We have an inflation of memory on the machines... because we must provide enough memory for each of the components including databases in silos... An example: two different in-memory systems using the same file will load the data in-memory two times... there is no global knowledge across tools that this data is already inmemory... Also if one layer is unoccupied, the other layers cannot use the memory it does not use in a flexible and dynamic way (there is no global capacity scheduler ). Every single tool has a static memory configuration. => heterogeneity of Big Data requires better management of resources (in particular memory resources)

The way forward... => there is a need for very clever resources managers and schedulers sufficiently standardized to allow many technologies to be working together and not in silos from a hardware standpoint. => Hadoop YARN has been a first move towards providing common resource management ; but many improvements are needed to manage resources in a much more clever way. Rendez-vous in 2015 for the Hadoop improvements...