In-Memory Data Grids

Similar documents

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

GridGain In- Memory Data Fabric: UlCmate Speed and Scale for TransacCons and AnalyCcs

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Moving From Hadoop to Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How Companies are! Using Spark

In Memory Accelerator for MongoDB

Big Data Analytics - Accelerated. stream-horizon.com

Enterprise Operational SQL on Hadoop Trafodion Overview

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop IST 734 SS CHUNG

So What s the Big Deal?

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

How To Scale Out Of A Nosql Database

How To Handle Big Data With A Data Scientist

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data and Data Science: Behind the Buzz Words

I/O Considerations in Big Data Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Real-time Big Data Analytics with Storm

Large scale processing using Hadoop. Ján Vaňo

Native Connectivity to Big Data Sources in MSTR 10

In-Memory Analytics for Big Data

Hadoop & Spark Using Amazon EMR

How To Create A Data Visualization With Apache Spark And Zeppelin

INTRODUCING APACHE IGNITE An Apache Incubator Project

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

MapReduce with Apache Hadoop Analysing Big Data

<Insert Picture Here> Getting Coherence: Introduction to Data Grids South Florida User Group

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Hadoop Ecosystem B Y R A H I M A.

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Big Data Technologies Compared June 2014

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Dominik Wagenknecht Accenture

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Table Of Contents. 1. GridGain In-Memory Database

BIG DATA TOOLS. Top 10 open source technologies for Big Data

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Scalable Architecture on Amazon AWS Cloud

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Workshop on Hadoop with Big Data

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

NoSQL and Hadoop Technologies On Oracle Cloud

Internals of Hadoop Application Framework and Distributed File System

Using an In-Memory Data Grid for Near Real-Time Data Analysis

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Introduction to Big Data Training

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

IN-MEMORY DATA FABRIC: Data Grid

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Tap into Hadoop and Other No SQL Sources

Unified Big Data Processing with Apache Spark. Matei

NoSQL Data Base Basics

Real-Time Analytics for Big Market Data with XAP In-Memory Computing

The Internet of Things and Big Data: Intro

Open Source Technologies on Microsoft Azure

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Introduction

Big Data Success Step 1: Get the Technology Right

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Lecture Data Warehouse Systems

Big Data: Are You Ready? Kevin Lancaster

Oracle Database 12c Plug In. Switch On. Get SMART.

Unified Big Data Analytics Pipeline. 连城

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

Upcoming Announcements

Real Time Big Data Processing

TRAINING PROGRAM ON BIGDATA/HADOOP

Spark: Making Big Data Interactive & Real-Time

In-Memory Computing : Premiere

GigaSpaces Real-Time Analytics for Big Data

How to Choose Between Hadoop, NoSQL and RDBMS

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Reference Architecture, Requirements, Gaps, Roles

In-Memory BigData. Summer 2012, Technology Overview

Transcription:

The Future of Real-time Big Data: In-Memory Data Grids Jacek Kruszelnicki, Numatica Corporation E-mail: jacek@numatica.com Phone: 781 756 8064 1

Presenter Jacek Kruszelnicki President & CEO, Numatica Corporation Intellectuals solve problems. Geniuses prevent them. -- Albert Einstein 20+ years of shipping scalable, distributed enterprise software Vendor neutral. Merit-based. Biased towards simplicity/elegance. Author, mentor & conference speaker Founder and past persident of New England WebLogic User Group M.S. in Computer Science, Adjunct Professor at Northeastern University Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 2

Agenda Definitions: What is Big Data? What is Real-time Big Data Analytics? What is High-velocity data stream computing Business Drivers Big Data Universe: Hadoop, NoSQL and Distributed Data Grids Distributed Data Grids: What are Distributed Data Grids Charateristics of a Distributed Data Grid Major players, and the winner is... From Technlogies to Platforms The Ultimate: Unified Big Data processing Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 3

Not on Agenda Not an IMDG tutorial Not In-depth comparison of IMDGs (things change too fast) Data modeling, partitioning strategy (hub/spoke, data affinity, etc) (separate presentation) Analysis of competing frameworks suitable for Real-Time B.D.A. (separate presentation) CAP Theorem discussion Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 4

What is Big Data? Data sets too large and complex to process using standard DB tools and data processing applications Few dozen terabytes to Many petabytes of data in a single data set. High Volume High Variety Structured, semi-structured At-rest, In-flight Variety of formats High Velocity Millions of events per second Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 5

Real-time Big Data Analytics High Volume High Variety Focus on Velocity + Variety Rapidly changing data (at rest or in-motion): Business activities Examples: trades, orders High Velocity External stimuli Examples: market data, current customer location Operational monitoring Example: application log entries Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 6

Business Drivers Operational Intelligence: Analyze stream of business activities and external stimuli on-the fly React to them (preferably) instantaneously First-mover competitive advantage (new market conditions, customer wishes, etc) Real time data stream processing is critical, otherwise business value is lost. Real-time Big Data Analytics examples: Dynamic pricing (e-commerce) High-frequency trading Network security threats Credit card fraud prevention Factory floor data collection, RFID Mobile infrastructure, machine to machine (M2M) applications Prescriptive or Location-based applications Real-time dashboards, alerts, and reports Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 7

Growth (per Gartner) In-Memory Computing Racing Towards Mainstream Adoption Market to Reach $1 Billion by 2016 Issues so far: Lack of standards, Scarcity of skills, Relative architectural complexity, Monitoring and management challenges Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 8

Real-time Data Processing Advanced queries or complex transactions on huge datasets Get results in (near) real-time In-motion: Rapidly changing, massive stream(s) of data (events) Multiple sources, formats Needs to be processed in-flight Events persisted or not, depending on value At rest: Data already persisted, change notification Re-ingest, re-process (deltas only?) Typically, a combination of both Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 9

Big Data Universe Hadoop Spark Shark (Spark on Hive) 40x Still needs Hadoop: HDFS, YARN, Pig, Hive, HBase, JobTracker. TaskTracker,... Open-source, batch-oriented MPP DBs Teradata Vertica Greenplum Proprietary, complex, expensive NoSQL DBs Column: Cassandra, Hbase, Redshift Document/K-V: MongoDB, CouchDB, Riak Graph: Neo4j, etc. No tx, no referential integrity, triggers, foreign keys In-Memory Data Grids Coherence (commercial Hazelcast (OSS) Terracotta (commercial) Gemfire (commercial) GridGain (commercial) Gigaspaces XAP (OSS) Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 10

The Origins of In-Memory Data Grids Local Cache Distributed Cache (K-V Store) + Distributed List, Queue, Topic Computational Grids Reliable jobs execution, Scheduling Load balancing In-Memory Distributed Data/Compute Grid RAM is the new disk DISK is the new tape SQL (not Turing complete) MapReduce (low granularity, invasive), often not enough From: Disk IO optimization To: Data management over network (data in a grid) Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 11

The Concept Node 1 Node 2 Node 3 RAM/JVM RAM/JVM RAM/JVM Auxiliary Storage (RDBMS, FS) (Pluggable Serialization) Data partitioning: Logical (on key) % LOGICAL_PARTITION_COUNT (Hazelcast = 271) Physical (one or more logical) Partition ownership, replicas reassigned on membership change Can determine physical location by key, to dispatch logic Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 12

Real-time Big Data Universe Hadoop family Apache Storm Spark/Shark In-Memory Data Grids => Data/Compute Grids Coherence (commercial Hazelcast (OSS) Terracotta (commercial) Gemfire (commercial) GridGain (OSS, commercial) Gigaspaces XAP (OSS) Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 13

Traits of a Real-time Analytics platform Low Transactional Latency In-memory transaction speeds, minimize network trips, response in ms Low Data Latency Real-time data, rather than stale data weeks or months old High Throughput, High Scalability Horizontal and vertical Stream processing/messaging/cep Continuous Query Resilience Stateless, decentralized, peer-peer - no single point of failure, Partitioned with backup replicas. nodes can fail without data loss and join without any data corruption Elastic, simple cluster management No Mesos on top of ZooKeeper, etc. Programmatic, performant SQL-like querying (data affinity) Txs, including XA (when needed) Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 14

Major players - comparison Feature Hazelcast 3.2 GemFire 7 Oracle Coherence 12.1.3 Distributed data/collections DistributedMap MultiMap Dist. Queue Dist. Events No Multimap No Multimap Distributed Concurrency Dist. Lock Dist. Atomic Ref Dist. Atomic Dist Semaphore No Dist. Atomic Ref. No Dist. Semaphore No Dist. Atomic Ref. No Dist. Semaphore Distributed Computing Executor Service, Key Affinity, Map/Reduce Convoluted, Invasive Convoluted, Invasive Resilience In-memory replicas Recovery WAN replication Elasticity/Cluster Management Yes Yes Yes Transactions Local + XA Local + XA Local + XA Yes Yes Distributed Querying SQL (subset) Predicate Continuous Query Yes Yes License Apache 2.0 + Commercial Commercial Commercial Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 15

And the winner is... The most intellectually elegant distributed in-memory data/compute grid. Minimalism as design aesthetic: Non-intrusive, No dependencies 2.6MB single jar library. Implements Java APIs (Map, List, Set, Queue, Lock) in a distributed manner Distributed Execution Framework (extension of Java s Executor Service) Distributed Queries (SQL/Predicate), Data affinity (execution on specific node execution) Cluster management Java API included (dead simple) Auto-discovery of nodes and intelligent synchronization Apache License 2, commercial extensions + support Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 16

Code example: Distributed Execution Framework (1) import java.util.concurrent.callable; public class Echo implements Callable<String>, Serializable { String input = null; } public Echo() { } public Echo(String input) { this.input = input; } public String call() { return Hazelcast.getCluster().getLocalMember().toString() + input; } ExecutorService executorservice = Executors.newSingleThreadExecutor(); Future<String> future = executorservice.submit (new Echo("myinput"));... String result = future.get(); Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 17

Code example Distributed Execution Framework (2) import com.hazelcast.core.iexecutorservice; import java.util.concurrent.callable; import java.util.concurrent.future; import java.util.set; public void echosomewhere(string input, Member member) { Callable<String> task = new Echo(input); HazelcastInstance hz = Hazelcast.newHazelcastInstance(); IExecutorService executorservice = hz.getexecutorservice("default"); Future<String> future = executorservice.submittomember(task, member); Future<String> future = executorservice.submittokeyowner(task, key); Map<Member, Future<String>> futures = executorservice.submittomembers(new Echo(input), members); String echoresult = future.get(); Execution callbacks, cancellation APIs provided Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 18

Code example: Data Affinity(1) HazelcastInstance instance = Hazelcast.newHazelcastInstance(cfg); Map mapa = instance.getmap("mapa"); Map mapb = instance.getmap("mapb"); Map mapc = instance.getmap("mapc"); // different entries, but the operation will take place on the same member mapa.put("key1", value); mapb.get("key1"); mapc.remove("key1"); // lock operation will still execute on the same member of the cluster instance.getlock ("key1").lock(); // consistent with distributed execution instance.getexecutorservice().executeonkeyowner(runnable, "key1"); Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 19

Code example: Data Affinity(2) public class OrderKey implements Serializable, PartitionAware { int customerid; int orderid; public OrderKey(int orderid, int customerid) {...} public Object getpartitionkey() { return customerid; } } Map mapcustomers = instance.getmap("customers") ; Map maporders = instance.getmap("orders"); mapcustomers.put(1, customer); maporders.put(new OrderKey(21, 1), order); public static class OrderDeletionTask implements Callable<Integer>, PartitionAware, Serializable {... } ExecutorService es = instance.getexecutorservice(); OrderDeletionTask task = new OrderDeletionTask(customerId, orderid); Future future = es.submit(task); int remainingorders = future.get(); Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 20

Code example: Distributed Query public class Employee implements Serializable { private String name; private int age; private boolean active; private double salary; } HazelcastInstance hz = Hazelcast.newHazelcastInstance(cfg); IMap map = hz.getmap("employeemap"); Set<Employee> employees = (Set<Employee>) map.values( new SqlPredicate("active AND age < 30")); // or JPA-like criteria Predicate predicate = e.is("active").and(e.get("age").lessthan(30)); Set<Employee> employees = (Set<Employee>) map.values(predicate); Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 21

Code example: Cluster Management HazelcastInstance hz = Hazelcast.newHazelcastInstance(cfg); Cluster cluster = hz.getcluster(); cluster.addmembershiplistener(new MembershipListener(){ public void memberadded(membershipevent membersipevent) { System.out.println("MemberAdded " + membersipevent); } } Member localmember = cluster.getlocalmember(); System.out.println ("my inetaddress= " + localmember.getinetaddress()); Set setmembers = cluster.getmembers(); for (Member member : setmembers) { System.out.println ("islocalmember " + member.localmember()); System.out.println ("member.inetaddress " + member.getinetaddress()); System.out.println ("member.port " + member.getport()); } Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 22

Example: Admin console Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 23

More than just Real-time Big Data Scalable Computing: DB Caching Web/Session Clustering High-speed K-V Data Store Distributed queues, topics (Fast) SOA Services Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 24

Limitations RAM currently maxes out at ~ 640GB/server Still more expensive than SSDs and HDDs Garbage collection Cost and capacity limitations will disappear over time Off-heap memory and specialized JVMs (Azul, etc.) Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 25

From Technologies to Platforms In-Memory Data/Compute Grids not enough: Backing DB(s) Enterprise Message Store(s) Example: GemFire GreenPlum DB Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 26

Holy Grail: Unified Big Data Processing Multi-Source Data Harvesting, Ingestion, Transformation Data-type agnostic: Highly Structured (RDBMS data) Semi-structured (documents, key-value tuples) In-Motion and At Rest Distributed Computing, not just Data Tx and non-tx Batch and/or Real-time Processing The fewer moving parts the better Data Warehousing -> Hadoop/MapReduce -> Unified Big Data Processing Relational + Batch Batch All of the above Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 27

Unified Architecture Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 28

Summary One order of magnitude faster Data And Logic colocated, affinity-able (with careful design) Flexibility: Turing complete SQL/Predicate queries MapReduce framework on top Scalable and Elastic, simply (automatic repartitioning) Current limitations (cost, GC) fading fast Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 29

Q & A http://www.numatica.com Copyright (C) 2014 Numatica Corporation. All Rights Reserved. 30

Appendix 31

Berkeley Big-data Analytics Stack (BDAS) 32