GigaSpaces Real-Time Analytics for Big Data



Similar documents
Easy Deployment of Mission-Critical Applications to the Cloud

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Real-Time Analytics for Big Market Data with XAP In-Memory Computing

Scaling Healthcare Applications to Meet Rising Challenges of Healthcare IT

Meeting Real Time Risk Management Challenge XAP In-Memory Computing

Real-Time Stream Processing and Cloud-Based Big Data Increasing in Today s Enterprises

Real Time Analy:cs for Big Data Lessons Learned from Facebook

INTRODUCTION TO CASSANDRA

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Cloud Courses Description

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Cloud Courses Description

Ubuntu and Hadoop: the perfect match

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Cloud/SaaS enablement of existing applications

INTRODUCING APACHE IGNITE An Apache Incubator Project

Hybrid Solutions Combining In-Memory & SSD

MySQL Comes of Age. Robert Hodges Sr. Staff Engineer Percona Live London November 4, VMware Inc. All rights reserved.

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Reference Model for Cloud Applications CONSIDERATIONS FOR SW VENDORS BUILDING A SAAS SOLUTION

Open Source Technologies on Microsoft Azure

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

BIG DATA TRENDS AND TECHNOLOGIES

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Scalable Architecture on Amazon AWS Cloud

ORACLE COHERENCE 12CR2

Cloud Computing: Making the right choices

Optimizing the Hybrid Cloud

Software-Defined Networks Powered by VellOS

ScaleArc for SQL Server

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

SCALABILITY IN THE CLOUD

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

MakeMyTrip CUSTOMER SUCCESS STORY

Big Data Analytics - Accelerated. stream-horizon.com

ScaleArc idb Solution for SQL Server Deployments

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Using In-Memory Computing to Simplify Big Data Analytics

Data Integration Checklist

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Java, PHP & Ruby - Cloud Hosting

Drive new Revenue With PaaS/IaaS. Ruslan Synytsky CTO, Jelastic

The 3 questions to ask yourself about BIG DATA

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

JAVA IN THE CLOUD PAAS PLATFORM IN COMPARISON

Hyperscaling Your SaaS Infrastructure

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

CERTIFICATE PROGRAMME ON CLOUD SPECIALISTS DEVELOPMENT

Data Challenges in Telecommunications Networks and a Big Data Solution

Driving More Value From OpenVMS Critical Infrastructure in Local and Global Datacenters: A CASE STUDY. Presented by: J. Barry Thompson, CTO Tervela

Virtualizing Apache Hadoop. June, 2012

Assignment # 1 (Cloud Computing Security)

<Insert Picture Here> Getting Coherence: Introduction to Data Grids South Florida User Group

NEXT-GENERATION, CLOUD-BASED SERVER MONITORING AND SYSTEMS MANAGEMENT

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

Relational Databases in the Cloud

The Next Wave of Data Management. Is Big Data The New Normal?

THE REALITIES OF NOSQL BACKUPS

NoSQL Data Base Basics

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

It s Not Public Versus Private Clouds - It s the Right Infrastructure at the Right Time With the IBM Systems and Storage Portfolio

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

How To Handle Big Data With A Data Scientist

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

BIG DATA-AS-A-SERVICE

Ubuntu: helping drive business insight from Big Data

Planning the Migration of Enterprise Applications to the Cloud

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Building Heavy Load Messaging System

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Microsoft Big Data Solutions. Anar Taghiyev P-TSP

ENZO UNIFIED SOLVES THE CHALLENGES OF OUT-OF-BAND SQL SERVER PROCESSING

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February

Amazon Web Services Yu Xiao

Why Big Data in the Cloud?

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Microsoft Private Cloud Fast Track

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Building Success on Acquia Cloud:

Future-Proofing MySQL for the Worldwide Data Revolution

Towards Smart and Intelligent SDN Controller

Intel IT s Cloud Journey. Speaker: [speaker name], Intel IT

PaaS - Platform as a Service Google App Engine

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Transcription:

GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and mobile applications are driving the need for scalable, real-time platforms that can handle streaming analysis and processing of massive amounts of data. Today, creating an analytics system for big data generally means collecting multiple technologies from various providers, and building the system yourself. This presents challenges in terms of erformance, costs, scalability, real-time, and more. GigaSpaces resolves these issues: You need to handle massive amounts of data in real time, without losing data and at minimum cost. Most analytics systems are not designed for real-time: it can take hours or days to see the impact of an event in reports, enabling you to take action. The challenge becomes even greater as events are gathered from more sources at significantly higher volumes. One option: Construct your own solution by combining various available technologies. This can be complex: In addition to messaging, data storage, and processing, you need management and orchestration for automating the deployment and ensuring continuous availability the assorted parts. A simpler option: Just plug in the GigaSpaces Real-Time Analytics solution. You can focus on your business logic, and leave the rest to us. GigaSpaces makes building and deploying a large-scale real-time analytics system simple. You just provide simple event processing business logic, and we handle the scalability, performance, and database integration. Seamlessly. GigaSpaces delivers software middleware that provides enterprises and ISVs with end-to-end application scalability and cloud-enablement for mission-critical applications for hundreds of tier-1 organizations worldwide. It s Open: Use any stack, avoid lock-in. Pick your own Big Data database (RDBMS or NoSQL); Plug in consistent management and monitoring across the stack without changing your code; Write event handlers using common languages; Access your data using standard SQL/JPA APIs. All while minimizing costs. A unique combination of memory and disk-based databases ensure the optimum cost/performance ratio. Leveraging automation and cloud-based deployment reduces operational costs. The GigaSpaces Real-Time Analytics solution for Big Data Applications eliminates the complexity XAP Real-Time Solution for Big Data Cassandra HBase MongoDB Redis

CURRENT TECHNOLOGIES OVERVIEW There is no one-size-fits-all technology. Building an analytic application that addresses real-time and batch analytics requirements requires a combination of the available technologies. The challenge becomes the integration of these various pieces, tuning the system to ensure consistent performance and scaling through the entire stack, and providing consistent management and monitoring across the entire stack. Most analytics systems can be broken down into three stages of data flow in the system: Metrics Correlation Research various metrics are collected into counters. For example, number of requests per day. (Real-time) Correlate metrics for a more aggregative system view. For example, analyze which features hook users. (Near real-time) Use this information to run research and trend analysis over a period of time. (Batch map/reduce processing) Currently, you must integrate different products and technologies to provide the entire analytics functionality. This method has many associated challenges: Traditional App: Database (RDBMS) Used to run many analytics systems Complex Event Processing (CEP) Designed to correlate data in real time Associated Challenges: Performance: Not designed for real time Scaling: Not designed to grow at the speed and volume of information required in a Big Data environment, doesn t fit well for data that is continuously evolving Cost: Most RDBMS rely on expensive set-up and hardware to maintain reliability and performance Scaling: It is often necessary to aggregate events into a centralized source, which doesn t scale Capacity: Not designed to deal with historical data Hadoop Designed for batch analytics and complex correlation Performance: Not designed for real time In-Memory Data Grid Fast processing power for storing and processing data Capacity: Capacity for storing vast amounts of information in-memory doesn t scale, in terms of both system scaling and cost NoSQL Designed to handle large data volumes at low cost Processing capability: Sheer amount of data can be challenging

THE SOLUTION Google, Facebook, and Twitter have already shown us the way by moving many of their analytics systems to real time. The question now is how businesses can build their own Google/Facebook/Twitter-like analytics, but in a significantly simpler way that fits existing applications and skillsets. Step 1: Collect and Store Enable collection of large volumes of data from multiple sources in real time. The process must be reliable, to ensure the accuracy of the analytics. Solution: Use an In-Memory Data Grid Memory enables x100k msg/sec Reliability is achieved through redundancy and replication Can be accessed through large set of APIs (Document, JMS, Memcache...) Step 2: Speed up processing through co-location of business logic with data By co-locating your business logic and data, you can process events as they enter the system, reducing multiple network hops and serialization/de-serialization overhead. You can also reduce the number of moving parts, making the entire system significantly simpler to scale and maintain. Step 3: Integrate with the Big Data store to meet volume and cost demands Integrate with the Big Data store through a generic plug-in, compatible with your data store of choice, whether NoSQL or SQL. Avoid lock-in to a specific NoSQL API Performance: Reduced network hops & serialization overhead Simplicity fewer moving parts Scalability without compromising consistency (strict consistency at the front, eventual consistency for the long-term data) JPA/Standard API

PUTTING IT ALL TOGETHER 1. Store events in memory 2. Co-locate business logic with data for RT processing 3. Integrate with Big Data store for long-term data 1. 2. 3. 4. 5. Cluster of in-memory data grids (IMDG) at the front and a Big Data database at the backend. Feeds are stored directly into the IMDG. The feeds trigger a set of co-located processors that process them. The processing can include validation and enrichment of the data as well as creation of new data sets needed for further correlation and post-processing of data. Data is forwarded to the back-end Big Data store through the built-in write-behind feature of the IMDG. The IMDG can be used as a processing buffer: After processing by the IMDG, data is stored in the Big Data storage. It can also be used to store the last day of information. Data sent to the NoSQL data store is stored in batches to maximize write throughput. The analytics application reads the data directly from the NoSQL data store. When the app needs only the last day of activity, it can access the data grid directly through the built-in JPA/SQL interface.

MAIN FEATURES & BENEFITS Performance Maximum throughput is achieved using in-memory devices and by distributing events between nodes and processing them in parallel. The write to the database is done in batches, asynchronously, maximizing throughput to the underlying database. Built-in synchronization (write-behind), uses batches to speed up write performance. Simplicity All you need to do to build your entire Facebooklike analytics system is to write your event handler business logic. GigaSpaces takes care of performance, high availability, scalability, and deployment management. Continuous Availability Keeping the real time part and long term decoupled makes it possible to continue and serve real time feeds even when the database is down. It also makes it easier to deal with planned downtime that is required when maintaining long term data, such as for re-shading. Cloud Enabled Works with any private and public cloud such as CloudStack, VMware, OpenStack, Amazon, Rackspace, Azure etc. Consistent Management The GigaSpaces cluster management offers built-in integration with many popular databases, such as MySQL, Postgress, Cassandra, and MongoDB, and with popular web platforms such as Tomcat, JBoss, and NodeJS, enabling you to deploy the entire application stack with a single click. Elasticity Scaling is achieved by adding more machines without any downtime Security Access to the data is secured both from the feeder side and the analytics system. You can also set roles that control the data sets that are accessible to specific users Transactionality and Consistency The entire processing is done under transaction, ensuring the consistency and reliability of the data. Openness Choose any Big Data database (RDBMS or NoSQL), and plug in consistent management and monitoring across the stack without changing your code. Write event handlers using common Java,.Net, Groovy, JavaScript, JRuby, and a large set of dynamic languages, and access the data using standard SQL/JPA APIs. REAL-TIME IN-MEMORY PROCESSING GRID AND BIG DATA STORAGE FEATURES Real-Time Event Processing Events are stored in memory. A built-in mechanism enables triggering of events based on SQL templates. Standard Query Users can access the data through a standard JPA/SQL interface. Write/Read Dynamic Scalability With a NoSQL back-end data store, the system can grow with the data, reducing the costs associated with over-provisioning. Built-In Pub/Sub Remote clients and services can subscribe to the processed data directly, without a need for additional messaging system.. Map/Reduce Data correlation and aggregation is done through parallel query and code execution across the entire data grid. Open Database Plug-In Easily plug in different sets of SQL and NoSQL databases without changing the application code. You can start with SQL databases at small scale, and switch to NoSQL at later, as your system grows.

COST BENEFITS Economic Data Scaling Leverage commodity hardware and software-based storage to provide a large-scale data store at low cost. Solution: Memory short-term data Disk long-term data Combine memory and disk for optimum cost performance ratio: Memory is x10, x100 lower than disk for high data access rate (According to Stanford research) Disk is lower cost for high capacity lower access rate Example: Cost RAM Use Disk for this throughput Throughput Disk Use Memory for this throughput Optimum Cost The cost of processing 10K events per second and storing it for a window of an hour (till it gets pulled to the long-term storage) with 500B message size in memory requires only ~16G at a cost of ~$32 per month per server. Economic App Scaling Automation: Reduce operational cost Elastic Scaling: Reduce over-provisioning cost Cloud portability: Choose the right cloud for the job Cloud bursting: Scavenge extra capacity when needed Industry use cases that particularly need real-time insights from big data sets include: Social Networking: Measure the immediate impact to your site traffic from social media, whether a new blog post, a tweet, a Like, or even a comment. Knowing this information translates to better conversion and more effective online campaigns. SaaS: Measuring user behavior and acting upon it is crucial for improving customer satisfaction and conversion rates which represent immediate increases in revenue. Financial Services: Determining in real time whether your portfolio is losing money, or if there is fraud in your system means that you can prevent disasters as they occur, not after the damage is done. Correlating multiple sources from the market in real time results in a more accurate view of the market and enables more accurate actions to maximize your profit. ABOUT GIGASPACES GigaSpaces Technologies is the pioneer of a new generation of application virtualization platforms, and a leading provider of end-to-end scaling solutions for distributed, mission-critical application environments, and cloud enabling technologies. GigaSpaces is the only platform on the market that offers truly silo-free architecture, along with operational agility and openness, delivering enhanced efficiency, extreme performance, and always-on availability. Our technology was designed from the ground up to support any cloud environment private, public, or hybrid and offers a pain-free, evolutionary path from today s data center to the technologies of tomorrow. GIGASPACES OFFICES WORLDWIDE US East Coast Office, New York Tel: +1-646-421-2830 US West Coast Office, San Jose Tel: +1-408-878-6982 International Office, Tel Aviv Tel: +972-9-952-6751 Europe Office, London Tel: +44-207-117-0213 Asia Pacific Office, Singapore Tel: +65-65497220