Implementing Operational Intelligence Using In-Memory Computing. William L. Bain June 29, 2015
|
|
|
- William Baldwin
- 10 years ago
- Views:
Transcription
1 Implementing Operational Intelligence Using In-Memory Computing William L. Bain June 29, 2015
2 Agenda What is Operational Intelligence? Example: Tracking Set-Top Boxes Using an In-Memory Data Grid (IMDG) for Operational Intelligence Tracking and analyzing live data Comparison to Spark Implementing OI Using Data-Parallel Computing in an IMDG A Detailed OI Example in Financial Services Code Samples in Java Implementing MapReduce on an IMDG Optimizing MapReduce for OI Integrating Operational and Business Intelligence ScaleOut Software, Inc. 2
3 About ScaleOut Software Develops and markets In-Memory Data Grids, software middleware for: Scaling application performance and Providing operational intelligence using In-memory data storage and computing Dr. William Bain, Founder & CEO Career focused on parallel computing Bell Labs, Intel, Microsoft 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server Ten years in the market; 400+ customers, 10,000+ servers Sample customers:
4 ScaleOut Software s Product Portfolio ScaleOut StateServer (SOSS) In-Memory Data Grid for Windows and Linux Scales application performance Industry-leading performance and ease of use Grid Service ScaleOut StateServer In-Memory Data Grid Grid Service Grid Service Grid Service ScaleOut ComputeServer adds Operational intelligence for live data Comprehensive management tools ScaleOut hserver Full Hadoop Map/Reduce engine (>40X faster*) Hadoop Map/Reduce on live, in-memory data ScaleOut GeoServer WAN based data replication for DR Global data access and synchronization *in benchmark testing ScaleOut Software, Inc. 4
5 In-Memory Computing Is Not New 1980 s: SIMD Systems, Caltech Cosmic Cube Thinking Machines Connection Machine s: Commercial Parallel Supercomputers Intel IPSC-2 IBM SP1
6 What s New: IMC on Commodity Hardware 1990 s early 2000 s: HPC on Clusters HP Blade Servers Since ~2005: Public Clouds Amazon EC2, Windows Azure
7 Introductory Video: What is Operational Intelligence
8 Online Systems Need Operational Intelligence Goal: Provide immediate (sub-second) feedback to a system handling live data. A few example use cases requiring immediate feedback within a live system: Ecommerce: personalized, real-time recommendations Healthcare: patient monitoring, predictive treatment Equity trading: minimize risk during a trading day Reservations systems: identify issues, reroute, etc. Credit cards & wire transfers: detect fraud in real time IoT, Smart grids: optimize power distribution & detect issues ScaleOut Software, Inc. 8
9 Operational vs Business Intelligence Operational Intelligence Real-time Live data sets Gigabytes to terabytes In-memory storage Sub-second to seconds Best uses: Tracking live data Immediately identifying trends and capturing opportunities Providing immediate feedback Big Data Analytics OI IMDGs CEP Storm BI Hadoop Spark Hana Business Intelligence Batch Static data sets Petabytes Disk storage Minutes to hours Best uses: Analyzing warehoused data Mining for long-term trends ScaleOut Software, Inc. 9
10 Example: Enhancing Cable TV Experience Goals: Make real-time, personalized upsell offers Immediately respond to service issues Detect and manage network hot spots Track aggregate behavior to identify patterns, e.g.: Total instantaneous incoming event rate Most popular programs and # viewers by zip code 2011 Tammy Bruce presents LiveWire Requirements: Track events from 10M set-top boxes with 25K events/sec (2.2B/day) Correlate, cleanse, and enrich events per rules (e.g. ignore fast channel switches, match channels to programs) Be able to feed enriched events to recommendation engine within 5 seconds Immediately examine any set-top box (e.g., box status) & track aggregate statistics ScaleOut Software, Inc. 10
11 The Result: An OI Platform Based on a simulated workload for San Diego metropolitan area: Continuously correlates and cleanses telemetry from 10M simulated set-top boxes (from synthetic load generator) Processes more than 30K events/second Enriches events with program information every second Tracks aggregate statistics (e.g., top 10 programs by zip code) every 10 seconds Real-Time Dashboard ScaleOut Software, Inc. 11
12 Using an IMDG to Implement OI IMDG models and tracks the state of a live system. IMDG analyzes the system s state in parallel and provides real-time feedback. IMDG analyzes in-memory data with integrated compute engine. IMDG tracks live system s state with an in-memory, object-oriented model. IMDG enriches in-memory model from disk-based, historical data. ScaleOut Software, Inc. 12
13 Example: Tracking Set-TopBoxes Each set-top box is represented as an object in the IMDG Object holds raw & enriched event streams, viewer parameters, and statistics IMDG captures incoming events by updating objects IMDG uses data-parallel computation to: immediately enrich box objects to generate alerts to recommendation engine, and continuously collect and report global statistics ScaleOut Software, Inc. 13
14 The Foundation: In-Memory Data Grids In-memory data grid (IMDG) provides scalable, hi av storage for live data: Designed to manage business logic state: Object-oriented collections by type Create/read/update/delete APIs for Java/C#/C++ Parallel query by object properties Data shared by multiple clients Designed for transparent scalability and high availability: Automatic load-balancing across commodity servers Automatic data replication, failure detection, and recovery IMDGs provide ideal platform for operational intelligence: Easy to track live systems with large workloads Appropriate availability model for production deployments ScaleOut Software, Inc. 14
15 Comparing IMDGs to Spark On the surface, both are surprisingly similar: Both designed as scalable, in-memory computing platforms Both implement data-parallel operators Both can handle streaming data But there are key differences that impact use for operational intelligence: IMDGs Spark Best use Live, operational data Static data or batched streams In-memory model Object-oriented collections Resilient distributed datasets Focus of APIs High availability tradeoffs CRUD, eventing, data-parallel computing Data replication for fast recovery Data-parallel operators for analytics Lineage for max performance ScaleOut Software, Inc. 15
16 Data-Parallel Computing on an IMDG IMDGs provide powerful, cost-effective platform for data-parallel computing: Enable integrated computing with data storage: Take advantage of cluster s commodity servers and cores. Avoid delays due to data motion (both to/from disk and across network). Leverage object-oriented model to minimize development effort: Easily define data-parallel tasks as class methods. Easily specify domain as object collection. Analyze Data (Eval) Example: Parallel Method Invocation (PMI): Object-oriented version of standard HPC model Runs class methods in parallel across cluster. Selects objects using parallel query of obj. collection. Serves as a platform for implementing MapReduce and other data-parallel operators Combine Results (Merge) ScaleOut Software, Inc. 16
17 PMI Example: OI in Financial Services Goal: track market price fluctuations for a hedge fund and keep portfolios in balance. How: Keep portfolios of stocks (long and short positions) in object collection within IMDG. Collect market price changes in one-second snapshots. Define a method which applies a snapshot to a portfolio and optionally generates an alert to rebalance. Perform repeated parallel method invocations on a selected (i.e., queried) set of portfolios. Combine alerts in parallel using a second user-defined method. Report alerts to UI every second for fund manager. ScaleOut Software, Inc. 17
18 Defining the Dataset Simplified example of a portfolio class (Java): Note: some properties are made query-able. Note: the evalpositions method analyzes the portfolio for a market snapshot. public class Portfolio { private long id; private Set<Stock> longpositions; private Set<Stock> shortpositions; private double totalvalue; private Region region; private boolean alerted; // alert for trading // query-able property public double gettotalvalue() { // query-able property public Region getregion() { } public Set<Long> evalpositions(marketsnapshot ms) { }; ScaleOut Software, Inc. 18
19 Defining the Parallel Methods Implement PMI interface to define methods for analyzing each object and for merging the results: public class PortfolioAnalysis implements Invokable<Portfolio, MarketSnapshot, Set<Long>> { public Set<Long> eval(portfolio p, MarketSnapshot ms) throws InvokeException { } // update portfolio and return id if alerted: return p.evalpositions(ms); public Set<Long> merge(set<long> set1, Set<Long> set2) throws InvokeException { set1.addall(set2); return set1; // merged set of alerted portfolio ids }} ScaleOut Software, Inc. 19
20 Running the Analysis PMI can be run from a remote workstation. IMDG ships code and libraries to cluster of servers: Execution environment can be pre-staged for fast startup. In-line execution minimizes scheduling time. Avoids batch scheduling delays. PMI automatically runs in parallel across all grid servers: Uses software multicast to accelerate startup. Passes market snapshot parameter to all servers. Uses all servers and cores to maximize throughput. ScaleOut Software, Inc. 20
21 Spawning the Compute Engine First obtain a reference to the IMDG s object collection of portfolios: NamedCache pset = CacheFactory.getCache( portfolios"); Create an invocation grid, a re-usable compute engine for the application: Spawns a JVM on all grid servers and connects them to the in-memory data grid. Stages the application code on all JVMs. Associates the invocation grid with an object collection. InvocationGrid grid = new InvocationGridBuilder("grid").addClass(DependencyClass.class).addJar("/path/to/dependency.jar").setJVMParameters("-Xmx2m").load(); pset.setinvocationgrid(grid); ScaleOut Software, Inc. 21
22 Invoking the PMI Run the PMI on a queried set of objects within the collection: Multicasts the invocation and parameters to all JVMs. Runs the data-parallel computation. Merges the results and returns a final result to the point of call. InvokeResult alertedportolios = pset.invoke( PortfolioAnalysis.class, Portfolio.class, and(greaterthan( totalvalue, ), // query spec equals( region, Region.US)), marketsnapshot, // parameters... ); System.out.println("The alerted portfolios are" + alertedportfolios.getresult()); ScaleOut Software, Inc. 22
23 Execution Steps Eval phase: each server queries local objects and runs eval and merge methods: Note: Accessing local data avoids networking overhead. Completes with one result object per server. Merge phase: all servers perform distributed merge to create final result: Merge runs in parallel to minimize completion time. Returns final result object to client. ScaleOut Software, Inc. 23
24 Importance of Avoiding Data Motion Local data access enables linear throughput scaling. Network access creates a bottleneck that limits throughput. ScaleOut Software, Inc. 24
25 Outputting Continuous Alerts to the UI PMI runs every second; it completes in 350 msec. and immediately refreshes UI. UI alerts trader to portfolios that need rebalancing. UI allows trader to examine portfolio details and determine specific positions that are out of balance. Result: in-memory computing delivers operational intelligence. ScaleOut Software, Inc. 25
26 Demonstration Video: Comparison of PMI to Apache Hadoop
27 PMI Scales for Large In-Memory Datasets Measured a similar financial services application (back testing stock trading strategies on stock histories) Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory IMDG handled a continuous stream of updates (1.1 GB/s) Results: analyzed 1 TB in 4.1 seconds (250 GB/s). Observed linear scaling as dataset and update rate grew. ScaleOut Software, Inc. 27
28 Using PMI to Implement MapReduce for OI PMI serves as foundational platform for MapReduce and other parallel operators. Implement MapReduce with two PMI phases: Runs standard Hadoop MapReduce applications. Data can be input from either the IMDG or an external data source. Works with any input/output format. IMDG uses PMI phases to invoke the mappers and reducers. Eliminates batch scheduling overhead. Intermediate results are stored within the IMDG. Minimizes data motion in shuffle phase. Allows optional sorting. Note: output of a single reducer/combiner optionally can be globally merged. ScaleOut Software, Inc. 28
29 MapReduce for OI Requires New Data Model IMDGs historically implement a feature-rich data model: Efficiently manages large objects (KBs-MBs). Supports object timeouts, locking, query by properties, dependency relationships, etc. MapReduce typically targets very large collections of small key/value pairs: Does not require rich object semantics. Does require efficient storage (minimum metadata) and highly pipelined access. Solution: a new IMDG data model for MapReduce: Uses standard Java named map APIs for access. MapReduce uses standard input/output formats. Stores data in chunks and pipelines to/from engine. Automatically defines splits for mappers and holds shuffled data for reducers. ScaleOut Software, Inc. 29
30 Optimizing MapReduce for OI: simplemr Integrate in-memory named map with MapReduce to minimize execution time. Use new API (simplemr in Java, C#) to simplify apps and remove Hadoop dependencies. public class Mapper : IMapper<int, string, string, int> { void IMapper<int, string, string, int>.map(int key, string value, IContext<string, int> context) {... context.emit(encoding.ascii.getstring(...), 1); }} inputmap = new NamedMap<int, string>("input_map"); outputmap = new NamedMap<string, int>("output_map"); inputmap.runmapreduce<string, int, string, int>(outputmap, new Mapper(), new Combiner(), new Reducer(),...); ScaleOut Software, Inc. 30
31 Integrating OI and BI in the Data Warehouse In-memory data grids can add value to a BI platform, e.g.: Transform live data and store in HDFS for analysis. Provide immediate feedback to live system pending deep analysis. ETL Example Using YARN, an IMDG can be directly integrated into a BI cluster: The IMDG holds fast-changing data. YARN directs MapReduce jobs to the IMDG. The IMDG can output results to HDFS. ScaleOut Software, Inc. 31
32 Recap: In-Memory Computing for OI Online systems need operational intelligence on live data for immediate feedback. Creates important new business opportunities. Operational intelligence can be implemented using standard data-parallel computing techniques. In-memory data grids provide an excellent platform for operational intelligence: Model and track the state of a live system. Implement high availability. Offer fast, data-parallel computation for immediate feedback. Provide a straightforward, object-oriented development model. ScaleOut Software, Inc. 32
33
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Using In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Using an In-Memory Data Grid for Near Real-Time Data Analysis
SCALEOUT SOFTWARE Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 IN today s competitive world, businesses
Using In-Memory Data Grids for Global Data Integration
SCALEOUT SOFTWARE Using In-Memory Data Grids for Global Data Integration by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 B y enabling extremely fast and scalable data
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
GigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ
End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,
Scaling Web Applications on Server-Farms Requires Distributed Caching
Scaling Web Applications on Server-Farms Requires Distributed Caching A White Paper from ScaleOut Software Dr. William L. Bain Founder & CEO Spurred by the growth of Web-based applications running on server-farms,
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Information Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!
Simplifying Big Data Analytics: Unifying Batch and Stream Processing John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Streaming Analy.cs S S S Scale- up Database Data And Compute Grid
Modernizing Your Data Warehouse for Hadoop
Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist [email protected] O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
IBM Netezza High Capacity Appliance
IBM Netezza High Capacity Appliance Petascale Data Archival, Analysis and Disaster Recovery Solutions IBM Netezza High Capacity Appliance Highlights: Allows querying and analysis of deep archival data
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
Datenverwaltung im Wandel - Building an Enterprise Data Hub with
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
Intelligent Business Operations and Big Data. 2014 Software AG. All rights reserved.
Intelligent Business Operations and Big Data 1 What is Big Data? Big data is a popular term used to acknowledge the exponential growth, availability and use of information in the data-rich landscape of
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Big Data Are You Ready? Jorge Plascencia Solution Architect Manager
Big Data Are You Ready? Jorge Plascencia Solution Architect Manager Big Data: The Datafication Of Everything Thoughts Devices Processes Thoughts Things Processes Run the Business Organize data to do something
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software
Real-Time Big Data Analytics with the Intel Distribution for Apache Hadoop software Executive Summary is already helping businesses extract value out of Big Data by enabling real-time analysis of diverse
HadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
BigMemory & Hybris : Working together to improve the e-commerce customer experience
& Hybris : Working together to improve the e-commerce customer experience TABLE OF CONTENTS 1 Introduction 1 Why in-memory? 2 Why is in-memory Important for an e-commerce environment? 2 Why? 3 How does
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Protect Data... in the Cloud
QUASICOM Private Cloud Backups with ExaGrid Deduplication Disk Arrays Martin Lui Senior Solution Consultant Quasicom Systems Limited Protect Data...... in the Cloud 1 Mobile Computing Users work with their
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Microsoft Big Data Solutions. Anar Taghiyev P-TSP E-mail: [email protected];
Microsoft Big Data Solutions Anar Taghiyev P-TSP E-mail: [email protected]; Why/What is Big Data and Why Microsoft? Options of storage and big data processing in Microsoft Azure. Real Impact of Big
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
Upcoming Announcements
Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC [email protected] Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Addressing Open Source Big Data, Hadoop, and MapReduce limitations
Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1 Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2 How Big are Data?
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated
BIG DATA ANALYTICS For REAL TIME SYSTEM
BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage
EMC SOLUTION FOR SPLUNK
EMC SOLUTION FOR SPLUNK Splunk validation using all-flash EMC XtremIO and EMC Isilon scale-out NAS ABSTRACT This white paper provides details on the validation of functionality and performance of Splunk
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
Interactive data analytics drive insights
Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has
Data Integration Checklist
The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
CAPTURING & PROCESSING REAL-TIME DATA ON AWS
CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent
An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise
An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise Solutions Group The following is intended to outline our
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase
FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase TECHNOLOGY OVERVIEW FRAUD MANAGE- MENT REFERENCE ARCHITECTURE This technology overview describes a complete infrastructure and application re-architecture
Big Data Processing: Past, Present and Future
Big Data Processing: Past, Present and Future Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. [email protected] [email protected] @OrionGM
Safe Harbor Statement
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP
PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP Your business is swimming in data, and your business analysts want to use it to answer the questions of today and tomorrow. YOU LOOK TO
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Please give me your feedback
Please give me your feedback Session BB4089 Speaker Claude Lorenson, Ph. D and Wendy Harms Use the mobile app to complete a session survey 1. Access My schedule 2. Click on this session 3. Go to Rate &
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
INTRODUCING APACHE IGNITE An Apache Incubator Project
WHITE PAPER BY GRIDGAIN SYSTEMS FEBRUARY 2015 INTRODUCING APACHE IGNITE An Apache Incubator Project COPYRIGHT AND TRADEMARK INFORMATION 2015 GridGain Systems. All rights reserved. This document is provided
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
In-memory computing with SAP HANA
In-memory computing with SAP HANA June 2015 Amit Satoor, SAP @asatoor 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Hyperconnectivity across people, business, and devices give rise to
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Big Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
