Real-time Ad-hoc Analytics on S3 with MemSQL

Size: px
Start display at page:

Download "Real-time Ad-hoc Analytics on S3 with MemSQL"

Transcription

1 Real-time Ad-hoc Analytics on S3 with MemSQL Satish Cattamanchi 4INFO Sarvesh Gupta Tavant Technologies September, 2015 ABSTRACT Enterprises are witnessing a rapid increase in data volume with growing business volume. At the same time, there is rise in demand for immediate results and readily available metrics in a cost effective manner to take strategic decisions. This paper is intended to share key insights from our experiment with MemSQL in-memory database to solve the challenge of querying huge data volumes from Amazon S3 in real time to enable execution of interactive ad-hoc queries. The paper outlines the details of the experiment along with benchmark details with size of data set, cluster configuration, types of query and tuning methods used to reduce the response time. The initial benchmark results indicate tremendous potential in MemSQL for providing high data ingestion speeds, scalable real time database engine accessible through a familiar SQL interface. Towards the end of paper, the roadmap section highlights the possible opportunities with MemSQL. Keywords: real-time analytics, in-memory database, big data, ad-hoc query INTRODUCTION The advent of open-source Hadoop with ability to scale for massive amounts of data has democratized big data and analytic capabilities for all companies to exploit. Possibilities to process such huge volume of data have given rise to newer set of use cases, one of them being able to do it in real-time for unlocking the true business potential for relevant businesses. We being a data-driven company, play a big role enabling our teams to gain insights from multi-terabyte scale data warehouse (DW). Our use cases range from analyzing campaign performance to analyzing audience behavior to apply statistical models for proprietary insights. The underpinning of our current technical architecture is that we leverage Amazon S3 web service for our DW. This allows us to isolate our storage and compute layers. Our team writes Hive queries or Python scripts to perform ETL and analytics, output of which is transferred to MySQL for reporting. Apart from this, we also require to analyze our broader data set on S3 consisting of billions of requests and events being generated every day to equip the team with an infrastructure that enables interactive querying capabilities, real-time analytics and avoid delays associated with batch processing pipeline running on Hadoop. The time to get the results is critical to our business needs. We came across the MemSQL inmemory database which looked promising with strong industry backing and some big implementations. Together with technology and consulting company Tavant Technologies, we evaluated MemSQL as a viable option for resolving this use case. CHALLENGE With tremendous growth in the ad traffic, the server log volume skyrocketed within a few months. To keep innovating, the team needed to analyze the high volumes of data consisting of billions of data points in real-time to produce insights and do strategic planning. Although, a robust data pipeline existed using Amazon EMR service and Spark, but it was not allowing the team to take full advantage of this due to the slowness resulting from sheer volume and velocity of data. We were facing multiple challenges to collect and process billions of records with real-time interactive response. There has been a great belief in the importance of building a datadriven culture enabling as many people as possible within the company with broad access to data for discovery and analysis. However, it requires a widely learned and accepted form of data access. Traditionally, the use of SQL to connect with BI tools or for interactive analysis with dashboards has made SQL the common thread which binds all. Since, a significant portion of our data contains some form of structure that can be accessed via the SQL paradigm we wanted to find a solution with SQLlike interface with real time capabilities to meet our needs. REQUIREMENTS In the existing setup, we are running a technical stack powered by Amazon EMR service. Processing billions of records is a slow process. The response time to generate reports is anywhere between 5 15 hours. Current 1

2 infrastructure is unable to meet the performance and scalability requirements at 4INFO. The objective behind implementing this technical stack is to ensure that we can get results in near real-time. This will allow our ad ops/research team to take quick actions for tuning the campaign performance as well as provide early insight into the relevant metrics to our partners. While doing this exercise we wanted to cover some frequent simple as well as complex scenarios in order to validate the usefulness of the solution. With this aim, we selected different use cases which we run on our existing queries for any use case at desired level of granularity. Thus, it would be ideal to stay closer to the Structure Query Language (SQL) paradigm as users of our system do not belong to the data science world but ones who leverage the decades of knowledge and experience in SQL-style querying, generating enormous business value. Scalable Our business volume has grown significantly in the recent past and is expected to grow manifold from here in near future. Hence, it is important for us to design the system to be robust and scalable such that we can sustain even higher volumes of data workload without re-engineering the platform. Ease of use Similar to the requirement of ad-hoc queries, the ease of use is important to ensure the implementation and maintenance is hassle-free. This involves installation and scaling of the servers, proper documentation and support along with intuitive tools to connect and query the data. TECHNOLOGIES VISITED setup and gather the performance numbers on new solution. As per Figure 1, we have the infrastructure to support the Data mining needs. However, we want to fulfill the needs for interactive queries and data analyst queries. With the rise in popularity of SQL-on-Hadoop solutions like Impala, Presto, and HAWQ, choosing one of them could have been one path to travel. Ideally, we wanted an open-source interactive query engine that can query the S3 warehouse without having the need to spin a Hadoop cluster. After narrowing down on our objective, we brainstormed on the key factors that are going to drive the decision for us to adopt a solution. After thinking through, we came up with the following list of requirements to zero upon a technology: Figure 1 Expected response times as per use case Low latency/real-time This factor is very critical to the whole objective of having a platform to be able to get answers to the questions without waiting for hours or days. Having low latency not only improves productivity but also prevents apprehension from the urge of asking questions. Ad-hoc queries The goal of setting up this pipeline is to enable data team to perform ad-hoc queries which implies team should be able to connect and fire To solve this challenge, we explored multiple options. Few of the key initial attempts were with following technologies: Druid [4] Druid is an open-source analytics data store designed for real-time exploratory analytics on large data sets. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. MemSQL [1] MemSQL combines transactions and analytics for subsecond data processing and reporting. Enterprises can easily build real-time applications to instantly respond to changes in their business, make better decisions, and deliver precision insights with real-time reporting. Pivotal Gemfire [5] Pivotal GemFire is a data management platform that provides real-time, consistent access to data-intensive applications throughout widely distributed cloud architectures. Apache Drill [6] Drill is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semistructured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity 2

3 and ecosystem of ANSI SQL, the industry-standard query language. With Apache Drill, we did deep dive more by installing it on the infrastructure as we were seeing lot of buzz around it. Drill includes a distributed execution environment, purpose built for large-scale data processing. Setting up drill is quite straightforward. We spun a cluster with 24 Drillbit nodes and 1 zookeeper instance. To get started with querying data, it is really simple i.e. we just created a profile pointing it to S3 and started querying data. Ingestion is intuitive as you do not need to create tables or schemas beforehand. You can use the profile and fire the query from the command line. In our exploratory attempt with Apache Drill, we were not able to achieve a good performance. Standard single table queries to our request and segment tables was consuming really long (in hours) with the data size mentioned above. We tried to contact Apache Drill as well and seemingly it is evolving for querying S3. Alternative options to explore using Apache Drill is using Hadoop cluster. However, we wanted to skip Hadoop cluster. We will still be evaluating it on a parallel track and try to get a better performance. Apart from above technologies, we also had Amazon Redshift and Presto on radar, however we were not much inclined towards them due to reasons such as dependency on Amazon stack and recommended use of Parquet file format for significant performance improvements. Why MemSQL? After making some initial attempts to look closely at the above set of technologies, we shortlisted MemSQL. There were multiple reasons behind this: The architecture and setup is quite intuitive Query syntax is ANSI SQL Strong industry backing and success stories Comcast, Pinterest and many more Being wire-compliant with MySQL as they quote, connecting to your data is seamless Blends with our data structure (gzipped flat files) and do not need transformation for data ingestion Allow reading directly from file system Can be deployed as in-house solution BENCHMARKING Every user s specific applications or workloads are often unique due to which no performance benchmark can be termed perfect. Therefore in this performance benchmark, we chose to limit variables and focus on a straightforward comparative: real-time analytics on S3. We used open-source and community editions of software with data loaded into standard file formats, while striving to keep many other factors constant. MEMSQL OVERVIEW MemSQL has a two-tiered, clustered architecture. Each instance of the MemSQL program is called a node, and runs identical software. The only difference is the role the nodes are configured to play. Aggregator nodes provide a single interface to database clients and applications. Aggregators broker SQL queries to the cluster and aggregate results. Leaf nodes store and process data. Communication between leaves and aggregators is all over standard SQL. SETUP Figure 2 MemSQL Architecture [1] To setup MemSQL, we used MemSQL cloud formation template to launch a cluster with MemSQL community edition: It can also be setup through the tar file and get up and running quickly, it is quite simple to expand and shrink the cluster using the web UI. However, cloud formation template is recommended by the MemSQL team for having a cluster configured with best practices recommendations. The cluster generator tool automatically provisions and configures a MemSQL cluster using Amazon Cloud Formation. BENCHMARK #1 DATA SET The data set comprised of three types of ad-server logs requests, events and metadata logs. We picked a canned date range to select the test data set. 3

4 The details of the data set are as follows: S.No. File Type Number of records Compressed size (in TB) Number of fields 1 Request Log 4,641,205, Event Log 126,201, Metadata Log 3,245,504, After loading the data, all the queries were run manually simulating single user query concurrency with no other activities. Performance was based upon the average of 5 runs through interactive command line without using any third party access tools. CLUSTER Option Value Number of Master Aggregator 1 Number of Child Aggregators 8 Number of Leaf nodes 20 EBS Volume Size (in gigabytes) 2x80 (SSD) EC2 Instance Type m3.2xlarge INGESTING DATA exercise, we used the command line from master aggregator to create tables and execute queries. KEY TAKEAWAYS/INSIGHTS With MemSQL, we were able to achieve significant performance improvement. We are excited to share the results as follows. INGESTION TIME For data ingestion, you need to execute the data loading process from aggregator nodes. Thus, having more child aggregator nodes helps to parallelize the loading process. We were able to load the entire data set (3 TB) within 2 hours. The rate of ingestion on each aggregator ranged between MB/sec. QUERY PERFORMANCE We have put down results for multiple queries ranging from simple aggregations querying single table to complex queries with three way joins. Query 1: Standard computation over single table select count(*) from table; MemSQL supports two types of stores, a completely inmemory row store and a disk-backed columnstore. Since our data volume was huge, storing whole data in memory would be an expensive proposition. Therefore, we chose to store bigger logs (request and metadata) in columnstore tables whereas for event log we created a rowstore table. It is important to select right keys for your table for optimum performance. For, rowstore tables select primary key to uniquely identify a record and secondary keys for columns to be filtered upon. On the other hand, columnstore lets you specify columnstore key and shard key for a more even distribution and preventing skew. Since, our data is stored on S3 we had to use MemSQL Loader, an open-source tool that lets you load sets of files from Amazon S3. MemSQL Loader also takes care of deduplication of files, parallelizing the workload, retrying files if they fail to load, and more. TOOLS TO INTERACT WITH CLUSTER One of the benefits of using MemSQL is it being wirecompliant with MySQL i.e. it is as easy as connecting to the MySQL database. Hence, you can use any mysqlclient to connect with your MemSQL cluster. For this Query 2: 3-way join (2 columnstore and 1 rowstore) We tried to achieve this result, we initially stored all 3 logs in the columnstore tables but the performance was really poor. As per the MemSQL team, 3-way joins are yet not supported for columnstore tables and are still evolving. However, a recommended approach is to restrict your 2 large volume tables in columnstore tables if meant to be used in a single query and keep the other tables in memory. Thus, we separated the less bulky event log table to a rowstore table. As a result, we were able to 4

5 query the data in quick time. Below are the numbers for the exercise: select c1, c2, c3 from table1 t1 join table2 t2 On t1.request_id = t2.request_id join table t3 On t2.request_id = t3.request_id group by c1, c2, c3; memsql> delete from metadata_log; Query OK, rows affected (8.39 sec) memsql> delete from request_log; Query OK, rows affected (10.78 sec) However, we noticed the deleting data from the rowstore memory tables take more time than their columnstore counterparts. memsql> delete from event_log; Query OK, rows affected (13.43 sec) TUNING Please note that the query also used multiple MemSQL functions like Crypto functions (MD5) and String functions (SUBSTR and CONCAT) in the select clause. Query 3: 2-way joins (1 rowstore and 1 columnstore) While querying the columnstore tables using a three way join, we noticed that it was consuming unreasonable time. This is when we came across the recommendation to ensure fewer row segment groups in a columnstore table for better performance. After inspecting the state of our columnstore tables, we found presence of a large number of row segment groups probably due to heavy workload because of which optimistic merger might not be able to catch up. Thus, we triggered the pessimistic merger manually using OPTIMIZE TABLE <tablename> command. We also connected with folks at MemSQL, who were really cooperative in resolving queries. Based on their recommendations, we did perform some changes to our queries to optimize the joins. Moreover, for the rowstore tables it is recommended to use secondary keys in addition to primary key to allow quicker filtering in the query i.e. you can add the columns to be used frequently in the where clause as your secondary keys. You can specify multiple secondary keys as well. BENCHMARK #2 After the first benchmark, we wanted to push the limits and test for a larger data to establish our faith in MemSQL. As a result, we increased the overall data volume by 6X. PURGING DATA The time required to purge the data is a critical measure as well because in production you need to utilize the pipeline for recurring use case over a sliding window. To purge the data we fired the delete statements and below are the results: DATASET S.No. File Type Number of records Compressed size (in TB) Number of fields 1 Request 29,064,598, Log 2 Event Log 142,638, Metadata Log 19,664,562, Based on the volume of dataset, we spun a new cluster for this benchmark with r3.4xlarge instances. There were a few experiments before having our final cluster to hold the data. After loading the data in MemSQL, we noticed that the column store engine compresses the data significantly. We received a 50% compression ratio for our 5

6 dataset which is a huge benefit in terms of capacity planning for our clusters in turn cost savings. QUERY PERFORMANCE CLUSTER Option Value Number of Master Aggregator 1 Number of Child Aggregators 4 Number of Leaf nodes 18 EBS Volume Size (in gigabytes) 1024 (EBS) EC2 Instance Type r3.4xlarge We spun raw Amazon EC2 machines without installing any additional software except memsql-ops and memsqlloader (only on aggregator nodes). INGESTING DATA While ingesting such huge volume of data, we used MemSQL Loader again. The data ingest speed was very impressive (included time to download the files from S3, extracting and loading them). But we faced some challenges on this front: 1. Lack of orchestration for data loading process between master and child aggregators i.e. each aggregator node is isolated from others. Thereby, to maximize the utilization of your aggregator nodes we had to manually execute load statements on these machines separately. One possible solution to this problem seems to be by using load balancer to distribute the tasks across different aggregator nodes. 2. While loading huge data volumes, as mentioned in the tuning section above, the optimistic merger process falls behind due to which several row segment groups get created for the column store table. To extract maximum benefit out of the sharding and enable faster queries, we had to trigger the pessimistic merger. The time taken to optimize one table was more than the loading time for both request table and metadata table combined. This indicates that regular loading of data is a better option to reduce the overall time to get data prepared for querying. We ran multiple tests with different MemSQL functions. One of the noticeable results was with count distinct query. This is a classic example of data going to the computation due to which calculating an exact count of distinct values (i.e. the cardinality) of a big table can consume large amounts of memory and time. We ran this query on the request table and as expected it took huge time. To improve this, MemSQL provides another aggregate function APPROX_COUNT_DISTINCT which gives a probabilistic count of the number of distinct values 6

7 in the given column or expression, using a variant of the HyperLogLog algorithm. Below are the results of the execution on the metadata table: Function Function Time Unit Response Time Type COUNT DISTINCT Aggregate Minutes 90 APPROX COUNT DISTINCT Aggregate Minutes 25 We ran the same test on the request table and following are the results Figure 3 Campaign Performance by Ads Function Function Time Unit Response Time Type COUNT DISTINCT Aggregate Minutes FAILED Due to memory constraints APPROX COUNT DISTINCT Aggregate Minutes 13 For the 3-way join query mentioned in the Benchmark #1, the response time was 7 minutes. However, loading the output into an in-memory rowstore table took 140 minutes. TABLEAU + MEMSQL After querying the data from command line, we thought it would be great to test it using a self-service BI tool like Tableau which will enable the team to visualize the patterns and generate some quick reports as well. We connected Tableau to MemSQL tables and the rendering time for the charts was within seconds. However, it is in addition to the time taken to connect to tables i.e. duration for loading the metadata by Tableau. Figure 4 Campaign Performance by Device OS (Geo) Columnstore Table: 2-6 minutes Aggregated In-memory Table: 1-2 minutes Once the tables are connected, we were able to drag measures for standard aggregation taking not more than seconds. As a next step we tried to slice and dice data by different dimensions and render various types of charts with execution time staying under a minute. In fact, most of the charts rendered within 5-20 seconds. Figure 5 Campaign Performance by Mobile Brand/Model To speed up the rendering, we collected the aggregated output of the queries in the rowstore tables in MemSQL and used them as data source for our charts. As a result of this, the charts rendered almost instantly. 7

8 MemSQL has lot more features than we could explore in the available bandwidth. The current setup is in prototyping phase but we are working closely to evaluate the use of MemSQL over a varied set of use cases for our production environments. ROADMAP This benchmark was the first of the several steps we plan to take to build the data pipeline. We started with this approach because it was relatively easy to measure and is a real use case that we encounter with our current data processing stack. However, there are various other parameters to be examined in the upcoming weeks and months, including: 1. Workflow management: This is critical in the path to productionize the use of MemSQL for this pipeline. The whole idea of allowing everyone to access the data for ad-hoc queries require automation of tasks like cluster setup, data loading and decommissioning of cluster. Figure 6 Campaign Performance by Event Type The same reports and charts were also run for the larger data set loaded in Benchmark #2. The performance was better than case 1 with in-memory aggregated table. CONCLUSION The initial results with MemSQL look very promising thus making MemSQL a strong contender for cost-effective adhoc analysis with huge data volumes. Some of the noteworthy results include high data ingestion rate and the query performance with single table queries both for column store and row store tables. We expected it to shine in with reasonable data volume, but sustaining the query times with larger data set puts it way ahead for a use case like ours. MemSQL also posts an excellent compression ratio with repeatable column data stores. One thing to consider is the use of SSD for data storage and analyzing the tradeoff between cost and performance. We did see performance improvement with the smaller dataset but would be good to analyze this aspect in depth to aid capacity planning. Moreover, for data sets in the range of 100 GB to 500 GB storing the data in memory is a viable scaling option for better throughput. With huge data sets this becomes an unreasonable bet. 2. Recovery: We want to quantify the capability and repercussions of a node failure. 3. Stress-mode: How MemSQL performs when the data starts to exceed capacity planning or when parallel queries need more memory than available on the cluster. In addition to this, keeping in view other data engineering efforts being made by our team, we plan to scale this effort and wire MemSQL together with data processing frameworks to build an end-to-end real-time data pipeline. We also went through the case studies of Pinterest and Comcast using MemSQL heavily for their real-time needs. Comcast is using MemSQL as a sink to their Storm topology for real-time processing in parallel to their batch pipeline. On the other hand, Pinterest is in process of productionizing a real-time data pipeline using Kafka, Spark and MemSQL. Their high level architecture looks as follows: The benchmarking exercise was aimed towards achieving certain results. Many other considerations apart from query performance are still to be worked upon. For example, challenges in the process of automating the end to end workflow as a regular job and features like high availability, cross data center replication and granular user permissions offered by MemSQL Enterprise Edition. 8

9 REFERENCES Figure 7 Real-time data pipeline using MemSQL at Pinterest [2] Since, we have a robust processing layer built on Spark being employed for machine learning and other analytical problems, we plan to continue utilize it and connect it to our MemSQL cluster for ad-hoc analytics. MemSQL recently introduced a connector to Spark, which will allow us to expand the analytics capabilities of MemSQL with the full range of Spark tools and libraries. This architecture (Figure 8) can help exploit MemSQL as a real-time database for business operations and executing complex queries. At the same time, our data science team can manipulate and explore data using Spark and write back the results to MemSQL tables for consuming the output. 1. MemSQL Docs: and 2. Real-time analytics at Pinterest (Official Blog): /real-time-analytics-at-pinterest 3. Real-time Stream Processing Architecture: 4. Druid Website: 5. Pivotal Gemfire Overview from official docs: ing_started/topics/gemfire_overview.html 6. Apache Drill docs: 7. Extending MemSQL Analytics with Spark: AUTHORS Satish Cattamanchi is Team Lead at 4INFO and has been an integral part of the Engineering team. He is consistently working on designing and implementing comprehensive systems as part of the Ad Platform using plethora of technologies. He likes playing chess. Sarvesh Gupta has extensive experience in designing and implementing analytics solutions, while being actively involved in cutting-edge technology projects. He has exposure to multiple domains as part of Media and Entertainment practice primarily advertising. When not coding, he enjoys reading, movies and music. He can be DEFINITIONS, ACRONYMS, ABBREVIATIONS ACKNOWLEDGMENTS Figure 8 Spark with MemSQL We would like to acknowledge the responsive MemSQL team to help us resolve our queries and also suggest ways to improvise. BI: Business Intelligence SQL: Structured Query Language S3: Simple Storage Service 9

From Spark to Ignition:

From Spark to Ignition: From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically

More information

Databricks. A Primer

Databricks. A Primer Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful

More information

Real Time Big Data Processing

Real Time Big Data Processing Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Big Data at Cloud Scale

Big Data at Cloud Scale Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc. How to Ingest Data into Google BigQuery using Talend for Big Data A Technical Solution Paper from Saama Technologies, Inc. July 30, 2013 Table of Contents Intended Audience What you will Learn Background

More information

Big Data Web Analytics Platform on AWS for Yottaa

Big Data Web Analytics Platform on AWS for Yottaa Big Data Web Analytics Platform on AWS for Yottaa Background Yottaa is a young, innovative company, providing a website acceleration platform to optimize Web and mobile applications and maximize user experience,

More information

Understanding the Value of In-Memory in the IT Landscape

Understanding the Value of In-Memory in the IT Landscape February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to

More information

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements

More information

Native Connectivity to Big Data Sources in MSTR 10

Native Connectivity to Big Data Sources in MSTR 10 Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP Your business is swimming in data, and your business analysts want to use it to answer the questions of today and tomorrow. YOU LOOK TO

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

CAPTURING & PROCESSING REAL-TIME DATA ON AWS CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved. Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!

More information

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering June 2014 Page 1 Contents Introduction... 3 About Amazon Web Services (AWS)... 3 About Amazon Redshift... 3 QlikView on AWS...

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau Powered by Vertica Solution Series in conjunction with: hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau The cost of healthcare in the US continues to escalate. Consumers, employers,

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center Presented by: Dennis Liao Sales Engineer Zach Rea Sales Engineer January 27 th, 2015 Session 4 This Session

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

ANALYTICS BUILT FOR INTERNET OF THINGS

ANALYTICS BUILT FOR INTERNET OF THINGS ANALYTICS BUILT FOR INTERNET OF THINGS Big Data Reporting is Out, Actionable Insights are In In recent years, it has become clear that data in itself has little relevance, it is the analysis of it that

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Analytics on Spark & Shark @Yahoo

Analytics on Spark & Shark @Yahoo Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment

More information

Scalability and Performance Report - Analyzer 2007

Scalability and Performance Report - Analyzer 2007 - Analyzer 2007 Executive Summary Strategy Companion s Analyzer 2007 is enterprise Business Intelligence (BI) software that is designed and engineered to scale to the requirements of large global deployments.

More information

Introducing Oracle Exalytics In-Memory Machine

Introducing Oracle Exalytics In-Memory Machine Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

How To Turn Big Data Into An Insight

How To Turn Big Data Into An Insight mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed

More information

Big data blue print for cloud architecture

Big data blue print for cloud architecture Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges

More information

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Building Your Big Data Team

Building Your Big Data Team Building Your Big Data Team With all the buzz around Big Data, many companies have decided they need some sort of Big Data initiative in place to stay current with modern data management requirements.

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

Exploring the Synergistic Relationships Between BPC, BW and HANA

Exploring the Synergistic Relationships Between BPC, BW and HANA September 9 11, 2013 Anaheim, California Exploring the Synergistic Relationships Between, BW and HANA Sheldon Edelstein SAP Database and Solution Management Learning Points SAP Business Planning and Consolidation

More information

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect A very short talk about Apache Kylin Business Intelligence meets Big Data Fabian Wilckens EMEA Solutions Architect 1 The challenge today 2 Very quickly: OLAP Online Analytical Processing How many beers

More information

Cost-Effective Business Intelligence with Red Hat and Open Source

Cost-Effective Business Intelligence with Red Hat and Open Source Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,

More information

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager steve.gonzales@thinkbiganalytics.com

More information

Tableau Server 7.0 scalability

Tableau Server 7.0 scalability Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

More information

AtScale Intelligence Platform

AtScale Intelligence Platform AtScale Intelligence Platform PUT THE POWER OF HADOOP IN THE HANDS OF BUSINESS USERS. Connect your BI tools directly to Hadoop without compromising scale, performance, or control. TURN HADOOP INTO A HIGH-PERFORMANCE

More information

BIG DATA ANALYTICS For REAL TIME SYSTEM

BIG DATA ANALYTICS For REAL TIME SYSTEM BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com

Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com REPORT Splice Machine: SQL-on-Hadoop Evaluation Guide www.splicemachine.com The content of this evaluation guide, including the ideas and concepts contained within, are the property of Splice Machine,

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

Dashboard Engine for Hadoop

Dashboard Engine for Hadoop Matt McDevitt Sr. Project Manager Pavan Challa Sr. Data Engineer June 2015 Dashboard Engine for Hadoop Think Big Start Smart Scale Fast Agenda Think Big Overview Engagement Model Solution Offerings Dashboard

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Introduction For companies that want to quickly gain insights into or opportunities from big data - the dramatic volume growth in corporate

More information

G-Cloud Big Data Suite Powered by Pivotal. December 2014. G-Cloud. service definitions

G-Cloud Big Data Suite Powered by Pivotal. December 2014. G-Cloud. service definitions G-Cloud Big Data Suite Powered by Pivotal December 2014 G-Cloud service definitions TABLE OF CONTENTS Service Overview... 3 Business Need... 6 Our Approach... 7 Service Management... 7 Vendor Accreditations/Awards...

More information

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5

More information

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING

More information

Dell* In-Memory Appliance for Cloudera* Enterprise

Dell* In-Memory Appliance for Cloudera* Enterprise Built with Intel Dell* In-Memory Appliance for Cloudera* Enterprise Find out what faster big data analytics can do for your business The need for speed in all things related to big data is an enormous

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Ali Ghodsi Head of PM and Engineering Databricks

Ali Ghodsi Head of PM and Engineering Databricks Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed

More information

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014 5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for

More information

Navigating Big Data business analytics

Navigating Big Data business analytics mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what

More information

Microsoft Analytics Platform System. Solution Brief

Microsoft Analytics Platform System. Solution Brief Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal

More information

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence

More information

Why Big Data in the Cloud?

Why Big Data in the Cloud? Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data

More information

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics Paper 1828-2014 Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics John Cunningham, Teradata Corporation, Danville, CA ABSTRACT SAS High Performance Analytics (HPA) is a

More information

How To Make Data Streaming A Real Time Intelligence

How To Make Data Streaming A Real Time Intelligence REAL-TIME OPERATIONAL INTELLIGENCE Competitive advantage from unstructured, high-velocity log and machine Big Data 2 SQLstream: Our s-streaming products unlock the value of high-velocity unstructured log

More information

Next-Generation Cloud Analytics with Amazon Redshift

Next-Generation Cloud Analytics with Amazon Redshift Next-Generation Cloud Analytics with Amazon Redshift What s inside Introduction Why Amazon Redshift is Great for Analytics Cloud Data Warehousing Strategies for Relational Databases Analyzing Fast, Transactional

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

BIG DATA-AS-A-SERVICE

BIG DATA-AS-A-SERVICE White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based

More information

Customer Case Study. Sharethrough

Customer Case Study. Sharethrough Customer Case Study Customer Case Study Benefits Faster prototyping of new applications Easier debugging of complex pipelines Improved overall engineering team productivity Summary offers a robust advertising

More information

PUSH INTELLIGENCE. Bridging the Last Mile to Business Intelligence & Big Data. 2013 Copyright Metric Insights, Inc.

PUSH INTELLIGENCE. Bridging the Last Mile to Business Intelligence & Big Data. 2013 Copyright Metric Insights, Inc. PUSH INTELLIGENCE Bridging the Last Mile to Business Intelligence & Big Data 2013 Copyright Metric Insights, Inc. INTRODUCTION... 3 CHALLENGES WITH BI... 4 The Dashboard Dilemma... 4 Architectural Limitations

More information

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling

More information

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture Apps and data source extensions with APIs Future white label, embed or integrate Power BI Deploy Intelligent

More information

Big Data must become a first class citizen in the enterprise

Big Data must become a first class citizen in the enterprise Big Data must become a first class citizen in the enterprise An Ovum white paper for Cloudera Publication Date: 14 January 2014 Author: Tony Baer SUMMARY Catalyst Ovum view Big Data analytics have caught

More information

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A Scalable Data Transformation Framework using the Hadoop Ecosystem A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Unleash your intuition

Unleash your intuition Introducing Qlik Sense Unleash your intuition Qlik Sense is a next-generation self-service data visualization application that empowers everyone to easily create a range of flexible, interactive visualizations

More information