# MAD Skills: New Analysis Practices for Big Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Whitepaper MAD Skills: New Analysis Practices for Big Data By Jeffrey Cohen, Greenplum; Brian Dolan, Fox Interactive Media; Mark Dunlap, Evergreen Technologies; Joseph M. Hellerstein, U.C. Berkeley and Caleb Welton, Greenplum Abstract As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms. Greenplum Software 1900 South Norfolk Street San Mateo, CA USA tel:

8 This will produce a square matrix and a vector based on the size of the independent variable vector. The final calculation is computed by inverting the small A matrix and multiplying by the vector to derive the coefficients β. Additionally, calculation of the coefficient of determination R 2 can be calculated concurrently by SSR = b β 1 X 2 yi n T SS = X yi 2 1 X 2 yi n R 2 = SSR T SS In the following SQL query, we compute the coefficients β, as well as the components of the coefficient of determination: CREATE VIEW ols AS SELECT pseudo_inverse(a) * b as beta_star, (transpose(b) * (pseudo_inverse(a) * b) - sum_y2/count) -- SSR / (sum_yy - sumy2/n) -- TSS as r_squared FROM ( SELECT sum(transpose(d.vector) * d.vector) as A, sum(d.vector * y) as b, sum(y)^2 as sum_y2, sum(y^2) as sum_yy, count(*) as n FROM design d ) ols_aggs; Note the use of a user-defined function for vector transpose, and user-defined aggregates for summation of (multidimensional) array objects. The array A is a small in-memory matrix that we treat as a single object; the pseudo-inverse function implements the textbook Moore-Penrose pseudoinverse of the matrix. All of the above can be efficiently calculated in a single pass of the data. For convenience, we encapsulated this yet further via two user-defined aggregate functions: SELECT ols_coef(d.y, d.vector), ols_r2(d.y, d.vector) FROM design d; Prior to the implementation of this functionality within the DBMS, one Greenplum customer was accustomed to calculating the OLS by exporting data and importing the data into R for calculation, a process that took several hours to complete. They reported significant performance improvement when they moved to running the regression within the DBMS. Most of the benefit derived from running the analysis in parallel close to the data with minimal data movement Conjugate Gradient In this subsection we develop a data-parallel implementation of the Conjugate Gradiant method for solving a system of linear equations. We can use this to implement Support Vector Machines, a state-of-the-art technique for binary classification. Binary classifiers are a common tool in modern ad placement, used to turn complex multi-dimensional user features into simple boolean labels like is a car enthusiast that can be combined into enthusiast charts. In addition to serving as a building block for SVMs, the Conjugate Gradiant method allows us to optimize a large class of functions that can be approximated by second order Taylor expansions. To a mathematician, the solution to the matrix equation Ax = b is simple when it exists: x = A 1 b. As noted in Section 5.1, we cannot assume we can find A 1. If matrix A is n n symmetric and positive definite (SPD), we can use the Conjugate Gradient method. This method requires neither df (y) nor A 1 and converges in no more than n interations. A general treatment is given in [17]. Here we outline the solution to Ax = b as an extremum of f(x) = 1 2 x Ax + b x + c. Broadly, we have an estimate ˆx to our solution x. Since ˆx is only an estimate, r 0 = Aˆx b is non-zero. Subtracting this error r 0 from the estimate allows us to generate a series p i = r i 1 {Aˆx b} of orthogonal vectors. The solution will be x = P i αipi for αi defined below. We end at the point r k 2 < ɛ for a suitable ɛ. There are several update algorithms, we have written ours in matrix notation. Begin iteration over i. r 0 = b Aˆx 0, α 0 = r 0 r 0 v 0 Av 0, v 0 = r 0, i =0 α i = r ir i v i Avi x i+1 = x i + α iv i r i+1 = r i α iav i check r i+1 2 ɛ v i+1 = r i+1 + r i+1r i+1 v r i i ri To incorporate this method into the database, we stored (v i,x i,r i,α i) as a row and inserted row i+1 in one pass. This required the construction of functions update alpha(r i, p i, A), update x(x i, alpha i, v i), update r(x i, alpha i, v i, A), and update v(r i, alpha i, v i, A). Though the function calls were redundant (for instance, update v() also runs the update of r i+1), this allowed us to insert one full row at a time. An external driver process then checks the value of r i before proceeding. Upon convergence, it is rudimentary to compute x. The presence of the conjugate gradient method enables even more sophisticated techniques like Support Vector Machines (SVM). At their core, SVMs seek to maximize the distance between a set of points and a candiate hyperplane. This distance is denoted by the magnitude of the normal vectors w 2. Most methods incorporate the integers {0, 1} as labels c, so the problem becomes argmax w,b f(w) = 1 2 w 2, subject to c w b 0. This method applies to the more general issue of high dimensional functions under a Taylor expansion f x0 (x) f(x 0)+df (x)(x x 0)+ 1 (x 2 xo) d 2 f(x)(x x 0) With a good initial guess for x and the common assumption of continuity of f( ), we know the the matrix will be SPD near x. See [17] for details. 5.3 Functionals Basic statistics are not new to relational databases most support means, variances and some form of quantiles. But modeling and comparative statistics are not typically builtin functionality. In this section we provide data-parallel

9 implementations of a number of comparative statistics expressed in SQL. In the previous section, scalars or vectors were the atomic unit. Here a probability density function is the foundational object. For instance the Normal (Gaussian) density f(x) = e (x µ)2 /2σ 2 is considered by mathematicians as a single entity with two attributes: the mean µ and variance σ. A common statistical question is to see how well a data set fits to a target density function. The z score of a datum x is given z(x) = (x µ) σ/ and is easy to obtain in n standard SQL. SELECT x.value, (x.value - d.mu) * d.n / d.sigma AS z_score FROM x, design d Mann-Whitney U Test Rank and order statistics are quite amenable to relational treatments, since their main purpose is to evaluate a set of data, rather then one datum at a time. The next example illustrates the notion of comparing two entire sets of data without the overhead of describing a parameterized density. The Mann-Whitney U Test (MWU) is a popular substitute for Student s t-test in the case of non-parametric data. The general idea it to take two populations A and B and decide if they are from the same underlying population by examining the rank order in which members of A and B show up in a general ordering. The cartoon is that if members of A are at the front of the line and members of B are at the back of the line, then A and B are different populations. In an advertising setting, click-through rates for web ads tend to defy simple parametric models like Gaussians or log-normal distributions. But it is often useful to compare click-through rate distributions for different ad campaigns, e.g., to choose one with a better median click-through. MWU addresses this task. Given a table T with columns SAMPLE ID, VALUE, row numbers are obtained and summed via SQL windowing functions. CREATE VIEW R AS SELECT sample_id, avg(value) AS sample_avg sum(rown) AS rank_sum, count(*) AS sample_n, sum(rown) - count(*) * (count(*) + 1) AS sample_us FROM (SELECT sample_id, row_number() OVER (ORDER BY value DESC) AS rown, value FROM T) AS ordered GROUP BY sample_id Assuming the condition of large sample sizes, for instance greater then 5,000, the normal approximation can be justified. Using the previous view R, the final reported statistics are given by SELECT r.sample_u, r.sample_avg, r.sample_n (r.sample_u - a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R as r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u The end result is a small set of numbers that describe a relationship of functions. This simple routine can be encapsulated by stored procedures and made available to the analysts via a simple SELECT mann whitney(value) FROM table call, elevating the vocabulary of the database tremendously Log-Likelihood Ratios Likelihood ratios are useful for comparing a subpopulation to an overall population on a particular attributed. As an example in advertising, consider two attributes of users: beverage choice, and family status. One might want to know whether coffee attracts new parents more than the general population. This is a case of having two density (or mass) functions for the same data set X. Denote one distribution as null hypothesis f 0 and the other as alternate f A. Typically, f 0 and f A are different parameterizations of the same density. For instance, N(µ 0,σ 0) and N(µ A,σ A). The likelihood L under f i is given by L fi = L(X f i)= Y k f i(x k ). The log-likelihood ratio is given by the quotient 2 log (L f0 /L fa ). Taking the log allows us to use the well-known χ 2 approximation for large n. Also, the products turn nicely into sums and an RDBMS can handle it easily in parallel. LLR =2 X k log f A(x k ) 2 X k log f 0(x k ). This calculation distributes nicely if f i : R R, which most do. If f i : R n R, then care must be taken in managing the vectors as distributed objects. Suppose the values are in table T and the function f A( ) has been written as a userdefined function f llk(x numeric, param numeric). Then the entire experiment is can be performed with the call SELECT 2 * sum(log(f_llk(t.value, d.alt_param))) - 2 * sum(log(f_llk(t.value, d.null_param))) AS llr FROM T, design AS d This represents a significant gain in flexibility and sophistication for any RDBMS. Example: The Multinomial Distribution The multinomial distribution extends the binomial distribution. Consider a random variable X with k discrete outcomes. These have probabilities p =(p 1,..., p k ). In n trials, the joint probability distribution is given by P(X p) = n (n 1,..., n k )! p n 1 1 p n k k. To obtain p i, we assume a table outcome with column outcome representing the base population. CREATE VIEW B AS SELECT outcome, outcome_count / sum(outcome_count) over () AS p FROM (SELECT outcome, count(*)::numeric AS outcome_count FROM input GROUP BY outcome) AS a In the context of model selection, it is often convenient to compare the same data set under two different multinomial distributions. «P(X p) LLR = 2 log P(X p) ` n 1 n = 2 log 1,...,n k pn1! pn k k ` n n 1 n 1,...,n k p 1 pn k k Or in SQL: = 2 X i n i log p i X i n i log p i.

### MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data Jeffrey Cohen Greenplum Joseph M. Hellerstein U.C. Berkeley Brian Dolan Fox Interactive Media Caleb Welton Greenplum Mark Dunlap Evergreen Technologies ABSTRACT

### MAD Skills: New Analysis Practices for Big Data

MAD Skills: New Analysis Practices for Big Data Jeffrey Cohen, Brian Dolan, Mark Dunlap Joseph M. Hellerstein, and Caleb Welton VLDB 2009 Presented by: Kristian Torp Overview Enterprise Data Warehouse

### Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

### CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu A Brief History Relational database management systems Time 1975-1985 1985-1995 1995-2005 Let us first see what a relational

### Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

### Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

EMC Greenplum Driving the Future of Data Warehousing and Analytics Tools and Technologies for Big Data Steven Hillion V.P. Analytics EMC Data Computing Division 1 Big Data Size: The Volume Of Data Continues

### Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

### OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

### Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

### Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

### Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

### How to Enhance Traditional BI Architecture to Leverage Big Data

B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

### IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

The Bloor Group IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS VENDOR PROFILE The IBM Big Data Landscape IBM can legitimately claim to have been involved in Big Data and to have a much broader

### Data processing goes big

Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

### Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

### Data Mining in the Swamp

WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

### W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

### Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION EXECUTIVE SUMMARY Oracle business intelligence solutions are complete, open, and integrated. Key components of Oracle business intelligence

### IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

### High-Volume Data Warehousing in Centerprise. Product Datasheet

High-Volume Data Warehousing in Centerprise Product Datasheet Table of Contents Overview 3 Data Complexity 3 Data Quality 3 Speed and Scalability 3 Centerprise Data Warehouse Features 4 ETL in a Unified

Mike Maxey Senior Director Product Marketing Greenplum A Division of EMC 1 Greenplum Becomes the Foundation of EMC s Big Data Analytics (July 2010) E M C A C Q U I R E S G R E E N P L U M For three years,

### PAGE 1 l Teradata Magazine l Q1/2011 l 2011 Teradata Corporation l AR-6309

PAGE 1 l Teradata Magazine l Q1/2011 l 2011 Teradata Corporation l AR-6309 It s going mainstream, and it s your next opportunity. by Merv Adrian Enterprises have never had more data, and it s no surprise

### Innovative technology for big data analytics

Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### SQL Server 2005 Features Comparison

Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions

### A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data

White Paper A Visualization is Worth a Thousand Tables: How IBM Business Analytics Lets Users See Big Data Contents Executive Summary....2 Introduction....3 Too much data, not enough information....3 Only

### Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

### B.Sc (Computer Science) Database Management Systems UNIT-V

1 B.Sc (Computer Science) Database Management Systems UNIT-V Business Intelligence? Business intelligence is a term used to describe a comprehensive cohesive and integrated set of tools and process used

### W H I T E P A P E R B u s i n e s s I n t e l l i g e n c e S o lutions from the Microsoft and Teradata Partnership

W H I T E P A P E R B u s i n e s s I n t e l l i g e n c e S o lutions from the Microsoft and Teradata Partnership Sponsored by: Microsoft and Teradata Dan Vesset October 2008 Brian McDonough Global Headquarters:

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### MicroStrategy Course Catalog

MicroStrategy Course Catalog 1 microstrategy.com/education 3 MicroStrategy course matrix 4 MicroStrategy 9 8 MicroStrategy 10 table of contents MicroStrategy course matrix MICROSTRATEGY 9 MICROSTRATEGY

### Cisco Data Preparation

Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

### DATA ANALYTICS USING R

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

### A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

### Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:

### Informatica and the Vibe Virtual Data Machine

White Paper Informatica and the Vibe Virtual Data Machine Preparing for the Integrated Information Age This document contains Confidential, Proprietary and Trade Secret Information ( Confidential Information

### Bringing Big Data into the Enterprise

Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?

### SQL Server 2012 Performance White Paper

Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.

### What's New in SAS Data Management

Paper SAS034-2014 What's New in SAS Data Management Nancy Rausch, SAS Institute Inc., Cary, NC; Mike Frost, SAS Institute Inc., Cary, NC, Mike Ames, SAS Institute Inc., Cary ABSTRACT The latest releases

### European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute

### The Future of Data Management

The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

### Foundations of Business Intelligence: Databases and Information Management

Chapter 6 Foundations of Business Intelligence: Databases and Information Management 6.1 2010 by Prentice Hall LEARNING OBJECTIVES Describe how the problems of managing data resources in a traditional

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated

### In-Memory Analytics for Big Data

In-Memory Analytics for Big Data Game-changing technology for faster, better insights WHITE PAPER SAS White Paper Table of Contents Introduction: A New Breed of Analytics... 1 SAS In-Memory Overview...

### Whitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com

Whitepaper Innovations in Business Intelligence Database Technology The State of Database Technology in 2015 Database technology has seen rapid developments in the past two decades. Online Analytical Processing

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

### The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Conclusions Paper The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS? Insights from a presentation at the 2014 Hadoop Summit Featuring Brian Garrett, Principal Solutions Architect

### Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence White Paper INTRODUCTION In recent years, the amount of data in companies has increased dramatically as enterprise resource

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Numerical Algorithms Group. Embedded Analytics. A cure for the common code. www.nag.com. Results Matter. Trust NAG.

Embedded Analytics A cure for the common code www.nag.com Results Matter. Trust NAG. Executive Summary How much information is there in your data? How much is hidden from you, because you don t have access

### <Insert Picture Here> Oracle and/or Hadoop And what you need to know

Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

### In-Database Analytics

Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing

### Understanding the Benefits of IBM SPSS Statistics Server

IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

### Web Data Mining: A Case Study. Abstract. Introduction

Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 okgupta@pvamu.edu Abstract With an enormous amount of data stored

### Business Intelligence and Service Oriented Architectures. An Oracle White Paper May 2007

Business Intelligence and Service Oriented Architectures An Oracle White Paper May 2007 Note: The following is intended to outline our general product direction. It is intended for information purposes

### Wrangling Actionable Insights from Organizational Data

Wrangling Actionable Insights from Organizational Data Koverse Eases Big Data Analytics for Those with Strong Security Requirements The amount of data created and stored by organizations around the world

### SimCorp Solution Guide

SimCorp Solution Guide Data Warehouse Manager For all your reporting and analytics tasks, you need a central data repository regardless of source. SimCorp s Data Warehouse Manager gives you a comprehensive,

### Framework for Data warehouse architectural components

Framework for Data warehouse architectural components Author: Jim Wendt Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 04/08/11 Email: erg@evaltech.com Abstract:

### hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

Powered by Vertica Solution Series in conjunction with: hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau The cost of healthcare in the US continues to escalate. Consumers, employers,

### WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

### Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Paper BB-01 Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole ABSTRACT Stephen Overton, Overton Technologies, LLC, Raleigh, NC Business information can be consumed many

### Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Wienand Omta Fabiano Dalpiaz 1 drs. ing. Wienand Omta Learning Objectives Describe how the problems of managing data resources

### IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance

Data Sheet IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance Overview Multidimensional analysis is a powerful means of extracting maximum value from your corporate

### ANALYTICS STRATEGY: creating a roadmap for success

ANALYTICS STRATEGY: creating a roadmap for success Companies in the capital and commodity markets are looking at analytics for opportunities to improve revenue and cost savings. Yet, many firms are struggling

### Information management software solutions White paper. Powerful data warehousing performance with IBM Red Brick Warehouse

Information management software solutions White paper Powerful data warehousing performance with IBM Red Brick Warehouse April 2004 Page 1 Contents 1 Data warehousing for the masses 2 Single step load

### BENEFITS OF AUTOMATING DATA WAREHOUSING

BENEFITS OF AUTOMATING DATA WAREHOUSING Introduction...2 The Process...2 The Problem...2 The Solution...2 Benefits...2 Background...3 Automating the Data Warehouse with UC4 Workload Automation Suite...3

### Turning Big Data into Big Insights

mwd a d v i s o r s Turning Big Data into Big Insights Helena Schwenk A special report prepared for Actuate May 2013 This report is the fourth in a series and focuses principally on explaining what s needed

### Grow Revenues and Reduce Risk with Powerful Analytics Software

Grow Revenues and Reduce Risk with Powerful Analytics Software Overview Gaining knowledge through data selection, data exploration, model creation and predictive action is the key to increasing revenues,

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Big Data Can Drive the Business and IT to Evolve and Adapt

Big Data Can Drive the Business and IT to Evolve and Adapt Ralph Kimball Associates 2013 Ralph Kimball Brussels 2013 Big Data Itself is Being Monetized Executives see the short path from data insights

Challenges of Handling Big Data Ramesh Bhashyam Teradata Fellow Teradata Corporation bhashyam.ramesh@teradata.com Trend Too much information is a storage issue, certainly, but too much information is also

### ANALYTICS CENTER LEARNING PROGRAM

Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

### Information Management course

Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

### BUSINESS INTELLIGENCE. Keywords: business intelligence, architecture, concepts, dashboards, ETL, data mining

BUSINESS INTELLIGENCE Bogdan Mohor Dumitrita 1 Abstract A Business Intelligence (BI)-driven approach can be very effective in implementing business transformation programs within an enterprise framework.

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

OLAP and OLTP AMIT KUMAR BINDAL Associate Professor Databases Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information, which is created by data,

### QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

Predictive Analytics Powered by SAP HANA Cary Bourgeois Principal Solution Advisor Platform and Analytics Agenda Introduction to Predictive Analytics Key capabilities of SAP HANA for in-memory predictive

### DBMS / Business Intelligence, SQL Server

DBMS / Business Intelligence, SQL Server Orsys, with 30 years of experience, is providing high quality, independant State of the Art seminars and hands-on courses corresponding to the needs of IT professionals.

### Data Warehousing and Data Mining in Business Applications

133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

### Cost Savings THINK ORACLE BI. THINK KPI. THINK ORACLE BI. THINK KPI. THINK ORACLE BI. THINK KPI.

THINK ORACLE BI. THINK KPI. THINK ORACLE BI. THINK KPI. MIGRATING FROM BUSINESS OBJECTS TO OBIEE KPI Partners is a world-class consulting firm focused 100% on Oracle s Business Intelligence technologies.

### PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions A Technical Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Data Warehouse Modeling.....................................4

### Understanding the Value of In-Memory in the IT Landscape

February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to

### ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

ORACLE OLAP KEY FEATURES AND BENEFITS FAST ANSWERS TO TOUGH QUESTIONS EASILY KEY FEATURES & BENEFITS World class analytic engine Superior query performance Simple SQL access to advanced analytics Enhanced