Big Data Big Deal? Salford Systems www.salford-systems.com



Similar documents
Building your Big Data Architecture on Amazon Web Services

Data processing goes big

Hybrid: The Next Generation Cloud Interviews Among CIOs of the Fortune 1000 and Inc. 5000

While a number of technologies fall under the Big Data label, Hadoop is the Big Data mascot.

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

RevoScaleR Speed and Scalability

Big Data, Big Traffic. And the WAN

Big Data Integration: A Buyer's Guide

Ubuntu and Hadoop: the perfect match

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

The Next Wave of Data Management. Is Big Data The New Normal?

Operational Analytics

In-Database Analytics

How To Handle Big Data With A Data Scientist

DDN updates object storage platform as it aims to break out of HPC niche

BIG DATA USING HADOOP

Big Data Readiness. A QuantUniversity Whitepaper. 5 things to know before embarking on your first Big Data project

Big Data Analytics: Today's Gold Rush November 20, 2013

Big Data at Cloud Scale

Agile Business Intelligence Data Lake Architecture

Big Data Processing: Past, Present and Future

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

DEFINITELY. GAME CHANGER? EVOLUTION? Big Data

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Scalable Forensics with TSK and Hadoop. Jon Stewart

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Customized Report- Big Data

Chapter 1. Contrasting traditional and visual analytics approaches

Why Big Data in the Cloud?

The big data revolution

How to pick ediscovery software

The State of Unassisted Support 2014

Expert Reference Series of White Papers. Cloud Computing: What It Is and What It Can Do for You

Big Data With Hadoop

College of Engineering, Technology, and Computer Science

Best practices for managing the data warehouse to support Big Data

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Transcription. Founder Interview - Panayotis Vryonis Talks About BigStash Cloud Storage. Media Duration: 28:45

Big Data Success Step 1: Get the Technology Right

Big Data and Apache Hadoop Adoption:

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

White Paper: Big Data and the hype around IoT

Expert Reference Series of White Papers. Cloud Computing: What It Is and What It Can Do for You

Big Data Comes of Age: Shifting to a Real-time Data Platform

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Investor Presentation. Second Quarter 2015

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

NoSQL for SQL Professionals William McKnight

Google Lead Generation for Attorneys

White Paper: Hadoop for Intelligence Analysis

Big Data. Fast Forward. Putting data to productive use

Online Press Releases For The Offline Business

SOCIAL MEDIA ADVERTISING STRATEGIES THAT WORK

Ten Mistakes to Avoid

Big Data and Natural Language: Extracting Insight From Text

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Advanced Big Data Analytics with R and Hadoop

Sunnie Chung. Cleveland State University

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

How Liferay Is Improving Quality Using Hundreds of Jenkins Servers

Cost-Effective Business Intelligence with Red Hat and Open Source

Big Data Technologies Compared June 2014

5 WAYS TO DOUBLE YOUR WEB SITE S SALES IN THE NEXT 12 MONTHS

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

How to Turn the Promise of the Cloud into an Operational Reality

Research Note What is Big Data?

Cloud Computing Paradigm

INTRODUCTION TO CASSANDRA

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Fast Analytics on Big Data with H20

Outline. What is Big data and where they come from? How we deal with Big data?

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Terms and Conditions

Big Data - Infrastructure Considerations

ANALYTICS BUILT FOR INTERNET OF THINGS

Costs of Data Warehousing & Business Intelligence for the Small to Midsize Business

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Oracle Big Data SQL Technical Update

CA Big Data Management: It s here, but what can it do for your business?

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Transcription:

Big Data Big Deal? Salford Systems www.salford-systems.com 2015

Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without encountering Big Data stories

Copyright Salford Systems 2010-2015 Zoom In On Last 5 Years Measured At January of Each Year 100 90 80 70 60 50 Series1 40 30 20 10 0 2009 2010 2011 2012 2013 2014 2015 2016 More than tripling of searches between 2010 and 2014. Last year s growth only about 5% suggesting saturation or maturity

Copyright Salford Systems 2010-2015 More Trends: Predictive Analytics and Machine Learning Predictive analytics has seen a fairly steady increase in searches Machine Learning went from 73 (March 2004) to 32 (March 2009) and back to 70 (March 2015)

Salford Systems 2015 Influential High Tech Marketing Book 1991 revised 2015 Incisive review of how new technologies evolve from startups to mainstream Traces successes and failures and looks for common themes Three editions have covered 3 different decades (80s, 90s, 2000s) all with the same conclusions Big Data is moving quickly with earliest components of the technology being about 10 years old

Life Cycle of Diffusion of Innovation Big Data In the beginning of Early Majority Phase Internet captured image

Copyright Salford Systems 2010-2015 Pressure Is On for Big Data Capability Every serious corporation is expected to have a Big Data strategy IT and main lines of business have no choice other than to introduce and deploy Big Data projects and solutions Ideal situation for vendors Buyers do not really know what they are buying or why Buyers have a budget that must be spent A replay of the data warehouse revolution that gathered such strength in the 1990s Requires new hardware (for in-house solutions) Requires new software (some open source, some proprietary) Requires new IT skills and possibly extensive consulting Substantial investment up front with rewards unknown

Observation on 2013 Deployments Everyone's Doing It, No One Knows Why According to a recent Gartner report, 64% of enterprises surveyed indicate that they're deploying or planning Big Data projects. Yet even more acknowledge that they still don't know what to do with Big Data. Have the inmates officially taken over the Big Data asylum? The same enterprises that seem most confused about Big Data seem to be the ones launching Big Data projects. What gives? Gartner On Big Data: Everyone's Doing It, No One Knows Why 18/09/2013

Observation on 2013 Deployments Everyone's Doing It, No One Knows Why Gartner On Big Data: Everyone's Doing It, No One Knows Why 18/09/2013

Copyright Salford Systems 2010-2015 Competitive Bragging Rights My Big Data is Bigger than your Big Data My Hadoop Cluster is bigger than your Hadoop Cluster No one is likely to beat Google at this game (although Google uses other Google-only software, not Hadoop) Obviously the question to answer is: what added value comes from all of this Big Data investment A reasonable cluster would have from 20 to 100 nodes At today s prices of $7,000 per node the costs are modest Roughly from $100K to $1million for a starter cluster Of course number of servers might need to go much higher Each server with 16TB storage, 64 GB RAM, 16 cores a solid component for such a cluster (priced at Dell.com)

Copyright Salford Systems 2010-2015 Using Amazon AWS Servers Bloggers discuss some options which range in cost from $10,000 to $90,000 per month for a 10+ node cluster Cost for one year similar to our estimates for in-house hardware http://www.intridea.com/blog/2014/2/13/big-data-on-small-budget/ In other words, a relatively serious investment

Salford Systems 2015 Is Big Data Really New? VLDB Started in 1974 More than 40 years ago Understanding that data can and will exceed the capacity of one server or a few servers has been evident since the first VLDB conference in 1975 VLDB remained a relatively obscure sub-specialty

Copyright Salford Systems 2013-2014 KDD Conference 1995 Knowledge Discovery In Data Bringing machine learning and large data bases together

What Enabled the new Big Data? Google introduced new technology MapReduce Facilitating massively parallel computation Yahoo developed open source software Hadoop Yahoo developers founded new companies (Cloudera, Hortonworks) Today there is a growing ecosystem of companies extending the initial capabilities Source: Glenn Klockwood, http://www.glennklockwood.com/data-intensive/hadoop/overview.html

Copyright Salford Systems 2010-2015 What Exactly is Special About Big Data? Volume of data is in and of itself not very interesting or useful IF all you have is more and more of exactly the same data as before Data becomes more interesting when it is broadened to include greater variety For example, some new style peer-to-peer lenders leverage information about potential borrowers from social media and from applicant s online behavior Growing interest in text mining New age of Big Data has made it far more practical to unify multiple sources of diverse data into a single useful repository Unification might be virtual and managed by software

Copyright Salford Systems 2010-2015 What Exactly Is Special About Hadoop Hadoop can function as a universal data store. You can throw anything and everything into it (spreadsheets, video, audio, conventional data tables) You don t have to plan before you store (unlike a traditional database which requires careful planning before you can do anything) Hadoop has been called the new tape. You just write to it. Hadoop data has also been described as write once read never meaning it becomes a data cemetery

What Exactly Is Special About Hadoop Hadoop allows you to defer all the work that normally is required for databases If you don t do the work before you wont be able to use the data later until that work is done (create schemas) Source: Glenn Klockwood, http://www.glennklockwood.com/data-intensive/hadoop/overview.html

Analytics on Hadoop Flood of new interest in making Hadoop useful Original definitive use-case: counting something If your data is so big that it must be distributed across possibly many machines (e.g. 1,000 servers) Each server can be organized to count something of interest in its local data store (doing anything on a single server is easy) We can then collect the subtotals and get a grand total Such a project can be expected to yield fast results no matter how many servers are involved Usual first example of working with Hadoop and MapReduce For more complex analytics Hadoop alone has been found to be intolerably slow Copyright Salford Systems 2010-2015

Copyright Salford Systems 2010-2015 Romance of Big Data Most non-analytics professionals hated their mandatory university statistics courses and learned little In spite of the power of advanced analytics having been demonstrated time again since at least the 1980s it has been difficult to excite managers about the topic Once the topic of analytics was inextricably linked with Big Data the trade press and the popular press became enthralled Wikileaks and NSA phone call monitoring fueled the notion of Big Data as being all knowing and all powerful Suddenly analytics was seen as a source of power & control

Romance of Big Data Wikileaks and NSA phone call monitoring fueled the notion of Big Data as being all knowing and all powerful Suddenly analytics was seen as a source of power & control Source: Srikant Sastri, http://yourstory.com/2015/06/big-data-challenges-india/

Copyright Salford Systems 2010-2015 Analytics Reality While we might now have access to huge repositories of data it is not the case we require or will use all of this data Impressive predictive models often are constructed from a relatively small number of core predictors Until recently major bank credit risk models might only leverage about 15 essential predictors New age modeling techniques might expand the set of predictors into the several hundreds We might have access to a huge number of predictors but for any given predictive project we will typically end up using a very small fraction of those predictors Once we have narrowed our focus we can continue analysis on the relevant subset of predictors

Mathematics of Rare Events Many analytical tasks focus on the prediction of rare events (fraud in a credit card transaction, conversion for an internet ad someone actually makes a purchase) Suppose a rare event occurs 1 in 1,000 chances One million chances will generate just 1,000 events We know that to optimally analyze 1,000 rare events we require all of the data of the rare events and a small sample of the common event (say 10,000) Even if we start with 1 million rows of extensive data we almost surely only need about 11,000 of these rows for first rate modeling In other words, the Big Data problem quickly becomes a small data problem Copyright Salford Systems 2010-2015

Value of added GOODs in small samples of BADS Sample Variance as 0:1 Ratio Increases (N1=250) Variance of discriminator as 0:1 ratio increases (N 1 =250) N 0 = Factor*N 1 Variance of Discrminator Vs 0:1 Ratio 0.0085 0.008 1 0.0075 0.007 0.0065 Series1 0.006 2 0.0055 0.005 4 0.0045 0.004 8 10 20 40 0 5 10 15 20 25 30 35 40 23 Assume there is a single relevant predictor X and we want to measure difference in mean of X between GOOD and BAD samples. Variance is reduced by 95% when factor is 20

Validation ROC Value of Increased Good/Bad Ratio Surprisingly low ratios sufficient Validation ROC by Good/Bad Ratio 0.95 0.94 0.93 0.92 0.91 231 Bads 491 Bads 942 Bads 2495 Bads 0.9 0.89 0.88 0 5 10 15 20 25 30 35 40 Good/Bad Ratio 24 Each curve has a given number of BADs and varies the ratio of GOODs Starting with more BADs yields better results as expected

Copyright Salford Systems 2010-2015 Single Server Capacity Take a modest 64GB RAM server Can effectively work with a data set of about 2 million rows by 3000 columns Using learning machines that must hold the training data in RAM RAM is also needed for workspace which is why the entire RAM is not dedicated to data storage If this is not enough can easily scale up to 512GB RAM which accommodate 16 million rows instead (for example) Very few predictive modeling problems for which this capacity is far more than sufficient

Copyright Salford Systems 2010-2015 What We Recommend When it comes to Big Data your organization has no choice Your organization will have to make the investments in hardware, software, and personnel (or get involved in the Cloud) Once the Big Data team has actually pulled something together determine what information they have been able to organize that is not typically available for your analytics Request access to that data or obtain an extract from the big data store that you can comfortably work with on a single server Use modern advanced analytics to ascertain the value (if any) of that added data

Copyright Salford Systems 2010-2015 Technologies You Need to Know About Gradient Boosting Random Forests Many textbooks available, training videos These can be found in the offerings of all major vendors and in open source Technologies were first brought to the market by Salford Systems