Big Data Big Deal? Salford Systems www.salford-systems.com 2015
Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without encountering Big Data stories
Copyright Salford Systems 2010-2015 Zoom In On Last 5 Years Measured At January of Each Year 100 90 80 70 60 50 Series1 40 30 20 10 0 2009 2010 2011 2012 2013 2014 2015 2016 More than tripling of searches between 2010 and 2014. Last year s growth only about 5% suggesting saturation or maturity
Copyright Salford Systems 2010-2015 More Trends: Predictive Analytics and Machine Learning Predictive analytics has seen a fairly steady increase in searches Machine Learning went from 73 (March 2004) to 32 (March 2009) and back to 70 (March 2015)
Salford Systems 2015 Influential High Tech Marketing Book 1991 revised 2015 Incisive review of how new technologies evolve from startups to mainstream Traces successes and failures and looks for common themes Three editions have covered 3 different decades (80s, 90s, 2000s) all with the same conclusions Big Data is moving quickly with earliest components of the technology being about 10 years old
Life Cycle of Diffusion of Innovation Big Data In the beginning of Early Majority Phase Internet captured image
Copyright Salford Systems 2010-2015 Pressure Is On for Big Data Capability Every serious corporation is expected to have a Big Data strategy IT and main lines of business have no choice other than to introduce and deploy Big Data projects and solutions Ideal situation for vendors Buyers do not really know what they are buying or why Buyers have a budget that must be spent A replay of the data warehouse revolution that gathered such strength in the 1990s Requires new hardware (for in-house solutions) Requires new software (some open source, some proprietary) Requires new IT skills and possibly extensive consulting Substantial investment up front with rewards unknown
Observation on 2013 Deployments Everyone's Doing It, No One Knows Why According to a recent Gartner report, 64% of enterprises surveyed indicate that they're deploying or planning Big Data projects. Yet even more acknowledge that they still don't know what to do with Big Data. Have the inmates officially taken over the Big Data asylum? The same enterprises that seem most confused about Big Data seem to be the ones launching Big Data projects. What gives? Gartner On Big Data: Everyone's Doing It, No One Knows Why 18/09/2013
Observation on 2013 Deployments Everyone's Doing It, No One Knows Why Gartner On Big Data: Everyone's Doing It, No One Knows Why 18/09/2013
Copyright Salford Systems 2010-2015 Competitive Bragging Rights My Big Data is Bigger than your Big Data My Hadoop Cluster is bigger than your Hadoop Cluster No one is likely to beat Google at this game (although Google uses other Google-only software, not Hadoop) Obviously the question to answer is: what added value comes from all of this Big Data investment A reasonable cluster would have from 20 to 100 nodes At today s prices of $7,000 per node the costs are modest Roughly from $100K to $1million for a starter cluster Of course number of servers might need to go much higher Each server with 16TB storage, 64 GB RAM, 16 cores a solid component for such a cluster (priced at Dell.com)
Copyright Salford Systems 2010-2015 Using Amazon AWS Servers Bloggers discuss some options which range in cost from $10,000 to $90,000 per month for a 10+ node cluster Cost for one year similar to our estimates for in-house hardware http://www.intridea.com/blog/2014/2/13/big-data-on-small-budget/ In other words, a relatively serious investment
Salford Systems 2015 Is Big Data Really New? VLDB Started in 1974 More than 40 years ago Understanding that data can and will exceed the capacity of one server or a few servers has been evident since the first VLDB conference in 1975 VLDB remained a relatively obscure sub-specialty
Copyright Salford Systems 2013-2014 KDD Conference 1995 Knowledge Discovery In Data Bringing machine learning and large data bases together
What Enabled the new Big Data? Google introduced new technology MapReduce Facilitating massively parallel computation Yahoo developed open source software Hadoop Yahoo developers founded new companies (Cloudera, Hortonworks) Today there is a growing ecosystem of companies extending the initial capabilities Source: Glenn Klockwood, http://www.glennklockwood.com/data-intensive/hadoop/overview.html
Copyright Salford Systems 2010-2015 What Exactly is Special About Big Data? Volume of data is in and of itself not very interesting or useful IF all you have is more and more of exactly the same data as before Data becomes more interesting when it is broadened to include greater variety For example, some new style peer-to-peer lenders leverage information about potential borrowers from social media and from applicant s online behavior Growing interest in text mining New age of Big Data has made it far more practical to unify multiple sources of diverse data into a single useful repository Unification might be virtual and managed by software
Copyright Salford Systems 2010-2015 What Exactly Is Special About Hadoop Hadoop can function as a universal data store. You can throw anything and everything into it (spreadsheets, video, audio, conventional data tables) You don t have to plan before you store (unlike a traditional database which requires careful planning before you can do anything) Hadoop has been called the new tape. You just write to it. Hadoop data has also been described as write once read never meaning it becomes a data cemetery
What Exactly Is Special About Hadoop Hadoop allows you to defer all the work that normally is required for databases If you don t do the work before you wont be able to use the data later until that work is done (create schemas) Source: Glenn Klockwood, http://www.glennklockwood.com/data-intensive/hadoop/overview.html
Analytics on Hadoop Flood of new interest in making Hadoop useful Original definitive use-case: counting something If your data is so big that it must be distributed across possibly many machines (e.g. 1,000 servers) Each server can be organized to count something of interest in its local data store (doing anything on a single server is easy) We can then collect the subtotals and get a grand total Such a project can be expected to yield fast results no matter how many servers are involved Usual first example of working with Hadoop and MapReduce For more complex analytics Hadoop alone has been found to be intolerably slow Copyright Salford Systems 2010-2015
Copyright Salford Systems 2010-2015 Romance of Big Data Most non-analytics professionals hated their mandatory university statistics courses and learned little In spite of the power of advanced analytics having been demonstrated time again since at least the 1980s it has been difficult to excite managers about the topic Once the topic of analytics was inextricably linked with Big Data the trade press and the popular press became enthralled Wikileaks and NSA phone call monitoring fueled the notion of Big Data as being all knowing and all powerful Suddenly analytics was seen as a source of power & control
Romance of Big Data Wikileaks and NSA phone call monitoring fueled the notion of Big Data as being all knowing and all powerful Suddenly analytics was seen as a source of power & control Source: Srikant Sastri, http://yourstory.com/2015/06/big-data-challenges-india/
Copyright Salford Systems 2010-2015 Analytics Reality While we might now have access to huge repositories of data it is not the case we require or will use all of this data Impressive predictive models often are constructed from a relatively small number of core predictors Until recently major bank credit risk models might only leverage about 15 essential predictors New age modeling techniques might expand the set of predictors into the several hundreds We might have access to a huge number of predictors but for any given predictive project we will typically end up using a very small fraction of those predictors Once we have narrowed our focus we can continue analysis on the relevant subset of predictors
Mathematics of Rare Events Many analytical tasks focus on the prediction of rare events (fraud in a credit card transaction, conversion for an internet ad someone actually makes a purchase) Suppose a rare event occurs 1 in 1,000 chances One million chances will generate just 1,000 events We know that to optimally analyze 1,000 rare events we require all of the data of the rare events and a small sample of the common event (say 10,000) Even if we start with 1 million rows of extensive data we almost surely only need about 11,000 of these rows for first rate modeling In other words, the Big Data problem quickly becomes a small data problem Copyright Salford Systems 2010-2015
Value of added GOODs in small samples of BADS Sample Variance as 0:1 Ratio Increases (N1=250) Variance of discriminator as 0:1 ratio increases (N 1 =250) N 0 = Factor*N 1 Variance of Discrminator Vs 0:1 Ratio 0.0085 0.008 1 0.0075 0.007 0.0065 Series1 0.006 2 0.0055 0.005 4 0.0045 0.004 8 10 20 40 0 5 10 15 20 25 30 35 40 23 Assume there is a single relevant predictor X and we want to measure difference in mean of X between GOOD and BAD samples. Variance is reduced by 95% when factor is 20
Validation ROC Value of Increased Good/Bad Ratio Surprisingly low ratios sufficient Validation ROC by Good/Bad Ratio 0.95 0.94 0.93 0.92 0.91 231 Bads 491 Bads 942 Bads 2495 Bads 0.9 0.89 0.88 0 5 10 15 20 25 30 35 40 Good/Bad Ratio 24 Each curve has a given number of BADs and varies the ratio of GOODs Starting with more BADs yields better results as expected
Copyright Salford Systems 2010-2015 Single Server Capacity Take a modest 64GB RAM server Can effectively work with a data set of about 2 million rows by 3000 columns Using learning machines that must hold the training data in RAM RAM is also needed for workspace which is why the entire RAM is not dedicated to data storage If this is not enough can easily scale up to 512GB RAM which accommodate 16 million rows instead (for example) Very few predictive modeling problems for which this capacity is far more than sufficient
Copyright Salford Systems 2010-2015 What We Recommend When it comes to Big Data your organization has no choice Your organization will have to make the investments in hardware, software, and personnel (or get involved in the Cloud) Once the Big Data team has actually pulled something together determine what information they have been able to organize that is not typically available for your analytics Request access to that data or obtain an extract from the big data store that you can comfortably work with on a single server Use modern advanced analytics to ascertain the value (if any) of that added data
Copyright Salford Systems 2010-2015 Technologies You Need to Know About Gradient Boosting Random Forests Many textbooks available, training videos These can be found in the offerings of all major vendors and in open source Technologies were first brought to the market by Salford Systems