Wesley Pasfield Northwestern University

Size: px
Start display at page:

Download "Wesley Pasfield Northwestern University William.pasifield2016@u.northwestern.edu"

Transcription

1 March Madness Prediction Using Big Data Technology Wesley Pasfield Northwestern University Abstract- - This paper explores leveraging big data technologies to create probabilistic predictive models for the NCAA Men s Basketball Tournament (March Madness). Historically many approaches have been deployed to predict March Madness, ranging from picking teams based on mascot to heavily quantitative predictive models. Building a consistently successful model has proven to be extremely challenging due to the variability that exists in each individual tournament. This paper attempts to identify the key regular season characteristics of teams that dictate success in the NCAA tournament, while leveraging big data technologies through the Apache Hadoop ecosystem. Specifically, this paper discusses creating probabilistic predictions using three different algorithms - logistic regression through the generalized linear model (GLM), a random forest classifier, and gradient boosting machines (GBM) with a Bernoulli distribution, all using the R scripting functionality for the open source machine learning engine H2O on top of a single node Apache Hadoop cluster. The data used in this analysis came from both the data science competition website Kaggle.com as well as Ken Pomeroy s college basketball analytics website KenPom.com. Keywords: Generalized Linear Model, Gradient Boosting Machines I. INTRODUCTION Using Big Data Technologies The NCAA basketball March Madness tournament is notorious for upsets and unpredictability. Many people fill out brackets trying to predict the outcome of the tournament, but most have to rip up their bracket after the first few rounds. Increasingly, efforts have been made to create a system to better predict the outcome of the NCAA tournament games. In 2015, HP sponsored a competition on the big data competition website Kaggle, offering prizes for teams and individuals that build a model which minimizes the predictive binomial deviance in predicting the outcome of NCAA tournament games through two different stages of submissions: The first evaluates all possible matchups in the previous four tournaments ( ), the second evaluates the upcoming 2015 tournament. Historic data is provided in the form of game- by- game regular season and NCAA tournament data dating back to 2003, with the option of integrating external data as well. This paper examines the use of machine learning techniques through big data technologies to create these predictions. Everything discussed in this paper is based on the first submission. Big Data Technologies Discussion Big data technologies have disrupted organizations across industries, and in recent years have begun gaining widespread traction. In its early stages the Hadoop ecosystem, which facilitates massive data processing and evaluation through parallel computing, was seen as a necessity largely for internet based companies that collected huge amounts of data, and had a need to extract and evaluate that data (Facebook, Amazon, Yahoo etc..). However, in recent years, Hadoop and other big data technologies have gained adoption across more traditional industries, and big data is no longer viewed as a niche operation, instead, it has become a necessity to remain competitive across numerous industries. Hadoop Hadoop is an open source framework for writing and running distributed applications that process massive quantities of data (Lam, 4). One of the major advantages distributed processing has over traditional 1

2 relational database management systems is horizontal scalability. Relational databases scale vertically, typically restricting storage to one machine, which becomes problematic when data reaches scale beyond the capability of that machine. Distributed processing scales horizontally on a cluster of commodity machines, with the cluster size determined by the user based on need. Another advantage of Hadoop is fault tolerance, data is replicated across multiple nodes spread out across different servers, so if one node goes down data still remains in tact. In addition, Hadoop is built for unstructured data, as it does not require a predefined schema in advance of loading data, unlike relational databases. In summation, Hadoop is highly scalable, fault tolerant and built for semi or unstructured data, making it an excellent complement to relational databases, which are still preferable for smaller datasets with a defined schema. Hadoop Ecosystem: In its nascence, the only way to utilize the Hadoop ecosystem was to write MapReduce code which is difficult to program, and requires Java knowledge (Lam, 212). Programs like Pig & Hive have opened up Hadoop programming to people who do not know how to program MapReduce in an easy to understand format. Pig is a procedural language that allows users to supply high- level data transformations, and also manipulate the number of reduce tasks using the PARALLEL function. (Lam, 231). Hive is a declarative language that is almost identical to SQL, and like PIG allows users to access the Hadoop ecosystem without having to deal with the complexities of MapReduce code (Lam 247). These programs have opened up the Hadoop ecosystem to the masses, as most business analysts or programmers have some familiarity with either procedural or declarative languages, and thus are able to utilize either Pig or Hive. Machine Learning with Hadoop: There are also programs that have been developed that allow for Machine learning and data mining on top of the Hadoop platform, such as Apache Mahout and H20. In this project, models were built by running R on top of H2O, and the details of the models and the H2O environment are discussed at length in later sections of this paper. II. Data Collection, Manipulation & Exploration Details was obtained from a college basketball website KenPom.com, which contains efficiency statistics for college basketball teams extending back to These efficiency statistics include : Adjusted Offense offensive efficiency relative to number of possessions each team has this takes efficiency of scoring into play, rather than just looking at points per game, since teams play at different speeds, and get different numbers of opportunities per game Adjusted Defense defensive efficiency relative to number of possessions each team has same explanation as offense Adjusted Tempo measures the speed at which teams play purpose of including in the model is to see if faster or slower teams have an advantage SOS Difference Strength of schedule measures the quality of competition for each team (KenPom.com, 2) All of these statistics should be extremely valuable in distilling insights from the teams, and are complementary to the data provided by Kaggle, which also was transformed, and is detailed below: Location Whether the winning team was playing at home, away or on a neutral court Offensive Rebound Differential Average offensive rebound differential per game for a team, calculated the same way as average point differential Defensive Rebound Differential Average defensive rebound differential per game for a team, calculated the same way as others in this section. Turnover Differential Average turnover differential per game for a team, calculated the same way as other in this section Three Point Reliance Percentage of points scored from three pointers for a team True Shooting Percentage True shooting percentage for a team. True shooting percentage is a more accurate version of shooting percentage, as it weights shots based on their value (ex. Three- pointers are more valuable). Prior to modeling, numerous transformations and additions we made to the dataset. First, information 2

3 True Shooting Percentage Allowed True shooting percentage allowed for a team. Same as above, but what the team allows rather than shoots. Block Differential Average difference in blocks per game for team vs. opponents. Dataset Summary: Each instance in the dataset provided contains all of this information for the winning and losing team in each NCAA regular season and tournament game since Game- by- game statistics were provided, but all of the statistics listed above were rolled into season averages for each team for use in the tournament predictor. This allows for predictive evaluation of teams, as using game- by- game predictors of outcome is not helpful, since it is impossible to know the game statistics prior to the game occurring. All factors are expected to have a positive relationship with winning apart from adjusted defense, true shooting percentage allowed and turnover differential, where in all three cases the goal is to have lower figures than opponents. The object of this competition is to minimize log loss probability so a binary win loss indicator was created, however, a continuous predictor could be created using the margin of victory for each team in each game as well. Modeling Details The individual game predictors for this study were season averages for each team. The model must be run on season averages for teams, since as discussed in- game statistics are not available prior to the game. Hive Analysis: Prior to modeling, data was uploaded into Hive to get a better understanding of how the variables relate to the outcome variable. Prior to evaluation, all factors were converted to z scores based on yearly averages for flexibility in evaluation and modeling. In each game evaluation, the difference in the Z scores between the winning and losing team is the indicator for each variable. The above chart shows the average z score for teams that won vs. teams that lost, and the code above it is what was used to generate the report Teams that won are represented by the binary variables in the newmm.binarymargin column, 1 indicates wins, 0 indicates losses. Unsurprisingly, teams that have an above average offensive efficiency rate & below average defensive efficiency rate (allow less points per possession), win more frequently. Tempo is very close to zero indicating there isn t much of a relationship between pace of play and winning in a vacuum, while strength of schedule has a positive relationship, which makes sense because better teams play harder schedules. Let s take a look at the remaining statistics, all from the provided Kaggle database: All of these results here are also intuitive, teams that shoot better and force their opponents into worse shooting percentages win more often. Defensive rebounding and true shooting had the highest z scores associated with winning teams. The difference between offensive and defensive rebounding is particularly interesting; as it appears defensive rebounding is much more important. Finally, an analysis was run to see how these figures look when taking margin of victory into account. Presumably, numbers would be more pronounced in either direction because margin of victory takes into account the disparity between the two teams, rather than just a binary win or loss. 3

4 The above chart validates the importance of the adjusted offensive and defensive efficiency statistics. As the margin of victory increases (far left column), the adjusted offensive efficiency z score increases, while the adjusted defensive efficiency z score decreases. There are minor inconsistencies, but in general there is clearly a linear trend for both variables. That is outlined by the graph below, which was generated using ggplot2 in R, after exporting the Hive results to CSV. The higher margins have larger variation because there are few instances of the higher margins, however, the trends for both adjusted offensive efficiency and adjusted defensive efficiency are clear, which is promising in terms of these variables being strong indicators of the outcome in the models. As seen in the code below, this pull was filtered to only include double- digit victories because there is less randomness when the margin of victory increases. The purpose of this chart is to confirm that the expected trends exist, so looking at the cases where randomness is less of a factor is useful. III. MODEL BUILDING & EVALUATION Modeling was completed using the H20 integration with R running on top of the Hadoop cluster installed on a Cloudera virtual machine. In order to do this, H20 & R (plus RStudio) must be installed on the virtual machine, and then the H2O package must be installed within R, and then the H2O connection has to be configured with the below code. The great thing about this integration is R acts as the interface for H2O, giving users with a background in R the ability to utilize R, while retaining the power of distributed computing that H2O provides. While the interface may be R, the objects, math and computing are all done by H2O, and in order to reach the optimal power of H2O it must be run on a server, or else the environment will become unstable (Lang, 1). H2O must be running prior to initiating the H2O package within R, or it will not be able to load. After configuration, training and testing files were imported into the R environment using the below code (summary used to ensure data looks in order): Using the same premise, true Shooting percentage was evaluated relative to true shooting percentage allowed: In this case, the same logical relationship emerges, but not as severely as seen with overall offensive and defensive efficiency. This again makes sense because shooting is only a portion of an entire team s performance on both offense and defense, so it makes sense why overall offense and defense have a stronger relationship with the outcome than true shooting percentage and true shooting percentage allowed in isolation. Logistic Regression: After this step, a logistic regression model was built using the H2O GLM package including all independent variables included in the file and 75% of the instances from the regular season file as the training set (Fu, Aiello, Rao, Wang, Kraljevic, Maj 46). Below is the code, as well as the results from the model: 4

5 better in the tournament, and there are much fewer observations in the tournament file. Here are the results from the tournament test set: Logistic Regression Evaluation: These results show strong overall performance (as seen by high AUC), however, it seems to over- predict victories relative to losses, as seen by the low threshold figure, as well as the disparity of false to true predictions. The coefficients are logical based on the data exploration we did using Hive, as offense is highly positive and defense is highly negative. It does appear that many factors are unimportant, so there s potential to rerun the model with fewer variables, but in the interest of time that was not enacted in this analysis. Running the test set on this file produced very similar results to training. The results look very similar to the regular season test set a tendency to over- predict wins relative to losses, but in general, fairly accurate. Finally, predictions were run on the submission dataset. The code is below there is no outcome evaluation possible because the outcomes are not provided: Unfortunately the submission was uploaded after the competition deadline due to time zone confusion, however, it would have ranked 22 nd out of 347 as scored on the predictive binomial deviance. The above chart shows the tendency to predict teams to win notice that there are more missed upsets than correct predictions when the margin of victory is very low (1-4), a negative margin of victory with a missed upset indicates a team that lost and was projected to win. Next the model was enacted versus a different test set historic NCAA tournament results. Major differences between the performances of this test set versus the regular season test set would indicate that there are significant differences between regular season and postseason play in predicting the outcomes of games. Notable differences between the regular season & tournament are all games are played on neutral courts in the tournament, and the quality of each team is Random Forest Classifier: Next a Random Forest classifier was built and performance was compared to the original logistic regression model (Fu et al.. 83). The random forest model is shown below, the regular season test set and the tourney test set were both used to validate the model, and overall results were inferior to the logistic regression model despite trying different parameters. 5

6 Unsurprisingly it better predicted the regular season dataset (testfile) that is much larger and more representative of the training set than the tournament test set (tourney). The submission prediction was run, and the predictive binomial deviance was also worse than the logistic regression model, most likely due to overtraining (too many trees/ too much tree depth). GBM Model: The final model run was the H2O.gbm model (Fu et al.. 39). GBM stands for gradient boosted classification trees in this example, and it essentially creates numerous decision trees, each learning off the residuals of the prior tree. The results are as follows: Ultimately the GBM predictor was outperformed by the original logistic regression model, but only slightly. Surprisingly the GBM tourney test set had a higher AUC than the regular season test set, unlike the random forest model. With more time and effort, the GBM model has potential to improve significantly, and ultimately is the most promising algorithmic solution. IV. CONCLUSION Overall the power of big data technologies is expansive, and the ability to run complex and sophisticated analyses on massive datasets represents a tremendous opportunity globally. With very little manipulation the H2O machine learning engine was able to produce an extremely competitive and actionable model in a very short period of time. H2O is an extremely powerful technology that encapsulates the power of R at a much greater data scale. Currently, the algorithms available in H2O are not as expansive as the library in R, however, the current ratio will continue to get closer as H2O evolves as a platform. In conclusion, big data technologies are no longer segregated to just the few in terms of both individuals and organizations. New languages like Pig & Hive, and open source platforms like Cloudera and H2O have opened up distributed computing to the masses, and as the Hadoop ecosystem continues to gain traction and represent a competitive advantage in the marketplace based on insight extraction, the importance of big data will continue to grow. 6

7 REFERENCE Fu, Anqi, Aiello, Spencer, Rao, Ariel, Wang, Amy, Kraljevic, Tom and Maj, Peter. H2O R Interface. February 2, < project.org/web/packages/h2o/h2o.pdf>. Lang, Irene. Running a GLM Model in H2O + R (notes from the hand- on meetup (Sept.26.) September 27, <0xdata.com>. Pomeroy, Ken. Ratings Glossary. June 8, <kenpom.com> Lam, Chuck. Hadoop In Action. Stamford, CT: Manning Publication Co Print. 7

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

How To Predict Seed In A Tournament

How To Predict Seed In A Tournament Wright 1 Statistical Predictors of March Madness: An Examination of the NCAA Men s Basketball Championship Chris Wright Pomona College Economics Department April 30, 2012 Wright 2 1. Introduction 1.1 History

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015 1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation

More information

How To Predict The Outcome Of The Ncaa Basketball Tournament

How To Predict The Outcome Of The Ncaa Basketball Tournament A Machine Learning Approach to March Madness Jared Forsyth, Andrew Wilde CS 478, Winter 2014 Department of Computer Science Brigham Young University Abstract The aim of this experiment was to learn which

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

Predictive Analytics Certificate Program

Predictive Analytics Certificate Program Information Technologies Programs Predictive Analytics Certificate Program Accelerate Your Career Offered in partnership with: University of California, Irvine Extension s professional certificate and

More information

Big Data and Hadoop for the Executive A Reference Guide

Big Data and Hadoop for the Executive A Reference Guide Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

The Inside Scoop on Hadoop

The Inside Scoop on Hadoop The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015 R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

White Paper: Hadoop for Intelligence Analysis

White Paper: Hadoop for Intelligence Analysis CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and

More information

NoSQL for SQL Professionals William McKnight

NoSQL for SQL Professionals William McKnight NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Big Data Big Deal? Salford Systems www.salford-systems.com

Big Data Big Deal? Salford Systems www.salford-systems.com Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without

More information

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom*

Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom* Seed Distributions for the NCAA Men s Basketball Tournament: Why it May Not Matter Who Plays Whom* Sheldon H. Jacobson Department of Computer Science University of Illinois at Urbana-Champaign shj@illinois.edu

More information

Big Data Analytics OverOnline Transactional Data Set

Big Data Analytics OverOnline Transactional Data Set Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look IBM BigInsights Has Potential If It Lives Up To Its Promise By Prakash Sukumar, Principal Consultant at iolap, Inc. IBM released Hadoop-based InfoSphere BigInsights in May 2013. There are already Hadoop-based

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress

More information

Delivering Real-World Total Cost of Ownership and Operational Benefits

Delivering Real-World Total Cost of Ownership and Operational Benefits Delivering Real-World Total Cost of Ownership and Operational Benefits Treasure Data - Delivering Real-World Total Cost of Ownership and Operational Benefits 1 Background Big Data is traditionally thought

More information

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

A survey of big data architectures for handling massive data

A survey of big data architectures for handling massive data CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - jordydomingos@gmail.com Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context

More information

Comprehensive Analytics on the Hortonworks Data Platform

Comprehensive Analytics on the Hortonworks Data Platform Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page

More information

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: 2454-2377 Vol. 1, Issue 6, October 2015. Big Data and Hadoop ISSN: 2454-2377, October 2015 Big Data and Hadoop Simmi Bagga 1 Satinder Kaur 2 1 Assistant Professor, Sant Hira Dass Kanya MahaVidyalaya, Kala Sanghian, Distt Kpt. INDIA E-mail: simmibagga12@gmail.com

More information

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling

More information

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

locuz.com Big Data Services

locuz.com Big Data Services locuz.com Big Data Services Big Data At Locuz, we help the enterprise move from being a data-limited to a data-driven one, thereby enabling smarter, faster decisions that result in better business outcome.

More information

How To Improve Your Profit With Optimized Prediction

How To Improve Your Profit With Optimized Prediction Higher Business ROI with Optimized Prediction Yottamine s Unique and Powerful Solution Forward thinking businesses are starting to use predictive analytics to predict which future business events will

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,

More information

Big Data must become a first class citizen in the enterprise

Big Data must become a first class citizen in the enterprise Big Data must become a first class citizen in the enterprise An Ovum white paper for Cloudera Publication Date: 14 January 2014 Author: Tony Baer SUMMARY Catalyst Ovum view Big Data analytics have caught

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

In-Database Analytics

In-Database Analytics Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing

More information

CDH AND BUSINESS CONTINUITY:

CDH AND BUSINESS CONTINUITY: WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable

More information

Client Overview. Engagement Situation. Key Requirements

Client Overview. Engagement Situation. Key Requirements Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

BIG DATA TOOLS. Top 10 open source technologies for Big Data

BIG DATA TOOLS. Top 10 open source technologies for Big Data BIG DATA TOOLS Top 10 open source technologies for Big Data We are in an ever expanding marketplace!!! With shorter product lifecycles, evolving customer behavior and an economy that travels at the speed

More information

The Rise of Industrial Big Data

The Rise of Industrial Big Data GE Intelligent Platforms The Rise of Industrial Big Data Leveraging large time-series data sets to drive innovation, competitiveness and growth capitalizing on the big data opportunity The Rise of Industrial

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business Instructor: Kunpeng Zhang (kzhang@rmsmith.umd.edu) Lecture-Discussions:

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

Big Data at Cloud Scale

Big Data at Cloud Scale Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For

More information

DATAOPT SOLUTIONS. What Is Big Data?

DATAOPT SOLUTIONS. What Is Big Data? DATAOPT SOLUTIONS What Is Big Data? WHAT IS BIG DATA? It s more than just large amounts of data, though that s definitely one component. The more interesting dimension is about the types of data. So Big

More information

Big Data Analytics and Optimization

Big Data Analytics and Optimization Big Data Analytics and Optimization C e r t i f i c a t e P r o g r a m i n E n g i n e e r i n g E x c e l l e n c e e.edu.in http://www.insof LIST OF COURSES Essential Business Skills for a Data Scientist...

More information

Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Case Study : 3 different hadoop cluster deployments

Case Study : 3 different hadoop cluster deployments Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within

More information

Cleaned Data. Recommendations

Cleaned Data. Recommendations Call Center Data Analysis Megaputer Case Study in Text Mining Merete Hvalshagen www.megaputer.com Megaputer Intelligence, Inc. 120 West Seventh Street, Suite 10 Bloomington, IN 47404, USA +1 812-0-0110

More information

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or

More information