Predict NBA Game Pace ECE 539 final project Fa Wang

Size: px
Start display at page:

Download "Predict NBA Game Pace ECE 539 final project Fa Wang"

Transcription

1 Predict NBA Game Pace ECE 539 final project Fa Wang Abstract Long time ago, people still believed that having all the best players on the court could guarantee winning of a game or even a champion title. Recently, more historical data proved that a winning team might not have a highest score on offense and lowest score on defense. To evaluate a quality of a team, an index of pace value is introduced. The goal of our project is to predict pace value of a NBA game by analyzing a large amount of historical data on the Internet. To do that, we first extract key features from play-by-play data and game stats data with map reduce pattern. For sake of large amount of historical play-by-play data, we take advantage of HBase[1] to store and process our data. Different from any traditional standalone Machine Learning theory designs, the paper would be more focusing on applying map-reduce pattern along with Machine Learning method to solve big data problem. By examining the final result s mean-square error (MSE), we can judge how far our prediction is away from the real value as to evaluate our model s correctness and accuracy. I. INTRODUCTION To predict NBA game score has been a heated topic for a long time. In recent year, people turn to a possession based score model, which has proved to provide more accurate statistical result. Later on, people introduced a new index to evaluate performance of a team for a certain game. The index is called Pace value. [2] According to basketball-reference, pace factor is an estimate of the number of possessions a team uses per game. Understanding opponent's pace value allows coach to better preparing a game as pace factor is closely related to score of a game, the formula of Score is given as: Score = ASPM * Pace. ASPM[3] denotes points / possession and Pace is the metric for possession. That means an accurate prediction of pace value will result in a precise final score. In our paper, we are trying to merge play-by-play (PBP) data and each game s stats data to predict future game s pace value. The challenge of predicting game's pace is obvious. Different

2 styles of defense and offense strategies may directly influence the pace of a game, especially during clutch time when a game comes down to a few processions; different teams may use the clock differently to defeat their opponents. Also an older team may be more likely to spend longer time on each offensive procession than a younger team. Factors of pace value may even depend on players' health and attitude. As a result, Pace factor is nonlinear and difficult to predict with a small set of data. II. MOTIVATION To precisely predict a future NBA game s result is a tough but appealing task to both NBA fans and non-fans all the time. Recent research shows that a precise game pace value is the key factor, so our team decide to use this sub-topic to take a trial. With a precise pace evaluation of the team, we can apply it to all of its players so that each player can be correctly tagged with a pace capability. Faster team may choose faster player so that our assessment can be taken as a toolbox for team to select desired player. Also NBA game bet is another appealing target for our research. III. RELATED WORK There are many researches on the topic of pace value. The paper [4] examines the optimality of the shooting decisions of National Basketball Association (NBA) players using a rich dataset of 1.4 million offensive possessions. The decision to shoot is a complex problem that involves weighing the continuation value of the possession and the outside option of a teammate shooting. To apply our abstract model to the data we make assumptions about the distribution of potential shots. In line with dynamic efficiency, we find that the cut threshold declines monotonically with time remaining on the shot clock at approximately the correct rate. Most line-ups show strong adherence to allocative efficiency. We link departures in optimality to line-up experience, player salary and overall ability. Paper [5] also talks about the how the teams trade of the value of controlling the length of the games. The author uses the 859 games played by the 30 teams in the NBA during regular season to get the result of the leading team tried to decrease the pace but the trailed team tried to increase the pace when the game is tend to over. The author

3 also concluded that the leading team got fewer points in the last few minutes of the game comparing to the 3rd quarter. But whether the player s decision would increase or decrease the pace, given the strategy of their opponent should be done more investigation. Bigtable Bigtable[6] data structure is a key component of the project and main stream of nowadays big data analytic. A paper Bigtable: A Distributed Storage System for Structured Data has a thorough overview of the design and usage of the data structure, from index key variables, data compression, implement to Google Earth example. The reason we consider Bigtable is a big help for our project is the defined data structure is sparse and distributed through all clusters. The majority operations for our project are mapping and reducing, the Bigtable with indexed by row key, column key and timestamp key can help us retrieving data across multiple clusters efficiently. HBase. Performance Management huge amount of data is a nightmare to engineers, it requires zero fault-tolerance to the system and the program. Processing a large data efficiently and safely is also a key part of the project. MySQL has been a reliable ACID storing platform for the last decent until the introduction of big data analytic. A relative paper Solving Big Data Challenges for Enterprise Application compares the workload performance of MySQL and HBase.[7]. The result of their experiment shows that MySQL has a very good scan throughput for a single node, but it doesn't scale with the number of nodes as data size increase. On the other hand, HBase could be scaled to obtain a linear increase in throughput with the number of nodes. Another reason for using the HBase is due to increase the speed of program execution, in the traditional way, data is copied and run on local machine. The paper Map Reduce: Simplified Data Processing on Large Clusters [8] talked about the locality on large clusters environment. The Map Reduce master takes the location information into account and schedules map task near a replica task location.

4 IV. DESIGN Fg1. Program Diagram The design of the program has three main phases. According to the design diagram above, first data source retrieval phase, a large set of HTML play by play files would be interpreted and stored into a HBase, alongside with Game status data source. Second, Features generation phase, map-reduce would be used to combine two data sources in the HBase and create 20 features for computation. Third, data training phase, since the data size has been reduced in phase 2, we could use the support vector regression (SVR) algorithm to train the data set in an old fashion way. Data source retrieval There are two types of data source used in this paper. The first data source is called PBP (Play By Play) data which is retrieval from the NBA website. The PBP data records possession of ball, shot attempt and possession turnover at certain moment of each NBA game. The PBP data used in this paper consists of 4000 games where each game's PBP record is presented in a HTML format on the NBA website. The total PBP data size is about 2.76GB, each of PBP record has two tables that are useful to our project, one is the abbreviation of the home team and the away team, the other one is the detailed of PBP data.

5 To reduce size and improve accuracy, a JAVA preprocessor program is created to translate the raw data and eliminate noise from the website. After preprocessing, 5000 PBP records are inserted into HBase table as html format string for features generation phase. Game status data is another data source which contains more static information about each team's condition, such as pace value of the current game, and how long has the home team be resting before the current game. Parts of Game status data can be computed after map-reduce process during the features generation phase. gameid Column <html>...</html> Features generation With 5000 html file entries store in the HBase, now we could implement the map-reduce function. The input to a mapper function is a specification with a pair of gameid and column value in the HBase table. In the mapper function, each html file would be translated into home team attack time and home team defense time. Table2 is a sample of PBP html record, to compute home team attack time, we can subtract Time value from row one to row two. And similar process can be done for computing the defense time. After Mapper function, a pair of gameid and a value features (set of attack time and defense time) would output to the reduce function. Time New York Score Indiana 11: L. Stephenson makes 2-pt shot from 1 ft 11:15.0 Shumpert makes pt shot from 23 ft(assist by C Anthony) 10: Personal block foul by I. Shumper

6 The reduce function is called once for each gameid. In our case, there has no real assignment for the reduce phase, only task for the reduce function is to output data from the mapper function directly into a database alongside with the pace value of the current game. The output of mapper phase has extracted features from the raw data and reduced the data size heavily. With a set of base features for each gameid entry, more features could be computed during post map reduce phase. Average home attack time for the last 5 games could be computed by iterating through the last 5 records. Average of game pace value could be computed in a similar fashion, etc. A complete half of features list is shown in Table3, second half of features will be same except that it's for away team. Here is list: Features Description gameid ID to identify each game. HomeTeam Home team name Avg1 Home Average of home defend time team's defend time for all the games before the current game. Avg2 Home Average of home defend time team's defend time for all the games this season. Avg1 Home Average of home offense time team's offense time for all the games before the current game. Avg2 Home Average of home offense time team's offense

7 Avg1 Home Pace Avg2 Home Pace Std1 Var Home Pace Std2 Var Home Pace Avg3 Home Pace Avg5 Home Pace time for all the games this season. Average of home team's pace value for all the games before the current game. Average of home team's pace value for all the games this season. Standard variance of home team's pace value for all the games before the current game. Standard variance of home team's pace value for all the games this season. Average of home team's pace value for last 3 games before the current game. Average of home team's pace value for last 5 games before the current

8 Avg7 Home Pace game. Average of home team's pace value for last 7 games before the current game. The first step for calculating features is reading data from Hbase and stores it into a container. Because features are calculated according to Game ID and team name, we need to search data and get it when we use it. Hash Map is selected as the container because searching is really fast if the key is known in a Hash Map. Three Hash Maps are needed. The first one is used to store the imported data. Game Id is the key, then store all the other data such as average attack time, defend time, actual day since start, rest day and actual pace into an array and the array is the value of the Hash Map. The most important thing for the first Hash Map is that team name should also be an element in the array because we need to calculate features not only according to Game Id, but also team name. The first value of the array is the home s name and the first value of the second half of the array is the away s name. The second Hash Map is used to store intermediate data, team names are considered as keys, store other data such as the sum of average attack time before the day considered and the sum of the average defend time till this day considered, the number of game the team attended and so on. The third Hash Map is used to store the final result. And Game Id is considered as the key, all the features are stored in an array and the array is considered as the value of the Hash Map. The second step is the Calculating. The average attack time is calculated by dividing the sum of attack time by the number of games it attended. For average defend time and average pace value is using similar way. For variance, it s calculated by average the square of the difference between each pace and average pace. For the rest day of each team, there are eight forms. Eight features are added to each team. The team s corresponding feature will be 1 and other seven features will be 0. All of the data is stored into the third Hash Map directly. Game Id is keys and array stores these features are values.

9 Data training We decided to use Support Vector Regression (SVR) as our main learning method for this project. As what we discussed in the previous session, the pace value for each game is difficult predict, because there are multiple non-linear factors, such as player health condition, team chemistry, and coach's game plan. These factors are unpredictable. Therefore Support Vector Regression (SVR) is the primary algorithm for the data training phase. To understand this model, let's start looking at the most elementary algorithm linear regression which is to minimize the quadratic cost function(1). In linear regression, we can solve the equation by finding the optimal vector w. There is another problem of the equation (1) even with a set of linear training data. The problem is called over-fitting which appears when the function performs perfectly well only with the training set but poorly with unknown set of data. To avoid this problem, a penalized term of w is introduced to the equation (2). With the penalized form of linear equation, now we could develop a non-linear form of equation by adding basic function to move existing vectors into a higher dimension. The idea is very simple, if we could not predict a pace value accurate with an existing features, we should simply add more features! Basic function is a mapping function from finite vectors to finite/infinite vectors. Before creating basic function, we could represent equation (2) to (3). The basic function is applied to convert xi to Bi = B(xi).

10 with some algebra tricks. The final form of w is derived. Noted that B is in form of vector with feature sizes. Now substitute 4 to the original equation, we get the form 5 which is the final form of the equation. We know that inner product of basic function can be evaluated by a kernel function. In fact, the kernel function that we used in this project is called Radial Basis function which move data from finite dimension to infinite dimension. The way to find optimal parameters for the algorithm is through quadratic optimization. The complete derivation and optimization of the algorithm can be found in the paper by Alex J. Smola[10] We slightly change our feature file by appending a certain number ahead as required by the SVR input format. Then we split the features file into two parts, one from year 2009 to 2012, one of year The former one is used as training data and the latter one is used as testing data. The training data is used to predict the outcome and the testing data is used to evaluate the estimate. We use mean-square error to evaluate our prediction. That is, the variance of the error between estimated result and real result. Until now, the best team in the world does it in 14.2, so we set our goal to be 16 or less. As a matter of fact, we follow the principle: the less the better. Then, we have to set the SVR core and mode to build the most likely model for our data. As our data of features do not fall into 2 or certain category, we use regression core instead of classification core. As for the mode, there are 2 main choices, one is linear fit and the other is Gaussian fit. We choose to test both of them and compare the outcome. V. RESULTS Since choosing representative features is the key part to achieve a good result, the first step of our experiment is to filter a set of features that best describe the target game. We consult industry expert to generate tens of candidate features that may contribute to our system. Then we draw visual graph like in Fg2 to see which of them may be more important.

11 Fg2. Feature Selection From charts shown above, we can easily find out that pace is highly related to features like attach/defense time but less sensitive to pace standard deviation. Since extra feature will not have too much bad influence on our system, so we don t go deep in proving the reliability or correctness of those chosen features. We just let the final prediction result to tell which kind of features is positive meanwhile others are not. Prediction Accuracy We ve run through bunch of combination of features to perform the final prediction. We take use of MSE (Min Square Error) to reflect the accuracy of our prediction. MSE is a classic metric to evaluate the average error of the system.

12 MSE Linear Kernal Gaussian Kernal 15.4 Feature Set 1 Feature Set 2 Feature Set 3 Fg3. MSE result Seen from Fg3, we can figure out that with more complex combination of features, the MSE can be reduced to a certain degree. Pace is with about 100, so 16 MSE means that the average error is about 4%. VI. Future Work Evaluating pace value of a NBA game is not an easy task in general. Despite with the current play-by-play data we have collected from the website, reaching the real meaningful and analytical data that truly describe a game is still challenging. The play-by-play data is an abstract description of a game at certain time frame; we could not know exact context and motivation of the each play. Another limitation may be related to a team reasons strategic. In our current model, we treat each game equally. In reality, to get a better place and shape in post-season, some teams may have different game plans throughout a season. For example, some teams may slow down their paces after securing their playoff positions. Players, coaches and training staffs may also different from one year to another. As a result, historical data may not be impacted to the current team's pace value. However, these more dynamic factors are not taking into account of the project. In future, new parameter could be introduced to characterize this kind of impacts. From a technical point of view, we may

13 require more data to improve the prediction accuracy. However, from paper by Alex J. Smola[10], we realize that SVR is a very expansive operation which may take quadratic time to solve a optimization equation. When data size reaches the bottleneck level, training phase in our design would be a time consuming process. One way to solve this problem is to add another layer of map reduce phase to highly compress the raw data, also we could migrate the training phase into another map reduce task to solve the Newtonian equation. VII. Conclusion We have successfully accomplished the objective that we proposed in the early stage. Each design phases truly allows us to practice and develop skills to deal with big data. Using Map Reduce and HBase to compress and extract features from 2.1GB data down to a few MB is really a key component to our project. It would be very inefficient to train and analyze a huge data size with a sophisticated algorithm like SVM. Furthermore, studying big data is really a topic with focusing on revealing favorable characteristic data from tons of data, but not just a huge data with dummy noise. Lastly, with that in mind, our key features describe the behavior of pace value of each game as it shows in our result. Acknowledgment I would like to acknowledge the help and support of my friends: Jiawei Dong, Cai Qi. References [1] Shoji Nishimura: MD-HBase: A Scalable Multi-dimensional Data [2] NBA Basketball reference: CA, May [3] Historical NBA ASPM: DStats/2013/nba-stats/historical-nba-aspm-and-hall-rating-released/ CA, May [4] Matt Goldman and Justin M. Rao. Tick-Tock Shot Clock: Optimal Stopping in the NBA.

14 Fall [5] Tim Xin: The Value of Pace In the NBA, Spring 2012 [6] Fay Chang: Bigtable: A Distributed Storage System for Structured Data, [7] Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier: Distributed Semantic Web Data Management in HBase and MySQL Cluster [8] Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. CA,2004 [9] Hao Helen Zhang and Marc Gento: Compactly Supported Radial Basis Function Kernels. [10] Alex J. Smola and Bernhard Scholkopf: A Tutorial on Support Vector Regression.September 30, 2003

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

YMCA Basketball Games and Skill Drills for 3 5 Year Olds

YMCA Basketball Games and Skill Drills for 3 5 Year Olds YMCA Basketball Games and s for 3 5 Year Olds Tips ( s) Variations Page 2 Dribbling Game 10 Players will learn that they must be able to dribble to attack the basket (target) to score in basketball. The

More information

How To Bet On An Nfl Football Game With A Machine Learning Program

How To Bet On An Nfl Football Game With A Machine Learning Program Beating the NFL Football Point Spread Kevin Gimpel kgimpel@cs.cmu.edu 1 Introduction Sports betting features a unique market structure that, while rather different from financial markets, still boasts

More information

Polynomial Neural Network Discovery Client User Guide

Polynomial Neural Network Discovery Client User Guide Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Relational Learning for Football-Related Predictions

Relational Learning for Football-Related Predictions Relational Learning for Football-Related Predictions Jan Van Haaren and Guy Van den Broeck jan.vanhaaren@student.kuleuven.be, guy.vandenbroeck@cs.kuleuven.be Department of Computer Science Katholieke Universiteit

More information

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system 1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

Client Overview. Engagement Situation. Key Requirements

Client Overview. Engagement Situation. Key Requirements Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Solving Mass Balances using Matrix Algebra

Solving Mass Balances using Matrix Algebra Page: 1 Alex Doll, P.Eng, Alex G Doll Consulting Ltd. http://www.agdconsulting.ca Abstract Matrix Algebra, also known as linear algebra, is well suited to solving material balance problems encountered

More information

FORECASTING. Operations Management

FORECASTING. Operations Management 2013 FORECASTING Brad Fink CIT 492 Operations Management Executive Summary Woodlawn hospital needs to forecast type A blood so there is no shortage for the week of 12 October, to correctly forecast, a

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Elements and the Teaching of Creative and Deceptive Play F. Trovato Alaska Youth Soccer Association

Elements and the Teaching of Creative and Deceptive Play F. Trovato Alaska Youth Soccer Association Elements and the Teaching of Creative and Deceptive Play F. Trovato Alaska Youth Soccer Association What is creativity in players? Is it just beating another player in a 1v1 situation? When we think about

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

ACL Soccer 4 v 4 Small Sided Games (SSG s)

ACL Soccer 4 v 4 Small Sided Games (SSG s) KEY TO THE DIAGRAMS Introduction In recent years, the 4v4 method has rapidly increased in popularity however this method is certainly not a new one. The method was introduced by the Dutch Football Association

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

COLE VALLEY CHRISTIAN BOYS BASKETBALL PROGRAM PHILOSOPHY!

COLE VALLEY CHRISTIAN BOYS BASKETBALL PROGRAM PHILOSOPHY! COLE VALLEY CHRISTIAN BOYS BASKETBALL PROGRAM PHILOSOPHY! Purpose! This program exists to carry out the mission of Cole Valley Christian Schools by shepherding, challenging and preparing students through

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours

Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours Essential Question: LESSON 4 FINITE ARITHMETIC SERIES AND RELATIONSHIP TO QUADRATIC

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

The Progression from 4v4 to 11v11

The Progression from 4v4 to 11v11 The Progression from 4v4 to 11v11 The 4v4 game is the smallest model of soccer that still includes all the qualities found in the bigger game. The shape of the team is a smaller version of what is found

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Bigtable is a proven design Underpins 100+ Google services:

Bigtable is a proven design Underpins 100+ Google services: Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Drugs store sales forecast using Machine Learning

Drugs store sales forecast using Machine Learning Drugs store sales forecast using Machine Learning Hongyu Xiong (hxiong2), Xi Wu (wuxi), Jingying Yue (jingying) 1 Introduction Nowadays medical-related sales prediction is of great interest; with reliable

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities

Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities Algebra 1, Quarter 2, Unit 2.1 Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities Overview Number of instructional days: 15 (1 day = 45 60 minutes) Content to be learned

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series Sequences and Series Overview Number of instruction days: 4 6 (1 day = 53 minutes) Content to Be Learned Write arithmetic and geometric sequences both recursively and with an explicit formula, use them

More information

Forecasting in STATA: Tools and Tricks

Forecasting in STATA: Tools and Tricks Forecasting in STATA: Tools and Tricks Introduction This manual is intended to be a reference guide for time series forecasting in STATA. It will be updated periodically during the semester, and will be

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Clustering and mapper

Clustering and mapper June 17th, 2014 Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

How to Win at the Track

How to Win at the Track How to Win at the Track Cary Kempston cdjk@cs.stanford.edu Friday, December 14, 2007 1 Introduction Gambling on horse races is done according to a pari-mutuel betting system. All of the money is pooled,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Wesley Pasfield Northwestern University William.pasifield2016@u.northwestern.edu

Wesley Pasfield Northwestern University William.pasifield2016@u.northwestern.edu March Madness Prediction Using Big Data Technology Wesley Pasfield Northwestern University William.pasifield2016@u.northwestern.edu Abstract- - This paper explores leveraging big data technologies to create

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Maximizing Precision of Hit Predictions in Baseball

Maximizing Precision of Hit Predictions in Baseball Maximizing Precision of Hit Predictions in Baseball Jason Clavelli clavelli@stanford.edu Joel Gottsegen joeligy@stanford.edu December 13, 2013 Introduction In recent years, there has been increasing interest

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Solving Systems of Linear Equations Using Matrices

Solving Systems of Linear Equations Using Matrices Solving Systems of Linear Equations Using Matrices What is a Matrix? A matrix is a compact grid or array of numbers. It can be created from a system of equations and used to solve the system of equations.

More information

Pre-Algebra Lecture 6

Pre-Algebra Lecture 6 Pre-Algebra Lecture 6 Today we will discuss Decimals and Percentages. Outline: 1. Decimals 2. Ordering Decimals 3. Rounding Decimals 4. Adding and subtracting Decimals 5. Multiplying and Dividing Decimals

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Fun Basketball Drills Collection for Kids

Fun Basketball Drills Collection for Kids Fun Basketball Drills Collection for Kids Most of the listed drills will improve the players fundamental skills in a fun way. They can be used for U10 until senior level players. When you are teaching

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

10 FREE BASKETBALL DRILLS

10 FREE BASKETBALL DRILLS BASKETBALL DRILLS AND PRACTICE PLANS 1 10 FREE BASKETBALL DRILLS by Coach Patrick Anderson BASKETBALL DRILLS AND PRACTICE PLANS 2 CONTENTS 1.1 Rapid Swing Pass... 3 1.2 Box Out React Drill... 3 1.3 Bump...

More information

Operation Count; Numerical Linear Algebra

Operation Count; Numerical Linear Algebra 10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Topic: Passing and Receiving for Possession

Topic: Passing and Receiving for Possession U12 Lesson Plans Topic: Passing and Receiving for Possession Objective: To improve the players ability to pass, receive, and possess the soccer ball when in the attack Dutch Square: Half of the players

More information

Drills to Improve Football Skills www.ulster.gaa.ie 1

Drills to Improve Football Skills www.ulster.gaa.ie 1 Drills to Improve Football Skills www.ulster.gaa.ie 1 Drills to Improve Football Skills Drills to Improve Football Skills has been designed with the intention that the coach should step back to take a

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

More information

Independent samples t-test. Dr. Tom Pierce Radford University

Independent samples t-test. Dr. Tom Pierce Radford University Independent samples t-test Dr. Tom Pierce Radford University The logic behind drawing causal conclusions from experiments The sampling distribution of the difference between means The standard error of

More information

Systems of Equations Involving Circles and Lines

Systems of Equations Involving Circles and Lines Name: Systems of Equations Involving Circles and Lines Date: In this lesson, we will be solving two new types of Systems of Equations. Systems of Equations Involving a Circle and a Line Solving a system

More information