Predict NBA Game Pace ECE 539 final project Fa Wang
|
|
- Adam Rice
- 7 years ago
- Views:
Transcription
1 Predict NBA Game Pace ECE 539 final project Fa Wang Abstract Long time ago, people still believed that having all the best players on the court could guarantee winning of a game or even a champion title. Recently, more historical data proved that a winning team might not have a highest score on offense and lowest score on defense. To evaluate a quality of a team, an index of pace value is introduced. The goal of our project is to predict pace value of a NBA game by analyzing a large amount of historical data on the Internet. To do that, we first extract key features from play-by-play data and game stats data with map reduce pattern. For sake of large amount of historical play-by-play data, we take advantage of HBase[1] to store and process our data. Different from any traditional standalone Machine Learning theory designs, the paper would be more focusing on applying map-reduce pattern along with Machine Learning method to solve big data problem. By examining the final result s mean-square error (MSE), we can judge how far our prediction is away from the real value as to evaluate our model s correctness and accuracy. I. INTRODUCTION To predict NBA game score has been a heated topic for a long time. In recent year, people turn to a possession based score model, which has proved to provide more accurate statistical result. Later on, people introduced a new index to evaluate performance of a team for a certain game. The index is called Pace value. [2] According to basketball-reference, pace factor is an estimate of the number of possessions a team uses per game. Understanding opponent's pace value allows coach to better preparing a game as pace factor is closely related to score of a game, the formula of Score is given as: Score = ASPM * Pace. ASPM[3] denotes points / possession and Pace is the metric for possession. That means an accurate prediction of pace value will result in a precise final score. In our paper, we are trying to merge play-by-play (PBP) data and each game s stats data to predict future game s pace value. The challenge of predicting game's pace is obvious. Different
2 styles of defense and offense strategies may directly influence the pace of a game, especially during clutch time when a game comes down to a few processions; different teams may use the clock differently to defeat their opponents. Also an older team may be more likely to spend longer time on each offensive procession than a younger team. Factors of pace value may even depend on players' health and attitude. As a result, Pace factor is nonlinear and difficult to predict with a small set of data. II. MOTIVATION To precisely predict a future NBA game s result is a tough but appealing task to both NBA fans and non-fans all the time. Recent research shows that a precise game pace value is the key factor, so our team decide to use this sub-topic to take a trial. With a precise pace evaluation of the team, we can apply it to all of its players so that each player can be correctly tagged with a pace capability. Faster team may choose faster player so that our assessment can be taken as a toolbox for team to select desired player. Also NBA game bet is another appealing target for our research. III. RELATED WORK There are many researches on the topic of pace value. The paper [4] examines the optimality of the shooting decisions of National Basketball Association (NBA) players using a rich dataset of 1.4 million offensive possessions. The decision to shoot is a complex problem that involves weighing the continuation value of the possession and the outside option of a teammate shooting. To apply our abstract model to the data we make assumptions about the distribution of potential shots. In line with dynamic efficiency, we find that the cut threshold declines monotonically with time remaining on the shot clock at approximately the correct rate. Most line-ups show strong adherence to allocative efficiency. We link departures in optimality to line-up experience, player salary and overall ability. Paper [5] also talks about the how the teams trade of the value of controlling the length of the games. The author uses the 859 games played by the 30 teams in the NBA during regular season to get the result of the leading team tried to decrease the pace but the trailed team tried to increase the pace when the game is tend to over. The author
3 also concluded that the leading team got fewer points in the last few minutes of the game comparing to the 3rd quarter. But whether the player s decision would increase or decrease the pace, given the strategy of their opponent should be done more investigation. Bigtable Bigtable[6] data structure is a key component of the project and main stream of nowadays big data analytic. A paper Bigtable: A Distributed Storage System for Structured Data has a thorough overview of the design and usage of the data structure, from index key variables, data compression, implement to Google Earth example. The reason we consider Bigtable is a big help for our project is the defined data structure is sparse and distributed through all clusters. The majority operations for our project are mapping and reducing, the Bigtable with indexed by row key, column key and timestamp key can help us retrieving data across multiple clusters efficiently. HBase. Performance Management huge amount of data is a nightmare to engineers, it requires zero fault-tolerance to the system and the program. Processing a large data efficiently and safely is also a key part of the project. MySQL has been a reliable ACID storing platform for the last decent until the introduction of big data analytic. A relative paper Solving Big Data Challenges for Enterprise Application compares the workload performance of MySQL and HBase.[7]. The result of their experiment shows that MySQL has a very good scan throughput for a single node, but it doesn't scale with the number of nodes as data size increase. On the other hand, HBase could be scaled to obtain a linear increase in throughput with the number of nodes. Another reason for using the HBase is due to increase the speed of program execution, in the traditional way, data is copied and run on local machine. The paper Map Reduce: Simplified Data Processing on Large Clusters [8] talked about the locality on large clusters environment. The Map Reduce master takes the location information into account and schedules map task near a replica task location.
4 IV. DESIGN Fg1. Program Diagram The design of the program has three main phases. According to the design diagram above, first data source retrieval phase, a large set of HTML play by play files would be interpreted and stored into a HBase, alongside with Game status data source. Second, Features generation phase, map-reduce would be used to combine two data sources in the HBase and create 20 features for computation. Third, data training phase, since the data size has been reduced in phase 2, we could use the support vector regression (SVR) algorithm to train the data set in an old fashion way. Data source retrieval There are two types of data source used in this paper. The first data source is called PBP (Play By Play) data which is retrieval from the NBA website. The PBP data records possession of ball, shot attempt and possession turnover at certain moment of each NBA game. The PBP data used in this paper consists of 4000 games where each game's PBP record is presented in a HTML format on the NBA website. The total PBP data size is about 2.76GB, each of PBP record has two tables that are useful to our project, one is the abbreviation of the home team and the away team, the other one is the detailed of PBP data.
5 To reduce size and improve accuracy, a JAVA preprocessor program is created to translate the raw data and eliminate noise from the website. After preprocessing, 5000 PBP records are inserted into HBase table as html format string for features generation phase. Game status data is another data source which contains more static information about each team's condition, such as pace value of the current game, and how long has the home team be resting before the current game. Parts of Game status data can be computed after map-reduce process during the features generation phase. gameid Column <html>...</html> Features generation With 5000 html file entries store in the HBase, now we could implement the map-reduce function. The input to a mapper function is a specification with a pair of gameid and column value in the HBase table. In the mapper function, each html file would be translated into home team attack time and home team defense time. Table2 is a sample of PBP html record, to compute home team attack time, we can subtract Time value from row one to row two. And similar process can be done for computing the defense time. After Mapper function, a pair of gameid and a value features (set of attack time and defense time) would output to the reduce function. Time New York Score Indiana 11: L. Stephenson makes 2-pt shot from 1 ft 11:15.0 Shumpert makes pt shot from 23 ft(assist by C Anthony) 10: Personal block foul by I. Shumper
6 The reduce function is called once for each gameid. In our case, there has no real assignment for the reduce phase, only task for the reduce function is to output data from the mapper function directly into a database alongside with the pace value of the current game. The output of mapper phase has extracted features from the raw data and reduced the data size heavily. With a set of base features for each gameid entry, more features could be computed during post map reduce phase. Average home attack time for the last 5 games could be computed by iterating through the last 5 records. Average of game pace value could be computed in a similar fashion, etc. A complete half of features list is shown in Table3, second half of features will be same except that it's for away team. Here is list: Features Description gameid ID to identify each game. HomeTeam Home team name Avg1 Home Average of home defend time team's defend time for all the games before the current game. Avg2 Home Average of home defend time team's defend time for all the games this season. Avg1 Home Average of home offense time team's offense time for all the games before the current game. Avg2 Home Average of home offense time team's offense
7 Avg1 Home Pace Avg2 Home Pace Std1 Var Home Pace Std2 Var Home Pace Avg3 Home Pace Avg5 Home Pace time for all the games this season. Average of home team's pace value for all the games before the current game. Average of home team's pace value for all the games this season. Standard variance of home team's pace value for all the games before the current game. Standard variance of home team's pace value for all the games this season. Average of home team's pace value for last 3 games before the current game. Average of home team's pace value for last 5 games before the current
8 Avg7 Home Pace game. Average of home team's pace value for last 7 games before the current game. The first step for calculating features is reading data from Hbase and stores it into a container. Because features are calculated according to Game ID and team name, we need to search data and get it when we use it. Hash Map is selected as the container because searching is really fast if the key is known in a Hash Map. Three Hash Maps are needed. The first one is used to store the imported data. Game Id is the key, then store all the other data such as average attack time, defend time, actual day since start, rest day and actual pace into an array and the array is the value of the Hash Map. The most important thing for the first Hash Map is that team name should also be an element in the array because we need to calculate features not only according to Game Id, but also team name. The first value of the array is the home s name and the first value of the second half of the array is the away s name. The second Hash Map is used to store intermediate data, team names are considered as keys, store other data such as the sum of average attack time before the day considered and the sum of the average defend time till this day considered, the number of game the team attended and so on. The third Hash Map is used to store the final result. And Game Id is considered as the key, all the features are stored in an array and the array is considered as the value of the Hash Map. The second step is the Calculating. The average attack time is calculated by dividing the sum of attack time by the number of games it attended. For average defend time and average pace value is using similar way. For variance, it s calculated by average the square of the difference between each pace and average pace. For the rest day of each team, there are eight forms. Eight features are added to each team. The team s corresponding feature will be 1 and other seven features will be 0. All of the data is stored into the third Hash Map directly. Game Id is keys and array stores these features are values.
9 Data training We decided to use Support Vector Regression (SVR) as our main learning method for this project. As what we discussed in the previous session, the pace value for each game is difficult predict, because there are multiple non-linear factors, such as player health condition, team chemistry, and coach's game plan. These factors are unpredictable. Therefore Support Vector Regression (SVR) is the primary algorithm for the data training phase. To understand this model, let's start looking at the most elementary algorithm linear regression which is to minimize the quadratic cost function(1). In linear regression, we can solve the equation by finding the optimal vector w. There is another problem of the equation (1) even with a set of linear training data. The problem is called over-fitting which appears when the function performs perfectly well only with the training set but poorly with unknown set of data. To avoid this problem, a penalized term of w is introduced to the equation (2). With the penalized form of linear equation, now we could develop a non-linear form of equation by adding basic function to move existing vectors into a higher dimension. The idea is very simple, if we could not predict a pace value accurate with an existing features, we should simply add more features! Basic function is a mapping function from finite vectors to finite/infinite vectors. Before creating basic function, we could represent equation (2) to (3). The basic function is applied to convert xi to Bi = B(xi).
10 with some algebra tricks. The final form of w is derived. Noted that B is in form of vector with feature sizes. Now substitute 4 to the original equation, we get the form 5 which is the final form of the equation. We know that inner product of basic function can be evaluated by a kernel function. In fact, the kernel function that we used in this project is called Radial Basis function which move data from finite dimension to infinite dimension. The way to find optimal parameters for the algorithm is through quadratic optimization. The complete derivation and optimization of the algorithm can be found in the paper by Alex J. Smola[10] We slightly change our feature file by appending a certain number ahead as required by the SVR input format. Then we split the features file into two parts, one from year 2009 to 2012, one of year The former one is used as training data and the latter one is used as testing data. The training data is used to predict the outcome and the testing data is used to evaluate the estimate. We use mean-square error to evaluate our prediction. That is, the variance of the error between estimated result and real result. Until now, the best team in the world does it in 14.2, so we set our goal to be 16 or less. As a matter of fact, we follow the principle: the less the better. Then, we have to set the SVR core and mode to build the most likely model for our data. As our data of features do not fall into 2 or certain category, we use regression core instead of classification core. As for the mode, there are 2 main choices, one is linear fit and the other is Gaussian fit. We choose to test both of them and compare the outcome. V. RESULTS Since choosing representative features is the key part to achieve a good result, the first step of our experiment is to filter a set of features that best describe the target game. We consult industry expert to generate tens of candidate features that may contribute to our system. Then we draw visual graph like in Fg2 to see which of them may be more important.
11 Fg2. Feature Selection From charts shown above, we can easily find out that pace is highly related to features like attach/defense time but less sensitive to pace standard deviation. Since extra feature will not have too much bad influence on our system, so we don t go deep in proving the reliability or correctness of those chosen features. We just let the final prediction result to tell which kind of features is positive meanwhile others are not. Prediction Accuracy We ve run through bunch of combination of features to perform the final prediction. We take use of MSE (Min Square Error) to reflect the accuracy of our prediction. MSE is a classic metric to evaluate the average error of the system.
12 MSE Linear Kernal Gaussian Kernal 15.4 Feature Set 1 Feature Set 2 Feature Set 3 Fg3. MSE result Seen from Fg3, we can figure out that with more complex combination of features, the MSE can be reduced to a certain degree. Pace is with about 100, so 16 MSE means that the average error is about 4%. VI. Future Work Evaluating pace value of a NBA game is not an easy task in general. Despite with the current play-by-play data we have collected from the website, reaching the real meaningful and analytical data that truly describe a game is still challenging. The play-by-play data is an abstract description of a game at certain time frame; we could not know exact context and motivation of the each play. Another limitation may be related to a team reasons strategic. In our current model, we treat each game equally. In reality, to get a better place and shape in post-season, some teams may have different game plans throughout a season. For example, some teams may slow down their paces after securing their playoff positions. Players, coaches and training staffs may also different from one year to another. As a result, historical data may not be impacted to the current team's pace value. However, these more dynamic factors are not taking into account of the project. In future, new parameter could be introduced to characterize this kind of impacts. From a technical point of view, we may
13 require more data to improve the prediction accuracy. However, from paper by Alex J. Smola[10], we realize that SVR is a very expansive operation which may take quadratic time to solve a optimization equation. When data size reaches the bottleneck level, training phase in our design would be a time consuming process. One way to solve this problem is to add another layer of map reduce phase to highly compress the raw data, also we could migrate the training phase into another map reduce task to solve the Newtonian equation. VII. Conclusion We have successfully accomplished the objective that we proposed in the early stage. Each design phases truly allows us to practice and develop skills to deal with big data. Using Map Reduce and HBase to compress and extract features from 2.1GB data down to a few MB is really a key component to our project. It would be very inefficient to train and analyze a huge data size with a sophisticated algorithm like SVM. Furthermore, studying big data is really a topic with focusing on revealing favorable characteristic data from tons of data, but not just a huge data with dummy noise. Lastly, with that in mind, our key features describe the behavior of pace value of each game as it shows in our result. Acknowledgment I would like to acknowledge the help and support of my friends: Jiawei Dong, Cai Qi. References [1] Shoji Nishimura: MD-HBase: A Scalable Multi-dimensional Data [2] NBA Basketball reference: CA, May [3] Historical NBA ASPM: DStats/2013/nba-stats/historical-nba-aspm-and-hall-rating-released/ CA, May [4] Matt Goldman and Justin M. Rao. Tick-Tock Shot Clock: Optimal Stopping in the NBA.
14 Fall [5] Tim Xin: The Value of Pace In the NBA, Spring 2012 [6] Fay Chang: Bigtable: A Distributed Storage System for Structured Data, [7] Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier: Distributed Semantic Web Data Management in HBase and MySQL Cluster [8] Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. CA,2004 [9] Hao Helen Zhang and Marc Gento: Compactly Supported Radial Basis Function Kernels. [10] Alex J. Smola and Bernhard Scholkopf: A Tutorial on Support Vector Regression.September 30, 2003
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationBeating the MLB Moneyline
Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationCreating a NL Texas Hold em Bot
Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationYMCA Basketball Games and Skill Drills for 3 5 Year Olds
YMCA Basketball Games and s for 3 5 Year Olds Tips ( s) Variations Page 2 Dribbling Game 10 Players will learn that they must be able to dribble to attack the basket (target) to score in basketball. The
More informationHow To Bet On An Nfl Football Game With A Machine Learning Program
Beating the NFL Football Point Spread Kevin Gimpel kgimpel@cs.cmu.edu 1 Introduction Sports betting features a unique market structure that, while rather different from financial markets, still boasts
More informationPolynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationRelational Learning for Football-Related Predictions
Relational Learning for Football-Related Predictions Jan Van Haaren and Guy Van den Broeck jan.vanhaaren@student.kuleuven.be, guy.vandenbroeck@cs.kuleuven.be Department of Computer Science Katholieke Universiteit
More information2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationDistributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
More informationClient Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationSolving Mass Balances using Matrix Algebra
Page: 1 Alex Doll, P.Eng, Alex G Doll Consulting Ltd. http://www.agdconsulting.ca Abstract Matrix Algebra, also known as linear algebra, is well suited to solving material balance problems encountered
More informationFORECASTING. Operations Management
2013 FORECASTING Brad Fink CIT 492 Operations Management Executive Summary Woodlawn hospital needs to forecast type A blood so there is no shortage for the week of 12 October, to correctly forecast, a
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationElements and the Teaching of Creative and Deceptive Play F. Trovato Alaska Youth Soccer Association
Elements and the Teaching of Creative and Deceptive Play F. Trovato Alaska Youth Soccer Association What is creativity in players? Is it just beating another player in a 1v1 situation? When we think about
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationACL Soccer 4 v 4 Small Sided Games (SSG s)
KEY TO THE DIAGRAMS Introduction In recent years, the 4v4 method has rapidly increased in popularity however this method is certainly not a new one. The method was introduced by the Dutch Football Association
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationChapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationCOLE VALLEY CHRISTIAN BOYS BASKETBALL PROGRAM PHILOSOPHY!
COLE VALLEY CHRISTIAN BOYS BASKETBALL PROGRAM PHILOSOPHY! Purpose! This program exists to carry out the mission of Cole Valley Christian Schools by shepherding, challenging and preparing students through
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationAcquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours
Acquisition Lesson Planning Form Key Standards addressed in this Lesson: MM2A3d,e Time allotted for this Lesson: 4 Hours Essential Question: LESSON 4 FINITE ARITHMETIC SERIES AND RELATIONSHIP TO QUADRATIC
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationBig Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationThe Progression from 4v4 to 11v11
The Progression from 4v4 to 11v11 The 4v4 game is the smallest model of soccer that still includes all the qualities found in the bigger game. The shape of the team is a smaller version of what is found
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
More informationCALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
More informationPredicting Flight Delays
Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
More informationGraph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite
ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,
More informationBig Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationBigtable is a proven design Underpins 100+ Google services:
Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationDrugs store sales forecast using Machine Learning
Drugs store sales forecast using Machine Learning Hongyu Xiong (hxiong2), Xi Wu (wuxi), Jingying Yue (jingying) 1 Introduction Nowadays medical-related sales prediction is of great interest; with reliable
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationKEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
More informationCreating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities
Algebra 1, Quarter 2, Unit 2.1 Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities Overview Number of instructional days: 15 (1 day = 45 60 minutes) Content to be learned
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationOverview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series
Sequences and Series Overview Number of instruction days: 4 6 (1 day = 53 minutes) Content to Be Learned Write arithmetic and geometric sequences both recursively and with an explicit formula, use them
More informationForecasting in STATA: Tools and Tricks
Forecasting in STATA: Tools and Tricks Introduction This manual is intended to be a reference guide for time series forecasting in STATA. It will be updated periodically during the semester, and will be
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationLeast-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
More informationModule 3: Correlation and Covariance
Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis
More informationClustering and mapper
June 17th, 2014 Overview Goal of talk Explain Mapper, which is the most widely used and most successful TDA technique. (At core of Ayasdi, TDA company founded by Gunnar Carlsson.) Basic idea: perform clustering
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationInfrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
More informationHow to Win at the Track
How to Win at the Track Cary Kempston cdjk@cs.stanford.edu Friday, December 14, 2007 1 Introduction Gambling on horse races is done according to a pari-mutuel betting system. All of the money is pooled,
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationWesley Pasfield Northwestern University William.pasifield2016@u.northwestern.edu
March Madness Prediction Using Big Data Technology Wesley Pasfield Northwestern University William.pasifield2016@u.northwestern.edu Abstract- - This paper explores leveraging big data technologies to create
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationMaximizing Precision of Hit Predictions in Baseball
Maximizing Precision of Hit Predictions in Baseball Jason Clavelli clavelli@stanford.edu Joel Gottsegen joeligy@stanford.edu December 13, 2013 Introduction In recent years, there has been increasing interest
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationSolving Systems of Linear Equations Using Matrices
Solving Systems of Linear Equations Using Matrices What is a Matrix? A matrix is a compact grid or array of numbers. It can be created from a system of equations and used to solve the system of equations.
More informationPre-Algebra Lecture 6
Pre-Algebra Lecture 6 Today we will discuss Decimals and Percentages. Outline: 1. Decimals 2. Ordering Decimals 3. Rounding Decimals 4. Adding and subtracting Decimals 5. Multiplying and Dividing Decimals
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationDATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.
DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,
More informationFun Basketball Drills Collection for Kids
Fun Basketball Drills Collection for Kids Most of the listed drills will improve the players fundamental skills in a fun way. They can be used for U10 until senior level players. When you are teaching
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More information10 FREE BASKETBALL DRILLS
BASKETBALL DRILLS AND PRACTICE PLANS 1 10 FREE BASKETBALL DRILLS by Coach Patrick Anderson BASKETBALL DRILLS AND PRACTICE PLANS 2 CONTENTS 1.1 Rapid Swing Pass... 3 1.2 Box Out React Drill... 3 1.3 Bump...
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationTopic: Passing and Receiving for Possession
U12 Lesson Plans Topic: Passing and Receiving for Possession Objective: To improve the players ability to pass, receive, and possess the soccer ball when in the attack Dutch Square: Half of the players
More informationDrills to Improve Football Skills www.ulster.gaa.ie 1
Drills to Improve Football Skills www.ulster.gaa.ie 1 Drills to Improve Football Skills Drills to Improve Football Skills has been designed with the intention that the coach should step back to take a
More informationEfficient Data Structures for Decision Diagrams
Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...
More informationThe Need for Training in Big Data: Experiences and Case Studies
The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor
More informationIndependent samples t-test. Dr. Tom Pierce Radford University
Independent samples t-test Dr. Tom Pierce Radford University The logic behind drawing causal conclusions from experiments The sampling distribution of the difference between means The standard error of
More informationSystems of Equations Involving Circles and Lines
Name: Systems of Equations Involving Circles and Lines Date: In this lesson, we will be solving two new types of Systems of Equations. Systems of Equations Involving a Circle and a Line Solving a system
More information