Drugs store sales forecast using Machine Learning
|
|
|
- Allyson Phillips
- 9 years ago
- Views:
Transcription
1 Drugs store sales forecast using Machine Learning Hongyu Xiong (hxiong2), Xi Wu (wuxi), Jingying Yue (jingying) 1 Introduction Nowadays medical-related sales prediction is of great interest; with reliable sales prediction, medical companies could allocate their resources more wisely and make better profits. We join a Kaggle competition to predict the everyday drug sale for each store based on the store, promotion and competitor data. We aim to apply different machine learning techniques to tune the model and make predictions on drug sales using time series analysis. 2 Data The Kaggle website provides us training data of 1115 Rossmann stores daily sales dated back to 2013, with 1,017,209 entries in total. The training data includes features of promotion and competitors information. Since we are not able to access to the real sales amount for testing during Kaggle competition, we decide to use 70% of the contest given training data as the training set for our model, the rest 30% as test set for cross validation. For now, the cross validation will be our estimated test error. 3 Method and Results Since the problem involves time series data, we intend to use time series analysis model to deal with it in the first place. We establish auto-regression (AR) model and train it using the data we have to get the parameters. We test AR models with different order numbers, and calculate the test errors. In the time series analysis part, we predicted each store's sale based on just its past data. Now we want to see how stores are different according to their different features. First, we establish a time-independent model by averaging the daily sales per store and collapsing the time dimension; in this frame we use Random Forest to select features and then use Support Vector Regression to fit different features (such as assortment type and competitors) to each store's mean sales. Finally, we manage to generalize the time series model to predict several stores at a time, instead of one by one in the first section. 3.1 Time Series Analysis In order to get a big picture, we first plot several stores daily sales with respect to time evolution. From the plots, we recognize that it's a time series data and for a certain store, its daily sales evolve periodically with a period of a year. For example, according to Figure 2, we can see that there will be sales peaks at certain dates around January and May, but in general the sales keep at a constant level Auto-Regression (AR) Model
2 Since the prediction for the daily sale for each store from past data seems to be a time series problem, we manage to solve this problem by building a time series analysis model. The basic method we are using is auto-regression (AR) model. X p = ϕ X + ε t i t i t i= 1 We looked online and discovered a package called "Forecast" in R, which does autoregression. We made some revision so the program could do what we want. To use this model, we pre-process the data to make weekly sales as a time unit. The first reason we want to do that is because that "zero Sunday sales" makes it hard to use daily sales as a time unit, and monthly sale would kill all the detail patterns; the second reason is that we plot sales at different days in a week through time, and discover that they basically follow the same pattern (Figure 1). Since we condense the daily sales to weekly sales at the beginning, after we predicted the weekly sales we still have to regenerate sales on each day in that week. We assume a normal distribution N(µ, σ 2 ) with µ equal to the weekly sales and variance σ 2 to be trained. We compare AR models with order p = 1, 2, 5, 10, 20 to see which order number fit best. Figure 1 Sales of store1 on different days of the week Cross-validation For each order p, we train parameters φ, ε, µ, and σ with 70% of the data and cross validation with the remain 30% to get the test error. The cost function we are using is the sum of the square of the difference between predicted and real daily sales of store1. The table below is the comparison among test errors using different order number. Order p Test Error (10^10) As we can see order number p = 2 yields the lowest test errors. We plot the predicted sales of store1 (Figure 3). If we compare Fig 2 and Fig 3, we can see the model capture the general pattern of the store1's daily sales.
3 Specifically speaking, for sales on Tuesday, Wednesday and Thursday, our predictions are within 10% of the real daily sales. However, it gives conservative predictions for Mondays and Fridays, when sales are apparently higher compared to other days in a week. Figure 2 Sales of store1 at daily base Figure 3 Predicted daily sales of store1 3.2 Random Forest Since there are so many factors influencing the sales, such as store types, promotions, competitors, and even holidays, we are trying to identify the most important features that influence the sales. We use random forest to do that; in order to decrease the amount of data to test the model and the program, we average the daily sales for each store, so the amount of data decrease from a million to a thousand. In the raw data provided, five parameters have missing data points. We first combine CompetitionSinceMonth and CompetitionSinceYear to a new variable CompetitionExistMonth. Similarly, we create a new variable PromoExistMonth from four promotion variables. We also scale the variable CompetitionDistance by 100. Figure 4 Feature selection through random forest We built a model using the "random Forest" package in R. Our model was built on 500 trees and 8 variables in the "store.csv" and the mean of sales for each store as the response variable. We used na.omit for the missing values and generated the variable importance plot for the feature selection. The out-of-bag errors were traced for the model.
4 It seems that the out-of-bag errors are pretty small. But we will have to dig more into the data and model fitting to find out if it's overfitting. The variable importance plot is as following: As you can see from the plot, the feature Assortment is the most important variable because including it could make the biggest reduction in the MSE. Also notice that CompetitionExistence is the least important one among all variables in the store characteristics. 3.3 Support Vector Regression (SVR) We used Support Vector Regression (SVR) method to find relation between each store's mean sales and different kind of features. SVR is a variation of Support Vector Machine (SVM). The essence of SVM is to find a classification boundary with maximum margins to the feature points; by similar token, SVR is to find a regression curve with minimum margins to the feature points. We used the standard liblinear2.1 package in MATLAB to implement SVR. 1 2 min wb, ω 2 () i T () i y ω x b ε st.. for i = 1, 2,...,m T ( i) ( i) ω x + b y ε According to the data, there are four different store types (a, b, c, d) and three different assortment types (a, b, c), in total 8 combinations (aa, ac, ba, bb, ca, cc, da, dc). We ran SVR algorithm with linear kernel according to different combinations. Test Error (a, a) (a, c) (b, a) (b, b) (c, a) (c, c) (d, a) (d, c) Linear Linear + Parabola Sqrt + Parabola *10 ^ Extension for Time Series Model In this model we used a method similar to seasonalized times series regression, in which prediction of sales Y = T SI. We adjusted this equation to Y = T DI, in which Y means the daily sale prediction, T means the sale trend, and DI means daily sale index. To obtain T, we linearly fitted of all historical sales data of a single store with equation y = a + bt, in which a = - /-, and b =. / 0. Then to predict sale at time t 1, T t 1 = a + bt 1. DI was obtained from historical data, and the 365 days in a year each has a different DI. For example, if we wish to get DI for January 2nd, we got historical sales on January 2nd, say s 3, s 4, s 5, at time point t 3, t 4, t 5, respectively, and linearly fitted sales y 3, y 4, y 5, then DI = DI reflects the sales fluctuations at different periods within a year. We take considerations for zero sales cases, and if s 3 = 0, we just ignores it when calculating DI. Instead in the final prediction we assigned sales on Sundays to be zero, based on observations of historical data, and holiday dates with all historical sales data equal to zero, like January 1st, were also predicted to be zero. We used 76% historical data for training, and 24% for testing, and the cost functions for 10 stores were listed.
5 Test Error Test Error Store1 Store2 Store3 Store4 Store Store6 Store7 Store8 Store9 Store Figure 5 Comparison between raw and predicted sales of store 1 & 2. 4 Conclusion and Prospective We use AR model to predict the sales with small discrepancy to the test data, and we use RF and SVR to find relations between store mean sales and other features. There are certainly rooms for improvements. We can make further predictions on daily sales using SVR. Even though we found the relations between the features, and make fairly good predictions on average sales for each store. We think next we could try to use SVR see how the parameters set in AR change according to features. By doing so, we could automate the process of making predictions on daily sales for all the stores in the time series model. Reference: [1] Wensen Dai et al. A Clustering-based Sales Forecasting Scheme Using Support Vector Regression for Computer Server. Procedia Manufacturing 2 ( 2015 ) [2] Marco Hulsmann et al. General Sales Forecast Models for Automobile Markets and their Analysis. Transactions on Machine Learning and Data Mining Vol. 5, No. 2 (2012)
Machine Learning in Statistical Arbitrage
Machine Learning in Statistical Arbitrage Xing Fu, Avinash Patra December 11, 2009 Abstract We apply machine learning methods to obtain an index arbitrage strategy. In particular, we employ linear regression
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
Sales Forecasting for Retail Chains
1 Sales Forecasting for Retail Chains Ankur Jain 1, Manghat Nitish Menon 2, Saurabh Chandra 3 A53097130 1, A53097652 2, A53104614 3 {anj022 1, mnmenon 2, sbipinch 3 }@eng.ucsd.edu Abstract This paper presents
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Predicting daily incoming solar energy from weather data
Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network
Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network Dušan Marček 1 Abstract Most models for the time series of stock prices have centered on autoregressive (AR)
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Forecasting in supply chains
1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
JetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
Early defect identification of semiconductor processes using machine learning
STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew
Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries
A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
The normal approximation to the binomial
The normal approximation to the binomial The binomial probability function is not useful for calculating probabilities when the number of trials n is large, as it involves multiplying a potentially very
Forecasting methods applied to engineering management
Forecasting methods applied to engineering management Áron Szász-Gábor Abstract. This paper presents arguments for the usefulness of a simple forecasting application package for sustaining operational
CS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis One-Factor Experiments CS 147: Computer Systems Performance Analysis One-Factor Experiments 1 / 42 Overview Introduction Overview Overview Introduction Finding
IST 557 Final Project
George Slota DataMaster 5000 IST 557 Final Project Abstract As part of a competition hosted by the website Kaggle, a statistical model was developed for prediction of United States Census 2010 mailing
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Comparison of machine learning methods for intelligent tutoring systems
Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu
Credit Scoring Modelling for Retail Banking Sector.
Credit Scoring Modelling for Retail Banking Sector. Elena Bartolozzi, Matthew Cornford, Leticia García-Ergüín, Cristina Pascual Deocón, Oscar Iván Vasquez & Fransico Javier Plaza. II Modelling Week, Universidad
Java Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
New Year Holiday Period Traffic Fatality Estimate, 2015-2016
New Year Holiday Period Traffic Fatality Estimate, 2015-2016 Prepared by Statistics Department National Safety Council December 2, 2015 Holiday period definition New Year's Day is January 1 and it is observed
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Demand Forecasting When a product is produced for a market, the demand occurs in the future. The production planning cannot be accomplished unless
Demand Forecasting When a product is produced for a market, the demand occurs in the future. The production planning cannot be accomplished unless the volume of the demand known. The success of the business
Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error
Supply Chain Forecasting Model Using Computational Intelligence Techniques
CMU.J.Nat.Sci Special Issue on Manufacturing Technology (2011) Vol.10(1) 19 Supply Chain Forecasting Model Using Computational Intelligence Techniques Wimalin S. Laosiritaworn Department of Industrial
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Designing a learning system
Lecture Designing a learning system Milos Hauskrecht [email protected] 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing
Modeling of Sales Forecasting in Retail Using Soft Computing Techniques
Modeling of Sales Forecasting in Retail Using Soft Computing Techniques Luís Lobo da Costa, Susana M. Vieira and João M. C. Sousa Center of Intelligent Systems - IDMEC-LAETA Instituto Superior Técnico,
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
How To Predict Web Site Visits
Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many
Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney
Combining GLM and datamining techniques for modelling accident compensation data Peter Mulquiney Introduction Accident compensation data exhibit features which complicate loss reserving and premium rate
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Introduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
Support Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
Sales and operations planning (SOP) Demand forecasting
ing, introduction Sales and operations planning (SOP) forecasting To balance supply with demand and synchronize all operational plans Capture demand data forecasting Balancing of supply, demand, and budgets.
Time series Forecasting using Holt-Winters Exponential Smoothing
Time series Forecasting using Holt-Winters Exponential Smoothing Prajakta S. Kalekar(04329008) Kanwal Rekhi School of Information Technology Under the guidance of Prof. Bernard December 6, 2004 Abstract
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Statistical Learning for Short-Term Photovoltaic Power Predictions
Statistical Learning for Short-Term Photovoltaic Power Predictions Björn Wolff 1, Elke Lorenz 2, Oliver Kramer 1 1 Department of Computing Science 2 Institute of Physics, Energy and Semiconductor Research
Regression and Time Series Analysis of Petroleum Product Sales in Masters. Energy oil and Gas
Regression and Time Series Analysis of Petroleum Product Sales in Masters Energy oil and Gas 1 Ezeliora Chukwuemeka Daniel 1 Department of Industrial and Production Engineering, Nnamdi Azikiwe University
Sheet Resistance = R (L/W) = R N ------------------ L
Sheet Resistance Rewrite the resistance equation to separate (L / W), the length-to-width ratio... which is the number of squares N from R, the sheet resistance = (σ n t) - R L = -----------------------
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Moving averages. Rob J Hyndman. November 8, 2009
Moving averages Rob J Hyndman November 8, 009 A moving average is a time series constructed by taking averages of several sequential values of another time series. It is a type of mathematical convolution.
Indian School of Business Forecasting Sales for Dairy Products
Indian School of Business Forecasting Sales for Dairy Products Contents EXECUTIVE SUMMARY... 3 Data Analysis... 3 Forecast Horizon:... 4 Forecasting Models:... 4 Fresh milk - AmulTaaza (500 ml)... 4 Dahi/
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
Equity forecast: Predicting long term stock price movement using machine learning
Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
A Study on the Comparison of Electricity Forecasting Models: Korea and China
Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison
2.2 Elimination of Trend and Seasonality
26 CHAPTER 2. TREND AND SEASONAL COMPONENTS 2.2 Elimination of Trend and Seasonality Here we assume that the TS model is additive and there exist both trend and seasonal components, that is X t = m t +
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
2) The three categories of forecasting models are time series, quantitative, and qualitative. 2)
Exam Name TRUE/FALSE. Write 'T' if the statement is true and 'F' if the statement is false. 1) Regression is always a superior forecasting method to exponential smoothing, so regression should be used
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
8.1 Summary and conclusions 8.2 Implications
Conclusion and Implication V{tÑàxÜ CONCLUSION AND IMPLICATION 8 Contents 8.1 Summary and conclusions 8.2 Implications Having done the selection of macroeconomic variables, forecasting the series and construction
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
2. What is the general linear model to be used to model linear trend? (Write out the model) = + + + or
Simple and Multiple Regression Analysis Example: Explore the relationships among Month, Adv.$ and Sales $: 1. Prepare a scatter plot of these data. The scatter plots for Adv.$ versus Sales, and Month versus
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Hedge Effectiveness Testing
Hedge Effectiveness Testing Using Regression Analysis Ira G. Kawaller, Ph.D. Kawaller & Company, LLC Reva B. Steinberg BDO Seidman LLP When companies use derivative instruments to hedge economic exposures,
Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I
Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA - Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Sales forecasting # 2
Sales forecasting # 2 Arthur Charpentier [email protected] 1 Agenda Qualitative and quantitative methods, a very general introduction Series decomposition Short versus long term forecasting
