# Predict the Popularity of YouTube Videos Using Early View Data

Size: px
Start display at page:

Transcription

2 The rest of the paper is organized as follows: Section 2 describes the video view counts and popularity distribution, and formulate the prediction problem. In Section 3, we introduced the existing algorithms and newly proposed models to solve the problem. Experiment results are presented in section 4, followed by conclusions in section 5. 2 Popularity Prediction Problem YouTube provides the statistical information along with the video on the web, and we focus on the video popularity associated with time. Figure1 shows an example of the view counts of Gangnam Style M/V 2, which is the most viewed YouTube video in the category music. The cumulative distribution of total view number is in Figure1-a. Around 1.5 trillion views have cumulated in the 273 days from July 15, 2012 (the video was uploaded) to April 13, 2013 (the data was sampled). The corresponding daily view count distribution is depicted in Figure1-b. Total number of views 1e Day Number of daily views 1e Day Figure 1: Total views and daily views of the video Gangnam Style M/V To predict the relations between video popularity and the time in a dataset, the description of the data is exhibited. Linearity is always the first option to be examined to describe the property of data and model choice in machine learning. Among a large number of videos, there is no strongly linearity in days and view counts, or between daily view counts and total view counts. An intuition in the popularity growth is that the view patterns are similar for the majority of videos. An example of the total views at day 30 and day 7 are shown in Figure 2. The dataset contains 5652 YouTube videos. After log transformation, an approximately linear relationship comes out between the total view numbers at two different days [3]. Similar patterns are observed in the view counts of other days. Next, the description of the problem is formulated after introducing the background. 2.1 Problem Formulation For a YouTube video v, denote the total view count at the date t by N(v, t). Given the daily view counts of days earlier than the reference date t r (inclusively), the goal is to predict the total view number of the video at the target date t t in the future. Let N(v, t t ) be the total view number at day t t, and ˆN(v, t r, t t ) be the predicted total view number. Therefore, ˆN(v, tr, t t ) is a function of the view counts in the first t r days: where t r 1, and t t > t r. ˆN(v, t r, t t ) = Φ(N(v, 1),, N(v, t r )) (1) For different services with regard to the popularity prediction, relative errors are likely to be more important than the absolute errors in the various contents [4]. The relative squared error (RSE) is 2 2

3 Total number of views in 30 days Total number of views in 7 days Figure 2: Total views at day 30 versus total views at day 7 used to evaluate the prediction of the total view number: RSE = ( ˆN(v, t r, t t ) N(v, t t ) 1) 2 (2) In a set of videos V, the mean of the relative squared error (MRSE) of all videos is the criterion of the performance: MRSE = 1 V ( ˆN(v, t r, t t ) 1) 2 (3) N(v, t t ) 3 Methods In these section, the four methods of video popularity prediction are presented. 3.1 Univariate Linear (UL) Model v V The popularity in a future date is strongly correlated with the total view count at the reference date via a log transformation, described by Szabo et al [3]. In the model, only N(v, t r ) is used instead of the total view count from the beginning of the sequence in 1. Then the relation in 1 is: ˆN(v, t r, t t ) = r tr,t r N(v, t r ) (4) Considering a uniformed model for all videos, r tr,t r is expected to be independent of the video, and only decided by t r and t t. Adopting linear regression in equation 3, the optimal coefficient is: r tr,t r = 3.2 Multivariate Linear (ML) Model v V ( N(v,tr) N(v,t t) ) v V ( N(v,tr) N(v,t t) )2 (5) Contrary to the UL model only using the data at the day t r, the multivariate linear model in [6] thinks that daily views are not equally important in contributing to the views at the day t t, and gives different priorities to the views in days earlier than t r. For simplicity, the total view number at the target day is a linear combination of the daily view counts multiplied by weights. Denote the view number in the ith day by x i, and x i (v) = N(v, i) N(v, i 1) (6) 3

4 The predicted popularity at the target date ˆN(v, t r, t t ) is expressed as: ˆN(v, t r, t t ) = Θ tr,t t X tr (v) (7) where Θ tr,t t = (θ 1,, θ tr ) and X tr (v) = (x 1 (v),, x tr (v)) T are the parameter vector and feature vector respectively. Then the MRSE in equation 3 becomes: MRSE = argmin 1 V ( Θ t r,t t X tr (v) 1) 2 (8) N(v, t t ) v V The approximate linearity of ˆN(v, t r, t t ) is shown in the UL model, and N(v, t t ) is a scalar. Let X (v) =, the criterion is: Xtr (v) N(v,t t) MRSE = argmin 1 V (Θ tr,t t X (v) 1) 2 (9) This linear least squares problem is solved via a single value decomposition of the matrix composed of X (v) [6]. 3.3 RBF Model v V Neither UL nor ML model tackles with the variance in the dataset. Only a single set of parameters is used for all videos, and it is weak to follow the different popularity growth patterns. Henrique et al [4] use ML model and radial basis functions (RBFs) together to depict the approximately linear functions, which is formally defined as: ˆN(v, t r, t t ) = Θ tr,t t X tr (v) + c C ω c RBF c (v) (10) where C is the set of examples chosen as centers for the kernel and ω c is the corresponding weight. The Gaussian kernel of the RBF is aimed at capture the variance in behaviors, and the RBF is: RBF c (v) = e X(v) X(c) 2 ( 2 δ 2 ) Different from the original, consider ML and RBF features together and then find optimums. To avoid large derivation from ML model results, the residuals of the prediction and real values are the inputs as actual object values. It is a degraded model used in the project compared with the original RBF algorithm. We compare the results with the ML, and only accept when the new error is better than ML model. Another difference is the way of finding RBF parameters. There are no best prior parameters. To solve the problem, they use grid search to find global optimal values. We choose centers of feature vector rather than directly using part of the dataset acting as centers. 3.4 Evolution Model In this method [5], the theory is based on an one-peak assumption in the daily view distribution. Details are not repeated here. We classify the fraction of peak views into different categories (memoryless, viral, quality and junk), and train models on each category. Theoretical depictions of the four categories are shown in Figure 3. The colors yellow, green, blue and red represent memoryless, viral, quality and junk videos, respectively. The values of views per day in the figure is lees important than the trends of curves. A burst is assumed to happen at the day 50. The obvious difference in patterns could possibly increase the performance of the ML model in the prediction. 4 Evaluation 4.1 Datasets YouTube API 3 contains the module YouTube Insight, a tool to retrieving statistics data from the video. The API provides video reports including view demographics and view report showing the 3 guide protocol (11) 4

5 Views per day Time Figure 3: Four patterns in the evolution model way that viewers interacted with the video. More information is available in the forms such as viewer locations, referrers, comments, favorites and ratings. However, it requires the authentication to access the statistical data of the objective videos or channels. Google Chart API 4 is a tool that allows creating charts in web pages. For the view count data, YouTube requests the Google Chart API to plot the graph of the statistics using at most 100 points. The API call to generate graphs were intercepted and the view counts data were collected via chart wrappers. Videos with points less than 100 are accurate in the daily view counts, while videos longer than 100 days need interpolation to infer the view count at each day. After the data collection and pre-processing, all videos are classified into two groups: Top videos and Random Videos [4]. Top Videos include top lists such as most viewed and dominant favorites in terms of region, category, period, etc. They have higher opportunities than other videos to appear in the recommendation list and thus possibly being viewed videos exist in this category. While random videos do not have any preference in the dataset generation, and a total videos belong to the random category. 4.2 Experimental Results Considering the variability of the data, we adopted cross validation which reduced the model s dependence on the data that were used in the training. The entire dataset was randomly divided into 10 folds of equal size. For each fold k {1, 2,, 10}, we trained the models on all folds but the kth, and tested on the kth fold. The same process was repeated for 10 times with a different fold as the validation data each time. After separated testing on different folds, the results of the ten tests were averaged as a whole. In this section, the default reference day is 7, and the target day is 30. Actually the meaning of predicting the 30th data using the previous 29 data points is much less than predicting using only the first 7 days. Figure 4 depicts the predicted mrse values using the UL and ML models in the top dataset. Model equations for top dataset UL model: ˆN(v, 7, 30) = N(v, 7) ML model: ˆN(v, 7, 30) = x1 (v) x 2 (v) x 3 (v) x 4 (v) x 5 (v) x 6 (v) x 7 (v) 4 5

6 mrse Model equations for random dataset Reference date Figure 4: UL and ML Model Prediction Errors UL model: ˆN(v, 7, 30) = N(v, 7) ML model: ˆN(v, 7, 30) = x1 (v) x 2 (v) x 3 (v) x 4 (v) x 5 (v) x 6 (v) x 7 (v) Generally, both models have the mean relative squared error decrease with the increasing day, namely more data is utilized. Compared with UL model, ML model has all daily view information in the days earlier than the reference day, providing the possible choices in the regression. At the left and right end, the two have similar results, but the reasons are not the same. At the first day, both UL and ML model predict based on only the first day data, thus generating the same result. When the data approaches the reference day, ML model attaches more importance to the recent data, which is the entire data the UL model could use. Through all reference dates, ML outperforms UL between the day 3 and day 11, with a 20% error deduction. The previous two models do not tackle with the variances in the dataset, since only a set of parameters are generated for the entire dataset. Then the radial basis function are introduced to reduce the variance. Compared with the original RBF features chosen in [6], our RBF model is degraded in the choices of centers. The radial basis function provided by scipy 5 is used. For convenience, log transformed view counts were used for calculation and fitting. The RBFs could fit the relative error, but also make a bad impact on the data points with no error by switching to the RBF value. Only 1 dimensional RBF is shown in Figure 5 for good visualbility. The last method is to classify the video into one of the four categories. The criterion is the fraction of numbers around the peak. The mean of the peak average for the memoryless, viral, quality and junk are 0.04, 0.04, 0.08, and 0.28 respectively. Corresponding intervals are [0.01, 0.21], [0.01, 0.15], [0.02, 0.32], and [0.03, 0.62]. We switch the predicted model based on the category. For a value that lies in several categories, we choose the its difference from the mean as the weight when conducting a weighted sum as the model. Future direction can be K-means classification. In general, the mean mrse values for the four methods are , , , and The combination of the ML and four categories classification, namely the 4th method has the best result

7 Relative error View count Figure 5: 1-d radial basis function on relative error Table 1: Four methods performance mrse 4.3 UL model ML model ML model and RBF ML model and four categories More Considerations Currently the dynamics are not clear. In most situations, more views lead to more comments, favorites, likes, dislikes. More likes cause more recommendations and views. It is hard to model without figure out relations. Then user behavior, internal mechanism of video recommendations are good areas to explore. Last but not least, no enough time to consider more options without enough data for the project. With more data in YouTube API, more specific questions like the study in terms of region is in [7]. 5 Conclusions In this article we have conducted four methods for predicting the long-term popularity of YouTube videos based on early measurements of view data. On a technical level, an approximate linear correlation exists between the logarithmically transformed video popularity at early and later times, with the residuals. The linearity is solved via linear regression while the residuals are dealt by different methods in the proposed models. ML has a big improvement over the UL model when the reference date is around 1/3 of the target date. Aimed at being more adaptive to the variance in the dataset, RBFs and classification are two methods. The better performance of the classification indicates that the dynamics of the social system plays an important role in the view number. In the future, we could possibly study why some videos are more popular than other videos, and their relations with referrals and recommendations. References [1] Statistics-YouTube [2] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics, [3] G. Szabo and B. Huberman. Predicting the popularity of online content. Communic. of ACM, 53(8),

8 [4] F. Figueiredo, F. Benevenuto, and J. Almeida. The tube over time: Characterizing popularity growth of youtube videos. In Proc. Conference of Web Search and Data Mining, [5] R. Crane and D. Sornette. Robust dynamic classes revealed by measuring the response function of a social system. Proc. National Academy of Sciences, 105(41), [6] H. Pinto, J. Almeida, and M. Gonalves. Using early view patterns to predict the popularity of youtube videos. In Proceedings of the sixth ACM international conference on Web search and data mining, [7] A. Brodersen, S. Scellato, and M. Wattenhofer. YouTube around the world: geographic popularity of videos. In Proceedings of the 21st international conference on World Wide Web,

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

### Music Mood Classification

Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

### The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and

### Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

### Module 3: Correlation and Covariance

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

### Beating the MLB Moneyline

Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

### The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

### Chapter 1 Introduction. 1.1 Introduction

Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Module 5: Multiple Regression Analysis

Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### A Study on the Comparison of Electricity Forecasting Models: Korea and China

Communications for Statistical Applications and Methods 2015, Vol. 22, No. 6, 675 683 DOI: http://dx.doi.org/10.5351/csam.2015.22.6.675 Print ISSN 2287-7843 / Online ISSN 2383-4757 A Study on the Comparison

### A Study of YouTube recommendation graph based on measurements and stochastic tools

A Study of YouTube recommendation graph based on measurements and stochastic tools Yonathan Portilla, Alexandre Reiffers, Eitan Altman, Rachid El-Azouzi ABSTRACT The Youtube recommendation is one the most

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### How can we discover stocks that will

Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive

### 1 Teaching notes on GMM 1.

Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

### Trend and Seasonal Components

Chapter 2 Trend and Seasonal Components If the plot of a TS reveals an increase of the seasonal and noise fluctuations with the level of the process then some transformation may be necessary before doing

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### 1 Review of Least Squares Solutions to Overdetermined Systems

cs4: introduction to numerical analysis /9/0 Lecture 7: Rectangular Systems and Numerical Integration Instructor: Professor Amos Ron Scribes: Mark Cowlishaw, Nathanael Fillmore Review of Least Squares

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Multiple Regression: What Is It?

Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

### The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Final Project Report

CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

### arxiv:1506.04135v1 [cs.ir] 12 Jun 2015

Reducing offline evaluation bias of collaborative filtering algorithms Arnaud de Myttenaere 1,2, Boris Golden 1, Bénédicte Le Grand 3 & Fabrice Rossi 2 arxiv:1506.04135v1 [cs.ir] 12 Jun 2015 1 - Viadeo

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Making Sense of Web Traffic Data and Its Implications for B2B Marketing Strategies

Making Sense of Web Data and Its Implications for B2B Marketing Strategies Analysis with Data Generated Website Abstract In the realm of B2B marketing, it is now realized buyers are switching catalogs

### Supervised and unsupervised learning - 1

Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

### CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### ROOT: A data mining tool from CERN What can actuaries do with it?

ROOT: A data mining tool from CERN What can actuaries do with it? Ravi Kumar, Senior Manager, Deloitte Consulting LLP Lucas Lau, Senior Consultant, Deloitte Consulting LLP Southern California Casualty

### Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Chapter 4: Vector Autoregressive Models

Chapter 4: Vector Autoregressive Models 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie IV.1 Vector Autoregressive Models (VAR)...

### CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

### Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Chapter Vector autoregressions We begin by taking a look at the data of macroeconomics. A way to summarize the dynamics of macroeconomic data is to make use of vector autoregressions. VAR models have become

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

### A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

### Statistical Learning for Short-Term Photovoltaic Power Predictions

Statistical Learning for Short-Term Photovoltaic Power Predictions Björn Wolff 1, Elke Lorenz 2, Oliver Kramer 1 1 Department of Computing Science 2 Institute of Physics, Energy and Semiconductor Research

### MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

### THE SELECTION OF RETURNS FOR AUDIT BY THE IRS. John P. Hiniker, Internal Revenue Service

THE SELECTION OF RETURNS FOR AUDIT BY THE IRS John P. Hiniker, Internal Revenue Service BACKGROUND The Internal Revenue Service, hereafter referred to as the IRS, is responsible for administering the Internal

### Long-term Stock Market Forecasting using Gaussian Processes

Long-term Stock Market Forecasting using Gaussian Processes 1 2 3 4 Anonymous Author(s) Affiliation Address email 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Jitter Measurements in Serial Data Signals

Jitter Measurements in Serial Data Signals Michael Schnecker, Product Manager LeCroy Corporation Introduction The increasing speed of serial data transmission systems places greater importance on measuring

### JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Edith Cowan University Research Online ECU Publications Pre. JPEG compression of monochrome D-barcode images using DCT coefficient distributions Keng Teong Tan Hong Kong Baptist University Douglas Chai

### Introduction to Matrix Algebra

Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

### Multivariate Analysis of Variance (MANOVA)

Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various

### This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

### Solving Systems of Linear Equations Using Matrices

Solving Systems of Linear Equations Using Matrices What is a Matrix? A matrix is a compact grid or array of numbers. It can be created from a system of equations and used to solve the system of equations.

### Simple Regression Theory II 2010 Samuel L. Baker

SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

### Predicting daily incoming solar energy from weather data

Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting

### Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Lecture 3: QR, least squares, linear regression Linear Algebra Methods for Data Mining, Spring 2007, University

### Twitter Analytics: Architecture, Tools and Analysis

Twitter Analytics: Architecture, Tools and Analysis Rohan D.W Perera CERDEC Ft Monmouth, NJ 07703-5113 S. Anand, K. P. Subbalakshmi and R. Chandramouli Department of ECE, Stevens Institute Of Technology

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Chapter 6: Multivariate Cointegration Analysis

Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration

### 1 Prior Probability and Posterior Probability

Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

### Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

### A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

### Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

### Analytical Test Method Validation Report Template

Analytical Test Method Validation Report Template 1. Purpose The purpose of this Validation Summary Report is to summarize the finding of the validation of test method Determination of, following Validation

### Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

Solving Inequalities An inequality is the result of replacing the = sign in an equation with ,, or. For example, 3x 2 < 7 is a linear inequality. We call it linear because if the < were replaced with

### Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

### South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

### The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

### Introduction to time series analysis

Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### A Short Tour of the Predictive Modeling Process

Chapter 2 A Short Tour of the Predictive Modeling Process Before diving in to the formal components of model building, we present a simple example that illustrates the broad concepts of model building.

### INCORPORATION OF LIQUIDITY RISKS INTO EQUITY PORTFOLIO RISK ESTIMATES. Dan dibartolomeo September 2010

INCORPORATION OF LIQUIDITY RISKS INTO EQUITY PORTFOLIO RISK ESTIMATES Dan dibartolomeo September 2010 GOALS FOR THIS TALK Assert that liquidity of a stock is properly measured as the expected price change,

### Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want