Save this PDF as:

Size: px
Start display at page:

## Transcription

2 After parsing and processing the data, the features available for us to use are: day of the week (Monday, Tuesday, etc.), month, date, year, hour, minute, second, number of unique users, number of POST requests, number of GET requests, and content size of the interaction. The diagram below represents our prediction model: At the very bottom level, we predict the server load based on time. The problem of predicting the server load only based on time would be extremely difficult to solve. So, we decided to divide the problem into two parts, one to predict the server load, and another, to predict the user features based on time. For regressions and classifications tests in the following section, we will assume that the user features were correctly predicted. The output value that we tested was chosen to be the 5-minute average server load because the uptime command was run every 5 minutes on the server. Experimental Result Linear Regression We assume that the user features such as number of unique users and number of requests, etc. were correctly predicted by h1(t). This assumption had to be made because of amount of data which was not sufficient to predict the user features since the logs had been made only since this August. The features month and year were dropped from the feature lists to create a training matrix that is of full rank. To create a larger set of features, we concatenated powers of the matrix X to the original matrix. So, our training matrix, X = [Xorig Xorig 2... Xorig n ].

3 Since the full rank of matrix X could not be preserved from n = 5, we could only test the cases where n = 1,2,3,4. To come up with the optimal n, some tests were done to compute the average training and test errors. The following graph shows the errors. As we can see from the graph above, the overall error actually got worse as we increased the number of features used. However, because we thought that n = 3 gave a prediction curve of shape closer to the actual data, we picked n = 3. Below are the plots of parts of the actual and predicted server load for training and test data. The corresponding errors with respect to n are listed below: n average training error average test error As the two graphs display, the predictions are not as accurate as it ought to be. However, the trends in the server load are quite nicely fit. For the above regression, the training error was , and the test error was Since the shape of the curve is quite nicely fit, for our purpose of predicting the server load to increase or decrease the number of servers, we could amplify the prediction curve by a constant factor to reduce the latency experienced by the users, or do the opposite to reduce the number of operating servers. Multinomial Logistic Regression In order to transform our problem into a logistic regression problem, we classified the values of the server load (CPU usage) into ten values (0,1,, 9) as follows: if the CPU usage was less than or equal to 0.10, it would be 0; else if less than or equal to 0.20, it would be 1; ; else if less than or

4 equal to 0.90, it would be 8; else it would be 9. We trained 10 separate hypotheses hi for each classification. For testing, the system would choose the classification that gives us the highest confidence level. We use the same approach as linear regression case to find the maximum power of the matrix we want to use when coming up with the matrix X, where X = [Xorig Xorig 2... Xorig n ]. The following plots for training and test errors were acquired: n average training error average test error As in the linear regression case, we chose n = 3 for logistic regression. The following is the resulting plot of our prediction compared to the actual data. As we can see, the multinomial logistic regression always outputs 0 as the classification no matter what the features are. This is in part because too many of our data points has Y-values of 0 as the below histogram shows. Because the company we got the data from has just launched, the server load is usually very low. As the website becomes more popular, we would be able to obtain a better

5 result since the Y-values will be more evenly distributed. Locally weight linear regression Like linear regression, we first assumed all features used as a input, such as number of users, total number of requests and so on, were correctly predicted by for the same reason. For building up a hypothesis, we had to determine what weight function we would use. In common sense, it is reasonable that server loads after five or ten minutes would be dependent on previous several hours of behavior so we decided to use as a weight function in our hypothesis. However, for using the weight function we were needed to find out the optimal value of which is the bandwidth of the bell-shaped weight function. So we used Hold-out Cross Validation for picking up the optimal and the way and results are as follows

6 First of all, we had picked 70% of the original data randomly and named it as training data and did again 30% and named testing data. Then we trained the hypothesis with various values of on the training data and applied testing data on the hypotheses and got test errors in each value of. Above chart shows test errors of sever loads tested on hypothesis h(t,u) with various. Above plot displays the pattern of the test errors with respect to. As we can see in the chart and graph, at =0.03, the test error is the smallest value on all of three cases so we picked up this one for our weight function. Above plots show the actual server load and prediction of test and train data, respectively. As making a prediction on every unit of query point x is prohibitively time consuming process, the prediction is made on each query point in train and test data. Though it seems that it followed the pattern of actual load, sometimes it could not follow spikes and it needed very long time to predict even on query points of test so it might not be suitable to be used in a real time system which would be needed in a general sever load prediction system. Thus, it is desirable that locally weighted regression will be trained just on previous several hours not on whole previous history if sufficient training data were given. Support Vector Regression We use SVM Light 3 to perform a regression. We remove all zero features ( which indicating that a server is idle ) and select training samples randomly. The trade-off between training error and margin is 0.1 and epsilon width of the tube is 1.0. We use a linear kernel and found that there are 148 support

7 vectors and the norm of the weight vectors are The training error is and the testing error is Above plots are parts of the results from the Support vector regression. The red curve is an actual data and the blue curve is the prediction. Although the support vector regression is quickly converged for 6,000 training data set, its prediction has higher training and testing errors than the rest of learning algorithms. We tried to adjust an epsilon width and found that the higher epsilon width will only offset an estimation by a fixed constant. Conclusion The prediction from the locally weight linear regression yields the smallest testing error but it takes an hour to process 6,000 training samples. The linear regression also yields an acceptable prediction when we concatenated the same feature vectors up to a degree of 3. Both Multinomial logistic regression and support vector regression do not perform well with sparse training data. Support vector regression, however, takes only a few minutes to converge, but yields the largest training error. Issues Bias with sparse data Our learning algorithms cannot estimate an uptime reliably because there are almost 90% of the training samples have an uptime value between 0.0 and 0.1. We believe that the data we have collected are not diverse due to the low number of activities of our testing web server. Not enough features In our experiment, a dimension of the feature vectors are of them are part of a timestamp. We think that we do not have enough data to capture the current physical state of the cloud servers such as the memory usage, bandwidth, and network latency. weak prediction model We think that our decision to use the number of transactions and unique users as part of the feature vectors does not accurately model the computing requirements of the cloud servers. An access log file and uptime file are generated separately by different servers. Our assumption that a user activity is the only main cause of high server load average may not be accurate. It is possible that a single transaction could cause a high demand for a computing requirement. Future work We have found a research paper that describes the use of website access logs to predict a user's activities by clustering users based on their recent web activities, we like this idea and think that it is possible to categorize website users based on their computing demands. When these type of users are using the website, we think there is a high chance of increasing the computing requirements. References

8 [1] Access Log specification. [2] Uptime specification. [3] SVMLight. Vers K. Aug Thorsten Joachims. Nov

### WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

### The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Data mining knowledge representation

Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### MTH 140 Statistics Videos

MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

### BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Drugs store sales forecast using Machine Learning

Drugs store sales forecast using Machine Learning Hongyu Xiong (hxiong2), Xi Wu (wuxi), Jingying Yue (jingying) 1 Introduction Nowadays medical-related sales prediction is of great interest; with reliable

### Curve Fitting Best Practice

Enabling Science Curve Fitting Best Practice Part 3: Fitting data Regression and residuals are an important function and feature of curve fitting and should be understood by anyone doing this type of analysis.

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### CAS CS 565, Data Mining

CAS CS 565, Data Mining Course logistics Course webpage: http://www.cs.bu.edu/~evimaria/cs565-10.html Schedule: Mon Wed, 4-5:30 Instructor: Evimaria Terzi, evimaria@cs.bu.edu Office hours: Mon 2:30-4pm,

### Analysis Tools and Libraries for BigData

+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

### Predicting Flight Delays

Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

### Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage September 2014 Author: Veronia Bahaa Supervisors: Maria Arsuaga-Rios Seppo S. Heikkila CERN openlab Summer Student Report 2014

### Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

### Our Raison d'être. Identify major choice decision points. Leverage Analytical Tools and Techniques to solve problems hindering these decision points

Analytic 360 Our Raison d'être Identify major choice decision points Leverage Analytical Tools and Techniques to solve problems hindering these decision points Empowerment through Intelligence Our Suite

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features Charlie Berger, MS Eng, MBA Sr. Director Product Management, Data Mining and Advanced Analytics charlie.berger@oracle.com www.twitter.com/charliedatamine

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Data Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining

### Capacity Planning. Capacity Planning Process

Process, page 1 Getting Started, page 2 Collected Data Categorization, page 3 Capacity Utilization, page 6 Process Figure 1: Process You should make changes to an existing Unified ICM/Unified CCE deployment

### Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

### Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Lecture - 32 Regression Modelling Using SPSS

Applied Multivariate Statistical Modelling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 32 Regression Modelling Using SPSS (Refer

### Pearson's Correlation Tests

Chapter 800 Pearson's Correlation Tests Introduction The correlation coefficient, ρ (rho), is a popular statistic for describing the strength of the relationship between two variables. The correlation

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Optimizing Search Engines using Clickthrough Data

Optimizing Search Engines using Clickthrough Data Presented by - Kajal Miyan Seminar Series, 891 Michigan state University *Slides adopted from presentations of Thorsten Joachims (author) and Shui-Lung

### Smart Queue Scheduling for QoS Spring 2001 Final Report

ENSC 833-3: NETWORK PROTOCOLS AND PERFORMANCE CMPT 885-3: SPECIAL TOPICS: HIGH-PERFORMANCE NETWORKS Smart Queue Scheduling for QoS Spring 2001 Final Report By Haijing Fang(hfanga@sfu.ca) & Liu Tang(llt@sfu.ca)

### Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

### DNS (Domain Name System) is the system & protocol that translates domain names to IP addresses.

Lab Exercise DNS Objective DNS (Domain Name System) is the system & protocol that translates domain names to IP addresses. Step 1: Analyse the supplied DNS Trace Here we examine the supplied trace of a

### Machine Learning. Topic: Evaluating Hypotheses

Machine Learning Topic: Evaluating Hypotheses Bryan Pardo, Machine Learning: EECS 349 Fall 2011 How do you tell something is better? Assume we have an error measure. How do we tell if it measures something

### ANALYTICS IN BIG DATA ERA

ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY, DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA MAURIZIO SALUSTI SAS Copyr i g ht 2012, SAS Ins titut

### Using News Articles to Predict Stock Price Movements

Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Predict the Popularity of YouTube Videos Using Early View Data

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

### Semester 1 Statistics Short courses

Semester 1 Statistics Short courses Course: STAA0001 Basic Statistics Blackboard Site: STAA0001 Dates: Sat. March 12 th and Sat. April 30 th (9 am 5 pm) Assumed Knowledge: None Course Description Statistical

### Spreadsheet software for linear regression analysis

Spreadsheet software for linear regression analysis Robert Nau Fuqua School of Business, Duke University Copies of these slides together with individual Excel files that demonstrate each program are available

### Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price

Biology 1 Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price Introduction In this world of high technology and information overload scientists

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### Figure 1. IBM SPSS Statistics Base & Associated Optional Modules

IBM SPSS Statistics: A Guide to Functionality IBM SPSS Statistics is a renowned statistical analysis software package that encompasses a broad range of easy-to-use, sophisticated analytical procedures.

### How to Win at the Track

How to Win at the Track Cary Kempston cdjk@cs.stanford.edu Friday, December 14, 2007 1 Introduction Gambling on horse races is done according to a pari-mutuel betting system. All of the money is pooled,

### JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

### Big Data para comprender la Dinámica Humana

Big Data para comprender la Dinámica Humana Carlos Sarraute Grandata Labs Hablemos de Big Data 26 noviembre 2014 @ 2014 All rights reserved. Brief presentation Grandata Founded in 2012. Leverages advanced

### A semi-supervised Spam mail detector

A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

### Web Site Visit Forecasting Using Data Mining Techniques

Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

### THE PROCESS CAPABILITY ANALYSIS - A TOOL FOR PROCESS PERFORMANCE MEASURES AND METRICS - A CASE STUDY

International Journal for Quality Research 8(3) 399-416 ISSN 1800-6450 Yerriswamy Wooluru 1 Swamy D.R. P. Nagesh THE PROCESS CAPABILITY ANALYSIS - A TOOL FOR PROCESS PERFORMANCE MEASURES AND METRICS -

### A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication

2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

### The Melvyl Recommender Project: Final Report

Performance Testing Testing Goals Current instances of XTF used in production serve thousands to hundreds of thousands of documents. One of our investigation goals was to experiment with a much larger

### South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

### Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

### CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

### Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis

Consumption of OData Services of Open Items Analytics Dashboard using SAP Predictive Analysis (Version 1.17) For validation Document version 0.1 7/7/2014 Contents What is SAP Predictive Analytics?... 3

### CHAPTER 6 NEURAL NETWORK BASED SURFACE ROUGHNESS ESTIMATION

CHAPTER 6 NEURAL NETWORK BASED SURFACE ROUGHNESS ESTIMATION 6.1. KNOWLEDGE REPRESENTATION The function of any representation scheme is to capture the essential features of a problem domain and make that

### MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

### USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY

Paper PO10 USE OF ARIMA TIME SERIES AND REGRESSORS TO FORECAST THE SALE OF ELECTRICITY Beatrice Ugiliweneza, University of Louisville, Louisville, KY ABSTRACT Objectives: To forecast the sales made by

### CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

### The scatterplot indicates a positive linear relationship between waist size and body fat percentage:

STAT E-150 Statistical Methods Multiple Regression Three percent of a man's body is essential fat, which is necessary for a healthy body. However, too much body fat can be dangerous. For men between the

### Oracle Data Miner (Extension of SQL Developer 4.0)

An Oracle White Paper October 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Generate a PL/SQL script for workflow deployment Denny Wong Oracle Data Mining Technologies 10 Van de Graff Drive Burlington,

### Designing a learning system

Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

### Electric Automobiles

Production Possibility Curves (Transformation Curves) Assumption: the curve or PPF (Production Possibility Frontier) represents full utilization of your productive resources. This means that all labor,

### Lasso on Categorical Data

Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

### The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

### Computer Systems Performance Analysis: An Introduction

Computer Systems Performance Analysis: An Introduction Dr. John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu COMP 528 Lecture 1 13 January 2005 Course Objectives Learn

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Toad for Oracle 8.6 SQL Tuning

Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Empirical Examination of a Collaborative Web Application

Empirical Examination of a Collaborative Web Application Christopher Stewart (stewart@cs.rochester.edu), Matthew Leventi (mlventi@gmail.com), and Kai Shen (kshen@cs.rochester.edu) University of Rochester

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

### Minitab Guide. This packet contains: A Friendly Guide to Minitab. Minitab Step-By-Step

Minitab Guide This packet contains: A Friendly Guide to Minitab An introduction to Minitab; including basic Minitab functions, how to create sets of data, and how to create and edit graphs of different

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### Machine Learning model evaluation. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

### Music Mood Classification

Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

### Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

### Applications. Network Application Performance Analysis. Laboratory. Objective. Overview

Laboratory 12 Applications Network Application Performance Analysis Objective The objective of this lab is to analyze the performance of an Internet application protocol and its relation to the underlying

### Statgraphics Centurion XVII (Version 17)

Statgraphics Centurion XVII (currently in beta test) is a major upgrade to Statpoint's flagship data analysis and visualization product. It contains 32 new statistical procedures and significant upgrades

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are