Production Floor Optimizations Using Dynamic Modeling and Data Mining

Size: px
Start display at page:

Download "Production Floor Optimizations Using Dynamic Modeling and Data Mining"

Transcription

1 The Interdisciplinary Center, Herzliya Efi Arazi School of Computer Science Production Floor Optimizations Using Dynamic Modeling and Data Mining M.Sc dissertation Submitted by Under the supervision of Dr. Tami Tamir & Prof. Yaakov Zehavi December, 2007

2 המרכז הבינתחומי בהרצליה בית-ספר אפי ארזי למדעי המחשב אופטימיזציות של רצפת הייצור על ידי שימוש במידול דינאמי וכריית מידע מוגש כחיבור סופי לפרויקט מחקר תואר מוסמך על-יד אביב ברונשטיין העבודה בוצעה בהנחיית ד"ר תמי תמיר ופרופ' יעקב זהבי דצמבר, 2007 ii

3 Abstract Data mining is defined as the process of discovering new correlations, hidden knowledge, unexpected patterns, and new rules from large databases [1]. Data mining is comprised of a collection of methodologies, standards and algorithms that are used in several research areas such as machine learning, marketing, statistics, artificial intelligence, visualization, and more. This paper describes how the field of data mining and statistical analysis can be applied to a real problem of production floor optimization in factories. In order to accomplish the task, an automated module was developed that can take advantage of massive amounts of raw data and through periodic analyses produce a valuable set of conclusions that will improve the efficiency of the production floor and/or increase its profits. Although data mining seems to be a promising solution to knowledge discovery and decision making it is not a panacea for all problems. Some of the data mining methods can work well in some domains but fail in others [2], therefore correct modeling of the problem domain and finding an adequate data mining method will directly affect the result quality and addedvalue. The project both reviews the theoretical background needed to support our thesis and the challenges introduced when trying to apply the data mining paradigm to a real application with real constraints, limitations and tradeoffs. iii

4 תקציר כריית מידע מוגדרת כגילוי מתאמים, ידע מוסתר, תבניות בלתי צפויות וחוקים חדשים מבסיסי נתונים גדולים [1]. התחום מוגדר מאוסף של שיטות, סטנדרטים ואלגורתמים אשר מנוצלים במגוון תחומי מחקר כמו חקר מכונות, שיווק, סטאטיסיקה, בינה מלאכותית, הדמייה גרפית ועוד. מסמך זה מתאר איך תחום כריית המידע והניתוח הסטאטיסטי מיושם על בעייה אמיתית של אופטימיזציות רצפת הייצור במפעלים. בכדי לבצע את המשימה, פותח מודול אוטומטי שמנצל כמויות אדירות של מידע שעליהם הוא מבצע ניתוח תקופתי בכדי לייצר מסקנות שיישומן יכול להגביר את יעילות המפעל ו/או להגדיל את רווחו. אף על פי שכריית מידע נראית כמו פתרון מבטיח לגילוי ידע והסקת מסקנות, היא אינה פותרת את כל הבעיות בתחום. חלק מן השיטות של כריית מידע יעבדו היטב במקרים מסויימים אך כשנשנה את מרחב הבעיה הם יכשלו [2]. לכן, מידול נכון של מרחב הבעיה ומציאת אלגוריתם מתאים ישפיעו על איכות התוצאות ועל הערך המוסף שלהן בצורה ניכרת. פרוייקט זה סוקר את הרקע התיאורתי הנדרש לתמוך בהנחת היסוד ואת האתגרים שהופיעו כאשר ניסינו ליישם את תהליך כריית המידע על מערכת אמיתית עם אילוצים, מגבלות ושקלול תמורות. iv

5 Table of Content Introduction... 2 Preface... 2 The data mining process... 3 Definition of the problem...3 Choosing an adequate model for the problem...4 Model execution over a training data set...5 Model validation over a test data set...5 Scoring, concluding and applying...6 Theoretical background... 7 Regression Analysis... 7 Linear Regression... 7 Least Median Square... 9 Support Vector Machine... 9 Statistics Correlation Coefficient...10 Coefficient of determination...11 Definitions and Notations The order...12 Logical Build...12 Logical Image...13 Physical Media Creation...13 Factory machines and products deployments...13 Objective Motivation Innovation Challenges Organization of this document Experiment Building the model Executing the model Results (Ordinary Least Squares) Linear Regression...20 Linear Regression (Least Median Square)...20 Support Vector Regression...21 Conclusions Module Design Entity Relational Diagram Class Diagrams Algorithms Algorithm for Warehousing...27 Algorithm for Model Building...28 Algorithm for Training and Validation...28 Algorithm for scoring...29 Algorithm for Applying...30 Services Results and Conclusions Module Results Figure 12 r 2 as a function of products (sorted by descending r 2 ). Error! Bookmark not defined. Further research Improvements Additional data mining tasks References v

6 List of Figures Figure 1 2D Linear Regression... 8 Figure 2 3D Linear Regression... 8 Figure 3 Linear Classifiers of Data... 9 Figure 4 - Several sets of (x,y) points, with the correlation coefficient of x and y for each set Figure 5 Metora production process schematics Figure 6 Histogram of number of attributes products have Figure 7 Production time as a function of Attr Figure 8 Production time as a function of Attr Figure 9 Production time as a function of Attr Figure 10 Entity Relation Diagram of the module Figure 11 Class Diagram 1/ Figure 12 Class Diagram 2/ Figure 13 r2 as a function of products (sorted by descending r2) vi

7 Introduction 1 Preface Manufacturing enterprises rely on vast amounts of data and information that is located in large databases [3]. Standard information systems usually implement a rather insignificant amount of that data in their manufacturing management, resource allocation, chain of supply, customer relations management and quality assurance applications. That minor fraction of the data used is in most cases sufficient for the common applications to perform calculations, summaries and in general, any data manipulation task that can be expressed by a structured programming/query language. Nevertheless, most modern manufacturing industries have evolved in all aspects of business intelligence and gained an important technological and/or financial advantage that would assist them in overcoming present challenges. In order to extract valuable information from an infinite amount of raw data, a manufacturing application needs to overcome the limitations presented by a relational and unidimensional query language. Business intelligence offers a variety of methods for extracting valuable information from the data collected using operational transactions executed during the manufacturing process. The field of business intelligence is divided into three main categories: Query and Reporting (Information) Extraction of detailed and roll-up data. Answers to question like what, who and when. Which machine had the smallest downtime in the last month? On-line Analytical Processing (Analyses) Summaries, trends and forecasts. What is the average amount of malfunctions reported by a machine? By work shift? 2

8 Data Mining (Insight and Predictions) Knowledge discovery of hidden patterns, answers to why and how. What will influence a machine to malfunction? The patterns identified by the data mining solutions can be transformed into knowledge, which can then be used to support business decision making [2]. The data mining process Using data mining techniques for knowledge discovery is an iterative deterministic process that involves a series of steps. After examining behavioral reports acquired from the OLAP stage, we can make educated guesses regarding relationships between behaviors measured in the factory. E.g. downtime of machines is an inevitable reality and can be naϊvely assumed to be randomly distributed. For each downtime, the operational system archives a vast amount of data related to that specific downtime instance, e.g. days after last calibration of the machine, month of year, day of week, last 10 products produced, load before downtime, machine startup time, user logged in, faulty products percentage, etc. The list is long and in most cases, the majority of the attributes collected are completely irrelevant to our target variable whereas the remainder conform to a finite set of predictive models. Those models are used to quantify the correlation between the features and the target variable and can be further analyzed to compute the correlation's quality. The data mining process includes the following steps: Definition of the problem Upon examining the production chain of a factory, we find that most manufactures suffer from the same common difficulties related to the production process. Downtimes, malfunctions, high production latency, raw 3

9 material shortage are some of the key issues taken into account when designing a production floor optimization scheme. From the problems mentioned above and by modeling the factory's entities we can derive our target variables and propose a thesis for our data mining process. Warehousing Warehousing is a broad area defining several standards and best practices for transformation of raw, distributed, nonhomogeneous, highvolume data into flat, normalized, scaled, unified, analysis-ready feature vectors. Once we have our basic problem definition, the process of data gathering may begin. Collecting data in a manufacturing enterprise is done by gathering information distributed throughout the enterprise operational system. That information suffers from data type inconsistency, missing attributes and edge values. Those can be handled using preprocessing of the data which may also include normalization and clipping (some algorithms do not tolerate outliers). The recommended approach for storing the warehouse is using a flat table consisting of sets of feature vectors whereas each item in the vector holds the value of the respective attribute value. Choosing an adequate model for the problem The model chosen for the analysis is usually directly derived from the problem definition. Most common problems can be modeled and reduced so that they can be solved using one of a finite set of methods for discovering patterns and relationships in the data. The field of data mining spans a variety of possibilities for productive analysis of data including Classification (Binary decisions Would this product reach due date[t/f]?, n-ary binning What is the severity of an unknown new malfunction[low/medium/high]?) Regression (Continuous measurements How much money will the factory earn next year?), Association Rules (Does one event imply another? A specific employee's shift and raw material shortage) Clustering (Finding groups which are very different from one another, but whose members are "alike" [6] Which products are similar enough to be promoted together?) and Feature 4

10 Selection (Ranking the importance of gathered measurements What is the most influential parameter in the invalidation of a product?). Once the problem was successfully clustered into one of the above subjects our journey towards successful analysis is about to face a branched junction. Each subject holds in store an impressive arsenal of heterogeneous algorithms which in turn provide a variety of configuration parameters for execution. If that is not enough, each algorithm represents undesired tradeoffs between CPU intensiveness, memory consumption, running time and analyses quality. After several experiments with sampled data, only a few algorithms will be found suitable for the problem. When the number of algorithms is reduced, there is no real holdback for executing several of them and choosing the best in run time. Model execution over a training data set After successfully choosing a model the analyses may begin. The warehouse is sampled for a certain percentage of its records to obtain the training data set. The sampling stage assumes a random independent distribution of the data. The model executes over that set of data and the analysis results are the coefficients and the statistics. A common problem with models executed on a specific data set comes up when their results behave very well for that specific data set, but when executed upon another data set the model's results come out poor. The recommended solution is rerunning the model provided with a data set used for testing (also, validation set). Model validation over a test data set Overcoming the problems introduced in the previous section requires running the model against a set of validation data. The validation data should (just like the training) be independent and randomly sampled with uniform distribution. Best practices suggest 1:3 ratio between the validation and the 5

11 training data used to build the model. After validation the model is mature enough to be used for applying to new data. Scoring, concluding and applying Discovering valuable information after a thorough analysis sessions resembles finding gold in a mine. The results of the analysis can be interpreted by a human analyst to translate the numbers into facts whereas the quality of the model can be evaluated using statistical measures and visualized in charts and reports. After scoring and concluding, the trained and validated model is applied to new sets of data. The knowledge acquired in the process can be used to refine the model and find out further insights. 6

12 Theoretical background 2 This section abstracts the mathematics behind the data mining methodology with focus on regression analysis and three of its implementation algorithms. Regression Analysis In statistics, regression analysis examines the relation of a dependent variable (response variable - y) to specified independent variables (explanatory variables x 1, x2..., xn ). The mathematical model of their relationship is the regression equation. The algorithms used in this project will be limited to linear regressions with the equation of the form where a i are the regression coefficients [7]. y = a i x i, i Linear Regression Linear regression is a regression method that models the relationship between a dependent variable Y, independent variables random term ε. The model can be written as y n = i= 1 β + β + ε i x i 0 x i i n and a where β 0 is the intercept ("constant" term), the βis are the respective parameters of independent variables, and p is the number of parameters to be estimated in the linear regression. This model is called "linear" because the relation of the response (the dependent variable Y) to the independent variables is assumed to be a linear function of the parameters. 7

13 Figure 1 2D Linear Regression Figure 2 3D Linear Regression 8

14 Least Median Square In regression analysis, Ordinary Least Squares (OLS) is a method for linear regression that determines the values of unknown quantities in a statistical model by minimizing the sum of the residuals (the difference between the predicted and observed values) squared [8]. Least Median Square is a variation of Ordinary Least Squares method with several adjustments such as using a robust regression that is not affected by outliers in the data set [9]. Support Vector Machine Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyper planes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. An assumption is made that the larger the margin or distance between these parallel hyperplanes the better the generalization error of the classifier will be [10]. As shown in figure 2, many linear classifiers separate the data. However, only one achieves maximum separation (with respect to that specific data set). Figure 3 Linear Classifiers of Data Support vector machine is a classification algorithm. In order to benefit from it in the area of regression we can use its version called Support Vector 9

15 Regression. Instead of finding hyperplanes that maximized the separation, the regression algorithm finds one hyperplane that crosses the vectors and minimizes the distance of all vectors from that hyper plane. Statistics The linear model needs to be evaluated in order to use it in real applications. There are several methods of evaluating linear models, some involve the following statistics: Correlation Coefficient An indicator to the strength and direction of a linear relationship between two random variables [11]. The coefficient is noted calculated X,Y Random variables σ X - Standard deviation of X σ Y - Standard deviation of Y ρ X, Y = cov σ ( X, Y) X σ Y ρ X, Y and Figure 4 - Several sets of (x,y) points, with the correlation coefficient of x and y for each set 10

16 Coefficient of determination The proportion of variability in a data set that is accounted for by a statistical model [12]. In simpler terms, the coefficient of determination (Noted r 2 ) is the percentage of the data the linear model was able to "explain". The variability is defined as the sum of squares. R 2 is calculated: yi The i th value from the samples of Y ( y yˆ ) 2 i R = 1 2 ( yi yi) ŷi - The estimated i-th value of Y from the regression model yi - Arithmetic mean of Y values. i i i 2 Another method to calculate the coefficient of determination is using the correlation coefficient: Coefficient of determination = correlation coefficient 2 11

17 Preliminaries 3 Definitions and Notations "Meteora" is a high-scale software development center. Each year several software products are being released and joins Meteora's product list. For purposes of software distribution, Meteora holds a production floor that handles on-line building, production, packaging and distribution of Meteora's software releases. The production process is iterative and includes several stages. The order When a customer wishes to purchase one of Meteora's products he contacts the ordering center and places an order. While filling the ordering form, the customer can choose the distribution method (CD/DVD by snail mail or delivery), the software distribution corresponding with the customer needs (Enterprise, Personal, etc), number of copies/licenses and several other parameters related to the product features. Logical Build After the order is processed, it initiates a production request in the factory. The first phase in the production process is called "logical build (LB)" and can take between 10 seconds and 2 hours depending on the software itself and other order parameters. The logical build process is executed on dedicated machines (Logical Build Server - LBS). LBS is getting its input from a central resource allocation module (would be explained later on). The input contains the software name, version, number of copies and other parameter the user requested in the ordering process. A common LBS execution usually includes transferring a copy of the source files that reside in a main version control server and executing the software proprietary build process (compilation, linking, build signing, etc). 12

18 Logical Image A logical image is the image of the software to be burnt on the media (CD/DVD) or sent via . After a LB is created, the Logical Imaging Servers (LIS) create images from the LB (ready-for-distribution copies of the product). In this stage the LIS are deriving data from the LB and executing imagedependent tasks (e.g. creation of license keys and serial numbers assignments). The image creation process is also dependent on parameters within the order (e.g. the parameter "number of copies" increases the LI creation time linearly, the software version [Enterprise, Personal, etc ] affects the image size and therefore influences directly on the LI production). The LI creation process also takes between 1 second and 2 hours and this time is also influenced by several parameters from the order phase. Physical Media Creation When the LI is ready, all that's left is burning the image on a physical media (CD/DVD). For that purpose, the factory holds several machines for producing physical copies of the images. Some of those machines are manually operated by dedicated personal and the rest are fully automatic. The physical media is delivered to the customer by snail mail. (Some of the orders are to be sent directly to the customer using . In this scenario, the physical media creation stage is bypassed and the LI is sent directly using / FTP). Factory machines and products deployments In this project we'll mainly deal with the main 3 products: LB, LI and PM. For each software order, all three of those products need to be produced. There are about 500 software products the factory manufactures. All the factory machines are being resource-planned by a centralized production management system (CPMS). The CPMS is responsible for scheduling, machine control and monitoring the entire production process (from the order stage to the final media packaging). While the production floor is operational 13

19 and active, CPMS is gathering raw production data (e.g. production times, faulty percentage, malfunctions statistics etc). The vast amount of data is used for analysis, inquiries and summery reports. Figure 5 Metora production process schematics 14

20 Objective The project's main goal is to find correlations between the production times of different products and their order specific parameters (those production parameters would be referred to as attributes [or Features] from now on in order to conform to the data mining notation standards). An example of an environmental attribute in PM is size of the LI. It would be an educated guess to assume the media burning time is linearly dependent on the size of the image. The enterprise versions usually contain more source files to be built in the Logical Build stage and therefore increases the LB production time linearly. Some of the attributes are product specific, e.g. the version of the third-party database to be packaged alongside the software product itself which is likely to increase the production time. Nevertheless, not all the attributes influence the production time, e.g. the customer is also an attribute in the order but it would not affect production times at all. Number Of Attributes Histogram Number Of Products Num Of Attributes each product has Figure 6 Histogram of number of attributes products have 15

21 Motivation If prior knowledge of the production times could be attained (or at least be fairly estimated), it would leverage the resource allocation module and allow it to use more sophisticated scheduling algorithms. This information can also be used by CPMS to alert if any products won't be on time. A fair estimation of production times could be the basis to finding additional problems in the production process (e.g. if a product is estimated to be produced in 1-2 minutes and its actual production time is 20 minutes, it would trigger a red light). Innovation Most organizations usually employ several data analysts to handle the data mining, knowledge discovery and business intelligence aspects of the enterprise. Unfortunately, Metora doesn't have the trained personal to do the data analyses; therefore, this project would provide a replacement for expensive personnel by automating the entire data mining process. When no human is involved in the automated process, the question of how to quantify the model's quality arises. For that purpose, the module defines a set of heuristics and ranking methods for scoring the models calculated without having an input from a user. Challenges 1. Every product has its own set of influencing attributes which are hidden inside an XML document that arrives with the order. 2. The order specific attributes arrive without knowledge of their data type. The value 120 can be either interpreted as a number or as an item in a set {100, 110, 120, 130, 140, 150}. 16

22 3. Not all products have a sufficient amount of historical archived production entries for analysis. 4. Several products do not have their production time dependent on measured attributes nor on order specific parameters. 5. Both the warehouse and the analysis consume expensive resources such as CPU time, I/O and memory consumption, and must therefore be executed when the factory is not operational, i.e. in a time-frame of one night (approx. 6 hours). Organization of this document The next chapter describes the experiment held on one of the products in the factory alongside results and conclusions. The chapter that follows it is dedicated to the design of the module accompanied by Entity-Relational diagrams, class diagrams and pseudo code. Chapter 5 summarizes the results of the execution and details the conclusions. The closing chapter describes possible further research in the area. 17

23 Experiment 4 The experiment will perform all the data mining tasks defined for this project on a sample test case containing a single product which was sampled for this purpose. This product ( 220) has 3 attributes (will be notated as attr1, attr2 and attr3 respectively) and about 500 production instances for the regression. The hypothesis is that the production time of product 220 depends on attr2 linearly. The pilot will strengthen or weaken this assumption. Building the model Before building the model we can visualize the correlations between the independent variables and the target attribute. Each one of the independent attributes used in this experiment is plotted against the target variable. Attribute #1 Figure 7 Production time as a function of Attr1 18

24 Attribute #2 Figure 8 Production time as a function of Attr2 Attribute #3 Figure 9 Production time as a function of Attr3 19

25 From observing the plots of the production time as a function of each of the attributes we can guess that attr1 is influencing the production time and attr2 and attr3 do not drastically affect the dependent variable. Executing the model The sample warehouse used to execute the model contains 475 records. We used 66% of the instances to build the model (training data) and the remaining 34% was used to validate the model (test data) built on the training data. To achieve the best result, we executed the model using several regression algorithms, including Ordinary Least Squares Linear Regression, Least Median Square Linear Regression and Support Vector Regression. Results (Ordinary Least Squares) Linear Regression Pr odtime = Attr Statistic Value Correlation Coefficient Mean Absolute Error Root Mean Square Error Relative Absolute Error % Root Relative Square Error R 2 (Coefficient of Determination) Linear Regression (Least Median Square) Pr odtime = Attr Attr ( Attr3= Value2) Statistic Value Correlation Coefficient Mean Absolute Error Root Mean Square Error Relative Absolute Error % 20

26 Root Relative Square Error R 2 (Coefficient of Determination) Support Vector Regression Pr odtime= Attr1( Norm.) Attr2( Norm.) Attr3( Norm.) Statistic Value Correlation Coefficient Mean Absolute Error Root Mean Square Error Relative Absolute Error % Root Relative Square Error R2 (Coefficient of Determination) Conclusions The experiment showed that there are relations between the order specific attributes and the production time. We can see by the value of R 2 (which is about 65%) that the model is better than the null model (without explanatory attributes). 21

27 Module Design 5 The project is divided into five stages; each of them is independent and combined together to accomplish the final result. Those stages are: warehousing, model building, attribute selection (optional), regression, validating and scoring against test data and finally applying the model on new sets of data. This section describes the algorithms in use, tables and objects and how the implementation challenges (see chapter 2) were handled. Entity Relational Diagram Production Entries Summery Product Number of production entries PK,FK1 PK Product Specific Attributes Product Attribute Name Attribute Get Function Attribute Data Type PK FK1 Applied Products Id Product Id Production Time PK,FK1 Products Id PK PK,FK1 Warehouse Id Product Purpose Production Time Attribute 1 Attribute 2 Attribute 3... Attribute200 PK PK Product Generic Attributes Product Type Attribute Name Attribute Data Type Attribute Get Function Type Number PK FK1 Id Null Models Product Production Time Mean Absolute Error Root_Mean Square Error RSquared (Coefficient of Determination) Regression Models Regression Algorithms Regression Config PK,FK1 Id PK Id PK Id Product Model Correlation Coefficient Mean Absolute Error Root_Mean Square Error Relative Absolute Error RSquared (Coefficient of Determination) Details Algorithm FK1 Regression Algorithm Param Name Option Flag Param Value Figure 10 Entity Relation Diagram of the module 22

28 Products Holds information about all the products in the factory Column Name Data Description type Id (PK) Numeric 4 digits unique identifier of a product. First digit states the product's type and remaining 3 are the numeric identifier of the product itself. Type String {Type 1, Type 2, Type 3} Number Numeric 3 last digits of the id Product Generic Attributes Holds all the attributes of the product types which are not order specific. These attributes are pre defined by the data analyst. Column Name Data Description type Product Type (FK -> Products.Type) String Product type identifier {Type 1, Type 2, Type 3} Attribute Name String Name of the attribute Attribute Datatype String The datatype of the attribute. {Numeric, String} Attribute Get Function String The function used to get the attribute value. This function will be called on warehouse construction. Product Specific Attributes Holds all the order specific attributes of all the products. This table is built dynamically and rebuilt each month. Column Name Data Description type Product Id (FK -> Numeric Product identifier Products.Id) Attribute Name String Name of the attribute Attribute Datatype String The datatype of the attribute. {Numeric, String} Attribute Function Get String The function used to get the attribute value. This function will be called on warehouse construction. Attributes Blacklist Every attribute that is taken as numeric and in fact the actual value is a string (See Challenges). Those attributes are in run-time inserted to this table and 23

29 the warehousing process retries without sampling them. Column Name Data Description type Product Id (FK -> Numeric Product identifier Products.Id) Attribute Name String Name of the attribute not to be included in the sampling process Warehouse This table holds a flat representation of the products and their attributes. The attributes columns are common for all the products and are distinguished by the attributes index stored in "Product Specific Attributes" and "Product Generic Attributes" Column Name Data Description type Id Numeric Unique identification of a product within it's type Product Id Numeric Product identifier Purpose String This table stores the warehouse for training the regression model and for applying new records. {Training, Applying} Production Time Numeric Number of seconds the production of this product instance took. (The dependent variable) Attribute1 String The value of attribute #1 Attribute2 String The value of attribute #2 Attribute200 String The value of attribute #200 Regression Models Holds the models results for each product. This table also holds a repository of the models themselves to be retrieved in the applying stage. Column Name Data Description type Id Numeric Unique identifier Product Id Numeric Product identifier Model Binary After the model building, the classifier is serialized and stored in this column. Details String Human readable results of the regression. Correlation Coefficient Mean absolute Error Relative absolute Error Numeric Correlation coefficient of this regression. 1, 1 [ ] Numeric Mean absolute error of this regression. Numeric Relative absolute Error of this regression 24

30 R Square Number Coefficient of determination. The percentage of the variance that can be explained by the regression. Null Models Some of the products don't have enough records to perform regression. For these products the null model is Column Name Data Description type Id Numeric Unique identifier Product Id Numeric Product identifier Model Binary After the model building, the classifier is serialized and stored in this column. Regression Algorithms Lists all the regression algorithms the module uses Column Name Data Description type Id Numeric Number of the algorithm Algorithm String Name of the algorithm Regression Config Summarize all the configuration parameters for the algorithms. Column Name Data Description type Id (PK) Numeric Unique identifier Regression Numeric Foreign key to regression algorithms Algorithm (FK) Param Name String The name of the parameter Option Flag String Short string representation Param Value Numeric The value of the parameter Applied Products All the new products are queued in this table until the Applier service will set their production time using pre stored regression models Column Name Data Description type Id Numeric Id of the product Product Id String FK to products table Production Time Numeric The estimated production time for this product in seconds. 25

31 Class Diagrams Figure 11 Class Diagram 1/2 Figure 12 Class Diagram 2/2 26

32 Algorithms Algorithm for Warehousing Warehousing is done at the beginning of the data mining process. The objective of warehousing is to create a simplified (flat) representation of the products and their attributes. The attributes of the product can mostly be derived directly from the database, but some attribute require more complicated computation for retrieval. Warehousing is usually a batched process being performed at night to save computation time. The main difficulty this process has to overcome is dealing with products that do not have a sufficient amount of production records (Challenge 3). Another question that arises is how much to sample. On the one hand, computation time grows with the sample size but on the other, small samples may not be sufficient to profit from regression analyses. The range was chosen, as it's the optimal tradeoff between resource consumption and analysis quality compromise. This assumption is based on trial and error. Iterate all the products in CPMS database and for each product p Query the production archives for product p If num_of_records_for_product_p < 500 then Skip this product; Else Sample min(num_of_records_for_product_p, 5000) End if For each production record Get a static list of product p's type generic attributes Get a dynamic list of product p's order specific attributes 27

33 Gather all the attributes values of the product End Loop End Loop ApplyPreProcessingFiltering(MissingValuesFilter, EdgeValuesFilter, Normalization Filter); Algorithm for Model Building Compound attributes creation For any two numeric attributes a1 and a2, produce third attribute a1,2 that will hold the numeric multiplication of a1 and a2. For any two alphanumeric attributes c1 and c2, produce third string attribute c1,2 that will hold the concatenation of c1 and c2. Regression Algorithms In order to achieve optimal results we will execute several regression algorithms and according to their model's results we will choose the best one at run time. Each one of the algorithms is different and with some datasets that algorithm A will score higher than algorithm B and some datasets - quite the opposite. The algorithms used are: Ordinary Least Sqaures Linear Regression, Least Median Square Method Linear Regression with and Support Vector Regression. Algorithm for Training and Validation The first execution of the model would be on the training data set. The coefficients from the regression would then be tested against another model that contains the validation data sets. The data set is divided to 2/3 training and 1/3 validation. The purpose of validation against a different data set is to detect potential problems with the model such as under-fitting or over-fitting. Execute model on Training Data to retrieve Training_Regression_coefficients 28

34 Run model with Training_regression_coefficients on Validation Data Summarize statistics (RMSE, correlation coefficient, r 2, MAE, etc) Store the model If there is no model stored for this product Store the current model Else If the current model is better than the stored model (Scoring methods will be introduced later) Store the current model Else Discard the current model End If End If Algorithm for scoring There are 2 types of scoring mechanism. The first is by observing the values of the regression statistics. The coefficient of determination and the correlation coefficient will be the statistics we will use to quantify the model result quality. Out goal is getting a higher r 2 than the previous model. For the second method we suggest the following algorithm: For each product p in a input set of products Schedule p to m machines using SPT End Loop The algorithm will execute with two inputs: jobs with real processing (production) time and jobs with estimated processing time from the 29

35 regression analysis. The difference between the makespans can be used for determining the quality of the model. Algorithm for Applying The best model for each product would be stored in the database as a result of the previous stage. This algorithm uses a special model called the "null model" to apply to products that doesn't have a regression analysis. Null model is a model using a naïve heuristics (simple arithmetic mean) for production time estimation. On-Event: New production order arrived Iterate all products in the production order and for each product p Retrieve the model stored for product p If p doesn't have a regression model then Retrieve the null model End if; Calculate the production time of product p according to model coefficients End Loop Services The data mining process will be performed by 3 services independent of each other: 1. Data Warehouser Handles the sampling, collecting, pre-processing and storing of the data. 2. Data Analyzer Computes regression analyses on the warehouse and stores the results in the database. 3. Data Applier Retrieves best model for product from the repository and use it to apply new records. 30

36 Results and Conclusions 6 Module Results The models produced by our module will be evaluated using the value of r 2, where high values of r 2, represent good models. The following chart shows the values of r 2 as a function of the product (sorted by descending r 2 ). Figure 13 r2 as a function of products (sorted by descending r2) From the chart above we can see that a small number of products received r 2 greater than 0.2. In the common applications of data mining, values higher than 20% for r 2 represent a good predictive model. Below the 20% value, the model is considered not so good. 31

37 Algorithms performance For each product, all 3 regression algorithms were executed. The following table summarizes their success. OLS Linear LSM Linear Support Vector Regression Regression Regression Best Values or r 2 76% 14% 10% Memory 10% 15% 75% consumption CPU Time 13% 19% 78% Conclusions 1. Only a small number of products had their production time successfully captured by a linear model. The majority of the products did not fall under the linear regression model and for that their could be several reasons: Some products are produced in parallel on the same machine. This phenomenon inserts an undesired error into the production times which decreases the linear model quality. Most of the products are not affected by order-specific parameters. 2. Linear regression is by far the best algorithm for these data sets. It is the fastest and most accurate of them all. Nevertheless, it is worthwhile to invest the time and resources (CPU and memory consumption) to execute the other algorithms. Despite their excessive resource consumption, they have gained higher values of r 2 for some products. 32

38 Further research 7 Improvements Our thesis suggests that the production times of products in the factory depend on attributes that arrive in the order. From the project's results, we can see that such a correlation does exist. Furthermore, we can improve the results (higher values of r 2 ) by measuring attributes from the production phase itself in addition to those retrieved from the order. Another improvement relates to the warehousing process itself. Existing sampling techniques largely hide any temporal relationships in the temporal data [3]. Pre-investigating the attribute values could uncover autocorrelations between pairs (or more) of attributes. Autocorrelations can mislead the data mining algorithm and decrease the results quality. In order to minimize that affect we can apply a sampling. Additional data mining tasks The project had his prime focus on calculating the production times of the various products in the factory. The implementation of the data mining module in CPMS can be the basis for further estimations: Size of product The physical storage the products acquire in the servers is also an important consideration when planning the server farm's deployment that handles the production of LI and LB. This variable can also be estimated using regression analyses on the order-specific parameters. Malfunctions Each malfunction also dump a massive amount of environmental measurements that took place before, during and after that specific malfunction instance. Malfunctions causes, as far as the factory's knowledge goess, are stochastic, but with the right 33

39 choice of explaining attributes, they can also be predicted (at least to an extent). 34

40 References 8 [1] Adriaans, P., Zantige, D., 1996, Data Mining, Addison-Wesley, UK. [2] Krzysztof Cios, Witold Pedrycz, Roman Swiniarski, 1998, Kluwer Academic Publishers, p. xvii). [3] E. I. Neaga., J. A. Harding., A Review Of Data Mining Techniques And Software Systems To Improve Business Performance In Extended Manufacturing Enterprises [4] George Fernandez, 2003, "Data mining using SAS applications", Chapman & Hall/CRC, p. 21 [5] J.A. Harding, M. Shahbaz, Srinivas, A. Kusiak, 2006, "Data Mining in Manufacturing - A Review" [6] Yaakov Zehavi, Data mining course slides, Introduction, s21 [7] Wikipedia, Linear Regression. [8] Wikipedia, Least Median Square [9] Humberto Barreto, 2001, An Introduction to Least Median of Squares [10] Wikipedia, Support Vector Regression [11] Wikipedia, Correlation [12] Wikipedia, Coefficient of determination 35

Web Data Mining: A Case Study. Abstract. Introduction

Web Data Mining: A Case Study. Abstract. Introduction Web Data Mining: A Case Study Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 okgupta@pvamu.edu Abstract With an enormous amount of data stored

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

On Correlating Performance Metrics

On Correlating Performance Metrics On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are

More information

A Prototype System for Educational Data Warehousing and Mining 1

A Prototype System for Educational Data Warehousing and Mining 1 A Prototype System for Educational Data Warehousing and Mining 1 Nikolaos Dimokas, Nikolaos Mittas, Alexandros Nanopoulos, Lefteris Angelis Department of Informatics, Aristotle University of Thessaloniki

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA ABSTRACT Current trends in data mining allow the business community to take advantage of

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Course Syllabus For Operations Management. Management Information Systems

Course Syllabus For Operations Management. Management Information Systems For Operations Management and Management Information Systems Department School Year First Year First Year First Year Second year Second year Second year Third year Third year Third year Third year Third

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data Warehouse design

Data Warehouse design Data Warehouse design Design of Enterprise Systems University of Pavia 21/11/2013-1- Data Warehouse design DATA PRESENTATION - 2- BI Reporting Success Factors BI platform success factors include: Performance

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

ETPL Extract, Transform, Predict and Load

ETPL Extract, Transform, Predict and Load ETPL Extract, Transform, Predict and Load An Oracle White Paper March 2006 ETPL Extract, Transform, Predict and Load. Executive summary... 2 Why Extract, transform, predict and load?... 4 Basic requirements

More information

B.Sc (Computer Science) Database Management Systems UNIT-V

B.Sc (Computer Science) Database Management Systems UNIT-V 1 B.Sc (Computer Science) Database Management Systems UNIT-V Business Intelligence? Business intelligence is a term used to describe a comprehensive cohesive and integrated set of tools and process used

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;

More information

Building a Database to Predict Customer Needs

Building a Database to Predict Customer Needs INFORMATION TECHNOLOGY TopicalNet, Inc (formerly Continuum Software, Inc.) Building a Database to Predict Customer Needs Since the early 1990s, organizations have used data warehouses and data-mining tools

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, University of Indonesia Objectives

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

More information

A Knowledge Management Framework Using Business Intelligence Solutions

A Knowledge Management Framework Using Business Intelligence Solutions www.ijcsi.org 102 A Knowledge Management Framework Using Business Intelligence Solutions Marwa Gadu 1 and Prof. Dr. Nashaat El-Khameesy 2 1 Computer and Information Systems Department, Sadat Academy For

More information

Chapter 7: Data Mining

Chapter 7: Data Mining Chapter 7: Data Mining Overview Topics discussed: The Need for Data Mining and Business Value The Data Mining Process: Define Business Objectives Get Raw Data Identify Relevant Predictive Variables Gain

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining Mining Process CRISP - DM Cross-Industry Standard Process for Mining (CRISP-DM) European Community funded effort to develop framework for data mining tasks Goals: Cross-Industry Standard Process for Mining

More information

WHITEPAPER. Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk

WHITEPAPER. Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk WHITEPAPER Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk Overview Angoss is helping its clients achieve significant revenue growth and measurable return

More information

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management Foundations of Business Intelligence: Databases and Information Management Problem: HP s numerous systems unable to deliver the information needed for a complete picture of business operations, lack of

More information

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful

More information

Example: Boats and Manatees

Example: Boats and Manatees Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM M. Mayilvaganan 1, S. Aparna 2 1 Associate

More information

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System By Jake Cornelius Senior Vice President of Products Pentaho June 1, 2012 Pentaho Delivers High-Performance

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013 A Short-Term Traffic Prediction On A Distributed Network Using Multiple Regression Equation Ms.Sharmi.S 1 Research Scholar, MS University,Thirunelvelli Dr.M.Punithavalli Director, SREC,Coimbatore. Abstract:

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

An Overview of Database management System, Data warehousing and Data Mining

An Overview of Database management System, Data warehousing and Data Mining An Overview of Database management System, Data warehousing and Data Mining Ramandeep Kaur 1, Amanpreet Kaur 2, Sarabjeet Kaur 3, Amandeep Kaur 4, Ranbir Kaur 5 Assistant Prof., Deptt. Of Computer Science,

More information

Data Mining mit der JMSL Numerical Library for Java Applications

Data Mining mit der JMSL Numerical Library for Java Applications Data Mining mit der JMSL Numerical Library for Java Applications Stefan Sineux 8. Java Forum Stuttgart 07.07.2005 Agenda Visual Numerics JMSL TM Numerical Library Neuronale Netze (Hintergrund) Demos Neuronale

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Cleaned Data. Recommendations

Cleaned Data. Recommendations Call Center Data Analysis Megaputer Case Study in Text Mining Merete Hvalshagen www.megaputer.com Megaputer Intelligence, Inc. 120 West Seventh Street, Suite 10 Bloomington, IN 47404, USA +1 812-0-0110

More information

Pattern-Aided Regression Modelling and Prediction Model Analysis

Pattern-Aided Regression Modelling and Prediction Model Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and

More information

Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

More information

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide Curriculum Map/Pacing Guide page 1 of 14 Quarter I start (CP & HN) 170 96 Unit 1: Number Sense and Operations 24 11 Totals Always Include 2 blocks for Review & Test Operating with Real Numbers: How are

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable

More information

DATA MINING AND WAREHOUSING CONCEPTS

DATA MINING AND WAREHOUSING CONCEPTS CHAPTER 1 DATA MINING AND WAREHOUSING CONCEPTS 1.1 INTRODUCTION The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation

More information

Easily Identify Your Best Customers

Easily Identify Your Best Customers IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

Numerical Algorithms Group. Embedded Analytics. A cure for the common code. www.nag.com. Results Matter. Trust NAG.

Numerical Algorithms Group. Embedded Analytics. A cure for the common code. www.nag.com. Results Matter. Trust NAG. Embedded Analytics A cure for the common code www.nag.com Results Matter. Trust NAG. Executive Summary How much information is there in your data? How much is hidden from you, because you don t have access

More information

Contents WEKA Microsoft SQL Database

Contents WEKA Microsoft SQL Database WEKA User Manual Contents WEKA Introduction 3 Background information. 3 Installation. 3 Where to get WEKA... 3 Downloading Information... 3 Opening the program.. 4 Chooser Menu. 4-6 Preprocessing... 6-7

More information

A Comparison of Variable Selection Techniques for Credit Scoring

A Comparison of Variable Selection Techniques for Credit Scoring 1 A Comparison of Variable Selection Techniques for Credit Scoring K. Leung and F. Cheong and C. Cheong School of Business Information Technology, RMIT University, Melbourne, Victoria, Australia E-mail:

More information

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,

More information

Introduction to Engineering System Dynamics

Introduction to Engineering System Dynamics CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are

More information

High-Volume Data Warehousing in Centerprise. Product Datasheet

High-Volume Data Warehousing in Centerprise. Product Datasheet High-Volume Data Warehousing in Centerprise Product Datasheet Table of Contents Overview 3 Data Complexity 3 Data Quality 3 Speed and Scalability 3 Centerprise Data Warehouse Features 4 ETL in a Unified

More information

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH 205 A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH ABSTRACT MR. HEMANT KUMAR*; DR. SARMISTHA SARMA** *Assistant Professor, Department of Information Technology (IT), Institute of Innovation in Technology

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Using Adaptive Random Trees (ART) for optimal scorecard segmentation A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized

More information

Data Visualization Techniques

Data Visualization Techniques Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview

More information

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS

OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

IBM SPSS Data Preparation 22

IBM SPSS Data Preparation 22 IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Nagarjuna College Of

Nagarjuna College Of Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted

More information