Production Floor Optimizations Using Dynamic Modeling and Data Mining

Transcription

1 The Interdisciplinary Center, Herzliya Efi Arazi School of Computer Science Production Floor Optimizations Using Dynamic Modeling and Data Mining M.Sc dissertation Submitted by Under the supervision of Dr. Tami Tamir & Prof. Yaakov Zehavi December, 2007

2 המרכז הבינתחומי בהרצליה בית-ספר אפי ארזי למדעי המחשב אופטימיזציות של רצפת הייצור על ידי שימוש במידול דינאמי וכריית מידע מוגש כחיבור סופי לפרויקט מחקר תואר מוסמך על-יד אביב ברונשטיין העבודה בוצעה בהנחיית ד"ר תמי תמיר ופרופ' יעקב זהבי דצמבר, 2007 ii

3 Abstract Data mining is defined as the process of discovering new correlations, hidden knowledge, unexpected patterns, and new rules from large databases [1]. Data mining is comprised of a collection of methodologies, standards and algorithms that are used in several research areas such as machine learning, marketing, statistics, artificial intelligence, visualization, and more. This paper describes how the field of data mining and statistical analysis can be applied to a real problem of production floor optimization in factories. In order to accomplish the task, an automated module was developed that can take advantage of massive amounts of raw data and through periodic analyses produce a valuable set of conclusions that will improve the efficiency of the production floor and/or increase its profits. Although data mining seems to be a promising solution to knowledge discovery and decision making it is not a panacea for all problems. Some of the data mining methods can work well in some domains but fail in others [2], therefore correct modeling of the problem domain and finding an adequate data mining method will directly affect the result quality and addedvalue. The project both reviews the theoretical background needed to support our thesis and the challenges introduced when trying to apply the data mining paradigm to a real application with real constraints, limitations and tradeoffs. iii

4 תקציר כריית מידע מוגדרת כגילוי מתאמים, ידע מוסתר, תבניות בלתי צפויות וחוקים חדשים מבסיסי נתונים גדולים [1]. התחום מוגדר מאוסף של שיטות, סטנדרטים ואלגורתמים אשר מנוצלים במגוון תחומי מחקר כמו חקר מכונות, שיווק, סטאטיסיקה, בינה מלאכותית, הדמייה גרפית ועוד. מסמך זה מתאר איך תחום כריית המידע והניתוח הסטאטיסטי מיושם על בעייה אמיתית של אופטימיזציות רצפת הייצור במפעלים. בכדי לבצע את המשימה, פותח מודול אוטומטי שמנצל כמויות אדירות של מידע שעליהם הוא מבצע ניתוח תקופתי בכדי לייצר מסקנות שיישומן יכול להגביר את יעילות המפעל ו/או להגדיל את רווחו. אף על פי שכריית מידע נראית כמו פתרון מבטיח לגילוי ידע והסקת מסקנות, היא אינה פותרת את כל הבעיות בתחום. חלק מן השיטות של כריית מידע יעבדו היטב במקרים מסויימים אך כשנשנה את מרחב הבעיה הם יכשלו [2]. לכן, מידול נכון של מרחב הבעיה ומציאת אלגוריתם מתאים ישפיעו על איכות התוצאות ועל הערך המוסף שלהן בצורה ניכרת. פרוייקט זה סוקר את הרקע התיאורתי הנדרש לתמוך בהנחת היסוד ואת האתגרים שהופיעו כאשר ניסינו ליישם את תהליך כריית המידע על מערכת אמיתית עם אילוצים, מגבלות ושקלול תמורות. iv

5 Table of Content Introduction... 2 Preface... 2 The data mining process... 3 Definition of the problem...3 Choosing an adequate model for the problem...4 Model execution over a training data set...5 Model validation over a test data set...5 Scoring, concluding and applying...6 Theoretical background... 7 Regression Analysis... 7 Linear Regression... 7 Least Median Square... 9 Support Vector Machine... 9 Statistics Correlation Coefficient...10 Coefficient of determination...11 Definitions and Notations The order...12 Logical Build...12 Logical Image...13 Physical Media Creation...13 Factory machines and products deployments...13 Objective Motivation Innovation Challenges Organization of this document Experiment Building the model Executing the model Results (Ordinary Least Squares) Linear Regression...20 Linear Regression (Least Median Square)...20 Support Vector Regression...21 Conclusions Module Design Entity Relational Diagram Class Diagrams Algorithms Algorithm for Warehousing...27 Algorithm for Model Building...28 Algorithm for Training and Validation...28 Algorithm for scoring...29 Algorithm for Applying...30 Services Results and Conclusions Module Results Figure 12 r 2 as a function of products (sorted by descending r 2 ). Error! Bookmark not defined. Further research Improvements Additional data mining tasks References v

6 List of Figures Figure 1 2D Linear Regression... 8 Figure 2 3D Linear Regression... 8 Figure 3 Linear Classifiers of Data... 9 Figure 4 - Several sets of (x,y) points, with the correlation coefficient of x and y for each set Figure 5 Metora production process schematics Figure 6 Histogram of number of attributes products have Figure 7 Production time as a function of Attr Figure 8 Production time as a function of Attr Figure 9 Production time as a function of Attr Figure 10 Entity Relation Diagram of the module Figure 11 Class Diagram 1/ Figure 12 Class Diagram 2/ Figure 13 r2 as a function of products (sorted by descending r2) vi

7 Introduction 1 Preface Manufacturing enterprises rely on vast amounts of data and information that is located in large databases [3]. Standard information systems usually implement a rather insignificant amount of that data in their manufacturing management, resource allocation, chain of supply, customer relations management and quality assurance applications. That minor fraction of the data used is in most cases sufficient for the common applications to perform calculations, summaries and in general, any data manipulation task that can be expressed by a structured programming/query language. Nevertheless, most modern manufacturing industries have evolved in all aspects of business intelligence and gained an important technological and/or financial advantage that would assist them in overcoming present challenges. In order to extract valuable information from an infinite amount of raw data, a manufacturing application needs to overcome the limitations presented by a relational and unidimensional query language. Business intelligence offers a variety of methods for extracting valuable information from the data collected using operational transactions executed during the manufacturing process. The field of business intelligence is divided into three main categories: Query and Reporting (Information) Extraction of detailed and roll-up data. Answers to question like what, who and when. Which machine had the smallest downtime in the last month? On-line Analytical Processing (Analyses) Summaries, trends and forecasts. What is the average amount of malfunctions reported by a machine? By work shift? 2

8 Data Mining (Insight and Predictions) Knowledge discovery of hidden patterns, answers to why and how. What will influence a machine to malfunction? The patterns identified by the data mining solutions can be transformed into knowledge, which can then be used to support business decision making [2]. The data mining process Using data mining techniques for knowledge discovery is an iterative deterministic process that involves a series of steps. After examining behavioral reports acquired from the OLAP stage, we can make educated guesses regarding relationships between behaviors measured in the factory. E.g. downtime of machines is an inevitable reality and can be naϊvely assumed to be randomly distributed. For each downtime, the operational system archives a vast amount of data related to that specific downtime instance, e.g. days after last calibration of the machine, month of year, day of week, last 10 products produced, load before downtime, machine startup time, user logged in, faulty products percentage, etc. The list is long and in most cases, the majority of the attributes collected are completely irrelevant to our target variable whereas the remainder conform to a finite set of predictive models. Those models are used to quantify the correlation between the features and the target variable and can be further analyzed to compute the correlation's quality. The data mining process includes the following steps: Definition of the problem Upon examining the production chain of a factory, we find that most manufactures suffer from the same common difficulties related to the production process. Downtimes, malfunctions, high production latency, raw 3

9 material shortage are some of the key issues taken into account when designing a production floor optimization scheme. From the problems mentioned above and by modeling the factory's entities we can derive our target variables and propose a thesis for our data mining process. Warehousing Warehousing is a broad area defining several standards and best practices for transformation of raw, distributed, nonhomogeneous, highvolume data into flat, normalized, scaled, unified, analysis-ready feature vectors. Once we have our basic problem definition, the process of data gathering may begin. Collecting data in a manufacturing enterprise is done by gathering information distributed throughout the enterprise operational system. That information suffers from data type inconsistency, missing attributes and edge values. Those can be handled using preprocessing of the data which may also include normalization and clipping (some algorithms do not tolerate outliers). The recommended approach for storing the warehouse is using a flat table consisting of sets of feature vectors whereas each item in the vector holds the value of the respective attribute value. Choosing an adequate model for the problem The model chosen for the analysis is usually directly derived from the problem definition. Most common problems can be modeled and reduced so that they can be solved using one of a finite set of methods for discovering patterns and relationships in the data. The field of data mining spans a variety of possibilities for productive analysis of data including Classification (Binary decisions Would this product reach due date[t/f]?, n-ary binning What is the severity of an unknown new malfunction[low/medium/high]?) Regression (Continuous measurements How much money will the factory earn next year?), Association Rules (Does one event imply another? A specific employee's shift and raw material shortage) Clustering (Finding groups which are very different from one another, but whose members are "alike" [6] Which products are similar enough to be promoted together?) and Feature 4

10 Selection (Ranking the importance of gathered measurements What is the most influential parameter in the invalidation of a product?). Once the problem was successfully clustered into one of the above subjects our journey towards successful analysis is about to face a branched junction. Each subject holds in store an impressive arsenal of heterogeneous algorithms which in turn provide a variety of configuration parameters for execution. If that is not enough, each algorithm represents undesired tradeoffs between CPU intensiveness, memory consumption, running time and analyses quality. After several experiments with sampled data, only a few algorithms will be found suitable for the problem. When the number of algorithms is reduced, there is no real holdback for executing several of them and choosing the best in run time. Model execution over a training data set After successfully choosing a model the analyses may begin. The warehouse is sampled for a certain percentage of its records to obtain the training data set. The sampling stage assumes a random independent distribution of the data. The model executes over that set of data and the analysis results are the coefficients and the statistics. A common problem with models executed on a specific data set comes up when their results behave very well for that specific data set, but when executed upon another data set the model's results come out poor. The recommended solution is rerunning the model provided with a data set used for testing (also, validation set). Model validation over a test data set Overcoming the problems introduced in the previous section requires running the model against a set of validation data. The validation data should (just like the training) be independent and randomly sampled with uniform distribution. Best practices suggest 1:3 ratio between the validation and the 5

11 training data used to build the model. After validation the model is mature enough to be used for applying to new data. Scoring, concluding and applying Discovering valuable information after a thorough analysis sessions resembles finding gold in a mine. The results of the analysis can be interpreted by a human analyst to translate the numbers into facts whereas the quality of the model can be evaluated using statistical measures and visualized in charts and reports. After scoring and concluding, the trained and validated model is applied to new sets of data. The knowledge acquired in the process can be used to refine the model and find out further insights. 6

12 Theoretical background 2 This section abstracts the mathematics behind the data mining methodology with focus on regression analysis and three of its implementation algorithms. Regression Analysis In statistics, regression analysis examines the relation of a dependent variable (response variable - y) to specified independent variables (explanatory variables x 1, x2..., xn ). The mathematical model of their relationship is the regression equation. The algorithms used in this project will be limited to linear regressions with the equation of the form where a i are the regression coefficients [7]. y = a i x i, i Linear Regression Linear regression is a regression method that models the relationship between a dependent variable Y, independent variables random term ε. The model can be written as y n = i= 1 β + β + ε i x i 0 x i i n and a where β 0 is the intercept ("constant" term), the βis are the respective parameters of independent variables, and p is the number of parameters to be estimated in the linear regression. This model is called "linear" because the relation of the response (the dependent variable Y) to the independent variables is assumed to be a linear function of the parameters. 7

13 Figure 1 2D Linear Regression Figure 2 3D Linear Regression 8

14 Least Median Square In regression analysis, Ordinary Least Squares (OLS) is a method for linear regression that determines the values of unknown quantities in a statistical model by minimizing the sum of the residuals (the difference between the predicted and observed values) squared [8]. Least Median Square is a variation of Ordinary Least Squares method with several adjustments such as using a robust regression that is not affected by outliers in the data set [9]. Support Vector Machine Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyper planes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes. An assumption is made that the larger the margin or distance between these parallel hyperplanes the better the generalization error of the classifier will be [10]. As shown in figure 2, many linear classifiers separate the data. However, only one achieves maximum separation (with respect to that specific data set). Figure 3 Linear Classifiers of Data Support vector machine is a classification algorithm. In order to benefit from it in the area of regression we can use its version called Support Vector 9

15 Regression. Instead of finding hyperplanes that maximized the separation, the regression algorithm finds one hyperplane that crosses the vectors and minimizes the distance of all vectors from that hyper plane. Statistics The linear model needs to be evaluated in order to use it in real applications. There are several methods of evaluating linear models, some involve the following statistics: Correlation Coefficient An indicator to the strength and direction of a linear relationship between two random variables [11]. The coefficient is noted calculated X,Y Random variables σ X - Standard deviation of X σ Y - Standard deviation of Y ρ X, Y = cov σ ( X, Y) X σ Y ρ X, Y and Figure 4 - Several sets of (x,y) points, with the correlation coefficient of x and y for each set 10

16 Coefficient of determination The proportion of variability in a data set that is accounted for by a statistical model [12]. In simpler terms, the coefficient of determination (Noted r 2 ) is the percentage of the data the linear model was able to "explain". The variability is defined as the sum of squares. R 2 is calculated: yi The i th value from the samples of Y ( y yˆ ) 2 i R = 1 2 ( yi yi) ŷi - The estimated i-th value of Y from the regression model yi - Arithmetic mean of Y values. i i i 2 Another method to calculate the coefficient of determination is using the correlation coefficient: Coefficient of determination = correlation coefficient 2 11

17 Preliminaries 3 Definitions and Notations "Meteora" is a high-scale software development center. Each year several software products are being released and joins Meteora's product list. For purposes of software distribution, Meteora holds a production floor that handles on-line building, production, packaging and distribution of Meteora's software releases. The production process is iterative and includes several stages. The order When a customer wishes to purchase one of Meteora's products he contacts the ordering center and places an order. While filling the ordering form, the customer can choose the distribution method (CD/DVD by snail mail or delivery), the software distribution corresponding with the customer needs (Enterprise, Personal, etc), number of copies/licenses and several other parameters related to the product features. Logical Build After the order is processed, it initiates a production request in the factory. The first phase in the production process is called "logical build (LB)" and can take between 10 seconds and 2 hours depending on the software itself and other order parameters. The logical build process is executed on dedicated machines (Logical Build Server - LBS). LBS is getting its input from a central resource allocation module (would be explained later on). The input contains the software name, version, number of copies and other parameter the user requested in the ordering process. A common LBS execution usually includes transferring a copy of the source files that reside in a main version control server and executing the software proprietary build process (compilation, linking, build signing, etc). 12

18 Logical Image A logical image is the image of the software to be burnt on the media (CD/DVD) or sent via . After a LB is created, the Logical Imaging Servers (LIS) create images from the LB (ready-for-distribution copies of the product). In this stage the LIS are deriving data from the LB and executing imagedependent tasks (e.g. creation of license keys and serial numbers assignments). The image creation process is also dependent on parameters within the order (e.g. the parameter "number of copies" increases the LI creation time linearly, the software version [Enterprise, Personal, etc ] affects the image size and therefore influences directly on the LI production). The LI creation process also takes between 1 second and 2 hours and this time is also influenced by several parameters from the order phase. Physical Media Creation When the LI is ready, all that's left is burning the image on a physical media (CD/DVD). For that purpose, the factory holds several machines for producing physical copies of the images. Some of those machines are manually operated by dedicated personal and the rest are fully automatic. The physical media is delivered to the customer by snail mail. (Some of the orders are to be sent directly to the customer using . In this scenario, the physical media creation stage is bypassed and the LI is sent directly using / FTP). Factory machines and products deployments In this project we'll mainly deal with the main 3 products: LB, LI and PM. For each software order, all three of those products need to be produced. There are about 500 software products the factory manufactures. All the factory machines are being resource-planned by a centralized production management system (CPMS). The CPMS is responsible for scheduling, machine control and monitoring the entire production process (from the order stage to the final media packaging). While the production floor is operational 13

19 and active, CPMS is gathering raw production data (e.g. production times, faulty percentage, malfunctions statistics etc). The vast amount of data is used for analysis, inquiries and summery reports. Figure 5 Metora production process schematics 14

20 Objective The project's main goal is to find correlations between the production times of different products and their order specific parameters (those production parameters would be referred to as attributes [or Features] from now on in order to conform to the data mining notation standards). An example of an environmental attribute in PM is size of the LI. It would be an educated guess to assume the media burning time is linearly dependent on the size of the image. The enterprise versions usually contain more source files to be built in the Logical Build stage and therefore increases the LB production time linearly. Some of the attributes are product specific, e.g. the version of the third-party database to be packaged alongside the software product itself which is likely to increase the production time. Nevertheless, not all the attributes influence the production time, e.g. the customer is also an attribute in the order but it would not affect production times at all. Number Of Attributes Histogram Number Of Products Num Of Attributes each product has Figure 6 Histogram of number of attributes products have 15

21 Motivation If prior knowledge of the production times could be attained (or at least be fairly estimated), it would leverage the resource allocation module and allow it to use more sophisticated scheduling algorithms. This information can also be used by CPMS to alert if any products won't be on time. A fair estimation of production times could be the basis to finding additional problems in the production process (e.g. if a product is estimated to be produced in 1-2 minutes and its actual production time is 20 minutes, it would trigger a red light). Innovation Most organizations usually employ several data analysts to handle the data mining, knowledge discovery and business intelligence aspects of the enterprise. Unfortunately, Metora doesn't have the trained personal to do the data analyses; therefore, this project would provide a replacement for expensive personnel by automating the entire data mining process. When no human is involved in the automated process, the question of how to quantify the model's quality arises. For that purpose, the module defines a set of heuristics and ranking methods for scoring the models calculated without having an input from a user. Challenges 1. Every product has its own set of influencing attributes which are hidden inside an XML document that arrives with the order. 2. The order specific attributes arrive without knowledge of their data type. The value 120 can be either interpreted as a number or as an item in a set {100, 110, 120, 130, 140, 150}. 16

22 3. Not all products have a sufficient amount of historical archived production entries for analysis. 4. Several products do not have their production time dependent on measured attributes nor on order specific parameters. 5. Both the warehouse and the analysis consume expensive resources such as CPU time, I/O and memory consumption, and must therefore be executed when the factory is not operational, i.e. in a time-frame of one night (approx. 6 hours). Organization of this document The next chapter describes the experiment held on one of the products in the factory alongside results and conclusions. The chapter that follows it is dedicated to the design of the module accompanied by Entity-Relational diagrams, class diagrams and pseudo code. Chapter 5 summarizes the results of the execution and details the conclusions. The closing chapter describes possible further research in the area. 17

23 Experiment 4 The experiment will perform all the data mining tasks defined for this project on a sample test case containing a single product which was sampled for this purpose. This product ( 220) has 3 attributes (will be notated as attr1, attr2 and attr3 respectively) and about 500 production instances for the regression. The hypothesis is that the production time of product 220 depends on attr2 linearly. The pilot will strengthen or weaken this assumption. Building the model Before building the model we can visualize the correlations between the independent variables and the target attribute. Each one of the independent attributes used in this experiment is plotted against the target variable. Attribute #1 Figure 7 Production time as a function of Attr1 18

24 Attribute #2 Figure 8 Production time as a function of Attr2 Attribute #3 Figure 9 Production time as a function of Attr3 19

25 From observing the plots of the production time as a function of each of the attributes we can guess that attr1 is influencing the production time and attr2 and attr3 do not drastically affect the dependent variable. Executing the model The sample warehouse used to execute the model contains 475 records. We used 66% of the instances to build the model (training data) and the remaining 34% was used to validate the model (test data) built on the training data. To achieve the best result, we executed the model using several regression algorithms, including Ordinary Least Squares Linear Regression, Least Median Square Linear Regression and Support Vector Regression. Results (Ordinary Least Squares) Linear Regression Pr odtime = Attr Statistic Value Correlation Coefficient Mean Absolute Error Root Mean Square Error Relative Absolute Error % Root Relative Square Error R 2 (Coefficient of Determination) Linear Regression (Least Median Square) Pr odtime = Attr Attr ( Attr3= Value2) Statistic Value Correlation Coefficient Mean Absolute Error Root Mean Square Error Relative Absolute Error % 20

26 Root Relative Square Error R 2 (Coefficient of Determination) Support Vector Regression Pr odtime= Attr1( Norm.) Attr2( Norm.) Attr3( Norm.) Statistic Value Correlation Coefficient Mean Absolute Error Root Mean Square Error Relative Absolute Error % Root Relative Square Error R2 (Coefficient of Determination) Conclusions The experiment showed that there are relations between the order specific attributes and the production time. We can see by the value of R 2 (which is about 65%) that the model is better than the null model (without explanatory attributes). 21

27 Module Design 5 The project is divided into five stages; each of them is independent and combined together to accomplish the final result. Those stages are: warehousing, model building, attribute selection (optional), regression, validating and scoring against test data and finally applying the model on new sets of data. This section describes the algorithms in use, tables and objects and how the implementation challenges (see chapter 2) were handled. Entity Relational Diagram Production Entries Summery Product Number of production entries PK,FK1 PK Product Specific Attributes Product Attribute Name Attribute Get Function Attribute Data Type PK FK1 Applied Products Id Product Id Production Time PK,FK1 Products Id PK PK,FK1 Warehouse Id Product Purpose Production Time Attribute 1 Attribute 2 Attribute 3... Attribute200 PK PK Product Generic Attributes Product Type Attribute Name Attribute Data Type Attribute Get Function Type Number PK FK1 Id Null Models Product Production Time Mean Absolute Error Root_Mean Square Error RSquared (Coefficient of Determination) Regression Models Regression Algorithms Regression Config PK,FK1 Id PK Id PK Id Product Model Correlation Coefficient Mean Absolute Error Root_Mean Square Error Relative Absolute Error RSquared (Coefficient of Determination) Details Algorithm FK1 Regression Algorithm Param Name Option Flag Param Value Figure 10 Entity Relation Diagram of the module 22

28 Products Holds information about all the products in the factory Column Name Data Description type Id (PK) Numeric 4 digits unique identifier of a product. First digit states the product's type and remaining 3 are the numeric identifier of the product itself. Type String {Type 1, Type 2, Type 3} Number Numeric 3 last digits of the id Product Generic Attributes Holds all the attributes of the product types which are not order specific. These attributes are pre defined by the data analyst. Column Name Data Description type Product Type (FK -> Products.Type) String Product type identifier {Type 1, Type 2, Type 3} Attribute Name String Name of the attribute Attribute Datatype String The datatype of the attribute. {Numeric, String} Attribute Get Function String The function used to get the attribute value. This function will be called on warehouse construction. Product Specific Attributes Holds all the order specific attributes of all the products. This table is built dynamically and rebuilt each month. Column Name Data Description type Product Id (FK -> Numeric Product identifier Products.Id) Attribute Name String Name of the attribute Attribute Datatype String The datatype of the attribute. {Numeric, String} Attribute Function Get String The function used to get the attribute value. This function will be called on warehouse construction. Attributes Blacklist Every attribute that is taken as numeric and in fact the actual value is a string (See Challenges). Those attributes are in run-time inserted to this table and 23

29 the warehousing process retries without sampling them. Column Name Data Description type Product Id (FK -> Numeric Product identifier Products.Id) Attribute Name String Name of the attribute not to be included in the sampling process Warehouse This table holds a flat representation of the products and their attributes. The attributes columns are common for all the products and are distinguished by the attributes index stored in "Product Specific Attributes" and "Product Generic Attributes" Column Name Data Description type Id Numeric Unique identification of a product within it's type Product Id Numeric Product identifier Purpose String This table stores the warehouse for training the regression model and for applying new records. {Training, Applying} Production Time Numeric Number of seconds the production of this product instance took. (The dependent variable) Attribute1 String The value of attribute #1 Attribute2 String The value of attribute #2 Attribute200 String The value of attribute #200 Regression Models Holds the models results for each product. This table also holds a repository of the models themselves to be retrieved in the applying stage. Column Name Data Description type Id Numeric Unique identifier Product Id Numeric Product identifier Model Binary After the model building, the classifier is serialized and stored in this column. Details String Human readable results of the regression. Correlation Coefficient Mean absolute Error Relative absolute Error Numeric Correlation coefficient of this regression. 1, 1 [ ] Numeric Mean absolute error of this regression. Numeric Relative absolute Error of this regression 24

30 R Square Number Coefficient of determination. The percentage of the variance that can be explained by the regression. Null Models Some of the products don't have enough records to perform regression. For these products the null model is Column Name Data Description type Id Numeric Unique identifier Product Id Numeric Product identifier Model Binary After the model building, the classifier is serialized and stored in this column. Regression Algorithms Lists all the regression algorithms the module uses Column Name Data Description type Id Numeric Number of the algorithm Algorithm String Name of the algorithm Regression Config Summarize all the configuration parameters for the algorithms. Column Name Data Description type Id (PK) Numeric Unique identifier Regression Numeric Foreign key to regression algorithms Algorithm (FK) Param Name String The name of the parameter Option Flag String Short string representation Param Value Numeric The value of the parameter Applied Products All the new products are queued in this table until the Applier service will set their production time using pre stored regression models Column Name Data Description type Id Numeric Id of the product Product Id String FK to products table Production Time Numeric The estimated production time for this product in seconds. 25

31 Class Diagrams Figure 11 Class Diagram 1/2 Figure 12 Class Diagram 2/2 26

32 Algorithms Algorithm for Warehousing Warehousing is done at the beginning of the data mining process. The objective of warehousing is to create a simplified (flat) representation of the products and their attributes. The attributes of the product can mostly be derived directly from the database, but some attribute require more complicated computation for retrieval. Warehousing is usually a batched process being performed at night to save computation time. The main difficulty this process has to overcome is dealing with products that do not have a sufficient amount of production records (Challenge 3). Another question that arises is how much to sample. On the one hand, computation time grows with the sample size but on the other, small samples may not be sufficient to profit from regression analyses. The range was chosen, as it's the optimal tradeoff between resource consumption and analysis quality compromise. This assumption is based on trial and error. Iterate all the products in CPMS database and for each product p Query the production archives for product p If num_of_records_for_product_p < 500 then Skip this product; Else Sample min(num_of_records_for_product_p, 5000) End if For each production record Get a static list of product p's type generic attributes Get a dynamic list of product p's order specific attributes 27

33 Gather all the attributes values of the product End Loop End Loop ApplyPreProcessingFiltering(MissingValuesFilter, EdgeValuesFilter, Normalization Filter); Algorithm for Model Building Compound attributes creation For any two numeric attributes a1 and a2, produce third attribute a1,2 that will hold the numeric multiplication of a1 and a2. For any two alphanumeric attributes c1 and c2, produce third string attribute c1,2 that will hold the concatenation of c1 and c2. Regression Algorithms In order to achieve optimal results we will execute several regression algorithms and according to their model's results we will choose the best one at run time. Each one of the algorithms is different and with some datasets that algorithm A will score higher than algorithm B and some datasets - quite the opposite. The algorithms used are: Ordinary Least Sqaures Linear Regression, Least Median Square Method Linear Regression with and Support Vector Regression. Algorithm for Training and Validation The first execution of the model would be on the training data set. The coefficients from the regression would then be tested against another model that contains the validation data sets. The data set is divided to 2/3 training and 1/3 validation. The purpose of validation against a different data set is to detect potential problems with the model such as under-fitting or over-fitting. Execute model on Training Data to retrieve Training_Regression_coefficients 28

34 Run model with Training_regression_coefficients on Validation Data Summarize statistics (RMSE, correlation coefficient, r 2, MAE, etc) Store the model If there is no model stored for this product Store the current model Else If the current model is better than the stored model (Scoring methods will be introduced later) Store the current model Else Discard the current model End If End If Algorithm for scoring There are 2 types of scoring mechanism. The first is by observing the values of the regression statistics. The coefficient of determination and the correlation coefficient will be the statistics we will use to quantify the model result quality. Out goal is getting a higher r 2 than the previous model. For the second method we suggest the following algorithm: For each product p in a input set of products Schedule p to m machines using SPT End Loop The algorithm will execute with two inputs: jobs with real processing (production) time and jobs with estimated processing time from the 29

35 regression analysis. The difference between the makespans can be used for determining the quality of the model. Algorithm for Applying The best model for each product would be stored in the database as a result of the previous stage. This algorithm uses a special model called the "null model" to apply to products that doesn't have a regression analysis. Null model is a model using a naïve heuristics (simple arithmetic mean) for production time estimation. On-Event: New production order arrived Iterate all products in the production order and for each product p Retrieve the model stored for product p If p doesn't have a regression model then Retrieve the null model End if; Calculate the production time of product p according to model coefficients End Loop Services The data mining process will be performed by 3 services independent of each other: 1. Data Warehouser Handles the sampling, collecting, pre-processing and storing of the data. 2. Data Analyzer Computes regression analyses on the warehouse and stores the results in the database. 3. Data Applier Retrieves best model for product from the repository and use it to apply new records. 30

36 Results and Conclusions 6 Module Results The models produced by our module will be evaluated using the value of r 2, where high values of r 2, represent good models. The following chart shows the values of r 2 as a function of the product (sorted by descending r 2 ). Figure 13 r2 as a function of products (sorted by descending r2) From the chart above we can see that a small number of products received r 2 greater than 0.2. In the common applications of data mining, values higher than 20% for r 2 represent a good predictive model. Below the 20% value, the model is considered not so good. 31

37 Algorithms performance For each product, all 3 regression algorithms were executed. The following table summarizes their success. OLS Linear LSM Linear Support Vector Regression Regression Regression Best Values or r 2 76% 14% 10% Memory 10% 15% 75% consumption CPU Time 13% 19% 78% Conclusions 1. Only a small number of products had their production time successfully captured by a linear model. The majority of the products did not fall under the linear regression model and for that their could be several reasons: Some products are produced in parallel on the same machine. This phenomenon inserts an undesired error into the production times which decreases the linear model quality. Most of the products are not affected by order-specific parameters. 2. Linear regression is by far the best algorithm for these data sets. It is the fastest and most accurate of them all. Nevertheless, it is worthwhile to invest the time and resources (CPU and memory consumption) to execute the other algorithms. Despite their excessive resource consumption, they have gained higher values of r 2 for some products. 32

38 Further research 7 Improvements Our thesis suggests that the production times of products in the factory depend on attributes that arrive in the order. From the project's results, we can see that such a correlation does exist. Furthermore, we can improve the results (higher values of r 2 ) by measuring attributes from the production phase itself in addition to those retrieved from the order. Another improvement relates to the warehousing process itself. Existing sampling techniques largely hide any temporal relationships in the temporal data [3]. Pre-investigating the attribute values could uncover autocorrelations between pairs (or more) of attributes. Autocorrelations can mislead the data mining algorithm and decrease the results quality. In order to minimize that affect we can apply a sampling. Additional data mining tasks The project had his prime focus on calculating the production times of the various products in the factory. The implementation of the data mining module in CPMS can be the basis for further estimations: Size of product The physical storage the products acquire in the servers is also an important consideration when planning the server farm's deployment that handles the production of LI and LB. This variable can also be estimated using regression analyses on the order-specific parameters. Malfunctions Each malfunction also dump a massive amount of environmental measurements that took place before, during and after that specific malfunction instance. Malfunctions causes, as far as the factory's knowledge goess, are stochastic, but with the right 33

39 choice of explaining attributes, they can also be predicted (at least to an extent). 34

40 References 8 [1] Adriaans, P., Zantige, D., 1996, Data Mining, Addison-Wesley, UK. [2] Krzysztof Cios, Witold Pedrycz, Roman Swiniarski, 1998, Kluwer Academic Publishers, p. xvii). [3] E. I. Neaga., J. A. Harding., A Review Of Data Mining Techniques And Software Systems To Improve Business Performance In Extended Manufacturing Enterprises [4] George Fernandez, 2003, "Data mining using SAS applications", Chapman & Hall/CRC, p. 21 [5] J.A. Harding, M. Shahbaz, Srinivas, A. Kusiak, 2006, "Data Mining in Manufacturing - A Review" [6] Yaakov Zehavi, Data mining course slides, Introduction, s21 [7] Wikipedia, Linear Regression. [8] Wikipedia, Least Median Square [9] Humberto Barreto, 2001, An Introduction to Least Median of Squares [10] Wikipedia, Support Vector Regression [11] Wikipedia, Correlation [12] Wikipedia, Coefficient of determination 35