Vinification Mining A Case Study on Wine Production

Transcription

1 Vinification Mining A Case Study on Wine Production Jorge RIBEIRO 1, José NEVES 2, Juan SANCHEZ 3, Paulo NOVAIS 2, José MACHADO 2 1 Viana do Castelo Polytechnic Institute, School of Technology and Management, Viana do Castelo, Portugal, jribeiro@estg.ipvc.pt 2 University of Minho, DI-CCTC,Department of Computer Science, Portugal,{jneves,pjon,jmac}@di.uminho.pt 3 Viana do Castelo Polytechnic Institute, Agrarian School, Portugal, xavier@esa.ipvc.pt Abstract Throughout time wine has performed a relevant role in almost every civilization. Demarcated regions to benefit from an Origin Denomination have to assure that every process on wine production is submitted to a strict control in every phase, since the vineyards till the costumer. The wine vinification process is one of the stages in wine s production that could influence the achievement of wine s quality. This assessment is traditionally realized by wine tasters that analyze some organoletic parameters such as colour, foam, flavour and savour being very important for the wine production and for its successful marketing. The use of Data Mining techniques in this field has a great relevance in revealing the importance of the numerous chemical parameters involved in the process of wine production, as well as to define classifying models to determine the parameters based on organoletic parameters from the chemical process of winemaking. The Decision Trees and the Linear Regression were used as Data Mining techniques to achieve the objectives of classification and regression. The experiments were oriented using the new Microsoft's SQL Server 2008 Business Intelligence Development Studio and an open-source Data Minig tool (WEKA). Very good results were achieved, with performances between 85% and 98% for all models. Key words Data Mining; Knowledge Discovery in Databases, Decision Trees, Linear Regression, Wine Vinification Process. 1. Introduction In the context of the wine production, the vinification process corresponds to the analysis over the time of the wine s quality. During this process (Fig. 1) several chemical parameters are analyzed such as ph, Anthocyanines, Chemistry Age, etc. [1, 2] are recorded. With these data it is possible to examine relationships between the attributes that allows to extract knowledge and create classification models in order to adjust some parameters to improve the quality of the wine and secondly, to analyze the chemical attributes that influence the best time to consume the wine. To complement the achievement of these results and in order to analyze the chemical quality of the samples the wine tasters analyze some organoletic/subjective attributes such as the savour, the colour, the flavour and the foam. In the case of the green red wines the process of winemaking begins with the wine grapes (in this case study of the vinhão wine). Next the wine is transported to an experimental winery and is made the grapes sampling. After the grape sampling it is made the fermentation process with the different types of maceration [1] and the process of racking, pressing and made the cold stabilizations (figure 1). Then is followed the procedure for the use of the glue and stabilization and for the wine bottling. To examine the organoletic/subjective quality of the

2 vinification, the vinification samples are evaluated by a set of wine tasters that reviews 8 times the same sample. The wines were produced following three different processes: pellicular fermentative maceration, a traditional method, rotary cube fermentation and the carbonic maceration as we present in the figure 1 and in the table 7 [1]. The total of wines phenolics were determined by colorimetry with phosphotungstic-phosphomolybdic acid [3] at 750nm. The results were expressed in units of the Folin-Ciocalteau Index(IFC). Grapes Vinhão Grapes sampling Transport to experimental winery Destemming/crushing Whole grapes into a CO2 satured tank Pellicular Fermentative Maceration (C) Rotary Cube Maceration (RF) Carbonic Maceration (CM) Racking after a week Racking after a week Racking after two weeks Pressing Pressing Destemming/crushing Cold stabilization/rackings Cold stabilization/rackings End of alcoolic fermentation Stabilization and bottling Stabilization and bottling Cold stabilization/rackings Stabilization and bottling Fig. 1 - Technological process used for the tree types of maceration in wine production. In recent years, the application Data Mining techniques [7] has become a very powerful tool and easy to use for analyzing relationships between various attributes of the data sets. The high volume of data stored by organizations through time origins a new challenge in the extraction of knowledge from the information stored. From the Knowledge Discovery from Databases process (KDD) [4] organizations can potentiate the stored data, discovering relationships or affinities between them and understand the behavior of the various agents that intervenue in the organization like customers, suppliers and sellers. Various tasks (selection, pre-processing, transformation, data mining and interpretation) are associated to this process. The Data Mining task is centred in the application of algorithms including: artificial neural networks, decision trees, association rules and genetic algorithms that are used to extract patterns from the previously treated data and are applied according to the KDD objectives (classification, rgression, clustering, forecasting and optimization). In this work we will use the classification and regression Data Mining objectives using the Decision Trees (DT) [5] and Linear Regression (LR) [6] as Data Mining techniques. This study focuses on the creation of classification models of subjective attributes (savour, colour, flavour and foam) [2] from the chemical parameters obtained during the process. To achieve this objective we used the DT and LR to represent mathematical functions that show the relationship between the chemical attributes to allow the creation of a function to obtain the values of a given subjective parameter. We use a data set of the green red wines vinification process of an agricultural cooperative of the North of Portugal. The tools used were an opensource tool (WEKA) [7] and a proprietary tool (Microsoft Business Intelligence 2008) [8]. With this work we intend to demonstrate the potential of the Data Mining techniques in

3 the extraction of knowledge in databases in particular for the creation of classification models of subjective attributes in the wine vinification process. With these techniques the managers of the agricultural cooperative could predict what will be the subjective values of the parameters varying the values of some chemical parameters. In this way, they can use a tool capable to assist them in the analysis of chemical parameters of the best wine to improve the wine s quality and analyse the best conditions to consume the wine. 2. Materials and Methods 2.1 Wine vinification Data This work adopted part of the data collected during the wine production phase during four years in a Wine Estate in Minho Region (North of Portugal) that produces and markets green red wine. During the process of wine vinification it was used three kinds of wine maceration [1] (figure 1): Vinification Maceration by Pellicular fermentative (C), Vinification by carbonic maceration (CM) and Vinification by Rotary Cube (CR). For each maceration type it was used five types of glue or clarification type [1]: Polyvinilpolipirrolidona, albumin, gelatin, casein more the witness, without any glue. These characteristics are mentioned in the table 1 the p, a, g, c and t respectively. Attribute Domain Values Categories/Classe Type Name Min Max A B C D E Sample Fermentation (time {6, 8, 12, 14, 24, 30, 36} in months) - SFTM Clarification type {t, p, a, g, c} Vinification Type (vt) {C, MC, CR} ph ,45 3,45 3,56 3,63 3,56 3,63 3,7 3,7 Absorbency -A ,21 0,42 0,42 0,5 0,6 0,5 0,6 0,7 0,7 Absorbency -A ,6 0,6 0,75 0,97 0,75 0,97 1,27 1,27 Absorbency -A ,16 0,16 0,19 0,23 0,19 0,23 0,28 0,28 Chemistry Age - 0,32 0,40 0, ,32 CA 0,4 0,48 0,56 0,56 Folin-Ciocalteeau Index (FCI) Anthocyanines Ant (mg/l) Chemical Savour 1,9 8 5,4 (*) 5,4 (**) Subjective Color 2,8 8,7 6,2 (*) 6,2 (**) Foam 2 8,4 5,6 (*) 5,6 (**) Aroma 1,5 8,1 4,7 (*) 4,7 (**) Tab.1: The main wine s vinification indicators.

4 The data set has two types of attributes: attributes with chemical characteristics and subjective attributes (colour, foam, flavour and savour). Table 1 presents the attributes of the data set with maximum and minimum values for continuous attributes and their correspondence in classes (A, B, C, D and E). The continued division of the values into five classes was decided by the production managers in the context of the green red wines. The subjective parameters (savour, colour, flavour and foam) are divided in two classes corresponding to "medium" for class "A" and "good" for class "B". As we mentioned the first objective is to create classification models for the various subjective attributes of the samples. It is intended to analyze the variation of chemical parameters and the predictive value of the subjective parameters in terms of two classes: "A" ( Medium represented as (*) in the table 1) and "B" ( Good represented by (**) in the table 1) corresponding to the "medium" and "good" evaluation of the attribute. Fig.2: Histograms for the attributes of the wine vinification data set. Several physico-chemical attributes associated with the production of wine [2, 9]. Despite the relevance of these parameters, the attributes that the managers of production considered most relevant to the analysis of the wine production are the ph, the absorbency at 420 nm (contribution to the colour blue), the absorbency a520nm (contribution to the colour red-blue), the absorbency to 620 nm (contribution to the color yellow), the Anthocyanins [9,10], the Chemistry Age (CA) [9] and the Folin-Ciocalteau Index [9]. The fermentation sample (SFTTM) corresponds to the time in months of the sample collection. This indicator has the values 6 till 36 months corresponding to the period of the vinification process. One particularity of the colour chemical parameter is that it is determined by the sum of absorbances at three a wavelength (420, 520 and 620 nm). For this reason the chemical parameter of the colour was removed from the dataset. Before attempting the DM modelling, the data was pre-processed. The original dataset contained attributes with missing values. Since it was not possible to obtain the correct values the blank records were discarded [11] remaining a total of 362 examples. The main features of the vinification data set are described in Table 1. The frequency distributions (or histograms) related to these variables are plotted in Figure 2.

5 According to the managers of this wine estate, the classification of the vinification wine s quality was defined as a typical classification and regression problem. 2.2 Decision Trees and Linear Regression The Decision Tree (DT) [12] is one of the most popular Data Mining and efficient classification algorithms. Corresponds to a representation of a set of rules that follow a hierarchy of classes or values, expressing a simple conditional logic and are graphically similar to a tree (figure 3). The DT corresponds to representations of a set of rules for classification, which classifies instances, from the root node to a terminal node (leaves), which provides the classification for the instance: each node of the tree specifies a test for the attributes of the instance (variable) and descending branch of each node corresponding to one of the possible values for this attribute. An instance is classified first by testing the attribute specified by the root node, then following the branch corresponding to the value of the attribute in the instance. Fig.3: Decision Tree example for the attribute Colour The most popular decision trees algorithms for classification are ID3, C4.5 and C5.0 proposed by Ross Quinlan [5]. The CART classification algorithm proposed by Breimann [6] is also widely adopted. In this study we use the C 4.5 implementation using the WEKA tool and the Microsoft Decision Trees has a hybrid of these algorithms (C4.5 and CART). The C4.5 is a decision tree algorithm that is based on the concept of information gain. The information gain represents the decrease in entropy caused by dividing a given data set according to an attribute. The attribute with the highest gain is chosen to divide the data set, and recursive

6 application of this procedure for different relevant attributes allows the structuring of the data set w.r.t. the relevant attributes. In this study the J48 [7], which is a Java re-implementation of C4.5 algorithm [13] and is a part of the machine learning package WEKA [7] was used to induce the decision trees under the open-source tool (WEKA). The other tool was the proprietary Microsoft Business Intelligence Studio 2008 [8]. The objective of the Linear Regression [6] is to find a basis for predicting one variable, i.e. find a function that represents a form to represent the variables behaviour (figure 4). Linear Regression uses interestingness and corresponds to rank and sort attributes in columns that contain continuous non-binary numeric date. The Interestingness score will be used to assess all input columns, to ensure consistency. Fig.4: Linear Regression between the attribute Flavour and Anthocyanin. The regression typically requires that both dependent and independent variables are continuous and numeric type. In this study, we applied linear regression to obtain lines for predicting the variables of subjective data set. For this reason were removed from the set of non-numeric data attributes: Clarification Vinification Type and Type. 3. Results Attending that the wine vinification analysis it was decided to develop the experimentation based on the classification of the subjective attributes of the data set. As we mentioned we use the Decision Trees and the linear regression. These two approaches will be compared and the criteria will be the predictive accuracy. Fig.5: Attribute dependency for the Flavour attribute. The classification models for wine vinification analysis were developed using the C4.5 algorithm [12]. To insure statistical significance of the attained results, 10 runs were applied

7 in all tests, being the accuracy estimates achieved using Holdout method [13]. The training strategy was separated in a balanced and non-balanced training sets. In each simulation, the available data is randomly divided into mutually exclusive partitions: the training set, with 2/3 of the available data and used during the modelling phase; and the test set, with remaining 1/3 examples, being used after training, in order to compute the accuracy values. A common tool for classification analysis is the confusion matrix [14]. This matrix is a structure of size N x N, where N denotes the number of possible cases. This matrix is created by matching the predicated given by the Data Mining model and the actual desired result. In the presented experiments, J48 [7] with defaults values of parameters was used for inducing classification trees. Model training and validation was based on 10-fold cross-validation and evaluated the number of correctly classification instances. 3.1 Experimental Results Table 2 presents the confusion matrix of the DT applicability for each tool, where the values denote the average of 10 runs. Both approaches have a predictive accuracy of about 90%. Analyzing the experimental results we can verify that when using the two different tools there are no improvements when using balanced training sets. The results reveal that the Model 1 (Microsoft Decision Tree) is more accurate than the model 2 (WEKA Decision Tree). Colour Foam Flavour Savour Model 1 - Microsoft Decision Tree Classification Matrix Predict Probability Score A B A B A ,36% 0,85 B A ,48% 0,95 B A ,18% 0,89 B A ,31% 0,96 B Tab. 2 Confusion Matrix of the obtained models. Model 2 WEKA Decision Tree Correct Confusion Matrix Classified Instances Model 1 - Microsoft Decision Tree Model 2 WEKA Decision Tree Colour Vinification Type and SFTM SFTM Foam Clarificant Type and SFTM SFTM and Clarification Type Flavour Vinification Type, CA SFTM Savour SFTM, Clarification Type and SFTM and Clarification Type Vinification Type Tab. 3 Releveant attributes for the Microsoft BI model and WEKA model. 83,3% 87,2% 92,2% 87,5% Table 3 presents the most relevant attributes for the various classification models obtained by the applying of the DT. A particularity of both tools is that both tools selects the attribute "SFTM" as the most relevant for classifying the various subjective attributes. The second most important attribute is the clarification type. For the classification of the attribute "Flavour" the most relevant attributes are the "vinification type" for the tool WEKA Vinification type and Chemistry Age for the Microsoft tool. The figure 5 presents the parameter dependency for the flavour attribute. Despite practically the tools obtain the same accuracy (91.18% and 92%) the difference in the selection of the attributes is justified by the more detailed analysis of correlation between the attributes by a tool against the other.

8 General rules of the type IF THEN can be deducted from decision trees by following the path from the leave node to the root node of the tree. From the tree of the figure 3 it could be derived that if Chemistry Age is equal to "D" (between 0.48 and 0.56) and the vinification type not equal to "CR" then the Flavour attribute is "B" ( "good") with a probability of 91.18%. Colour Foam Flavour Savour Model 1 - Microsoft Decision Tree IF SFTM= 36 AND ca= B THEN COLOUR= B (99,36%) IF SFTM = 26 AND a420= B THEN FOAM= A (99,81%) IF ca= A AND ant = A THEN FLAVOUR= A (99,35%) IF SFTM {8, 30, 36} and vt = 'C' and CA = 'D' THEN SAVOUR= B (98,98%) Model 2 WEKA Decision Tree IF SFTM="20" AND vt="cr" AND ant=e THEN B (75%) IF SFTM="20" AND ct="g" and ct="cr" THEN A (85%) IF SFTM="26" AND ant="b" THEN A (70%) IF SFTM="14" AND ct="g" AND vt="mc" THEN B (65%) Tab 4 Rules derived by the Data Mining Tools applying the Decision Trees technique. The rules presented in table 4, corresponds to the top of the path tree. For the "colour" attribute one example of a rule can be extracted as: IF SFTM is different than 30 and 36 and the vinification is type 'C' then the colour will be "good" (class "B") with an predictive probability of 94%. For the model 2, if the SFTM is equal to 26 and the Anthocyanins equal to "B" (between 160 and 230) then the Falvour attribute is "A" (Medium) with a probability of 70%. Colour Flavour Foam Model 5 - Microsoft Linear Regression Model 2 WEKA Decision Tree Model Score Model CC Colour = 6,241-0,060*(SFTM- 1,38 Colour = * SFTM * ph 0.71% 22,087)+0,002*(AntmgL-351,970) * A * A * Flavour = 4,358+1,556*(CA-0,455)- 0,141*(SFTM-22,429) Foam = 5,430+1,228*(p H-3,571)+1,583*(CA- 0,458)-0,088*(SFTM-22,270) FCI * Ant(mg/L) ,91 Flavour = * SFTM * ph * A * CA ,41 Foam = * SFTM * ph * A * CA * Ant(mg/L) Savour Savour = 5,133-0,123*(SFTM-22,143) 1,49 Savour = * SFTM * ph * A * A (*)Correctly Classified Instances Tab.6 Linear Regression results 0.88% 0,76% 0,79% As we mentioned the objective of the regression is to find a function (Figure 4) which represents an approximate form of the variables behaviour. The linear regression obtained by the application of Microsoft Linear Regression for the attribute flavour is presented in the figure 3 and the equations in the table Discussion As we present in the tables, the performance of the Microsoft Decision Tree was better that the open-source tool. Accuracies between 85% and 98% were achieved by the Microsoft tool and 83% to 93% for the open-source tool. The most influence attributes for both tools were the SFTM and the clarification type that influences the prediction of the subjective attributes from the chemical parameters of the wine vinification process. This shows the importance of the time of the sampling and by the clarificat used. As the SFTM value increases, the quality

9 of the sample in the various subjective attributes decrease indicating that this type of wine should be consumed between 8 and 12 months after the vinification process. Given the results, the production managers can use such tools in other data sets with more chemical parameters in the wine production providing additional support to the production managers. 4. Conclusion This paper presented a study of the organoletic prediction attributes (colour, foam, flavour and savour) in the wine vinification process using the Decision Trees and Linear Regression models as Data Mining techniques. The experiments were conducted using the new Microsoft Business Intelligence Studio 2008 and the open-source WEKA tool. Accuracies between 85% and 98% were obtained, indicating that the use of Data Mining models can be used to predictive subjective attributes in the wine vinification process based on chemical parameters. It was possible to create classification models for the various subjective attributes in order to identify the relevance of other attributes. Although the data set contains few attributes quite good results were attained. In the future it should be interesting also to consider a new set of chemical attributes in the wine production. With this work we present the advantages of using Data Mining tools to support decision-making process in particular in the winemaking field. Literature [1] Castillo-Sanchez, J.X., Arantes J. et Maia, M.O. Étude de l' Évolution des Composés Phénoliques des Vins du Nord du Portugal Issues des Différentes Processus de Vinification. In: Polyphenols Comunications 96 Vol. I, 18th International Conference on Polyphenols, July 15-18, Bordeaux, pp: 55-56, [2] Castillo-Sanchez, J.X, Mejuto, J.C., Garrido, J. and Garcia-Falcón, S. Influence of winemaking protocol and fining agents on the evolution of the anthocyanin content, color and general organoleptic quality of Vinhão wines. Food Chemistry, 97, 1, pp: , [3] OIV. Office Internationale de la Vigne et du Vin. Recueil des Méthodes Internationales d Analyse des Vins et des Moûts., Paris, [4] Fayyad, U.M., Pialetski, G., Smith, P. Advances in Knowledge Discovery and Data Mining., The MIT Press, Massachussets, USA, [5] Quilan, J.R., Induction of decision trees. Machine Learning, pp: , [6] Breimann, L., Friedman, J., Olshen A., Stone J., Classification and Regression trees. Wadsworth, Pacific Grove, [7] Witten, I.H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco, p. 369, [8] Larson, B., Delivering Business Intelligence with Microsoft SQL Server 2008, McGraw-Hill Osborne Media; 2 edition, [9] Somers, T.C. and Evans, M.E. Spectral Evaluation of Young Red Wines: Anthocyanin Equilibria, Total Phenolics, Free and Molecular SO 2,"Chemical Age". J. Sci. Food Agric., 28, pp: , [10] Papadopoulou C., Kalliopi, S., Ioannis, R., Potential Antimicrobial Activity of Red and White Wine Phenolic Extracts against Strains of Staphylococcus aureus, Escherichia

10 coli and Candida albicans, CAntimicrobial Activity of Wine Phenolic Extracts, Food Technol. Biotechnol. 43 (1) pp.41 46, [11] Pyle, D., Data Preparation for Data Mining, Morgan Kauffman Publishers, [12] Quilan, J.R., Bagging Boosting and C4.5, Proceedings of the fourteenth National Conference on Artificial Intelligence. [13] Souza, J., Matwin, S., Japkowicz, N., Evaluating Data Mining Models: A Pattern Language, Proceedings of the 9 th Conference on Pattern Language of Programs, Illinois, USA, [14] Kohavi, R., Provost, F., Glossary of Terms, Machine Learning, 30 (2/3), pp , Apendix A Alcohol (vol.%) Sugar (gl -1 ) Volatile acidity (gl -1 ) Total acidity (gl -1 ) Sulphur dioxide total (mgl -1 Free sulphur dioxide (mgl -1 ) C 10,5+/-0,05 1,70+/-0,015 0,31+/-0,012 9,97+/-0,34 111,02+/-3,6 30,21+/-0,95 3,29+/-0,1 CM 10,7+/-0,05 1,77+/-0,02 0,55+/-0,022 6,69+/-0,24 99,0+/-2,8 25,36+/-0,75 3,49+/-0,12 RF 10,1+/-0,06 1,80+/-0,025 0,49+/-0,019 10,45+/-0,64 109,59+/-2,6 26,12+/-0,75 3,38+/-0,12 Tab. 7 Chemical parameters of the green red wines (three vinifications; average of the three samples) ph