Harumanis Mango Flowering Stem Prediction using Machine Learning Techniques

Transcription

1 Harumanis Mango Flowering Stem Prediction using Machine Learning Techniques R.S.M. Farook a,*, H. Ali b, A. Harun c, Ndzi. D. L. d, A. Y. M. Shakaff c, Mahmad Nor Jaafar e, Z. Husin a, A.H.A. Aziz a School of Computer and Communication Engineering, University Malaysia Perlis (UNIMAP), Perlis, Malaysia a, Block A, Kompleks Pusat Pengajian Seberang Ramai No. 12 & 14, Jalan Satu, Taman Seberang Jaya Fasa 3 Seberang Ramai, Kuala Perlis Perlis Darul Sunnah Telephone Number : rohani@unimap.edu.my Fax Number : Information Technology Department, Polytechnic Tuanku Syed Sirajuddin, (PTSS), Perlis Malaysia b School of Mechatronic Engineering, University Malaysia Perlis (UNIMAP), Perlis, Malaysia c School of Engineering, University of Portsmouth, Portsmouth, UK d Agrotechnology Research Station, University Malaysia Perlis (UNIMAP), Perlis, Malaysia e rohassanie@yahoo.com, David.ndzi@port.ac.uk { zulhusin, hallis, aliyeon, mahmad }@unimap.edu.my ABSTRACT Harumanis Mango (Mangifera indica) is known as one of the best table tropical fruit, due to its aroma and sweetness. Harumanis mango cultivar is included in the national agenda as a specialty fruit from Perlis, Malaysia for the world. Despite its overwhelming local demand in Malaysia and also internationally, the fruit supply never meets the demand. Mango flowering stem prediction is important as one of the factors to predict mango yield in order to implement effective forward marketing. Forward marketing is a co ntract that is signed between supplier and client based on the amount of delivery and the price of delivery in future, based on the predicted yield. In this paper, machine learning techniques are used to perform prediction of the flowering tree branches that could be used to predict yield in mango trees. Results shows that machine learning techniques could be used to predict the flowering branches. Research Notes in Information Science (RNIS) Volume13,May 2013 doi: /rnis.vol13.10 Categories and Subject Descriptors I.5.2 [Computing Methodologies]: Classifier Design and Analysis, Pattern Analysis. General Terms Algorithms, Performance. Keywords Machine learning; mango flowering stem prediction; soft computing 1. INTRODUCTION Harumanis mango is one of the fruit that has high economic demand and potential for Malaysia export business especially the Perlis State in Malaysia. Perlis exported 3.1 metric tons of Harumanis mango to Japan in 2010 and has targeted the export demand to increase to 100 metric tons by Harumanis mango tree is a yearly fruit bearing trees and reproductive phase of the mango trees often starts from January and ends nearly on June. This type of mango is highly sensitive to the climate and only grows in Perlis and part of Surabaya in Indonesia. It requires a significant dry weather period to initial flowering and the productive phase can be 46

2 significantly affected by change in weather. Although it does grow on Surabaya, the variety that grows in Perlis is often highly valued for export and therefore attracts high foreign earnings. The demand outstrips supply most of the time and there is a need to study and understand the yield cycle in order to accurately predict supply. Harumanis Mango growth and reproductive phases are illustrated in Figure 1 that depicts the growth and reproductive phases of Harumanis Mango in Perlis associated with the period of months. The vegetative growth period is approximately from July to December. December and January are considered as Pre-Flowering phase where the flower induction process is stimulated. January to February months is the period when the flowers grow and bloom. Fruit bearing occurs during March to April that leads to harvest from May to June. Figure 1. Harumanis Mango Growth and Reproductive Phases The pre flowering and flowering phases are identified as important stages in the plant reproductive physiology. The Harumanis mango flowering induction event can be influenced by a few factors such as pruning, defoliation and nitrogen fertilizer application [3,4,10]. Since Harumanis mango tree grow in Malaysia, a tropic country, flowering induction is influenced by the climatic factors[2,6,18], and biotic factors[10,12,15]. In tropics climate countries, it is an important factor that the terminal stems of the trees are allowed to rest after the previous vegetative growth, to be able to produce reproductive shoots [11]. This resting time is necessary to provide the stem ample time to mature and grow sufficient whorl, length and diameter. The mango trees yield predictions are essential to enable forward marketing signed between Malaysia and mango importing countries such as Japan. The forward marketing contract includes the specific mango quantity in tons to be delivered at specific times. A Harumanis mango exporter needs to know the approximate yield before agreeing to terms of the contract. The yield can be approximated once the farmer performs the possible flowering stems prediction. Machine learning techniques application in agriculture is a relatively new approach for classification and yield prediction in agriculture. There are a few research studies that include machine learning techniques in agricultural domains for yield prediction [1,5], crop classification, crop disease detection [19], management and advisory expert system [7,8,9,13,14,20]. Machine learning is about learning the structures from data. Machine learning techniques can be used to perform classification and prediction for future observation. Classification is a task of assigning objects to one of several predefined classes to create a classification model, whereas prediction is where the classification model is used to predict the new observations. A classification technique such as k-nn classifier, rule-based classifier and naïve Bayes classifiers employs a learning algorithm to find a model that best suits the relationship between the attributes and the classification categories. A training set consists of data whose class are known and is used to build a classification model. The classification model is later applied to a test data set, to test the classification accuracy of the model. Evaluation of the classification model is based on the counts of test records correctly and incorrectly classified by the model. The model performance can be displayed using performance metrics determine the level of accuracy. The performance of the model can be evaluated using several methods such as Hold out Method, Random Subsampling, Cross Validation and Bootstrap [16]. A Cross Validation technique is used to evaluate the performance of the classification models. In this paper, machine learning techniques are applied to identify the possible flowering tree branches using biotic factors. Five different classifiers performance used to predict the possible flowering branches are compared. The classifiers used are k-nearest Neighbour (k-nn), Naives Bayes, Support Vector Machine (SVM), Classification Trees (CAT) and Random Forest (RF). The rest of the paper is outlined as follows; Section 2 discusses the data sets descriptions and method. Results and discussion are presented in Section 3 and the paper ends with the conclusion in Section METHODS In this paper, the data from Harumanis Greenhouse at the Institute of Agrotechnology, University Malaysia Perlis are used. The biotic data c onsists of 254 s tems of generative and vegetative flushes. The attributes and the data types are given in Table 1: 47

3 Table 1. The Attributes and the Data Type of the Biotic Factors of Harumanis Branches Attributes Lysimeter Length of first whorl Length of second whorl Length of Third Whorl Data Type Nominal Figure 2. The typical mango terminal stems displaying the three whorls where the measurements are taken from. The whorls are representing the termination of each previous flush of vegetative growth. The Diameter of the Third Whorl Stem State Categorical There are 5 attributes that have been used in this research. These are lysimeter, that describe the type of lysimeter that the trees are planted in, length of the first, second and the third stem whorl (length in mm of the 3 whorl branches) and their diameters. Lysimeter feature is assigned a v alue between 1 and 3 which represent the root zones of the mango trees. The mango trees are planted in 3 different lysimeter sizes which are micro-lysimeter (1), lysimeter (2) and unrestricted root zones (3). The micro-lysimeter has dimension of 50 cm in deep and a diameter of 50 c m while the other lysimeters size with dimensions of 0.75 m deep and 1.5 m in diameter. Mango trees with unrestricted root zones are planted directly into the ground. Stem State describes the state of the stem. It is assign a state value of flowering or vegetative. Stem State is the class that will be used by the machine learning techniques to learn while in the training process and build an appropriate model. The test data set is used to validate the models developed to predict the Stem State. Figure 2 displays the features from the mango stem that are measured and used to classify the flowering stems. The stems whorls length are measured and recorded accordingly. Machine learning techniques that were used in this work are k-nn, Naïve Bayes, SVM, CAT and RF. k-nn algorithm is a technique used to classify the new observations based on the closest neighbor s labels. In this technique the distance (similarity) between the test set and the training set is used to determine the nearest-neighbor list. The appropriate number of k (the number of neighbor instances to be compared) is important to be determined. A small k value might cause the classifier to be susceptible to over fitting because of noise in the training data. On the other hand a large k value might cause misclassification because neighbors that are located far from the neighborhood will also be included in the classification decision. The k- NN models uses Euclidean distance metrics as shown in Equation 1 to get the nearest neighbors d = ( x i yi 2 ) Algorithm for k-nn algorithm (Equation 1) Data : Training Samples D= { x 1:N, c 1:N }, Test Point x*. Result : Class of new point c* 1: Let k be the number of nearest neighbors. 2: for i = 1 to N do 3: calculate distance d(x i, x*) between x* and every sample in D; 4: find x j, for which the distance the smallest, the set of k closest training samples to x*; 5: y = arg max I( v = yi ), v ( xi, yi ) Dx* 6: end 48

4 The naïve bayesian classifier algorithm is one of the classification algorithms where an instance is described with n, attributes a i ( i = 1 to n) and the classified class v from the set of possible classes V is described as follows: v = arg max n ( j ) P( ai v j ) P v vi V i= 1 (Equation 2) Root Node Internal Node Compute a vector of elements p j = P v n ( j ) P( ai v j ) i= 1 (Equation 3) which, after normalization so that the sum of p j is equal to 1, represents class probabilities. The class probabilities and conditional probabilities (priors) in the above formulae are estimates from the training data: class probability is equal to the relative class frequency, while the conditional probability of attribute value given class is computed by figuring out the proportion of instances with a value of i-th attribute equal to a i among instances that from class v j. Relative frequency is used when computing prior conditional probabilities. So the total number of training examples is n and nc is the number of training example that has the specific condition. The relative frequency corresponding to the probability would be nc P = n (Equation 4) SVM [17] is one of the techniques that is widely used in classification problems. SVM separates the classes independently with the hyperplane that maximizes the distance from a hyperplane separating the classes to the nearest point in the data set. CAT is an algorithm that splits the training instances accordingly and builds a tree that consists of root node, internal nodes and leaf nodes as a model to be used to classify the test examples. Leaf Node Leaf Node Leaf Node RF builds several classification trees to a d ata set and combines the predictions from all the trees. A classification tree is fitted to each bootstrap samples from the data. At each node, a small number of randomly selected variables are made available for binary partitioning. In random forest each variable that is importance is measured. The classifier models using machine learning techniques have been built and tested using 10 fold cross validation tests. The k-nn technique has been applied to build a model to classify the flowering and vegetative branches. The Euclidean metrics has been used as learning metrics. The classification accuracy with values from 0 to 1, where 0 is the worst classification and 1 is the best classification has been recorded. Two learner metrics in k-nn technique performance are compared to find the better learner technique that could be used in the classification model. The Learner metrics are Euclidean and Hamming metrics. The results are displayed in Figure 4. Since the Euclidean metrics show better learning ability, this metrics has been used through out the training and testing of the data. The other techniques, Naïve Bayes, SVM, CAT and RF have also been used to build the classifier models, tested and compared to report the best technique that could predict the flowering stems. The results are displayed and discussed in the following section. 3. RESULT AND DISCUSSION Figure 3 displays the result of k-nn technique classification accuracy which varies the number of neighbors from 1 t o 10, using 10 f old cross validation testing. The highest classification accuracy is achieved using Euclidean metrics for k value from 1 to 7 which is 49

5 The accuracy decreases for k from 8 to 10, with values from to On the other hand the classification accuracy of Hamming learning metric is lower than that of Euclidean metrics. The highest accuracy is achieved at k =9 where the accuracy is Figure 3. The Classification Accuracy using k-nn (k=1to10) and Different Leaning Metrics Table 2 displays the classification accuracy for the various classification models in predicting flowering branches. Classification Trees classifier model outperforms the others with 77.95% accuracy rate followed by the SVM classifier model. Naives Bayes classifier model also achieve 72.43% accuracy rate. k-nn and Random Forest achieve less than 70% accuracy. Classifier Model Table 2. Classififation Accuracy, Sensitivity and Specificity of the Classifier Models Classification Accuracy (%) Sensitivity (%) Specificity (%) k-nn Naives Bayes SVM Classification Trees Random Forest The sensitivity rates are also given in Table 2 where SVM outperforms the other techniques in classifying or predicting the flowering branches with 85.11% accuracy followed by the Classification Trees at 76.6%. Naives Bayes model achieves a lower Sensitivity rate at 65.96%. The k-nn and Random Forest sensitivity is very low at 29.79% and 26.6%, respectively. SVM and Classification Trees classifier models outperform the other methods applied in this study in predicting the flowering stems and also the vegetative stem with accuracy levels of more than 70%. 4. CONCLUSION The development of accurate yield prediction methods is invaluable both to suppliers and the importers. This is more critical in high value crops that can help in communities and government to predict and plan expenditure. The results presented in this paper show that SVM and Classification trees classifier models outperform other methods tested in this study in predicting the flowering stems. The results also demonstrate that the machine learning technique could be used to perform classification on flowering and non flowering stems. This classification algorithm can be used in Decision support system that could predict tree yield every season using biotic factors. 5. ACKNOWLEDGMENT Special thanks to the UniMAP Agricultural Station, for providing the samples for data collection. This project is funded by UniMAP Short Grant ( ), University Malaysia Perlis. Rohani S. Mohamed Farook acknowledges the sponsorship provided by UniMAP. 6. REFERENCES [1] Basso, B. Spatial validation of crop models for precision agriculture. Agricultural Systems 68, 2 (2001), [2] Chacko, E.K. Physiology of vegetative and reproductive in mango (Mangifera indica L.) trees. Proc. 1st Australian Mango Research Workshop, (1986), [3] Davenport, T.L., Ying, Z., Kulkarni, V., and White, T.L. Evidence for a translocatable florigenic promoter in mango. Scientia Horticulturae 110, 2 (2006), [4] Davenport, T.L. Reproductive physiology. In: Litz RE (ed), The Mango, Botany, Production and Uses, 2008, [5] Elwell, D.L., Curry, R.B., and Keener, M.E. Determination of potential yield-limiting factors of soybeans using SOYMOD/OARDC. Agricultural Systems 24, 3 (1987), [6] Farook, R.S.M., Aziz, A.H.A., Harun, A., et al. Data Mining on Climatic Factors for Harumanis Mango Yield Prediction. Intelligent Systems, Modelling and Simulation, International Conference on 0, (2012), [7] Fukuda, S., Spreer, W., Yasunaga, E., Yuge, K., Sardsud, V., and Müller, J. Random Forests modelling for the 50

6 estimation of mango (Mangifera indica L. cv. Chok Anan) fruit yields under different irrigation regimes. Agricultural Water Management, (2012). [8] Hernández-Sánchez, C., Luis, G., Moreno, I., et al. Differentiation of mangoes (Magnifera indica L.) conventional and organically cultivated according to their mineral content by using support vector machines. Talanta 97, (2012), [9] Hu, J., Li, D., Duan, Q., Han, Y., Chen, G., and Si, X. Fish species classification by color, texture and multi-class support vector machine using computer vision. Computers and Electronics in Agriculture 88, (2012), [10] Núñez-Elisea, R. and Davenport, T.L. Requirements for mature leaves during floral induction and floral transition in developing shoots of mango. Acta Hortic. 296, (1992), [11] Núñez-Elisea, R. and Davenport, T.L. Flowering of mango trees in containers as influenced by seasonal temperature and water stress. Scientia Horticulturae 58, 1-2 (1994), [12] Núñez-Elisea, R. Effect of leaf age, duration of cool temperature treatment, and photoperiod on bud dormancy release and floral initiation in mango. Scientia Horticulturae 62, 1-2 (1995), [13] Papageorgiou, E.I., Markinos, A.T., and Gemtos, T.A. Fuzzy cognitive map based approach for predicting yield in cotton crop production as a basis for decision support system in precision agriculture application. Applied Soft Computing Journal 11, 4 (2011), [14] Pomar, J. and Pomar, C. A knowledge-based decision support system to improve sow farm productivity. Expert Systems with Applications 29, 1 (2005), [15] Ramírez, F. and Davenport, T.L. Mango (Mangifera indica L.) flowering physiology. Scientia Horticulturae 126, 2 (2010), [16] Tan, P.-N., Steinbach, M., and Kumar, V. Introduction to Data Mining. Addison Wesley, [17] Vapnik, V.N. Statistical Learning Theory. John Wiley & Sons, Inc., [18] Whiley, A.W. Environmental effects on phenology and physiology of mango-a review. Acta Hortic. 341, (1993), [19] Yang, C.-C., Prasher, S.O., Landry, J.-A., and Ramaswamy, H.S. Development of a herbicide application map using artificial neural networks and fuzzy logic. Agricultural Systems 76, 2 (2003), [20] Zheng, H. and Lu, H. A least-squares support vector machine (LS-SVM) based on fractal analysis and CIELab parameters for the detection of browning degree on mango (Mangifera indica L.). Computers and Electronics in Agriculture 83, (2012),