Data Mining Approach for Predictive Modeling of Agricultural Yield Data Branko Marinković, Jovan Crnobarac, Sanja Brdar, Borislav Antić, Goran Jaćimović, Vladimir Crnojević Faculty of Agriculture, University of Novi Sad, Serbia {branko, jovanc, jgoran}@polj.uns.ac.rs Faculty of Technical Sciences, University of Novi Sad, Serbia sanja_brdar@yahoo.com, {tk_boris, crnojevic}@uns.ac.rs Abstract - Prediction of agricultural yields is a challenging task that demands fusion of knowledge from different areas such as data mining, statistics and agriculture. This paper shows that data mining techniques can be successfully applied to agricultural data analysis. Results that we present are gained on the data set that contains monthly measurements of different environmental parameters and annual yields for maize, soybean and sugar beet. Obtained results are in compliance with previous results on plant production modeling that are at the core of agricultural science. Keywords - Data mining, precision agriculture, yield prediction, genetic algorithms. I. INTRODUCTION Precision agriculture is a new paradigm that arose mostly from the developments in the field of wireless sensor networks. Those networks can collect huge amounts of environmental data that have a strong impact on agricultural production. One of the interesting aspects of precision agriculture is the prediction of yields. Data mining offers possibilities to change raw data into valuable information that could be used for making better decisions. Timely data collection directly from field deployed sensors is a new paradigm that aims to make improvements and increase profitability of agricultural measures through the use of appropriate data analysis algorithms. Pioneering applications of data mining in agriculture have been reported in the papers [1] [2], but the concept can also be successfully operated in other environmental fields, such as forestry and biodegradability analysis [3] [4]. Numerous factors have an impact on yield of cultivated plants. They significantly determine its level, whether separately or through a very complex set of interactions. Climate conditions have predominant role among all of the factors [5]. In the past few decades, plant production modeling has introduced some novel elements in the modern agriculture. It is motivated by the human everlasting wish to anticipate the progress of cultivated plants and it came up as a result of joint work of many teams of biologists, agronomists, meteorologists and programmers [6]. In the course of previous developments and exploitations, plant production models had served to researchers as a valuable instrument for organization and retrieval of data collected through field experimentations. In developed countries, these models have become irreplaceable source of information for all counseling services, agricultural stations and other sites that use them for making important decisions regarding plant production. First techniques for plant production modeling were based on the regression analysis. It represents the simplest technique that interprets experimental data using a mathematical model or function that describes a certain phenomenon or process. These techniques are substantially improved lately, and there are today very complex computer programs forecasting vegetation dynamics, yield components or yields of cultivated plants. By knowing the soil characteristics, the requirements of cultivar or hybrid, the history of applied agro-technical measures and weather conditions, it is possible to predict the moments of phonological phase changes, biomass developments and plant yields. The rest of the paper is structured as follows. Section II describes the dataset and section III describes the data mining methodology used to build the predictive models. In section IV, we present and discuss the results. Section V presents the conclusions and plans for future work. II. DATA Data collected during the period from 1999 to 2008 about the yields of main field crops (maize, soybean and sugar beet) in Serbian province of Vojvodina, have been taken from the internal database of the Department of field and vegetable crops at the Faculty of Agriculture in Novi Sad. Basic meteorological parameters in vegetation period - maximal, minimal and average monthly air temperature, as well as precipitation level, have been used in the analysis. These parameters are calculated by averaging daily measurements made by seven hydro-meteorological stations distributed all over the province of Vojvodina. For the analysis of water balance in vegetation periods of some crops (ETP potential evapotranspiration, ETR real evapotranspiration, shortage or excess of water with respect to the plant s needs), the bioclimate method based on hydrophitothermic indices (HFTC) has been used [7]. In the semi-arid conditions of Vojvodina, this is the most widely used method for defining plant s water deficit or surplus. Hydrophitothermic index HFTC shows a quantity of water (in
milliliters) used by a plant in the ETP process for every grade of average daily temperature. Monthly ETP value is calculated on a basis of the following formula (1) where ETP represents potential evapotranspiration (measured in mm) for a month period, HFTC is a hydrophitothermic index and Σt represents the sum of all average daily temperatures ( C) that have been recorded during a particular month. III. DATA MINING ALGORITHMS Data mining algorithms were applied using WEKA software. It includes a wide variety of learning algorithms and preprocessing tools [8] [9]. Among the algorithms implemented in WEKA, M5P model tree was the most suitable for our dataset and the problem of yield prediction that we want to solve. MP5 model tree is a combination of data classification and regression [10]. It follows the idea of decision tree methods, but in the leaves it has liner regression functions instead of class labels. The MP5 model tree is constructed in a top down way. At each step a decision is made whether to partition the training set (i.e. to create a split node) or to introduce a regression function as a leaf node. The decision is based on the standard deviation of target variable and the calculation of expected reduction of using equation (2). If we denote by T the set of training instances, T i the subsets of instances that are created by splitting the set T, std(t) and std(t i ) the standard deviations of sets T and T i, the reduction term is given by Δ (2) Important parameter that indicates the performance of model tree is the correlation coefficient r. It measures the statistical correlation between the prediction p and target variable a using equation, where cov(p,a) is the covariance between predicted and actual values, while std(p) and std(a) are their standard deviations. Error measures are expressed by the root mean squared (RMSE) and the mean absolute (MAE). (3) (4) (5) We have also performed experiments with attribute selection filters in order to extract the most relevant attributes that have an impact on agricultural yields. In that way we managed to reduce the attribute list and to increase model tree performance. s are grouped into feature subsets. Selection is done by evaluating the objective function for each feature subset. Subsets of features that are highly correlated with the target variable while having a low inter-correlation are preferred. Feature subset search methods that improved our results are the best-first search and genetic algorithm. The best-first search method searches in the space of feature subsets by greedy hill-climbing technique. It starts with a random solution and iteratively makes changes to the solution in order to improve it. Algorithm terminates when it cannot produce any further improvement. In Weka this heuristic search method is implemented with backtracking facility, which means that algorithm keeps previous state and therefore can return to it if the current state is found unpromising. Genetic algorithm (GA) is a search method that incorporates principles of natural selection. GA evolves a population of individuals, where each individual is a possible solution to the optimization problem. In our agricultural yield prediction problem, each individual is a candidate subset of attributes that strongly influence the yield. Every individual is quantitatively evaluated by fitness function. Promising candidates are selected and copied to the next generation. Also, these candidates are randomly altered by genetic operations crossover and mutation. In crossover operation two individuals swap segments of their code and in that way produce offspring. Mutation changes a few bits of individual's code. These operations are intended to simulate the analogous processes of recombination and mutation of chromosomes in living beings. When new generation is created, fitness evaluation is performed again. Overall process is repeated several times and solution that GA produces is the best individual in all generations. GA searches solution space in multiple directions at once. Therefore the strength of this algorithm lies in its effectiveness when searching large spaces. IV. RESULTS This section presents the results of our work. We estimated the performance of the applied data mining algorithms by the 10-fold cross validation. Data are randomly partitioned into 10 blocks, one block is held out for the test purpose and the model is built on the remaining nine blocks. This method is then repeated for other blocks. Finally, a measure of performance is calculated by averaging. Part A describes the results gained on full attribute set, while part B describes the improved results gained with attribute selection processes. A. Full attribute set Table I presents the correlation coefficients, mean absolute and root mean squared for maize, soybean and sugar beet datasets. TABLE I. MODEL TREE PERFORMANCE Model Tree Parameters of Performance Results Mean Root mean Correlation Coefficient absolute squared Maize 0.8948 449.4081 551.6791 Soybean 0.8188 281.2256 342.6852 Sugar beet 0.7462 4477.7011 6315.5219 The best correlation coefficient is obtained for maize data set. After performing pruning, model trees for all three datasets
are reduced to only one regression function. Figures 1, 2 and 3 present these regression expressions. All attribute values were normalized in order to better understand and compare their influence on the yield. 1.0684 * Tmax_06-0.4045 * Tmax_07 + 0.4625 * Tmax_10 + 0.5726 * Tmin_06-0.2487 * Tmin_09-0.3218 * Tmin_10-1.4286 * Tsr_06-0.7805 * Tsr_08 + 0.1484 * Pmm_05 + 0.2769 * Pmm_09 + 0.939 Figure 1. Regression rule for maize data set According to Figure 1, it is established that the maize yield is mostly affected by the temperatures in June, while the precipitation variables are most significant in May and September. May and June can be deemed critical for the growht and development of maize, since during these months, there is an intensive increase of the vegetation mass and the generative organs start to form. 0.3963 * Tmax_06 + 0.3718 * Tmax_10-0.2749 * Tsr_07-0.6916 * Tsr_08 + 0.1582 * Tsr_09 + 0.2335 * Pmm_05 + 0.5767 * Pmm_06 + 0.3606 * Pmm_10-0.2465 * ETP_04 + 0.3359 Figure 2. Regression rule for soybean data set Soybean yield, very much like the maize yield, depends on the temperatures during summer. It should be taken into account that the temperatures during the hottest months (July and August), had a negative impact on the yield while the temperatures in June had a positive impact. Precipitation in May and June affect the formation of vegetation mass, which also has a positive impact on the yield. The results are fairly in accordance with the agricultural practice since it is stated in the literature that soybean is particularly sensitive to drought during blooming and grain formation. Soybean s needs for water are growing since sowing, it is largely in demand for water during summer (June, July and August) and afterwards it is getting less and less water until the end of the vegetation period. This is related to the growth, development and ripening of soybean as well as with meteorological changes during the vegetation period. According to Figure 3, high average temperatures in August and maximum temperatures in September had the greatest negative impact on the yield of sugar beet, while high temperatures in August had a positive impact on the yield. Also, real evapotranspiration values in May and June had a positive effect on the yield, while the potential evapotranspiration values in July negatively affected the yield. 1.0581 * Tmax_08-0.4732 * Tmax_09 + 0.1148 * Tmin_06-0.1353 * Tmin_09-1.6514 * Tsr_08 + 0.2753 * Tsr_09-0.1737 * ETP_07 + 0.2309 * ETR_05 + 0.1669 * ETR_06 + 0.8541 Figure 3. Regression rule for sugar beet data set B. Reduced sets For soybean data set the highest improvement is gained by genetic algorithm search method for attribute selection, as explained in Table II. Starting from 41 attributes, GA produces a reduced set of seven most informative attributes: Tsr_04 (average temperature in April), Tsr_08 (average temperature in August), ETP_06 (potential evapotranspiration in June), ETP_07 (potential evapotranspiration in July), ETP_08 (potential evapotranspiration in August), ETR_04 (real evapotranspiration in April) and ETR_06 (real evapotranspiration in June). TABLE II. Model Tree improvements for Soybean Without attribute selection best-first search genetic algorithm search IMPROVED MODEL TREE PERFORMANCE FOR SOYBEAN Model Tree Parameters Mean Root mean Correlation Coefficient absolute squared 0.8188 281.2256 342.6852 0.8596 243.8395 302.6648 0.862 244.0551 300.6706 After attribute selection done by GA method, data set with the higher level of relevance to the soybean yield is produced. The effect of temperatures and precipitation in October is eliminated since it is known that they don t have any significant impact on soybean yield. In accordance with the improved regression rule for soybean data set displayed in Figure 4, April occurs as a significant period for yield formation when the growth of real evapotranspiration value causes an adequate yield growth (its impact is measured by teperature and precipitation values).
-0.3678 * Tsr_04-0.4962 * Tsr_08-0.4053 * ETP_07 + 0.348 * ETR_04 + 0.5737 * ETR_06 + 0.574 Figure 4. Improved Regression Rule for Soya Data Set The highest improvement for sugar beet data set is gained by the best-first search method for attribute selection (Table III). Starting from the set of 41 attributes, the algorithm manages to find the subset of only 8 attributes: Tmin_05 (minimum temperature in May), Tmin_08 (minimum temperature in August), Tsr_06 (average temperature in June), Tsr_07 (average temperature in July), Tsr_08 (average temperature in August), ETP_05 (potential evapotranspiration in May), ETP_08 (potential evapotranspiration in August), ETR_05 (real evapotranspiration in May). The model tree built onto these seven attributes is shown in Figure 5. It complies well with the previous published results that relate to the prediction of the yield of sugar beet based on general environmental factors. TABLE III. Model Tree improvements for Sugar Beet Without attribute selection genetic algorithm search best first search IMPROVED MODEL TREE PERFORMANCE FOR SUGAR BEET Model Tree Parameters Correlation Coefficient Mean absolute Root mean squared 0.7462 4477.7011 6315.5219 0.8583 3506.2127 4710.3439 0.8595 3610.7432 4709.3904 V. CONCLUSION In this paper we present new research possibilities for the application of data mining methodology to the problem of yield prediction for maize, soybean, sugar beet and other field cultures. Data mining algorithms were applied using WEKA software. Basic meteorological parameters in vegetation period - maximal, minimal and average monthly air temperature, as well as precipitation level, have been used in the analysis. For the analysis of water balance in vegetation periods of some crops, the bioclimate method based on hydrophitothermic indices (HFTC) is used. M5P model tree applied on the full set of 41 attributes produced meaningful regression rules that are in accordance with plant production models proposed by agricultural scientists. Important feature subset selection methods, such as best-first method or genetic algorithm, have improved the accuracy and made simpler models that are easier to interpret by agronomists. VI. ACKNOWLEDGEMENT This work was supported by the Ministry of Science and Technological Development of Republic of Serbia, under the technology development project Wireless Sensor Networks and Remote Sensing Foundations of Modern Agricultural Infrastructure (grant TR-11022). Sanja Brdar was supported through the student scholarship program of the Ministry of Science and Technological Development of Republic of Serbia. Tsr_08 <= 23.135: LM1 Tsr_08 > 23.135: Tsr_08 <= 23.636: LM2 Tsr_08 > 23.636: ETR_05 <=42.65: LM3 ETR_05 > 42.65: LM4 LM num: 1 0.0695 * Tmin_05-0.3811 * Tmin_08-0.2122 * Tsr_07 + 0.0766 * ETP_05-0.0871 * ETP_08 + 0.0971 * ETR_05 + 0.8582 LM num: 2-0.4307 * ETP_08 + 0.2833 * ETR_05 + 0.7463 LM num: 3-0.364 * ETP_08 + 0.302 * ETR_05 + 0.6286 LM num: 4-0.364 * ETP_08 + 0.2898 * ETR_05 + 0.6437 Figure 5. Improved Model Tree for Sugar Beet Dataset REFERENCES [1] G. Ruß, R. Kruse, M. Schneider, P. Wagner, Data Mining with Neural Networks for Wheat Yield Prediction, in Advances in Data Mining. Medical Applications, E-Commerce, Marketing and Theoretical Aspects, Lecture Notes in Computer Science, Vol. 5077, Springer, pp. 47 56, 2008.
[2] D. Pokrajac, T. Fiez, D. Obradovic, S. Kwek, Z. Obradovic, Distribution comparison for site-specific regression modeling in agriculture, in Proc. 12 th International Joint Conference on Neural Networks (IJCNN), pp. 3937-3941, 1999. [3] S. Džeroski, A. Kobler, V. Gjorgjioski, P. Panov, Using decision trees to predict forest stand height and canopy cover from LANSAT and LIDAR data, in Proc. 20th International Conference on Informatics for Environmental Protection, pp. 125-133, 2006. [4] H. Blockeel, S. Džeroski, B. Kompare, S. Kramer, B. Pfahringer, W. V. Laer, Experiments in predicting biodegradability, in Proc. 9th International Workshop on Inductive Logic Programming, pp. 80-91, Springer, 1999. [5] B. Marinković, J. Crnobarac, D. Marinković, G. Jaćimović, D.V. Mircov, Weather conditions in the function of optimal corn yield in Serbia and the Vojvodina province, in Proceeding of the 1 st Scientific Agronomic Days, pp. 15-19, 2008. [6] B. Lalic, L. Pankovic, D. T. Mihailovic, M. Malesevic, I. Arsenic: Crop models and its use in vegetation dynamic forecasting. In Proc. of Institute of Field and Vegetable Crops, Vol. 44, pp. 317-323, 2007. [7] N. Vučić, Bioclimate coefficients and plant water regime theory and practical applications. Vodoprivreda, Vol. 6, Num. 8, pp. 45-54, 1971. [8] I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005, Second edition. [9] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison Wesley, 2005. [10] J.R. Quinlan, Learning with Continuous Classes, in Proc. 5th Australian Joint Conference on Artificial Intelligence, pp. 343-348, 1992.