Machine learning data set analysis with visual simulation

Transcription

1 Machine learning data set analysis with visual simulation Marko Bohanec 1, Mirjana Kljajić Borštnar 2, Marko Robnik Šikonja 3 1 Salvirt d.o.o., Slovenia, Marko.Bohanec@salvirt.com 2 University of Maribor, Faculty of Organizational Sciences, Slovenia Mirjana.Kljajic@fov.uni-mb.si 3 University of Ljubljana, Faculty of Computer and Information Science, Slovenia Marko.RobnikSikonja@fri.uni-lj.si Abstract Most machine learning (ML) techniques require data as an input. Often training data subsets are chosen at random, thus we have to check the suitability of the sampled data. Mostly this is done implicitly through performance of different classification techniques. The focus of this paper is to develop a visual approach, which exposes data suitability through distance between cases and classifiers response. We demonstrate our ideas on a real case of business-to-business sales forecasting data. Keywords: machine learning, visual simulation, B2B sales, organizational learning, knowledge engineering. 1. Introduction The use of machine learning methods in business, for example in the field of finance and market segmentation is widespread and well adopted. Presently the methods require well-structured, updated and complete information about past events that are stored in commercial databases and reports. In cases, where events from the environment significantly affect the decision-making process, or information is only weakly structured, incomplete, and requires deep understanding, the automatic models fail to provide efficient decision support. This holds true especially in the area of inter-organizational business (business-to-business, B2B), which is complex in nature due to weakly structured problems and data, missing data, and frequent changes in the environment. Thus these problems are mostly left out of automatic approaches and are solved by subjective judgement of individuals or groups. Unfortunately, weak understanding of the problem situations often result in a large gap between planned (announced) and realized goals. In order to reduce this gap between the desired and realized business outcomes, our long term goal is to develop machine learning approaches that, introduced to the organizational learning process, will support decision-makers in forecasting closure of business-to-business sales opportunities. Based on transparent interpretation of individual sales opportunities this will improve understanding of the organizational processes and contribute to overall business results. In this paper we investigate selection of training data on a real-world business data set. In machine learning training instances are mostly chosen randomly which introduces certain amount of variability and possibly noise into the models. We argue that in a real world sales forecasting 1

2 scenario, not all training instances are equally adequate and a preprocessing stage where we visualize and select the appropriate cases is useful and helps in creating first a better data set and second better decision models. Before the selection of the best data subset by experts, we therefore propose several comprehensive visualizations revealing differences between different subsets. Similarly, after we build several ML models, a visualization of performance for a particular classification model trained on a selected data set is beneficial and can help business users in their understanding of the model, its strengths and weaknesses and possible business applicability. Our aim in this paper is therefore to present informative visualizations reflecting on the performance of the models trained on the underlying data. This will enable business users to understand requirements for high quality data needed for successful machine learning modeling without a need to understand technical details. 2. Literature review and methodology Different techniques exist to split the available data to training and testing data set (Batista, Bazan and Monard, 2004). In most cases challenges are related to class imbalance, missing values, reaction of classifiers to minority classes, and possible concept drift Distance among cases In this paper we are not focusing on class imbalance, rather we are interesting in similarity of different training cases as used in the prediction models. We measure this similarity through the proximity measure defined with the random forest (RF) classification model (Breiman, 2001). Random forest is an ensemble learning technique using several decisions. The trees are built in such a way to minimize their inter-correlation but retain at least some strength. For this a bootstrap sampling of training sets and sampling of candidate features in the internal tree nodes is used during the construction phase. As a side effect of bootstrap sampling with replacement, some training instances are not selected for a so called out-of-bag set, which can be used as an independent validation set. Random forest method computes similarity between instances through classification of out-of-bag instances. If two out-of-bag cases are classified in the same tree leaf the proximity between them is incremented. The proximity can be transformed into a distance with the expression Classification accuracy To calculate classification accuracy of a machine learning model, we first need to select a classification method. Following previous research (Čehovin, Bosnić, 2010; Kotsiantis, 2007), we selected Random Forest (Breiman, 2001) and Naïve Bayes, two frequently used classifiers. Random forest is selected due to its excellent and robust classification performance and Naïve Bayes is chosen as a simple baseline method. 2

3 3. Results 3.1. Research setup A real world data set with B2B sales history was obtained from a software development company, participating in the research. Data set consists of 22 features and 150 past sales cases, among them 65 described sales opportunities were signed and 85 were lost. From initial data set 60% of the cases were randomly partitioned into a training set (90 cases) and the rest was left for a testing set (60 cases). Data mining and machine learning suite R ver was leveraged for data analysis, visualization and programming tasks. For machine learning techniques and methods library CORElearn ver (Robnik-Šikonja, 2015) was used and for visualization lattice package ver (Sarkar, 2015) was utilized Visual simulation of different training data sets To get an insight how different training data sets differ from the perspective of the selected cases we performed several random splits and simulated the difference between instances using RF based distance in Figure 1. The darkness of squares indicates the distance between individual instances. The distance of each instance to itself is zero, thus black squares (lowest distance, maximal similarity) on the diagonal. The left-hand visualization (Fig. 1a) shows instances which are relatively dissimilar (it has less darker or even black boxes), the variability is therefore large and the data set seems a good candidate for the training set which would cover variability of the problem. On the right-hand plot (Fig. 1b) there is an obvious issue with the data set. We notice dark patches and even a black box indicating there is no difference at all for many instances. Therefore the left-hand sample is more appealing to become a useful training data set. (a) (b) Figure 1: Training cases distance plots for different learning data sets 3

4 3.3 Growing model incrementally by adding cases and features To simulate performance of different classification models on the testing data, we constructed a wrapper method which incrementally adds cases from the data set and features to a training classifier and computes classification accuracy (CA) of such a growing model. For each pair (number of instances, number of features) we created a matrix of CA values for Random Forest (RF) and Naïve Bayes (NB) classifiers. The performance is visible in Figures 2 (RF) and 3 (NB), respectively, where the darkness of the squares indicates the CA score with actual numbers given on the right-hand side of each graph. Figure 2: Classification accuracy of RF (shade), with increasing number of instances and attributes Figure 3: Classification accuracy of NB (shade), with increasing number of instances and attributes Comparing Figure 2 and 3 it appears that Naïve Bayes is much faster (needs less cases) to achieve higher classification accuracy than Random Forest. In the specific problem it also appears more resistant to the noise, coming from redundant features, which we can detect from absence of performance drop after more features are added to model. This CA drop for Random Forest is visible in Figure 2 inside the rounded area, occurring for low number of instances. 4

5 4. Discussion and Conclusions We addressed the problem of business-to-business sales opportunity forecasting using machine learning method. Specifically we present two visualizations for identification and understanding of the quality of data set. Visualization of data set quality might be a useful method. Our simulation coupled with the visualization has a goal to expose data quality and data suitability for machine learning modeling. We demonstrate the idea on a set of real business data and achieve our goal with the presented figures. In the future similar simulations and visualizations will be applied to business-to-business data sets from different companies. The grey spaces in the upper right corners of Figures 2 and 3 indicate a drop in classification accuracy when large numbers of features are used in the classification models. This phenomenon indicates a noise in the data and requirement for application of feature subset selection techniques, however further research is needed for conclusive explanation. 5. Acknowledgement We are grateful to the Salvirt ltd. Company for funding the research and development of the visual simulation, presented in this paper. 6. References Batista, G. E. A. P. A., Bazan, A. L., and Monard, M. C. (2004): A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data, ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets, Vol. 6 Iss. 1, Pages Bolon-Canedo V., Sanchez-Marono N., Alonso-Betanzos A. (2013): A review of feature selection methods on synthetic data, Knowledge Information Systems, Vol 34, pp Breiman L. (2001): Random forests, Machine Learning Journal, Vol. 45, pp Čehovin L., Bosnić Z. (2010): Empirical evaluation of feature selection methods in classification, Intelligent Data Analysis 14, pp , IOS Press Kotsiantis S. B. (2007): Supervised Machine Learning: A review of Classification Techniques, Informatica 31, pp Robnik-Šikonja M. (2015): CORElearn: various machine learning algorithms and techniques, R package version version Sarkar D. (2015): lattice graphical plots, R package version version